United States | English
Locations Investors Newsroom Contact
Perspective

Why your cloud infrastructure will fail at the moment you need it most

Infrastructure built for today collapses under tomorrow's load. The 2024 cloud outages, the CrowdStrike incident, and the rising cost of cloud waste reveal what separates resilient organisations from fragile ones.

Reading time · 8 minutes Published · Q1 2025 By NovasIQ Insights team

2024 was the year cloud infrastructure stopped being invisible. From a single faulty CrowdStrike update that grounded 8.5 million Windows systems, to AWS, Azure, and Google Cloud disruptions that took thousands of dependent services offline, the year exposed an uncomfortable truth: the infrastructure most enterprises rely on is more brittle, more concentrated, and more expensive than the cloud-first narrative suggested. The companies that scale successfully build resilience in before the pressure hits.

Outages are getting rarer — and more expensive

Uptime Institute's Annual Outage Analysis 2025 reports that the proportion of operators experiencing significant downtime has been declining since 20211. Yet when outages do occur, they are increasingly costly: more than half (54%) of respondents said their most recent significant or severe outage cost over $100,000, and one in five said it exceeded $1 million1. Power-related issues remain the leading cause of impactful data centre outages, while IT and networking issues — driven by configuration complexity and change management failures — accounted for 23% of impactful outages in 20241.

The Uptime data shows a clear pattern: outages are becoming less frequent but more severe. Modern infrastructure is denser and more interdependent, so when something does break, it cascades.

$5.4B
Estimated potential financial loss from the July 2024 CrowdStrike incident, in which a faulty Falcon Sensor update affected 8.5 million Windows machines globally — described by Microsoft as the largest IT outage in history2.

The 2024 outages that proved the point

The CrowdStrike incident on 19 July 2024 was the inflection point. A defective configuration update to the Falcon Sensor security agent triggered Blue Screen of Death cascades on roughly 8.5 million Windows devices — under 1% of all Windows endpoints, but enough to halt airlines, hospitals, banks, payment terminals, and emergency services worldwide3. More than 3,300 flights were cancelled within hours. Parametrix estimated $5.4 billion in potential financial losses2. Recovery took days for many large organisations because remediation required manual intervention on each affected machine.

Parametrix's Cloud Outage Risk Report 2024 found that critical cloud service disruptions involving AWS, Microsoft Azure, and Google Cloud increased 18% year-on-year in 2024, and 52% since 20224. The duration of critical cloud outages rose to 221 hours in 2024, up 51% since 20224. Six outages lasted more than ten hours each in 2024, totalling nearly 100 hours of high-impact downtime4.

8.5M
Windows devices affected by the CrowdStrike Falcon update of 19 July 20243
52%
Increase in critical cloud disruptions across AWS, Azure, GCP since 20224
$1M+
Cost of most recent severe outage for 1 in 5 organisations surveyed1
29%
of cloud spend wasted in 2025 — first increase after a five-year decline5
84%
of organisations cite managing cloud spend as their top cloud challenge5
17%
average overrun on cloud budgets — a persistent gap year-on-year5

The hidden cost: waste, not just downtime

While outages capture headlines, the more insidious cost of fragile cloud infrastructure is wasted spend. Flexera's 2025 State of the Cloud Report — based on responses from 759 IT decision-makers — found that 84% of organisations identify managing cloud spend as their top challenge, surpassing concerns about security or compliance5. The 2026 edition reported that estimated wasted cloud spend rose to 29% in 2025, reversing a five-year downward trend5.

Industry analysis estimates this represents approximately $180 billion in wasted cloud spend globally each year6. The drivers are familiar: over-provisioning to avoid performance risk, forgotten test environments, mis-sized instances, untagged resources, and AI workloads with unpredictable cost spikes.

"Outages overall have slowed down. Data centre operators are facing a growing number of external risks beyond their control, including power grid constraints, extreme weather, network provider failures, and third-party software issues."

Why infrastructure built for today fails tomorrow

The Uptime Institute data highlights a uncomfortable pattern: nearly 40% of organisations have suffered a major outage caused by human error in the past three years, and 85% of those incidents stem from staff failing to follow procedures, or from flaws in the procedures themselves1. The proportion of human-error outages caused by failure to follow procedures rose by ten percentage points between 2024 and 20251.

This is not primarily a technology problem. It is a complexity problem. As infrastructure grows — more services, more dependencies, more vendors, more configurations — the operational discipline required to keep it running scales nonlinearly. Tools that were sufficient for a 50-service estate begin to fail at 500 services. Documentation that worked for one cloud region breaks across multi-region, multi-cloud deployments.

The companies that scale successfully share three characteristics:

1. They architect for failure, not just performance

Resilience is treated as a first-class design property — multi-region deployment, graceful degradation, circuit breakers, and explicit blast-radius limits — not an afterthought. Investments in distributed resiliency tooling have measurably improved availability, although Uptime cautions that this complexity also introduces new failure modes1.

2. They invest in FinOps before the bill spirals

Flexera's 2026 data shows 63% of organisations now have a dedicated FinOps team, up from 51% in 20245. Organisations that combine FinOps with engineering practices — rightsizing, commitment discounts, automated anomaly detection — typically reduce cloud spend by 15–25% within the first year5.

3. They reduce single points of dependency

The 2024 outages exposed how concentrated cloud risk has become. AWS, Azure, and Google Cloud together account for approximately 63% of the cloud market7. Organisations that survived 2024 with minimal disruption typically had multi-region, multi-cloud, or hybrid failover paths — not as a cost-saving measure, but as a continuity safeguard.

What this means for technology leaders

The data points to a few hard conclusions:

The bottom line

Cloud infrastructure was sold as elastic, infinite, and self-healing. The 2024 data reveals a more honest picture: highly capable, but fragile in ways that only show up at scale. The companies scaling smartly are not the ones spending the most on cloud — they are the ones building resilience, operational discipline, and cost discipline into the architecture from day one.

The good news is that the playbook is now well-established. The bad news is that most organisations still treat resilience as something to add later — usually after the first major outage has already cost them.

Sources & References
Citations to publicly available primary research

All statistics and findings cited in this report are drawn from publicly available primary research published by the named organisations. NovasIQ has not produced original survey data for this report; figures are reproduced as published, with full source attribution below.

  1. Uptime Institute. Annual Outage Analysis 2025 (7th annual edition), Uptime Intelligence, May 2025. Based on Uptime's 2024 annual operator survey of 412 respondents and a separate severity survey of 97 respondents, plus tracking of publicly reported outages. Available at: https://intelligence.uptimeinstitute.com/resource/annual-outage-analysis-2025 and https://uptimeinstitute.com/about-ui/press-releases/uptime-announces-annual-outage-analysis-report-2025
  2. Parametrix Insurance. Loss estimate for the 19 July 2024 CrowdStrike Falcon Sensor incident, cited in industry coverage. Parametrix initially estimated approximately $5.4 billion in potential US Fortune 500 financial losses (excluding Microsoft itself) from the outage. Reported widely in financial media and analysed in: Cloud Security Alliance, What We Can Learn from the 2024 CrowdStrike Outage, July 2025. Available at: https://cloudsecurityalliance.org/blog/2025/07/03/what-we-can-learn-from-the-2024-crowdstrike-outage
  3. Microsoft Corporation, Helping our customers through the CrowdStrike outage, blog post by David Weston, 20 July 2024. Confirms that approximately 8.5 million Windows devices were affected — less than one per cent of all Windows machines. See also: IBM Corporation, Recent CrowdStrike outage: What you should know, 22 July 2024. Available at: https://www.ibm.com/think/news/recent-crowdstrike-outage-what-you-should-know and https://en.wikipedia.org/wiki/2024_CrowdStrike-related_IT_outages
  4. Parametrix Insurance. Cloud Outage Risk Report 2024. Analysis of cloud service disruptions across AWS, Microsoft Azure, and Google Cloud Platform. Reports a 52% increase in critical cloud disruptions since 2022 and an 18% year-on-year increase in 2024. Cited in coverage at: https://cyberinsurancenews.org/cloud-outages-2024-report/
  5. Flexera. 2025 State of the Cloud Report (March 2025) and 2026 State of the Cloud Report. Annual surveys of 750+ global IT decision-makers and cloud practitioners. Available at: https://www.flexera.com/blog/finops/the-latest-cloud-computing-trends-flexera-2025-state-of-the-cloud-report/ and https://info.flexera.com/CM-REPORT-State-of-the-Cloud
  6. Williams, D. Cloud Cost Management is Broken: Why FinOps Tools are Failing and What's Needed, October 2024. Industry analysis citing the 27% wasted cloud spend figure from Flexera's 2024 report and estimating approximately $180 billion in annual global cloud waste. Available at: https://medium.com/@dpwilliams03/cloud-cost-management-is-broken-why-finops-tools-are-failing-and-whats-needed-9c16939fe439
  7. Synergy Research Group. Quarterly cloud market share analysis. AWS, Microsoft Azure, and Google Cloud collectively held approximately 63% of the global cloud infrastructure market as of Q2 2025. See also Synergy Research's quarterly publications at: https://www.srgresearch.com/

Where research firms have published differing methodologies for the same metric, this report cites the most recent figure from the named primary source. URLs were valid at time of publication; some primary reports require free registration to access in full. Numerical figures are rounded as published in original sources. NovasIQ is not affiliated with any of the cited research organisations.

Get in touch

Have a transformation challenge worth solving?

Tell us where you're heading. We'll map the capabilities, talent, and delivery approach to get you there.