Why your cloud infrastructure will fail

2024 was the year cloud infrastructure stopped being invisible. From a single faulty CrowdStrike update that grounded 8.5 million Windows systems, to AWS, Azure, and Google Cloud disruptions that took thousands of dependent services offline, the year exposed an uncomfortable truth: the infrastructure most enterprises rely on is more brittle, more concentrated, and more expensive than the cloud-first narrative suggested. The companies that scale successfully build resilience in before the pressure hits.

Outages are getting rarer — and more expensive

Uptime Institute's Annual Outage Analysis 2025 reports that the proportion of operators experiencing significant downtime has been declining since 2021¹. Yet when outages do occur, they are increasingly costly: more than half (54%) of respondents said their most recent significant or severe outage cost over $100,000, and one in five said it exceeded $1 million¹. Power-related issues remain the leading cause of impactful data centre outages, while IT and networking issues — driven by configuration complexity and change management failures — accounted for 23% of impactful outages in 2024¹.

The Uptime data shows a clear pattern: outages are becoming less frequent but more severe. Modern infrastructure is denser and more interdependent, so when something does break, it cascades.

$5.4B

Estimated potential financial loss from the July 2024 CrowdStrike incident, in which a faulty Falcon Sensor update affected 8.5 million Windows machines globally — described by Microsoft as the largest IT outage in history².

The 2024 outages that proved the point

The CrowdStrike incident on 19 July 2024 was the inflection point. A defective configuration update to the Falcon Sensor security agent triggered Blue Screen of Death cascades on roughly 8.5 million Windows devices — under 1% of all Windows endpoints, but enough to halt airlines, hospitals, banks, payment terminals, and emergency services worldwide³. More than 3,300 flights were cancelled within hours. Parametrix estimated $5.4 billion in potential financial losses². Recovery took days for many large organisations because remediation required manual intervention on each affected machine.

Parametrix's Cloud Outage Risk Report 2024 found that critical cloud service disruptions involving AWS, Microsoft Azure, and Google Cloud increased 18% year-on-year in 2024, and 52% since 2022⁴. The duration of critical cloud outages rose to 221 hours in 2024, up 51% since 2022⁴. Six outages lasted more than ten hours each in 2024, totalling nearly 100 hours of high-impact downtime⁴.

8.5M

Windows devices affected by the CrowdStrike Falcon update of 19 July 2024³

52%

Increase in critical cloud disruptions across AWS, Azure, GCP since 2022⁴

$1M+

Cost of most recent severe outage for 1 in 5 organisations surveyed¹

29%

of cloud spend wasted in 2025 — first increase after a five-year decline⁵

84%

of organisations cite managing cloud spend as their top cloud challenge⁵

17%

average overrun on cloud budgets — a persistent gap year-on-year⁵

The hidden cost: waste, not just downtime

While outages capture headlines, the more insidious cost of fragile cloud infrastructure is wasted spend. Flexera's 2025 State of the Cloud Report — based on responses from 759 IT decision-makers — found that 84% of organisations identify managing cloud spend as their top challenge, surpassing concerns about security or compliance⁵. The 2026 edition reported that estimated wasted cloud spend rose to 29% in 2025, reversing a five-year downward trend⁵.

Industry analysis estimates this represents approximately $180 billion in wasted cloud spend globally each year⁶. The drivers are familiar: over-provisioning to avoid performance risk, forgotten test environments, mis-sized instances, untagged resources, and AI workloads with unpredictable cost spikes.

"Outages overall have slowed down. Data centre operators are facing a growing number of external risks beyond their control, including power grid constraints, extreme weather, network provider failures, and third-party software issues."

Why infrastructure built for today fails tomorrow

The Uptime Institute data highlights a uncomfortable pattern: nearly 40% of organisations have suffered a major outage caused by human error in the past three years, and 85% of those incidents stem from staff failing to follow procedures, or from flaws in the procedures themselves¹. The proportion of human-error outages caused by failure to follow procedures rose by ten percentage points between 2024 and 2025¹.

This is not primarily a technology problem. It is a complexity problem. As infrastructure grows — more services, more dependencies, more vendors, more configurations — the operational discipline required to keep it running scales nonlinearly. Tools that were sufficient for a 50-service estate begin to fail at 500 services. Documentation that worked for one cloud region breaks across multi-region, multi-cloud deployments.

The companies that scale successfully share three characteristics:

1. They architect for failure, not just performance

Resilience is treated as a first-class design property — multi-region deployment, graceful degradation, circuit breakers, and explicit blast-radius limits — not an afterthought. Investments in distributed resiliency tooling have measurably improved availability, although Uptime cautions that this complexity also introduces new failure modes¹.

2. They invest in FinOps before the bill spirals

Flexera's 2026 data shows 63% of organisations now have a dedicated FinOps team, up from 51% in 2024⁵. Organisations that combine FinOps with engineering practices — rightsizing, commitment discounts, automated anomaly detection — typically reduce cloud spend by 15–25% within the first year⁵.

3. They reduce single points of dependency

The 2024 outages exposed how concentrated cloud risk has become. AWS, Azure, and Google Cloud together account for approximately 63% of the cloud market⁷. Organisations that survived 2024 with minimal disruption typically had multi-region, multi-cloud, or hybrid failover paths — not as a cost-saving measure, but as a continuity safeguard.

What this means for technology leaders

The data points to a few hard conclusions:

Outages will continue to become rarer and more severe. The cost of a single major incident now routinely exceeds $1 million for one in five enterprises¹.
The cloud bill is the new infrastructure tax. Without active FinOps practice, organisations should expect to waste approximately 29% of their cloud spend⁵.
Vendor concentration is a strategic risk. A 2024 single-vendor failure pattern (CrowdStrike, AWS, Azure, Google Cloud) is now the most likely cause of a multi-million-dollar disruption^2,4.
Human error is the primary attributable cause. 85% of human-error outages stem from procedural failures¹. Process discipline is a higher-leverage investment than additional tooling for most organisations.

The bottom line

Cloud infrastructure was sold as elastic, infinite, and self-healing. The 2024 data reveals a more honest picture: highly capable, but fragile in ways that only show up at scale. The companies scaling smartly are not the ones spending the most on cloud — they are the ones building resilience, operational discipline, and cost discipline into the architecture from day one.

The good news is that the playbook is now well-established. The bad news is that most organisations still treat resilience as something to add later — usually after the first major outage has already cost them.

Why your cloud infrastructure will fail at the moment you need it most

Outages are getting rarer — and more expensive

The 2024 outages that proved the point

The hidden cost: waste, not just downtime

Why infrastructure built for today fails tomorrow

1. They architect for failure, not just performance

2. They invest in FinOps before the bill spirals

3. They reduce single points of dependency

What this means for technology leaders

The bottom line

Have a transformation challenge worth solving?

Why your cloud infrastructure will fail at the moment you need it most

Outages are getting rarer — and more expensive

The 2024 outages that proved the point

The hidden cost: waste, not just downtime

Why infrastructure built for today fails tomorrow

1. They architect for failure, not just performance

2. They invest in FinOps before the bill spirals

3. They reduce single points of dependency

What this means for technology leaders

The bottom line

Continue exploring

AI Agents are the new workforce

DevSecOps is no longer optional

Cloud Infrastructure

Have a transformation challenge worth solving?