On October 20, 2025, the AWS region in North Virginia, experienced a severe outage for nearly 16 hours: disrupting 113 services and taking major platforms like Zoom, DoorDash, Capital One, Coinbase, and Reddit offline. These companies (which, collectively, were spending eight to nine figures annually on backup and disaster recovery solutions) were completely powerless.

The incident exposed an under-acknowledged truth: backups protect data, not infrastructure. And without infrastructure, your data isn’t worth much.

The $100 Million Question Nobody Wants to Ask

Think about that for a moment. Zoom: a company so critical to modern work. DoorDash, serving millions of meals daily. Capital One and Coinbase, handling billions in financial transactions.

All of them have disaster recovery plans.
All of them invest heavily in backup solutions.
All of them went dark when a single DNS configuration error cascaded through AWS's US-EAST-1 region.

The hard truth is: backing up your data is great, but it does almost nothing for business continuity. And if some of the biggest and most seemingly well-prepared brands are paying for not being disaster-ready, what’s in store for the rest?

A Pattern of Failure: AWS Outage History and the Growing DR Risk

This wasn't a one-off event. It's the third major outage in five years tied to AWS's US-EAST-1 data cluster, which powers much of the internet's traffic.

The historical record tells a troubling story:

November 2020: A Kinesis data stream failure in US-EAST-1 affected authentication, delivery tracking, and streaming services across the internet.
December 2021: API Gateway issues in US-EAST-1 caused widespread disruption lasting over seven hours, impacting DynamoDB, EC2, and the AWS Console.
June 2023: Lambda service disruptions in US-EAST-1 affected serverless applications worldwide.
July 2024: Another Kinesis event in US-EAST-1 disrupted data streaming services.

And now, October 2025: A DNS race condition in DynamoDB's automated management system cascaded into a day-long outage that exposed the fragility of global internet infrastructure.

The recurring theme? US-EAST-1, Amazon's oldest and largest region in Northern Virginia, remains a single point of failure for much of the internet. AWS controls approximately 30% of the global cloud market: which means when it fails, the ripple effects are widespread.

Enter CAIRS: The Category That Changes Everything

This kind of widespread trouble with disaster recovery is precisely why Gartner created a new category in 2025: Cloud Application Infrastructure Recovery Solutions (CAIRS).

According to Gartner, CAIRS solutions automate discovery, protection, and restoration of full-stack cloud applications: not just data, but infrastructure and configurations, too. For decades, backup and disaster recovery practices focused almost exclusively on safeguarding data, leaving a critical gap: if infrastructure and configurations are compromised, protected data remains inaccessible.

Cloud resiliency is two things: it tells you which parts of your cloud are backup-ready, and then it couples it with the ability to automate the deployment of the backup and infrastructure. But not all solutions make true cloud resiliency possible.

The False Promise of Traditional DR

Here's the shameful little secret about traditional disaster recovery no one talks about: it was designed for a reality that no longer exists.

Legacy DR solutions assume your infrastructure is static, predictable, and owned by you. They focus obsessively on data backup: creating snapshots, replicating databases, archiving files. But in cloud-native environments, data is only half the equation. Without the infrastructure layer, like the compute instances, networking configurations, security groups, and IAM policies, your backed-up data might as well not exist.

When AWS's US-EAST-1 region failed, companies discovered they couldn't simply "restore from backup." They needed to:

Spin up infrastructure in a different region
Reconfigure networking and security
Update DNS and load balancers
Restore application state
Reconnect integrations and dependencies

All while their SLAs were bleeding, and their competitors were gaining ground.

Why ClickOps Is Your Single Point of Failure

The AWS outage exposed another uncomfortable truth: most cloud infrastructure exists as undocumented, manually configured resources that can't be easily replicated or recovered. This is the ClickOps problem.

Engineers log into the AWS console, click through wizards, and deploy resources directly. It's fast., it's intuitive, and it's a disaster waiting to happen.

When those manually configured resources fail (or when an entire region goes dark), how do you recreate them? From memory? From scattered documentation? From screenshots? The organizations recovering fastest and maintaining business continuity best are the ones with infrastructure defined as code, automated deployment pipelines, and multi-region strategies.

What Should You Do Right Now?

The answer isn't spending more on traditional backup solutions. It's fundamentally rethinking your approach to cloud resilience:

Step 1: Audit Your Infrastructure Resilience

Can you answer these questions right now? (Hint: If you can't answer confidently, you're vulnerable.)

Which parts of your cloud infrastructure are backed up?
Can you recreate your entire stack in a different region?
How long would recovery actually take?
What percentage of your infrastructure exists as undocumented ClickOps resources?

Step 2: Adopt Infrastructure-as-Code Everywhere

Every resource in your cloud should be defined as code. Not some. Not most. All of it. This isn't optional anymore. When (not if) the next outage hits, you need to be able to redeploy your entire infrastructure with a single command.

Step 3: Implement Cloud Resiliency Posture Management (CRPM)

Don't wait for the next outage to discover you're unprepared. CRPM gives you continuous visibility into which parts of your infrastructure are resilient and which are vulnerable.

Step 4: Test Your Actual Recovery Capabilities

Most disaster recovery plans look great on paper and fail spectacularly in practice. When was the last time you actually tested failing over to a different region? Not a tabletop exercise but an actual test with your production workloads.

Step 5: Consider Cloud Application Infrastructure Recovery Solutions (CAIRS)

Gartner's recognition of CAIRS underscores the industry's shift toward solutions that restore the entire cloud stack, so organizations can resume operations quickly after an incident. And Firefly is one of only three vendors recognized by Gartner in the CAIRS category for fast, automated recovery through IaC, reducing recovery times from days to minutes. It’s a solution purpose-built for cloud-native infrastructure recovery.

The Moment of Truth Is Coming

It took learning the hard way for some, and watching a cautionary tale unfold for others. But finally, cloud leaders are starting to understand: they must pay more attention to resiliency, not just data backup. The threat landscape is evolving faster than traditional DR strategies can adapt.

As we saw with AWS, simple human errors can cascade into global outages. The difference between those who stay afloat and those who go down is a comprehensive approach to preparedness: one that factors in your data as well as your infra and all its configurations.

🔗 Dive into an overview of CAIRS solutions and why they matter
🔗 Explore Firefly’s disaster recovery and cloud backup capabilities
🔗 Get the scoop direct from Gartner

Featured blog posts

DORA Metrics for DevOps: How to Go Beyond Measurement and Improve Performance

Firefly-as-Code: How to Use the Firefly Terraform Provider

The Bad IaC Tax: Why Poor Infrastructure-as-Code Bleeds Cloud Budgets and Causes Waste at Scale

Related case studies

Aspyr gains visibility and control in the wake of cloud chaos

How AppsFlyer achieved 84% greater platform engineering efficiency with Firefly

How Aqua Security achieved 100% visibility and governance over their infrastructure

Play Asset Mutations Racer

Welcome to the Asset Mutations Racer

Your mission: track, manage, and control changes across your entire cloud ecosystem.

An asset mutation occurs when an asset revision is made in your cloud infrastructure. Some are beneficial and lead to a well-controlled cloud, but others are harmful, creating risk and waste.

Use your ↑up and ↓down arrow keys to collect as many beneficial asset mutations as possible.

Avoid harmful asset mutations! Firefly enables rollbacks, but—in this game—you are only allowed 3. When you apply a harmful mutation and are out of rollbacks, your services will be disrupted and it is game over.

Play Drift Defender

Firefly Drift Defender

Score: 0 | High Score: 0

Welcome to Firefly Drift Defender!

Your mission is to prevent drifts in your cloud infrastructure. A drift occurs when the desired state defined in your configuration files doesn't match the actual state of your cloud infrastructure, which can cause deployment issues and security risks.

In this game, you are trying to prevent drift in your Databases, Network, Server, and Storage configurations. When a drift occurs, a resource will catch on fire.

Click on the drifted resource to automatically remediate it, and earn points.

Sadly, your platform engineers are making several manual changes in your cloud consoles, so you'll experience more drifts over time. When you have 5 drifts simultaneously, your services will be disrupted and the game will be over.

Game Over

Your Score: 0

Your High Score: 0

Play Ghosty Cloud

Firefly Ghosty Cloud

score2: 0 | High score2: 0

Welcome to Firefly Ghosty Cloud!

Your mission is to avoid ghosted resources in your cloud infrastructure.

A ghosted resource was once created through Infrastructure as Code (IaC) but has since been deleted or is missing from the actual cloud infrastructure.

In this game, use your spacebar to avoid ghosted resources in your cloud.

The further you go without encountering a ghost resource, the more points you earn for having a reliable and immutable cloud infrastructure.

Game Over

Your score: 0

Your high score: 0

Big Disaster, Slow Recovery: Why Most Disaster Recovery Strategies Failed During the AWS Outage

The $100 Million Question Nobody Wants to Ask

A Pattern of Failure: AWS Outage History and the Growing DR Risk

Enter CAIRS: The Category That Changes Everything

The False Promise of Traditional DR

Why ClickOps Is Your Single Point of Failure

What Should You Do Right Now?

Step 1: Audit Your Infrastructure Resilience

Step 2: Adopt Infrastructure-as-Code Everywhere

Step 3: Implement Cloud Resiliency Posture Management (CRPM)

Step 4: Test Your Actual Recovery Capabilities

Step 5: Consider Cloud Application Infrastructure Recovery Solutions (CAIRS)

The Moment of Truth Is Coming

Featured blog posts

DORA Metrics for DevOps: How to Go Beyond Measurement and Improve Performance

Firefly-as-Code: How to Use the Firefly Terraform Provider

The Bad IaC Tax: Why Poor Infrastructure-as-Code Bleeds Cloud Budgets and Causes Waste at Scale

Related case studies

Aspyr gains visibility and control in the wake of cloud chaos

How AppsFlyer achieved 84% greater platform engineering efficiency with Firefly

How Aqua Security achieved 100% visibility and governance over their infrastructure

Curious to learn more about IaC? Explore our free resources or schedule a demo.

Play Asset Mutations Racer

Firefly Asset Mutations Racer

Welcome to the Asset Mutations Racer

Your Cloud Asset Mutations

Game over

Play Drift Defender

Firefly Drift Defender

Welcome to Firefly Drift Defender!

Your Infrastructure

Game Over

Play Ghosty Cloud

Firefly Ghosty Cloud

Welcome to Firefly Ghosty Cloud!

Game Over