On October 20, 2025, the AWS region in North Virginia, experienced a severe outage for nearly 16 hours: disrupting 113 services and taking major platforms like Zoom, DoorDash, Capital One, Coinbase, and Reddit offline. These companies (which, collectively, were spending eight to nine figures annually on backup and disaster recovery solutions) were completely powerless.
The incident exposed an under-acknowledged truth: backups protect data, not infrastructure. And without infrastructure, your data isn’t worth much.
The $100 Million Question Nobody Wants to Ask
Think about that for a moment. Zoom: a company so critical to modern work. DoorDash, serving millions of meals daily. Capital One and Coinbase, handling billions in financial transactions.
- All of them have disaster recovery plans.
- All of them invest heavily in backup solutions.
- All of them went dark when a single DNS configuration error cascaded through AWS's US-EAST-1 region.
The hard truth is: backing up your data is great, but it does almost nothing for business continuity. And if some of the biggest and most seemingly well-prepared brands are paying for not being disaster-ready, what’s in store for the rest?
A Pattern of Failure: AWS Outage History and the Growing DR Risk
This wasn't a one-off event. It's the third major outage in five years tied to AWS's US-EAST-1 data cluster, which powers much of the internet's traffic.
The historical record tells a troubling story:
- November 2020: A Kinesis data stream failure in US-EAST-1 affected authentication, delivery tracking, and streaming services across the internet.
- December 2021: API Gateway issues in US-EAST-1 caused widespread disruption lasting over seven hours, impacting DynamoDB, EC2, and the AWS Console.
- June 2023: Lambda service disruptions in US-EAST-1 affected serverless applications worldwide.
- July 2024: Another Kinesis event in US-EAST-1 disrupted data streaming services.
And now, October 2025: A DNS race condition in DynamoDB's automated management system cascaded into a day-long outage that exposed the fragility of global internet infrastructure.
The recurring theme? US-EAST-1, Amazon's oldest and largest region in Northern Virginia, remains a single point of failure for much of the internet. AWS controls approximately 30% of the global cloud market: which means when it fails, the ripple effects are widespread.
Enter CAIRS: The Category That Changes Everything
This kind of widespread trouble with disaster recovery is precisely why Gartner created a new category in 2025: Cloud Application Infrastructure Recovery Solutions (CAIRS).
According to Gartner, CAIRS solutions automate discovery, protection, and restoration of full-stack cloud applications: not just data, but infrastructure and configurations, too. For decades, backup and disaster recovery practices focused almost exclusively on safeguarding data, leaving a critical gap: if infrastructure and configurations are compromised, protected data remains inaccessible.
Cloud resiliency is two things: it tells you which parts of your cloud are backup-ready, and then it couples it with the ability to automate the deployment of the backup and infrastructure. But not all solutions make true cloud resiliency possible.
The False Promise of Traditional DR
Here's the shameful little secret about traditional disaster recovery no one talks about: it was designed for a reality that no longer exists.
Legacy DR solutions assume your infrastructure is static, predictable, and owned by you. They focus obsessively on data backup: creating snapshots, replicating databases, archiving files. But in cloud-native environments, data is only half the equation. Without the infrastructure layer, like the compute instances, networking configurations, security groups, and IAM policies, your backed-up data might as well not exist.

When AWS's US-EAST-1 region failed, companies discovered they couldn't simply "restore from backup." They needed to:
- Spin up infrastructure in a different region
- Reconfigure networking and security
- Update DNS and load balancers
- Restore application state
- Reconnect integrations and dependencies
All while their SLAs were bleeding, and their competitors were gaining ground.
Why ClickOps Is Your Single Point of Failure
The AWS outage exposed another uncomfortable truth: most cloud infrastructure exists as undocumented, manually configured resources that can't be easily replicated or recovered. This is the ClickOps problem.
Engineers log into the AWS console, click through wizards, and deploy resources directly. It's fast., it's intuitive, and it's a disaster waiting to happen.
When those manually configured resources fail (or when an entire region goes dark), how do you recreate them? From memory? From scattered documentation? From screenshots? The organizations recovering fastest and maintaining business continuity best are the ones with infrastructure defined as code, automated deployment pipelines, and multi-region strategies.
What Should You Do Right Now?
The answer isn't spending more on traditional backup solutions. It's fundamentally rethinking your approach to cloud resilience:
Step 1: Audit Your Infrastructure Resilience
Can you answer these questions right now? (Hint: If you can't answer confidently, you're vulnerable.)
- Which parts of your cloud infrastructure are backed up?
- Can you recreate your entire stack in a different region?
- How long would recovery actually take?
- What percentage of your infrastructure exists as undocumented ClickOps resources?
Step 2: Adopt Infrastructure-as-Code Everywhere
Every resource in your cloud should be defined as code. Not some. Not most. All of it. This isn't optional anymore. When (not if) the next outage hits, you need to be able to redeploy your entire infrastructure with a single command.
Step 3: Implement Cloud Resiliency Posture Management (CRPM)
Don't wait for the next outage to discover you're unprepared. CRPM gives you continuous visibility into which parts of your infrastructure are resilient and which are vulnerable.
Step 4: Test Your Actual Recovery Capabilities
Most disaster recovery plans look great on paper and fail spectacularly in practice. When was the last time you actually tested failing over to a different region? Not a tabletop exercise but an actual test with your production workloads.
Step 5: Consider Cloud Application Infrastructure Recovery Solutions (CAIRS)
Gartner's recognition of CAIRS underscores the industry's shift toward solutions that restore the entire cloud stack, so organizations can resume operations quickly after an incident. And Firefly is one of only three vendors recognized by Gartner in the CAIRS category for fast, automated recovery through IaC, reducing recovery times from days to minutes. It’s a solution purpose-built for cloud-native infrastructure recovery.
The Moment of Truth Is Coming
It took learning the hard way for some, and watching a cautionary tale unfold for others. But finally, cloud leaders are starting to understand: they must pay more attention to resiliency, not just data backup. The threat landscape is evolving faster than traditional DR strategies can adapt.
As we saw with AWS, simple human errors can cascade into global outages. The difference between those who stay afloat and those who go down is a comprehensive approach to preparedness: one that factors in your data as well as your infra and all its configurations.
🔗 Dive into an overview of CAIRS solutions and why they matter
🔗 Explore Firefly’s disaster recovery and cloud backup capabilities
🔗 Get the scoop direct from Gartner
