Recovery time objectives (RTOs) are everywhere: in compliance reports, risk assessments, vendor service-level agreements (SLAs) and disaster recovery plans. They promise a return to operational normalcy within (N) hours or minutes — whatever target an organization commits to. On paper, these timelines sound reasonable. In practice, they’re often aspirational at best, and dangerously misleading at worst. 

RTOs are more than just metrics; they’re promises that intersect directly with reliability, trust and customer impact. Yet most teams are flying blind when it comes to validating whether those promises can actually be kept.

As organizations modernize their infrastructure with Kubernetes, serverless components, multicloud architectures and Infrastructure as Code (IaC), the complexity of restoring full system functionality increases exponentially. Configuration is often fragmented across multiple layers — Terraform, Helm charts, Kubernetes manifests and embedded cloud console settings —  making it difficult to reassemble the full picture during an outage. 

While teams may test isolated recoveries, few simulate full-stack restoration at production scale. The result? RTOs that look good in audits but collapse under real-world pressure. This article, originally published in The New Stack, is your RTO tell-all.

Why RTOs Fail in Complex Systems

Traditionally, RTOs were defined during tabletop exercises or disaster recovery (DR) drills focused on restoring a small set of core systems, like databases, file servers and VMs. Today’s environments are far more interconnected. A functioning database doesn’t mean your SaaS identity provider is up. A redeployed microservice doesn’t mean its config maps or secrets were correctly restored. It’s easy to restore pieces. It’s hard to restore everything that matters.

One of the most common sources of failure is the assumption that cloud infrastructure can be rehydrated quickly using “just the data.” In reality, recovery requires the full stack: the compute, the networking, the access policies, the DNS, the autoscaling rules, the observability stack and more. If one layer lags — or is missing entirely — your entire RTO commitment becomes meaningless.

The Hidden Gaps in DR Planning

From speaking with dozens of platform engineering and DevOps leaders, a common pattern has emerged: DR plans often stop at the backup layer. Teams assume they can pull snapshots from S3 or recover databases from a backup tool. What they don’t account for is the reconfiguration time required to stitch everything back together.

Even teams with excellent backup coverage are often missing critical components:

  • Kubernetes workloads that depend on ephemeral configs or secrets.
  • Cloud functions tied to region-specific identity and access management (IAM) policies.
  • Infrastructure as code (IaC) templates that were never updated post-deployment.
  • Services that rely on mutable cloud components or SaaS plugins.

Without continuously validated, end-to-end recovery workflows, theoretical RTOs quickly unravel. This was made abundantly clear in several real-world incidents.

When the Clock Starts Ticking: Real RTO Failures

An anonymized case from a FinTech company illustrates the danger of overconfidence. Its DR plan stated a two-hour RTO across regions. But when an IAM misconfiguration led to a cascading production outage, the team quickly discovered that while data backups were intact, IaC didn’t reflect the current production state. Reapplying the old templates introduced inconsistencies, leading to repeated rollbacks. 

The actual recovery time? Just over 11 hours. (And that’s actually a positive outcome; there have been much worse horror stories with more widespread impact as well.)

Another example from a media company involved a partial regional failover. Its Kubernetes cluster was backed up, but restoring across clouds revealed major drift between staging and prod Helm charts. Environment-specific logic was hardcoded, breaking portability. The promised one-hour RTO became a two-day scramble, damaging SLAs and client trust.

Complexity as the RTO Killer

What do these incidents have in common? 

Infrastructure complexity, not data loss, was the primary factor driving prolonged outages. And this complexity often hides in plain sight:

  • Dozens of git repos with IaC fragments and configuration drift.
  • Region- or account-specific customizations that don’t translate across environments.
  • Event-driven systems with opaque dependencies and retry behaviors.
  • Platform teams using staging environments that don’t reflect real production topology.

The more fragmented and layered your system is, the harder it is to recover it under pressure — especially if your DR process is manually triggered or partially documented.

Codifying and Continuously Testing Recovery

So what’s the path forward? 

RTOs need to be redefined through the lens of operational reality and validated through regular, full-system DR rehearsals. This is where IaC and automation come in.

By codifying all layers of your infrastructure —not just compute and storage, but IAM, networking, observability and external dependencies too — you gain the ability to version, test and rehearse your recovery plans. Tools like Terraform, Helm, OpenTofu and Crossplane allow you to build immutable blueprints of your infrastructure, which can be automatically redeployed in disaster scenarios.

But codification alone isn’t enough. Continuous testing is critical. Just as CI/CD pipelines validate application changes, DR validation pipelines should simulate failover scenarios, verify dependency restoration and track real MTTR (Mean Time to Recovery) metrics over time. This moves DR from an annual checkbox to a living, breathing operational function.

Measuring the Gap: The Real MTTR

It’s also time to stop relying on aspirational RTOs and instead measure actual MTTR . It’s what matters when things go wrong, indicating how long it really takes to go from incident to resolution. Unlike RTOs, which are often set arbitrarily, MTTR is a tangible, trackable indicator of resilience.

And like any metric, MTTR improves only with practice. Platform teams that run frequent recovery drills and invest in codifying full-stack restoration consistently reduce their MTTR over time. They build confidence, gain visibility into hidden dependencies and reduce the human error factor during high-stress scenarios.

Practical Recommendations to Close the RTO Gap

If you’re looking to align your disaster recovery posture with reality, here’s where to start:

  • Audit your RTOs: Take a hard look at your current RTOs. Are they based on validated recovery exercises or are they aspirational numbers in a spreadsheet?

  • Map dependencies: Inventory all services, data stores, configs and secrets involved in critical workflows. If it’s not codified, assume it’s missing.

  • Codify the full stack: Move beyond backing up data to backing up your environment state. Use IaC to ensure repeatable, portable infrastructure restoration.

  • Integrate testing into CI/CD: Automate recovery drills, failover simulations and post-incident reviews. Track real MTTR and benchmark progress over time.

  • Train for the chaos: Disaster recovery is part process, part muscle memory. Run training days, simulate real failures and ensure every team knows their role.

From Wishful Thinking to Operational Readiness

Seconds of downtime can cost millions, and pretending your systems can recover in two hours when they can’t is more than just bad planning. It’s also a major business risk. Today, codifying your infrastructure, continuously testing recovery and grounding your objectives in reality is no longer optional. 

The gap between RTO and reality is fixable, but only if we stop deluding ourselves about what recovery really looks like. It means going beyond just data backups to ensuring full coverage of every critical layer: infrastructure, configurations, access controls and dependencies. 

Recovery starts not with promises, but with preparation.

‍