7 Infrastructure Metrics Every DevOps Engineer Should Be Tracking in 2025

By Firefly

Explore the core metrics that actually impact your infrastructure — like saturation, configuration drift, latency spikes, and error budget burn.

Workflows

Explore the resource

In 2025, managing a SaaS e-commerce platform running on multi-region EKS clusters with microservices spread across AWS accounts means dealing with complex infrastructure. Our core transaction system relies on an RDS PostgreSQL instance and several S3 buckets for order and payment data.

Suppose it’s late at night during peak traffic, and an API Gateway in us-west-2 starts timing out. CloudWatch dashboards show everything green, average service response hovers around 200ms, and no alerts. But when you dig into the [.code]prod-api-gateway logs[.code], you find 99th percentile latencies spiking to 5 seconds on 5% of requests.

The root cause is a spike in checkout traffic overwhelming the RDS connection pool, silently throttling backend services, and cascading into gateway timeouts. This metric isn’t part of the standard “Golden Signals”: latency, traffic, errors, and saturation. In today’s containerized, auto-scaling environments, relying solely on these standard signals leaves critical blind spots.

This article outlines 7 infrastructure metrics every devops or platform engineer should track to improve operational resilience, reduce incidents, and manage costs effectively. Drawing on experience managing 200+ Terraform resources in complex AWS environments, these metrics go beyond the basics to catch early warning signs before issues impact users.

Why Do Infrastructure Metrics Matter?

Infrastructure metrics are quantitative measures of your infrastructure’s health, performance, and cost, providing early warnings of degradation in compute, storage, networking, or database resources. Unlike application-level metrics (e.g., API response codes), infrastructure metrics focus on the underlying cloud and Kubernetes components like EKS nodes, RDS instances, S3 buckets, or VPC networks that support your services.

Nowadays, with architectures ranging from serverless functions to container clusters, and an increasing emphasis on cost management through FinOps, these metrics are essential for spotting potential issues early, such as resource limits being approached, configuration mismatches, or unexpected cost spikes, before they affect your users or your organization’s budget. For our multi-region cloud setup, running various microservices on EKS with Terraform-managed resources and GitLab CI/CD pipelines, tracking the right metrics has reduced the downtime of our applications by 25% and incident response time by 40%.

Below are 7 metrics for maintaining reliability, performance, and cost efficiency in a cloud-native setup.

The Core Metrics DevOps Engineers Must Track

1. Saturation (System Resource Pressure)

Saturation measures how heavily your compute resources are being used, specifically CPU, memory, or thread pools, across EKS nodes or EC2 instances. When these reach high usage, the system can’t keep up with new workload requests.

For better understanding, if your nodes are running close to full capacity, say >85% CPU or low available memory, Kubernetes can start evicting pods, throttle workloads, or fail to schedule new pods. This directly affects service reliability. In high-traffic environments, like payment gateways or checkout services, high saturation usually leads to latency spikes, timeouts, or failed transactions.

According to DORA, saturation is a key metric tied to system availability and operational performance.

2. Infrastructure Drift & Change Frequency

Moving from saturation metrics to infrastructure drift, it’s important to measure how often infrastructure changes are made outside your Infrastructure-as-Code (IaC) pipeline. Drift refers to any shift in cloud resources that isn’t tracked in your Terraform state, which is usually caused by manual edits in the AWS console, scripts, or misconfigured automation.

Take this scenario: Terraform config defines an EKS node group using t3.large instances of AWS. Someone from the team, while debugging a capacity issue, updates the node group's Launch Template in the AWS GUI to use m5.large instead.

The next time you run [.code]terraform apply[.code], Terraform sees this as a drifted resource. Since the instance type now differs from what’s defined in the module, Terraform attempts to recreate the entire node group with the original t3.large setting. This triggers a replacement of all nodes in that group, causing:

Kubernetes to drain the running pods from old nodes (gracefully or forcefully, depending on readiness probes and pod disruption budgets).
Pods scheduled on the new nodes may fail if the required ENIs, IAM roles, or daemonsets aren’t set up identically.
Auto-scaler policies tuned for [.code]m5.large[.code] performance now behaves unpredictably under [.code]t3.large[.code].
If your workload uses EBS volumes with instance-store-backed nodes, you may also lose ephemeral data.
HorizontalPodAutoscaler (HPA) behavior may break due to different CPU performance baselines.

Even worse, if your GitLab CI/CD pipeline runs [.code]terraform plan[.code] and doesn’t explicitly check for drift, this replacement happens during a routine deploy, causing a partial cluster outage without clear warnings. In our case, it led to 502 errors at the ingress and impacted order processing for 30 minutes.

And if Terraform isn’t run regularly, say, for 10 days or more, these drifted resources remain in a mismatched state, increasing risk. However, Firefly gives you this capability out of the box by continuously scanning your cloud environment and Terraform state for drift, so you don’t have to reinvent the wheel when it comes to drift detection and remediation.

Here’s a snapshot from the Firefly Inventory dashboard, showing a detailed view of your cloud assets:

This immediate visibility of your infrastructure lets you remediate drift before it triggers unexpected downtime or costly replacements. Firefly also offers Slack integration, providing automatic updates about any drifts occurring in your cloud. This way, you won't need to check your dashboards again.

3. Latency Percentiles (Not Just Averages)

Measures response times at specific percentiles like p95 and p99, not just average latency. These values show the slowest request when the Redis cache (used for session tokens) misses, the API falls back to a slow database query. It only affects a few requests, but those users are stuck waiting.

If you had only tracked average latency, this issue would’ve stayed hidden.

4. Cloud Waste Metrics

Cloud waste happens when you’re paying for resources that aren’t doing useful work or are misconfigured in a way that inflates your bill and risks stability. This includes idle instances running at low CPU, unattached EBS volumes, orphaned IPs, or load balancers that spin up more capacity than needed.

For example, your autoscaler might react to a spike in traffic by launching many instances, but if those instances sit idle most of the time or your scaling rules are too aggressive, you’re burning money on unused capacity. Similarly, leftover storage volumes that aren’t attached to any instance keep accumulating charges without adding value.

This waste also ties directly into reliability. Bursty workloads caused by inefficient autoscaling or repeated retries from backend errors increase cost and make troubleshooting harder.

Firefly’s Analytics dashboard shows these problems in detail. It tracks where cloud spend is going, highlights underutilized or orphaned resources, and links those to error patterns or configuration issues. This lets you pinpoint exactly what’s causing cost spikes and wasted resources, so you can clean up aggressively without risking service availability.

Here’s what you’d see in Firefly:

It tracks 1.0k resources for waste, identifies 588 untagged assets, and highlights $0.16 in cloud waste costs. The dashboard also flags 127 critical violations and tracks shifts in resource usage over time. With this analysis, Firefly helps you address inefficiencies, control costs, and ensure service reliability.

5. MTTR & MTTD

MTTD (Mean Time to Detect) tells you how long it takes for your team to notice a problem after it starts. MTTR (Mean Time to Resolve) tells you how long it takes to fully fix that problem once it’s been detected.

Both of these metrics are some of the most underrated tools for assessing the effectiveness of your monitoring and incident response processes.

Let’s say a production PostgreSQL RDS instance starts hitting high CPU usage above 85%. This could be caused by a specific query pattern from the frontend that’s scanning too many rows without an index. If you haven’t set up an alert in Prometheus that checks average CPU usage for that database, it might take 30 minutes or more before anyone notices. During that time, requests from your app to the database start backing up. As a result, users might see very slow responses and get errors while using the application.

Now, once someone on your team does notice the alert, if there's no documented troubleshooting process for high RDS CPU usage, the engineer might waste time checking logs, guessing at causes, and trying random changes. A better approach would be to have a clearly written step-by-step response: check slow query logs in CloudWatch, look for missing indexes, and if needed, temporarily scale up the instance or drop traffic from non-critical reporting jobs.

To track MTTD and MTTR, use incident management tools like PagerDuty or Opsgenie to log the exact start and end times of incidents. This data helps calculate how quickly issues are detected and resolved. In Prometheus, set up alerts on critical resources such as your RDS instances, like one that will notify the team when the CPU goes above 80%, giving them a chance to act before it becomes a full outage.

6. Disk I/O and Storage Latency

Disk I/O latency measures how long it takes to complete read or write operations on storage volumes. In AWS, this typically means tracking EBS metrics like VolumeReadLatency and VolumeWriteLatency. For S3, latency is harder to monitor directly and is usually inferred from application-level performance metrics. High latency usually indicates disk contention, exhausted IOPS credits, or storage misconfiguration, and it will slow down any system that depends on fast disk access, including databases, queues, and application servers.

This matters because even if your CPU and memory are healthy, your application can still crawl if the underlying storage is slow. For example, your API might fetch user data from a database that depends on fast disk reads. If read latency on the EBS volume spikes, queries start to lag, and users begin to see timeouts or long page loads. In microservice environments where databases, caches, and queues all rely on persistent volumes, unmonitored storage latency becomes a silent bottleneck.

In one production case, we saw 50ms average read latency on an EBS volume backing PostgreSQL in prod-payments-eks. This was caused by burst credit exhaustion on a GP2 volume. The result: around 20% of queries were delayed, and key endpoints like /transactions and /payouts saw response times degrade. Since we had no latency alert configured, the issue wasn't caught for nearly an hour, and users were already impacted by the time engineering got involved.

To catch this earlier, you should monitor storage latency just like any other core infrastructure metric. In CloudWatch, track AWS/EBS/VolumeReadLatency and VolumeWriteLatency. In Prometheus, use aws_ebs_volume_read_latency_average and alert when latency exceeds 10ms.

Don’t wait for users to give feedback. Set up disk I/O and storage latency alerts alongside CPU and memory. It’s often the root cause of slow databases and flaky services, especially in production environments with unpredictable workloads.

7. Network Saturation & Retransmits

Network saturation and TCP retransmit rates measure how well your network is handling traffic within your VPC or through load balancers like an ALB. High retransmit rates indicate packet loss, which often results from network congestion, hardware issues, or noisy neighbors competing for bandwidth.

Packet loss and saturation cause requests to time out and degrade overall application performance, especially for latency-sensitive services.

For example, we observed a 10% TCP retransmit rate on our ALB, which pointed to a noisy neighbor issue in the same availability zone. We identified this using VPC flow logs and network metrics.

To monitor this, track metrics like [.code]AWS/ELB/TCP_Client_Reset_Count[.code] in CloudWatch or [.code]node_network_transmit_errs_total[.code] in Prometheus. Set alerts when retransmission rates exceed 5%. Early detection of rising retransmits helps prevent network-related timeouts and keeps your services responsive.

Turning Raw Metrics into SLO-Backed Signals

Most infrastructure teams monitor raw metrics like CPU usage or memory consumption, but these numbers often fail to capture what users actually experience. A metric like “Database CPU < 70%” might sound healthy to engineers, but it tells you nothing about whether your app feels fast or reliable to users. That’s where Service Level Indicators (SLIs) and Service Level Objectives (SLOs) come in. They translate noisy system metrics into actionable, user-focused signals.

Instead of monitoring infrastructure for its own sake, SLOs help teams measure what matters: “Are we fast enough? Are we available enough?” For example, a meaningful latency SLO could be: “99% of SELECT queries complete in under 100ms over 28 days.” This ties backend behavior directly to the quality of service your customers feel. You can express this using PromQL, or define it declaratively with tools like Sloth, which lets you generate Prometheus SLOs via YAML, or use platforms like Nobl9 that support multi-source SLO management and alerting.

Monitoring raw metrics and defining SLOs are crucial for your infrastructure, but the real value comes from putting these into practice with automated alerts that notify you before problems impact users. For DevOps teams, disk I/O latency on EBS volumes is a common bottleneck that can silently degrade database and application performance.

To ensure you catch these issues early, the next example walks through how to set up a Terraform-managed CloudWatch alarm for EBS read latency. This practical approach helps you move from theory to action, automating the detection of storage latency spikes so you can respond quickly and keep your systems healthy.

How to Monitor and Alert on EBS I/O Latency Using Terraform

High I/O latency on Amazon EBS volumes measured by metrics like [.code]VolumeTotalReadTime[.code] and [.code]VolumeReadOps[.code] or the number of read operations can silently degrade application performance, especially when these volumes back latency-sensitive workloads like PostgreSQL, MySQL, or MongoDB running on EC2 instances. Instead of reacting after users report slowness, you can proactively monitor EBS read latency and trigger alerts using CloudWatch, all provisioned via Terraform for automation and consistency.

Firefly: Monitor Infra Drift, Changes, and Cost Without Manual Guesswork

Firefly focuses on the core metrics that actually impact your infrastructure — like saturation, configuration drift, latency spikes, and error budget burn. Tracking these with precision cuts incident response time by 40% and reduces downtime by 25%. Firefly collects telemetry across your cloud accounts and IaC state, then surfaces exactly which resources have drifted and how that affects your cloud spend.

The dashboard below shows real-time data from Firefly’s Unified Drift & Cost view. You can immediately see how many resources are unmanaged or drifted, spot assets with high-severity violations, and understand the dollar impact of cloud waste down to resource type, like AWS GP2 EBS volumes.

This tight integration of drift detection and cost visibility lets teams quickly prioritize remediation efforts, reduce unexpected cloud charges, and prevent outages caused by hidden infrastructure issues.

What’s Next? How Do You Help Your Boss Make Sense of What You’re Tracking?

Tracking the right metrics, but still not sure how to translate the value of cloud to your boss, your execs, and your board? Download our Value of Cloud slide template to start proving the value you’re creating.

‍

Featured blog posts

The Misconfig Heard Around the World: Why Ops is Always Business Critical

Embracing the Future: Firefly Innovation and the Gartner SRE Hype Cycle 2024

Implementing a Robust Cloud Governance Framework: 4 Steps to Control Your Cloud Infrastructure

Related case studies

How Basis Technologies took control of infrastructure sprawl — reducing cloud waste by 83%

How Comtech quickly reduced cloud waste by $180,000 per year using Firefly’s cloud governance

Sportradar’s journey from Cloudformation to Terraform in a few clicks with Firefly

Play Asset Mutations Racer

Welcome to the Asset Mutations Racer

Your mission: track, manage, and control changes across your entire cloud ecosystem.

An asset mutation occurs when an asset revision is made in your cloud infrastructure. Some are beneficial and lead to a well-controlled cloud, but others are harmful, creating risk and waste.

Use your ↑up and ↓down arrow keys to collect as many beneficial asset mutations as possible.

Avoid harmful asset mutations! Firefly enables rollbacks, but—in this game—you are only allowed 3. When you apply a harmful mutation and are out of rollbacks, your services will be disrupted and it is game over.

Play Drift Defender

Firefly Drift Defender

Score: 0 | High Score: 0

Welcome to Firefly Drift Defender!

Your mission is to prevent drifts in your cloud infrastructure. A drift occurs when the desired state defined in your configuration files doesn't match the actual state of your cloud infrastructure, which can cause deployment issues and security risks.

In this game, you are trying to prevent drift in your Databases, Network, Server, and Storage configurations. When a drift occurs, a resource will catch on fire.

Click on the drifted resource to automatically remediate it, and earn points.

Sadly, your platform engineers are making several manual changes in your cloud consoles, so you'll experience more drifts over time. When you have 5 drifts simultaneously, your services will be disrupted and the game will be over.

Game Over

Your Score: 0

Your High Score: 0

Play Ghosty Cloud

Firefly Ghosty Cloud

score2: 0 | High score2: 0

Welcome to Firefly Ghosty Cloud!

Your mission is to avoid ghosted resources in your cloud infrastructure.

A ghosted resource was once created through Infrastructure as Code (IaC) but has since been deleted or is missing from the actual cloud infrastructure.

In this game, use your spacebar to avoid ghosted resources in your cloud.

The further you go without encountering a ghost resource, the more points you earn for having a reliable and immutable cloud infrastructure.

Game Over

Your score: 0

Your high score: 0

7 Infrastructure Metrics Every DevOps Engineer Should Be Tracking in 2025

Why Do Infrastructure Metrics Matter?

The Core Metrics DevOps Engineers Must Track

1. Saturation (System Resource Pressure)

2. Infrastructure Drift & Change Frequency

3. Latency Percentiles (Not Just Averages)

4. Cloud Waste Metrics

5. MTTR & MTTD

6. Disk I/O and Storage Latency

7. Network Saturation & Retransmits

Turning Raw Metrics into SLO-Backed Signals

How to Monitor and Alert on EBS I/O Latency Using Terraform

Firefly: Monitor Infra Drift, Changes, and Cost Without Manual Guesswork

What’s Next? How Do You Help Your Boss Make Sense of What You’re Tracking?

Featured blog posts

The Misconfig Heard Around the World: Why Ops is Always Business Critical

Embracing the Future: Firefly Innovation and the Gartner SRE Hype Cycle 2024

Implementing a Robust Cloud Governance Framework: 4 Steps to Control Your Cloud Infrastructure

Related case studies

How Basis Technologies took control of infrastructure sprawl — reducing cloud waste by 83%

How Comtech quickly reduced cloud waste by $180,000 per year using Firefly’s cloud governance

Sportradar’s journey from Cloudformation to Terraform in a few clicks with Firefly

Firefly: alien technology, now available on Earth

Play Asset Mutations Racer

Firefly Asset Mutations Racer

Welcome to the Asset Mutations Racer

Your Cloud Asset Mutations

Game over

Play Drift Defender

Firefly Drift Defender

Welcome to Firefly Drift Defender!

Your Infrastructure

Game Over

Play Ghosty Cloud

Firefly Ghosty Cloud

Welcome to Firefly Ghosty Cloud!

Game Over