Skip to content

Reliability Metrics

First PublishedLast UpdatedByAtif Alam

Reliability metrics summarize how quickly you detect and recover from failures, and how often failures occur.

They complement SLOs (which define what good looks like) by measuring how you’re doing over time:

  • MTTD (Mean Time To Detection),
  • MTTR (Mean Time To Recovery),
  • MTBF (Mean Time Between Failures).

This page covers definitions, how to measure them, how they relate to the Incident Lifecycle and alerting, and how to use them to track reliability improvements.

MTTD is the average time from when a failure or degradation starts until the team is aware of it (e.g. alert fires, someone notices, incident is declared).

It answers: “How fast do we find out something is wrong?”

  • Why it matters — The longer detection takes, the longer users are impacted and the later mitigation can start. Improving MTTD (better alerting, better dashboards, better symptom coverage) directly reduces time-to-mitigation.
  • How to measure — You need a start time (when the failure began—often approximated by when an SLI first breached or when the first error spike occurred) and a detection time (when the alert fired or the incident was declared). MTTD = mean of (detection time − start time) over a set of incidents. In practice, “start time” is often inferred from telemetry after the fact, so MTTD is easiest to compute for incidents where you have a clear timeline from post-incident reviews.
  • Connection to Incident Lifecycle — Detection is Phase 1 (Detection & Declaration). Alerting design (what you alert on, burn rate, multi-window) directly affects how quickly Phase 1 happens. See Alerting for reducing detection delay and avoiding alert fatigue.

MTTR is the average time from when an incident is detected (or declared) until the service is recovered—e.g. SLIs back in the green, incident resolved.

It answers: “How fast do we fix it?”

  • Why it matters — MTTR drives how long users are impacted. Fast recovery (e.g. rollback, failover, fix) reduces impact and supports availability SLOs. Release Engineering (rollback, feature flags, progressive delivery) and Disaster Recovery (failover, restore) practices are aimed at keeping MTTR low.
  • How to measure — Recovery time = (time when incident declared resolved / SLIs restored) − (time when incident declared or detected). MTTR = mean of recovery time over a set of incidents. Use a consistent definition of “resolved” (e.g. IC declares resolved when traffic is stable and SLIs green) so MTTR is comparable across incidents.
  • Connection to Incident Lifecycle — Recovery spans Phase 3 (Triage & Mitigation) through Phase 5 (Recovery & Verification). Runbooks, rollback paths, and failover procedures all affect MTTR. Track MTTR by incident class (e.g. deploy-related vs dependency failure) to see where to invest.

Often “MTTR” is used loosely to mean “time to recover”; in some contexts it’s defined as “Mean Time To Repair” (fix the root cause).

For consistency with impact and SLOs, use recovery (service restored) unless you explicitly mean repair. Document your definition when you report MTTR.

MTBF is the average time between the start of one failure (or incident) and the start of the next.

It answers: “How often do we have incidents?”

  • Why it matters — Higher MTBF means failures are less frequent. It reflects both the inherent stability of the system and the effectiveness of prevention (better testing, safer deploys, fewer config errors, capacity headroom). MTBF complements MTTR: you care both how often you fail and how long each failure lasts.
  • How to measure — For each incident, you need a “start” time (when the failure began or when the incident was declared). MTBF = (total time period) / (number of incidents in that period), or the mean of (start of incident N+1 − start of incident N). Use a consistent incident definition (e.g. every declared SEV-1/SEV-2, or every incident that consumed error budget) so MTBF is meaningful.
  • Relation to SLOs — Availability SLOs are related to both frequency and duration of failure: more frequent or longer outages burn more error budget. Improving MTBF (fewer incidents) and MTTR (shorter incidents) both improve availability. See Error Budgets for how budget connects to reliability work.

MTBF is sometimes used in hardware/reliability engineering with a different definition (time between failures of a component).

In an operational context, “between failures” usually means between incidents affecting the service.

  • MTTD + MTTR — Time to detect plus time to recover gives total time users are impacted. Improving either improves impact. Phases 1–5 of the Incident Lifecycle map to detection (Phase 1) and recovery (Phases 3–5).
  • MTBF — Tracks how often you’re in incident mode. Use it to see if incidents are becoming more or less frequent and to prioritize prevention (e.g. deploy safety, dependency resilience).
  • Trend over time — Track MTTD, MTTR, and MTBF (e.g. monthly or quarterly) to see if reliability is improving. Report them in reliability reviews and SLO reviews so the team can tie actions (better alerting, runbooks, rollback, capacity) to outcomes.

Reliability metrics are inputs to prioritization: if MTTD is high, invest in alerting and detection; if MTTR is high, invest in runbooks, rollback, and failover; if MTBF is low, invest in prevention (deploy safety, testing, capacity).