SRE Runbook Template

First PublishedApr 9, 2026ByAtif Alam

A runbook is a step-by-step procedure for a known scenario (see Runbooks and Playbooks for the distinction from playbooks). The sections below are a practical skeleton you can copy into your wiki or repo; trim or extend per service. Optimize for someone paged at 3am, not for a leisurely read.

1. Header and metadata

Title — Specific scenario (e.g. “High 5xx rate on checkout API”).
Owner — Team or named owner responsible for accuracy.
Last updated — Date (and optionally review cadence, e.g. quarterly or after every related incident).
Severity — Typical severity when this runbook applies, if helpful.
Linked alert / paging policy — Which alert routes here; link to on-call rotation or escalation group if your tool supports it.
Where this doc lives — Wiki URL, git path, or “source of truth” so copies do not diverge.

2. Overview

What this runbook covers — Symptom or failure mode in one short paragraph.
Service or system — Name, boundaries, critical dependencies.
Business impact — What degrades for users or internal customers if this issue is real (latency, failed orders, data staleness).
Out of scope (optional) — e.g. “Not for physical data center access” or “Not for client-side app bugs.”
Dependencies — Other teams, vendors, or services often involved—so responders know when to hand off early.

3. Detection and symptoms

How the issue surfaces — Alert names, dashboard links, typical log patterns or metrics.
Normal vs broken — Baseline (“p99 ~40ms”) vs failure (“p99 over 2s sustained 5m”).
Noise vs signal — Known false positives, flapping, or noisy alerts; when to wait one evaluation vs act immediately.
Correlation — Check recent deploys, config changes, or change calendar; many incidents follow a change event.

Link alerts to this runbook from your alerting tool so the page is one click from the page.

4. Impact assessment

Who or what is affected — Regions, tenants, feature flags, percentage of traffic.
Blast radius — How big could this get if unchecked (see multi-region ideas for geographic scope).
SLO / SLA / error budget — Whether this scenario burns budget or breaches a customer commitment.
When this becomes an incident — Pointer to Severity and classification so responders know when to declare and pull in comms or leadership.

5. Triage steps

Quick checks — Short, numbered diagnostics (commands, URLs) with expected output when possible.
Decision tree — “If X → go to section / runbook Y”; “If not X → continue.”
Stopping rule — When to stop triage and escalate (time box, failed check, missing access).
Trace and correlation IDs — How to pull request IDs or trace IDs from logs so you can follow one path through the system.

6. Remediation steps

Numbered, unambiguous steps — No assumed expert knowledge; define terms or link to glossary once.
Commands — Show expected output or “success looks like …” for critical steps.
Preflight — Preconditions (maintenance window, quorum, backups) before destructive actions.
Rollback — If a fix can worsen things: how to revert config, flag, or deployment; who approves. For release-driven rollback criteria and change-window planning (what to document before a deploy), see Rollback plans—this runbook template stays incident-centric.
Secrets — Never paste credentials in the runbook. Link to your vault or secret manager and name the key reference.
Disruption — If remediation needs a window or customer-visible change, note it and link Communicating during incidents when user impact is possible.

Automate stable, repeated steps over time; keep human approval for risky changes—see Toil and automation.

7. Escalation path

When to escalate — Time thresholds (e.g. “no progress in 15 minutes”), severity, or safety criteria.
Technical escalation — Next on-call line, platform team, or subject-matter expert; how to page (tool, Slack).
Non-engineering — When to involve vendor support, executive, or communications (customer-facing status)—align with On-call and escalation.

8. Post-incident

Incident ticket — Link pattern for Jira, PagerDuty incident, or internal tracker.
Postmortem — Pointer to your template or Post-incident review process.
Update this runbook — If steps were wrong, incomplete, or order mattered in a way the doc did not capture, edit the runbook as part of follow-up (blameless).
Feedback — Optional “Was this helpful?” link or owner inbox for continuous improvement.

9. References

Architecture — Diagram or link to system design doc.
Related runbooks — Link, do not duplicate overlapping procedures.
Dashboards and logs — Deep links to saved views.
Code and config repos — Repositories or IaC paths that matter for this scenario.
Status page / comms — Where customer-facing status is updated if applicable.
Internal incident channel — Template for Slack/Teams incident room if your org uses one.

Principles that separate great from good

Principle	Meaning
Actionable over explanatory	Steps and checks, not essays. Long context belongs in linked docs.
Tested	Runbooks rot. Validate in game-days and drills or fire drills; update when reality changes.
Linked, not duplicated	One canonical procedure; other runbooks point here instead of copying steps.
Audience-aware	Written for a tired on-call with standard access, not only for senior experts.