SRE Runbook Template
A runbook is a step-by-step procedure for a known scenario (see Runbooks and Playbooks for the distinction from playbooks). The sections below are a practical skeleton you can copy into your wiki or repo; trim or extend per service. Optimize for someone paged at 3am, not for a leisurely read.
1. Header and metadata
Section titled “1. Header and metadata”- Title — Specific scenario (e.g. “High 5xx rate on checkout API”).
- Owner — Team or named owner responsible for accuracy.
- Last updated — Date (and optionally review cadence, e.g. quarterly or after every related incident).
- Severity — Typical severity when this runbook applies, if helpful.
- Linked alert / paging policy — Which alert routes here; link to on-call rotation or escalation group if your tool supports it.
- Where this doc lives — Wiki URL, git path, or “source of truth” so copies do not diverge.
2. Overview
Section titled “2. Overview”- What this runbook covers — Symptom or failure mode in one short paragraph.
- Service or system — Name, boundaries, critical dependencies.
- Business impact — What degrades for users or internal customers if this issue is real (latency, failed orders, data staleness).
- Out of scope (optional) — e.g. “Not for physical data center access” or “Not for client-side app bugs.”
- Dependencies — Other teams, vendors, or services often involved—so responders know when to hand off early.
3. Detection and symptoms
Section titled “3. Detection and symptoms”- How the issue surfaces — Alert names, dashboard links, typical log patterns or metrics.
- Normal vs broken — Baseline (“p99 ~40ms”) vs failure (“p99 over 2s sustained 5m”).
- Noise vs signal — Known false positives, flapping, or noisy alerts; when to wait one evaluation vs act immediately.
- Correlation — Check recent deploys, config changes, or change calendar; many incidents follow a change event.
Link alerts to this runbook from your alerting tool so the page is one click from the page.
4. Impact assessment
Section titled “4. Impact assessment”- Who or what is affected — Regions, tenants, feature flags, percentage of traffic.
- Blast radius — How big could this get if unchecked (see multi-region ideas for geographic scope).
- SLO / SLA / error budget — Whether this scenario burns budget or breaches a customer commitment.
- When this becomes an incident — Pointer to Severity and classification so responders know when to declare and pull in comms or leadership.
5. Triage steps
Section titled “5. Triage steps”- Quick checks — Short, numbered diagnostics (commands, URLs) with expected output when possible.
- Decision tree — “If X → go to section / runbook Y”; “If not X → continue.”
- Stopping rule — When to stop triage and escalate (time box, failed check, missing access).
- Trace and correlation IDs — How to pull request IDs or trace IDs from logs so you can follow one path through the system.
6. Remediation steps
Section titled “6. Remediation steps”- Numbered, unambiguous steps — No assumed expert knowledge; define terms or link to glossary once.
- Commands — Show expected output or “success looks like …” for critical steps.
- Preflight — Preconditions (maintenance window, quorum, backups) before destructive actions.
- Rollback — If a fix can worsen things: how to revert config, flag, or deployment; who approves. For release-driven rollback criteria and change-window planning (what to document before a deploy), see Rollback plans—this runbook template stays incident-centric.
- Secrets — Never paste credentials in the runbook. Link to your vault or secret manager and name the key reference.
- Disruption — If remediation needs a window or customer-visible change, note it and link Communicating during incidents when user impact is possible.
Automate stable, repeated steps over time; keep human approval for risky changes—see Toil and automation.
7. Escalation path
Section titled “7. Escalation path”- When to escalate — Time thresholds (e.g. “no progress in 15 minutes”), severity, or safety criteria.
- Technical escalation — Next on-call line, platform team, or subject-matter expert; how to page (tool, Slack).
- Non-engineering — When to involve vendor support, executive, or communications (customer-facing status)—align with On-call and escalation.
8. Post-incident
Section titled “8. Post-incident”- Incident ticket — Link pattern for Jira, PagerDuty incident, or internal tracker.
- Postmortem — Pointer to your template or Post-incident review process.
- Update this runbook — If steps were wrong, incomplete, or order mattered in a way the doc did not capture, edit the runbook as part of follow-up (blameless).
- Feedback — Optional “Was this helpful?” link or owner inbox for continuous improvement.
9. References
Section titled “9. References”- Architecture — Diagram or link to system design doc.
- Related runbooks — Link, do not duplicate overlapping procedures.
- Dashboards and logs — Deep links to saved views.
- Code and config repos — Repositories or IaC paths that matter for this scenario.
- Status page / comms — Where customer-facing status is updated if applicable.
- Internal incident channel — Template for Slack/Teams incident room if your org uses one.
Principles that separate great from good
Section titled “Principles that separate great from good”| Principle | Meaning |
|---|---|
| Actionable over explanatory | Steps and checks, not essays. Long context belongs in linked docs. |
| Tested | Runbooks rot. Validate in game-days and drills or fire drills; update when reality changes. |
| Linked, not duplicated | One canonical procedure; other runbooks point here instead of copying steps. |
| Audience-aware | Written for a tired on-call with standard access, not only for senior experts. |
See also
Section titled “See also”- Runbooks and Playbooks — Conceptual overview and playbook distinction.
- DR planning and testing — DR runbooks and test cadence.
- Failover and Failback — Structure for failover-style runbooks.