Post-Incident Review
A repeatable 10-step checklist for a solid post-mortem / root-cause analysis (RCA). Each step calls out what you do, the key outputs, and why it matters.
10-Step Checklist
Section titled “10-Step Checklist”| # | What You Do | Key Outputs | Why It Matters |
|---|---|---|---|
| 1 — Trigger & Timing | Decide this incident meets your post-mortem criteria (e.g., SEV-1, ≥ 5% SLO burn). Schedule the session within 24–72 h of resolution. | Calendar invite; unique doc link | Fresh memories give better data and keep a blameless tone. |
| 2 — Assemble the Crew | IC nominates Facilitator, Scribe, and SMEs. Include anyone who paged or pushed a button. | Attendee list, communication channel | Shared context avoids blind spots and blame games. |
| 3 — State the Facts First | Pull logs, metrics, chat transcripts, deploy history; freeze them in the doc. | Evidence bundle | Stops hindsight bias and “he said/she said.” |
| 4 — Build a Precise Timeline | Minute-by-minute: trigger → detection → mitigation → recovery. | Timeline table | Enables gap analysis (detection delay, escalation hops). |
| 5 — Describe Impact | Customer / business effect (error rate, revenue hit, user complaints). | SLO diff, $ or user minutes lost | Shows why the fix is worth doing. |
| 6 — Root-Cause & Contributing Factors | Use 5 Whys, fish-bone, or fault-tree; also list conditions that let it slip through. | RC statement, factor list | Focuses on system weakness, not people. |
| 7 — What Went Well / Didn’t | Call out tools, runbooks, comms that helped—or hurt. | Two bullet lists | Reinforces good practice and surfaces toil. |
| 8 — Corrective & Preventive Actions | SMART tasks with owner + due date + status (code fix, alert, test, doc). | Action-item table tracked in Jira / backlog | Turns lessons into reliability capital. |
| 9 — Approval & Publish | Tech lead or Bar-Raiser reviews for clarity and completeness; doc goes to the shared post-mortem repo / runbooks. | Signed-off PDF / Confluence page | Keeps tribal knowledge searchable and auditable. |
| 10 — Follow-through | Check action-item burn-down in sprint retro; reopen if deadlines slip. | KPI: action completion rate | Prevents “write-and-forget” post-mortems. |
Guiding principles
Section titled “Guiding principles”- Blameless by default — Assume everyone made the best decision with the info they had. Encourages full disclosure and better data.
- Data > opinions — Screenshots, Grafana exports, and chat logs trump memory.
- Time-boxed & facilitated — 60–90 minutes max with a neutral moderator keeps it productive.
- Actionable & tracked — A post-mortem without ticketed follow-ups is just a story.
- Share widely — Transparency multiplies learning across teams (post-mortem culture).
30-second summary
Section titled “30-second summary”We run a blameless post-mortem within 48 hours. Facilitator gathers evidence → builds the timeline and impact → digs into root cause with 5 Whys → captures SMART corrective actions with owners and due dates → publishes and tracks them in Jira until closed. The goal is learning and prevention, not finger-pointing.