Skip to content

Post-Incident Review

First PublishedByAtif Alam

A repeatable 10-step checklist for a solid post-mortem / root-cause analysis (RCA). Each step calls out what you do, the key outputs, and why it matters.

#What You DoKey OutputsWhy It Matters
1 — Trigger & TimingDecide this incident meets your post-mortem criteria (e.g., SEV-1, ≥ 5% SLO burn). Schedule the session within 24–72 h of resolution.Calendar invite; unique doc linkFresh memories give better data and keep a blameless tone.
2 — Assemble the CrewIC nominates Facilitator, Scribe, and SMEs. Include anyone who paged or pushed a button.Attendee list, communication channelShared context avoids blind spots and blame games.
3 — State the Facts FirstPull logs, metrics, chat transcripts, deploy history; freeze them in the doc.Evidence bundleStops hindsight bias and “he said/she said.”
4 — Build a Precise TimelineMinute-by-minute: trigger → detection → mitigation → recovery.Timeline tableEnables gap analysis (detection delay, escalation hops).
5 — Describe ImpactCustomer / business effect (error rate, revenue hit, user complaints).SLO diff, $ or user minutes lostShows why the fix is worth doing.
6 — Root-Cause & Contributing FactorsUse 5 Whys, fish-bone, or fault-tree; also list conditions that let it slip through.RC statement, factor listFocuses on system weakness, not people.
7 — What Went Well / Didn’tCall out tools, runbooks, comms that helped—or hurt.Two bullet listsReinforces good practice and surfaces toil.
8 — Corrective & Preventive ActionsSMART tasks with owner + due date + status (code fix, alert, test, doc).Action-item table tracked in Jira / backlogTurns lessons into reliability capital.
9 — Approval & PublishTech lead or Bar-Raiser reviews for clarity and completeness; doc goes to the shared post-mortem repo / runbooks.Signed-off PDF / Confluence pageKeeps tribal knowledge searchable and auditable.
10 — Follow-throughCheck action-item burn-down in sprint retro; reopen if deadlines slip.KPI: action completion ratePrevents “write-and-forget” post-mortems.
  • Blameless by default — Assume everyone made the best decision with the info they had. Encourages full disclosure and better data.
  • Data > opinions — Screenshots, Grafana exports, and chat logs trump memory.
  • Time-boxed & facilitated — 60–90 minutes max with a neutral moderator keeps it productive.
  • Actionable & tracked — A post-mortem without ticketed follow-ups is just a story.
  • Share widely — Transparency multiplies learning across teams (post-mortem culture).

We run a blameless post-mortem within 48 hours. Facilitator gathers evidence → builds the timeline and impact → digs into root cause with 5 Whys → captures SMART corrective actions with owners and due dates → publishes and tracks them in Jira until closed. The goal is learning and prevention, not finger-pointing.