Post-Incident Review

First PublishedFeb 10, 2026ByAtif Alam

A repeatable 10-step checklist for a solid post-mortem / root-cause analysis (RCA). Each step calls out what you do, the key outputs, and why it matters.

10-Step Checklist

#	What You Do	Key Outputs	Why It Matters
1 — Trigger & Timing	Decide this incident meets your post-mortem criteria (e.g., SEV-1, ≥ 5% SLO burn). Schedule the session within 24–72 h of resolution.	Calendar invite; unique doc link	Fresh memories give better data and keep a blameless tone.
2 — Assemble the Crew	IC nominates Facilitator, Scribe, and SMEs. Include anyone who paged or pushed a button.	Attendee list, communication channel	Shared context avoids blind spots and blame games.
3 — State the Facts First	Pull logs, metrics, chat transcripts, deploy history; freeze them in the doc.	Evidence bundle	Stops hindsight bias and “he said/she said.”
4 — Build a Precise Timeline	Minute-by-minute: trigger → detection → mitigation → recovery.	Timeline table	Enables gap analysis (detection delay, escalation hops).
5 — Describe Impact	Customer / business effect (error rate, revenue hit, user complaints).	SLO diff, $ or user minutes lost	Shows why the fix is worth doing.
6 — Root-Cause & Contributing Factors	Use 5 Whys, fish-bone, or fault-tree; also list conditions that let it slip through.	RC statement, factor list	Focuses on system weakness, not people.
7 — What Went Well / Didn’t	Call out tools, runbooks, comms that helped—or hurt.	Two bullet lists	Reinforces good practice and surfaces toil.
8 — Corrective & Preventive Actions	SMART tasks with owner + due date + status (code fix, alert, test, doc).	Action-item table tracked in Jira / backlog	Turns lessons into reliability capital.
9 — Approval & Publish	Tech lead or Bar-Raiser reviews for clarity and completeness; doc goes to the shared post-mortem repo / runbooks.	Signed-off PDF / Confluence page	Keeps tribal knowledge searchable and auditable.
10 — Follow-through	Check action-item burn-down in sprint retro; reopen if deadlines slip.	KPI: action completion rate	Prevents “write-and-forget” post-mortems.

Guiding principles

Blameless by default — Assume everyone made the best decision with the info they had. Encourages full disclosure and better data.
Data > opinions — Screenshots, Grafana exports, and chat logs trump memory.
Time-boxed & facilitated — 60–90 minutes max with a neutral moderator keeps it productive.
Actionable & tracked — A post-mortem without ticketed follow-ups is just a story.
Share widely — Transparency multiplies learning across teams (post-mortem culture).

30-second summary

We run a blameless post-mortem within 48 hours. Facilitator gathers evidence → builds the timeline and impact → digs into root cause with 5 Whys → captures SMART corrective actions with owners and due dates → publishes and tracks them in Jira until closed. The goal is learning and prevention, not finger-pointing.