Skip to content

Incident Lifecycle

First PublishedByAtif Alam

A field-tested sequence for SRE/DevOps on-call.

Each step calls out what to do, why it matters, and the concrete artifacts you should leave behind.

#PhaseWhat Success Looks LikeKey Actions & Artifacts
0Preparation (always-on)People, playbooks and telemetry are ready before anything breaks.Clear on-call rotation and escalation matrix; severity matrix, incident channel template, runbooks, dashboards; regular game-days & tool fire-drills.
1Detection & DeclarationThe first responder spots the symptom and explicitly “calls the incident.”Monitoring/alert fires, engineer verifies; declare severity (SEV-x), open incident ticket / Slack channel.
2Assign Roles (ICS / 3 Cs)Everyone knows Coordinate, Communicate, Control roles.Incident Commander (IC), Operations Lead (OL), Communications Lead (CL).
3Triage & MitigationStop the bleeding and shrink blast radius.Gather current state → form hypotheses; execute safest mitigation first (rollback, failover, feature-flag); IC keeps a running timeline. For rollback and feature flag strategies, see Release Engineering. For region or site failover, see Disaster Recovery.
4Internal & External CommsStakeholders know impact, action, ETA—no rumor mill.CL posts regular updates (e.g., every 15 min); status-page entries, customer success briefings.
5Recovery & VerificationService Level Indicators (SLIs) back in the green.Gradual traffic ramp-up / canary; IC declares “Resolved” when steady-state proven.
6Blameless Post-Incident Review (PIR)Learning is captured, not buried.Within 24–72 h, write postmortem (timeline, impact, root cause(s), contributing factors, “could this have been detected sooner?“).
7Action-Items & Continuous ImprovementThe same failure mode can’t bite twice.Track remediation tasks in backlog with owners and due dates; trend analysis of incident classes to guide larger investments.

For how to define severity and when to declare an incident, see Severity and Classification. For on-call rotation and escalation design, see On-Call and Escalation.

  • Role clarity cuts cognitive load; the Incident Command System scales from a two-person page to a cross-org SEV-0.
  • Rapid mitigation > root-cause-hunting during the fire—root cause can wait until systems are stable.
  • Blameless culture encourages full disclosure of mistakes, giving you the data needed for systemic fixes.
  • Postmortem action tracking turns lessons into reliability capital instead of “shelved docs.”

P-D-R-M-C-R-P-A

Prepare → Detect/Declare → Roles → Mitigate → Communicate → Restore → Postmortem → Action-items.

For the detailed post-incident phase (phase 6), including a repeatable 10-step post-mortem checklist and guiding principles, see Post-incident review.