Skip to content

Game-Days and Drills

First PublishedByAtif Alam

A chaos experiment tests a component. A game-day tests the whole system—technology, people, and process together.

Game-days are scheduled exercises where you simulate a failure scenario and walk through the response end-to-end: detection, diagnosis, mitigation, communication, and recovery.

They answer questions that automated experiments can’t: Do the right people get paged? Can the on-call follow the runbook? Does the team know how to communicate during an incident?

Game-DayDrill
ScopeBroad: end-to-end failure scenario involving multiple teams or systemsNarrow: single procedure or tool (e.g. “restore this database from backup”)
Duration1-4 hours15-60 minutes
ParticipantsCross-functional: engineering, on-call, sometimes leadership and commsUsually one team or one person
GoalValidate the full response chain: detection → response → recovery → communicationValidate a specific runbook step or tool works

Both are valuable.

Drills build muscle memory for individual procedures. Game-days test how those procedures fit together under pressure.

  1. Pick a Scenario. Choose a realistic failure that would have high impact. Examples:

    • Primary database becomes unavailable.
    • A major dependency returns errors for 30 minutes.
    • A region goes down and you need to fail over.
    • A bad deploy causes elevated error rates for a critical API.
  2. Define Objectives. What are you testing? Examples:

    • “Can the on-call detect the problem within 5 minutes?”
    • “Does the failover runbook work end-to-end?”
    • “Can we communicate status to stakeholders within 10 minutes?”
  3. Assign Roles. At minimum:

    • Facilitator — Runs the exercise, injects the failure, keeps things on track. Does not participate as a responder.
    • Observers — Take notes on what happens, what goes well, and where things break down.
    • Responders — The team that would normally respond to this incident. They should not know the scenario in advance (or at least not the details).
  4. Set Safety Boundaries. Will you inject real failures or simulate them? If injecting, define abort conditions (see Chaos Experiments — Blast Radius Control). If simulating, prepare mock alerts, dashboards, or narrated scenarios.

  5. Notify Stakeholders. Let affected teams and leadership know a game-day is happening so real alerts during the exercise don’t cause confusion.

  • Inject or Simulate the Failure. Start the clock.
  • Let the Team Respond. Don’t coach—observe. The value is in seeing how the team actually responds, not how they respond with hints.
  • Track Timeline. Note when the alert fired, when the on-call responded, when the runbook was opened, when mitigation was applied, when recovery was confirmed.
  • Enforce Time Limits. If the team is stuck for too long, the facilitator can offer a hint or end the exercise. A game-day that runs 4 hours without progress isn’t productive.

Run a debrief (similar to a post-incident review) within a day:

  • What Worked? — Detection was fast, runbook was accurate, communication was clear.
  • What Didn’t? — Alert didn’t fire, runbook was outdated, nobody knew who to escalate to, recovery took too long.
  • Action Items — Update the runbook, fix the alert, add a missing dashboard, schedule a follow-up drill on the weak area.

The debrief is the most valuable part. A game-day without a debrief is a missed opportunity.

Exercise TypeSuggested CadenceNotes
Focused drills (single runbook or tool)Monthly or quarterlyLow overhead, high value for muscle memory
Team game-day (one team, one scenario)QuarterlyCore practice for on-call teams
Cross-team game-day (multi-team, complex scenario)Annually or semi-annuallyHigher coordination cost, tests organizational response
DR exercise (failover, backup restore)See DR Planning and TestingSpecific cadence for disaster recovery scenarios

Adjust based on your team’s maturity and how often your architecture changes.

New services or major redesigns are good triggers for an unscheduled game-day.

Game-days frequently reveal:

  • Runbooks Are Outdated. The infrastructure changed but the runbook wasn’t updated. Steps reference old tools or missing dashboards.
  • Detection Is Slow. The failure happened but nobody noticed for 15 minutes because the alert was too noisy or didn’t exist.
  • Escalation Paths Are Unclear. The on-call didn’t know who to page next, or the escalation contact was out of date.
  • Communication Gaps. Stakeholders weren’t updated, or updates were too technical/too vague.
  • Recovery Is Harder Than Expected. The “simple failover” turns out to require manual steps nobody practiced.
  • Tool Access Issues. Someone needed to access a system but didn’t have the right credentials or permissions.

DR testing (failover drills, backup restores) is a specialized form of game-day focused on disaster recovery scenarios.

DR Planning and Testing covers DR-specific exercises, including backup restore cadence and failover drills.

The practices overlap: a DR failover drill is a game-day with a DR scenario.

Use whichever framing fits your organization, but make sure both your operational failure scenarios (a dependency is slow, a deploy is bad) and your disaster scenarios (a region is down, data is corrupted) get exercised regularly.