Game-Days and Drills

First PublishedFeb 16, 2026ByAtif Alam

A chaos experiment tests a component. A game-day tests the whole system—technology, people, and process together.

Game-days are scheduled exercises where you simulate a failure scenario and walk through the response end-to-end: detection, diagnosis, mitigation, communication, and recovery.

They answer questions that automated experiments can’t: Do the right people get paged? Can the on-call follow the runbook? Does the team know how to communicate during an incident?

Game-Days vs. Drills

	Game-Day	Drill
Scope	Broad: end-to-end failure scenario involving multiple teams or systems	Narrow: single procedure or tool (e.g. “restore this database from backup”)
Duration	1-4 hours	15-60 minutes
Participants	Cross-functional: engineering, on-call, sometimes leadership and comms	Usually one team or one person
Goal	Validate the full response chain: detection → response → recovery → communication	Validate a specific runbook step or tool works

Both are valuable.

Drills build muscle memory for individual procedures. Game-days test how those procedures fit together under pressure.

Running A Game-Day

Before: Planning

Pick a Scenario. Choose a realistic failure that would have high impact. Examples:
- Primary database becomes unavailable.
- A major dependency returns errors for 30 minutes.
- A region goes down and you need to fail over.
- A bad deploy causes elevated error rates for a critical API.
Define Objectives. What are you testing? Examples:
- “Can the on-call detect the problem within 5 minutes?”
- “Does the failover runbook work end-to-end?”
- “Can we communicate status to stakeholders within 10 minutes?”
Assign Roles. At minimum:
- Facilitator — Runs the exercise, injects the failure, keeps things on track. Does not participate as a responder.
- Observers — Take notes on what happens, what goes well, and where things break down.
- Responders — The team that would normally respond to this incident. They should not know the scenario in advance (or at least not the details).
Set Safety Boundaries. Will you inject real failures or simulate them? If injecting, define abort conditions (see Chaos Experiments — Blast Radius Control). If simulating, prepare mock alerts, dashboards, or narrated scenarios.
Notify Stakeholders. Let affected teams and leadership know a game-day is happening so real alerts during the exercise don’t cause confusion.

During: Execution

Inject or Simulate the Failure. Start the clock.
Let the Team Respond. Don’t coach—observe. The value is in seeing how the team actually responds, not how they respond with hints.
Track Timeline. Note when the alert fired, when the on-call responded, when the runbook was opened, when mitigation was applied, when recovery was confirmed.
Enforce Time Limits. If the team is stuck for too long, the facilitator can offer a hint or end the exercise. A game-day that runs 4 hours without progress isn’t productive.

After: Debrief

Run a debrief (similar to a post-incident review) within a day:

What Worked? — Detection was fast, runbook was accurate, communication was clear.
What Didn’t? — Alert didn’t fire, runbook was outdated, nobody knew who to escalate to, recovery took too long.
Action Items — Update the runbook, fix the alert, add a missing dashboard, schedule a follow-up drill on the weak area.

The debrief is the most valuable part. A game-day without a debrief is a missed opportunity.

Cadence

Exercise Type	Suggested Cadence	Notes
Focused drills (single runbook or tool)	Monthly or quarterly	Low overhead, high value for muscle memory
Team game-day (one team, one scenario)	Quarterly	Core practice for on-call teams
Cross-team game-day (multi-team, complex scenario)	Annually or semi-annually	Higher coordination cost, tests organizational response
DR exercise (failover, backup restore)	See DR Planning and Testing	Specific cadence for disaster recovery scenarios

Adjust based on your team’s maturity and how often your architecture changes.

New services or major redesigns are good triggers for an unscheduled game-day.

Common Findings

Game-days frequently reveal:

Runbooks Are Outdated. The infrastructure changed but the runbook wasn’t updated. Steps reference old tools or missing dashboards.
Detection Is Slow. The failure happened but nobody noticed for 15 minutes because the alert was too noisy or didn’t exist.
Escalation Paths Are Unclear. The on-call didn’t know who to page next, or the escalation contact was out of date.
Communication Gaps. Stakeholders weren’t updated, or updates were too technical/too vague.
Recovery Is Harder Than Expected. The “simple failover” turns out to require manual steps nobody practiced.
Tool Access Issues. Someone needed to access a system but didn’t have the right credentials or permissions.

Relationship To DR Testing

DR testing (failover drills, backup restores) is a specialized form of game-day focused on disaster recovery scenarios.

DR Planning and Testing covers DR-specific exercises, including backup restore cadence and failover drills.

The practices overlap: a DR failover drill is a game-day with a DR scenario.

Use whichever framing fits your organization, but make sure both your operational failure scenarios (a dependency is slow, a deploy is bad) and your disaster scenarios (a region is down, data is corrupted) get exercised regularly.