Chaos Experiments
A chaos experiment is a controlled test where you deliberately inject a failure into a system and observe how it responds.
The goal is not to break things—it’s to verify that the system handles failure the way you expect .
Every chaos experiment starts with a hypothesis: “We believe that when X fails, the system will do Y.”
If it does, you’ve validated your resilience. If it doesn’t, you’ve found a gap to fix—before users find it for you.
The Experiment Loop
Section titled “The Experiment Loop”- Form a Hypothesis. State what you expect: “If we kill one of three database replicas, read traffic will shift to the remaining replicas with no user-visible errors.”
- Define The Scope. What are you testing? Which component, which environment, which traffic?
- Set Blast Radius Controls. Limit the impact: target a small percentage of traffic, a single availability zone, or a non-critical service. Define abort conditions.
- Inject The Failure. Execute the experiment.
- Observe. Watch SLIs (error rate, latency, availability), dashboards, and alerts. Did the system behave as expected?
- Analyze and Act. If the hypothesis held, document and move on. If it didn’t, file a finding, fix the gap, and retest.
Types Of Failure Injection
Section titled “Types Of Failure Injection”| Category | Examples | What It Tests |
|---|---|---|
| Instance/Process Failure | Kill a container, terminate an instance, restart a service | Redundancy, health checks, auto-restart, load balancing |
| Network Failure | Add latency, drop packets, partition a network segment, block a dependency | Timeouts, retries, circuit breakers, failover |
| Dependency Failure | Return errors from a downstream service, slow a database, throttle an API | Fallback behavior, graceful degradation, error handling |
| Resource Exhaustion | Fill disk, exhaust memory, saturate CPU, exhaust connection pool | Backpressure, resource limits, OOM handling, alerting |
| Clock/Time | Skew clocks, NTP drift, expired certificates | Time-dependent logic, certificate rotation, scheduled jobs |
| Region/Zone Failure | Take down an entire availability zone or region | Multi-AZ/multi-region failover, DR plans |
Blast Radius Control
Section titled “Blast Radius Control”An experiment that causes a real outage has failed as an experiment.
Blast radius control is what separates chaos engineering from recklessness.
How to limit blast radius:
- Start small. Begin with a single instance, a small percentage of traffic, or a non-critical service. Expand scope only after gaining confidence.
- Use canary traffic. Route only a fraction of requests through the experiment path. See Progressive Delivery for traffic-splitting patterns.
- Define abort conditions. Before the experiment starts, agree on what triggers an immediate stop: “If error rate exceeds 1% or latency p99 exceeds 500ms, abort.”
- Have a kill switch. A way to immediately stop the failure injection and restore normal behavior. This should be tested before the experiment.
- Run in pre-production first. Validate the experiment mechanism itself before running against production traffic.
- Time-box experiments. Set a maximum duration. Don’t let experiments run indefinitely.
Environments
Section titled “Environments”| Environment | Risk | Fidelity | Best For |
|---|---|---|---|
| Development/Staging | Low | Low (different scale, config, data) | Testing experiment tooling, new experiment types, team training |
| Pre-Production (production-like) | Medium | Medium-high | Validating failure modes before production, load + chaos combined |
| Production (with controls) | Higher | Highest | Verifying real resilience with real traffic, real dependencies, real config |
Production chaos is the gold standard for fidelity, but it requires mature observability, blast radius controls, and team confidence. Start in lower environments and work up.
Writing A Good Hypothesis
Section titled “Writing A Good Hypothesis”A hypothesis should be:
- Specific — “The system will…” not “The system should be fine.”
- Observable — You can measure the outcome with SLIs, logs, or dashboards.
- Falsifiable — There’s a clear way to determine if it failed.
Good examples:
- “If we add 200ms latency to the payment gateway, checkout success rate stays above 99.5% because the client timeout is 2 seconds.”
- “If we terminate 1 of 3 app server instances, the load balancer routes traffic to the remaining instances with no 5xx errors.”
- “If Redis becomes unavailable, the API falls back to database reads with latency increasing to ~50ms (from ~5ms) but no errors.”
Weak examples:
- “The system handles failures.” (not specific, not measurable)
- “Nothing bad happens.” (not falsifiable)
Common Findings
Section titled “Common Findings”Chaos experiments frequently reveal:
- Missing or Misconfigured Timeouts — A service waits forever for a dependency that’s not responding, causing cascading delays.
- No Fallback Behavior — When a cache or dependency is down, the service returns 500 errors instead of degrading gracefully.
- Alert Gaps — The failure happened but nobody was alerted, or alerts fired too late.
- Runbook Gaps — The runbook didn’t cover this scenario, or the steps were outdated.
- Single Points of Failure — Something marked as “redundant” turns out not to be.
- Retry Storms — Retries without backoff amplify the problem instead of recovering from it.
See Also
Section titled “See Also”- Game-Days and Drills — Taking experiments from individual component tests to team-wide exercises.
- Load and Stress Testing — Establish performance baselines before running chaos experiments.
- Alerting — Chaos experiments often reveal alerting gaps.
- DR Planning and Testing — DR drills are a specialized form of reliability testing.