Skip to content

Chaos Experiments

First PublishedByAtif Alam

A chaos experiment is a controlled test where you deliberately inject a failure into a system and observe how it responds.

The goal is not to break things—it’s to verify that the system handles failure the way you expect .

Every chaos experiment starts with a hypothesis: “We believe that when X fails, the system will do Y.”

If it does, you’ve validated your resilience. If it doesn’t, you’ve found a gap to fix—before users find it for you.

  1. Form a Hypothesis. State what you expect: “If we kill one of three database replicas, read traffic will shift to the remaining replicas with no user-visible errors.”
  2. Define The Scope. What are you testing? Which component, which environment, which traffic?
  3. Set Blast Radius Controls. Limit the impact: target a small percentage of traffic, a single availability zone, or a non-critical service. Define abort conditions.
  4. Inject The Failure. Execute the experiment.
  5. Observe. Watch SLIs (error rate, latency, availability), dashboards, and alerts. Did the system behave as expected?
  6. Analyze and Act. If the hypothesis held, document and move on. If it didn’t, file a finding, fix the gap, and retest.
CategoryExamplesWhat It Tests
Instance/Process FailureKill a container, terminate an instance, restart a serviceRedundancy, health checks, auto-restart, load balancing
Network FailureAdd latency, drop packets, partition a network segment, block a dependencyTimeouts, retries, circuit breakers, failover
Dependency FailureReturn errors from a downstream service, slow a database, throttle an APIFallback behavior, graceful degradation, error handling
Resource ExhaustionFill disk, exhaust memory, saturate CPU, exhaust connection poolBackpressure, resource limits, OOM handling, alerting
Clock/TimeSkew clocks, NTP drift, expired certificatesTime-dependent logic, certificate rotation, scheduled jobs
Region/Zone FailureTake down an entire availability zone or regionMulti-AZ/multi-region failover, DR plans

An experiment that causes a real outage has failed as an experiment.

Blast radius control is what separates chaos engineering from recklessness.

How to limit blast radius:

  • Start small. Begin with a single instance, a small percentage of traffic, or a non-critical service. Expand scope only after gaining confidence.
  • Use canary traffic. Route only a fraction of requests through the experiment path. See Progressive Delivery for traffic-splitting patterns.
  • Define abort conditions. Before the experiment starts, agree on what triggers an immediate stop: “If error rate exceeds 1% or latency p99 exceeds 500ms, abort.”
  • Have a kill switch. A way to immediately stop the failure injection and restore normal behavior. This should be tested before the experiment.
  • Run in pre-production first. Validate the experiment mechanism itself before running against production traffic.
  • Time-box experiments. Set a maximum duration. Don’t let experiments run indefinitely.
EnvironmentRiskFidelityBest For
Development/StagingLowLow (different scale, config, data)Testing experiment tooling, new experiment types, team training
Pre-Production (production-like)MediumMedium-highValidating failure modes before production, load + chaos combined
Production (with controls)HigherHighestVerifying real resilience with real traffic, real dependencies, real config

Production chaos is the gold standard for fidelity, but it requires mature observability, blast radius controls, and team confidence. Start in lower environments and work up.

A hypothesis should be:

  • Specific — “The system will…” not “The system should be fine.”
  • Observable — You can measure the outcome with SLIs, logs, or dashboards.
  • Falsifiable — There’s a clear way to determine if it failed.

Good examples:

  • “If we add 200ms latency to the payment gateway, checkout success rate stays above 99.5% because the client timeout is 2 seconds.”
  • “If we terminate 1 of 3 app server instances, the load balancer routes traffic to the remaining instances with no 5xx errors.”
  • “If Redis becomes unavailable, the API falls back to database reads with latency increasing to ~50ms (from ~5ms) but no errors.”

Weak examples:

  • “The system handles failures.” (not specific, not measurable)
  • “Nothing bad happens.” (not falsifiable)

Chaos experiments frequently reveal:

  • Missing or Misconfigured Timeouts — A service waits forever for a dependency that’s not responding, causing cascading delays.
  • No Fallback Behavior — When a cache or dependency is down, the service returns 500 errors instead of degrading gracefully.
  • Alert Gaps — The failure happened but nobody was alerted, or alerts fired too late.
  • Runbook Gaps — The runbook didn’t cover this scenario, or the steps were outdated.
  • Single Points of Failure — Something marked as “redundant” turns out not to be.
  • Retry Storms — Retries without backoff amplify the problem instead of recovering from it.