Chaos Experiments

First PublishedFeb 16, 2026ByAtif Alam

A chaos experiment is a controlled test where you deliberately inject a failure into a system and observe how it responds.

The goal is not to break things—it’s to verify that the system handles failure the way you expect .

Every chaos experiment starts with a hypothesis: “We believe that when X fails, the system will do Y.”

If it does, you’ve validated your resilience. If it doesn’t, you’ve found a gap to fix—before users find it for you.

The Experiment Loop

Form a Hypothesis. State what you expect: “If we kill one of three database replicas, read traffic will shift to the remaining replicas with no user-visible errors.”
Define The Scope. What are you testing? Which component, which environment, which traffic?
Set Blast Radius Controls. Limit the impact: target a small percentage of traffic, a single availability zone, or a non-critical service. Define abort conditions.
Inject The Failure. Execute the experiment.
Observe. Watch SLIs (error rate, latency, availability), dashboards, and alerts. Did the system behave as expected?
Analyze and Act. If the hypothesis held, document and move on. If it didn’t, file a finding, fix the gap, and retest.

Types Of Failure Injection

Category	Examples	What It Tests
Instance/Process Failure	Kill a container, terminate an instance, restart a service	Redundancy, health checks, auto-restart, load balancing
Network Failure	Add latency, drop packets, partition a network segment, block a dependency	Timeouts, retries, circuit breakers, failover
Dependency Failure	Return errors from a downstream service, slow a database, throttle an API	Fallback behavior, graceful degradation, error handling
Resource Exhaustion	Fill disk, exhaust memory, saturate CPU, exhaust connection pool	Backpressure, resource limits, OOM handling, alerting
Clock/Time	Skew clocks, NTP drift, expired certificates	Time-dependent logic, certificate rotation, scheduled jobs
Region/Zone Failure	Take down an entire availability zone or region	Multi-AZ/multi-region failover, DR plans

Blast Radius Control

An experiment that causes a real outage has failed as an experiment.

Blast radius control is what separates chaos engineering from recklessness.

How to limit blast radius:

Start small. Begin with a single instance, a small percentage of traffic, or a non-critical service. Expand scope only after gaining confidence.
Use canary traffic. Route only a fraction of requests through the experiment path. See Progressive Delivery for traffic-splitting patterns.
Define abort conditions. Before the experiment starts, agree on what triggers an immediate stop: “If error rate exceeds 1% or latency p99 exceeds 500ms, abort.”
Have a kill switch. A way to immediately stop the failure injection and restore normal behavior. This should be tested before the experiment.
Run in pre-production first. Validate the experiment mechanism itself before running against production traffic.
Time-box experiments. Set a maximum duration. Don’t let experiments run indefinitely.

Environments

Environment	Risk	Fidelity	Best For
Development/Staging	Low	Low (different scale, config, data)	Testing experiment tooling, new experiment types, team training
Pre-Production (production-like)	Medium	Medium-high	Validating failure modes before production, load + chaos combined
Production (with controls)	Higher	Highest	Verifying real resilience with real traffic, real dependencies, real config

Production chaos is the gold standard for fidelity, but it requires mature observability, blast radius controls, and team confidence. Start in lower environments and work up.

Writing A Good Hypothesis

A hypothesis should be:

Specific — “The system will…” not “The system should be fine.”
Observable — You can measure the outcome with SLIs, logs, or dashboards.
Falsifiable — There’s a clear way to determine if it failed.

Good examples:

“If we add 200ms latency to the payment gateway, checkout success rate stays above 99.5% because the client timeout is 2 seconds.”
“If we terminate 1 of 3 app server instances, the load balancer routes traffic to the remaining instances with no 5xx errors.”
“If Redis becomes unavailable, the API falls back to database reads with latency increasing to ~50ms (from ~5ms) but no errors.”

Weak examples:

“The system handles failures.” (not specific, not measurable)
“Nothing bad happens.” (not falsifiable)

Common Findings

Chaos experiments frequently reveal:

Missing or Misconfigured Timeouts — A service waits forever for a dependency that’s not responding, causing cascading delays.
No Fallback Behavior — When a cache or dependency is down, the service returns 500 errors instead of degrading gracefully.
Alert Gaps — The failure happened but nobody was alerted, or alerts fired too late.
Runbook Gaps — The runbook didn’t cover this scenario, or the steps were outdated.
Single Points of Failure — Something marked as “redundant” turns out not to be.
Retry Storms — Retries without backoff amplify the problem instead of recovering from it.