Active-Passive vs Active-Active Multi-Region
Normal request path (before failures)
Section titled “Normal request path (before failures)”A useful mental model for steady state:
- Client resolves DNS or is routed by a global entry (anycast, global load balancer, or geo-DNS).
- Traffic lands in a region on your API tier (containers, functions, or VMs) behind a regional load balancer.
- The app may read regional cache (for example Redis), then databases or queues—often a primary for writes and replicas for reads, depending on design.
Nothing here requires multiple regions; it is the happy path you secure before you reason about regional failover.
Multi-AZ vs multi-region
Section titled “Multi-AZ vs multi-region”- Availability zones (AZs) are separate data centers within one region, connected by low-latency networks. Surviving one AZ is usually solved with redundant subnets, load balancers, and synchronous or fast failover inside the region.
- A region is a geographic failure and isolation boundary in public clouds: power, networking, and control-plane scope are independent per region. Multi-region design addresses loss of a whole region (or placing workloads closer to users across continents).
Solve AZ-level problems with multi-AZ first. Add another region when you need cross-region DR, latency to distant users, or residency constraints that one region cannot satisfy.
What RTO and RPO mean here
Section titled “What RTO and RPO mean here”- RTO (recovery time objective) — how long service may be degraded after a regional disaster before you recover.
- RPO (recovery point objective) — how much data you may lose (time or volume) measured at failover.
These pair with replication: async replication usually improves write latency but widens RPO; sync replication tightens RPO but adds latency and coupling. See RTO and RPO for definitions and tradeoffs.
Core architecture comparison
Section titled “Core architecture comparison”| Dimension | Active-Passive | Active-Active |
|---|---|---|
| Traffic handling | One region serves production traffic; standby is cold or warm | Multiple regions serve traffic simultaneously |
| Complexity | Lower | Higher (routing, data, conflicts, ops) |
| Cost | Lower if standby is mostly idle | Higher—often full capacity in each active region |
| RTO | Minutes (failover, DNS, promotion) | Near zero for read paths if others absorb load |
| RPO | Seconds to minutes typical with async replication | Can approach zero with the right stores and sync model |
| Write conflicts | None across regions if single writer | Must be designed for (or avoided by funneling writes) |
Sync vs async replication (plain language)
Section titled “Sync vs async replication (plain language)”- Synchronous replication to another region waits for the remote copy before acknowledging the write. RPO can be very small, but write latency includes cross-region round trips (often tens to hundreds of milliseconds).
- Asynchronous replication acknowledges after the local commit; the remote copy catches up in the background. Faster writes, but on failover you may lose lag worth of data—your RPO is at least that lag unless you have a compensating design.
At 50k RPS and sub-200ms latency
Section titled “At 50k RPS and sub-200ms latency”Rough targets like 50,000 requests per second and P99 under 200ms push you toward:
- Connection pooling at the database boundary (pooler or proxy) so failover does not cold-start thousands of connections.
- Regional caches to absorb read variance and protect the database.
- Fast health checks (for example on the order of seconds, not half a minute) so a bad region does not absorb traffic long enough to burn your error budget.
- Warm standby or shadow traffic in active-passive so the passive region is not completely cold (JVM caches, connection pools).
Strict global P99 under 200ms during a regional failover event is hard in pure active-passive because cutover and cold caches create spikes—active-active with regional capacity and routing often fits latency SLOs better if you can afford the data-layer complexity.
Active-passive design
Section titled “Active-passive design”Traffic layer
Section titled “Traffic layer”- Global load balancer or DNS (patterns exist on AWS, GCP, Azure, Cloudflare, and others) with health checks aimed at the primary region.
- The passive region receives no production traffic until failover (unless you use shadow traffic to keep it warm).
- DNS TTL often kept low (tens of seconds) to shorten cutover time—at the cost of more DNS churn and operational attention.
Database layer
Section titled “Database layer”- Primary in the active region takes all writes (typical pattern).
- Replica in the passive region via async or sync replication.
- Sync tightens RPO but adds write latency; async is faster for writes but widens the data-loss window on promotion.
Failure modes (active-passive)
Section titled “Failure modes (active-passive)”| Failure | Response |
|---|---|
| Active region down | LB marks unhealthy; traffic shifts to passive; degraded period roughly DNS/TTL and promotion steps |
| DB primary fails | Promote replica in passive region; replay or accept replication lag |
| Replication lag spike | RPO risk—monitor lag aggressively |
| Partial failure (one AZ) | Prefer AZ failover inside the active region before declaring a region dead |
Cold standby risk: After cutover, caches and pools may be cold—expect minutes of elevated latency unless you warm the passive path regularly.
Active-active design
Section titled “Active-active design”Traffic layer
Section titled “Traffic layer”- Latency-based routing at the global tier so users hit a nearby region.
- Regions are often sized for part of peak each (for example 60–70% per region) so one region can absorb a surge when another fails.
- Circuit breakers per region to shed load when overloaded.
Database layer (patterns)
Section titled “Database layer (patterns)”Three common patterns:
- Global or multi-region database with multi-master or consensus (for example DynamoDB Global Tables, CockroachDB, Spanner)—writes in multiple regions; conflict resolution is product-specific (for example last-write-wins).
- Geo-sharding — Data has a home region; writes go home; reads may be local from replicas.
- CQRS and event sourcing — Writes append to a log replicated across regions; each region builds read models; strong eventual consistency; read-heavy workloads often fit well.
Additional components
Section titled “Additional components”- Distributed tracing across regions (OpenTelemetry, vendor APM) to debug cross-region paths.
- Conflict strategy wherever two regions can write the same entity.
- Idempotency keys on mutating APIs so retries do not double-apply.
- Session affinity only if stateful; otherwise externalize session state to a shared store or token.
Failure modes (active-active)
Section titled “Failure modes (active-active)”| Failure | Response |
|---|---|
| One region down | Global routing stops sending traffic there; survivors must have headroom |
| Network partition | Regions may continue serving; divergent writes possible—reconcile on heal |
| Replication lag | Reads may be stale—define per-endpoint staleness SLOs |
| Split-brain on multi-master | Database or app conflict rules; idempotency limits duplicate side effects |
| Cascading overload | One region fails → surge elsewhere → second failure; needs load shedding and autoscaling buffer |
How traffic shifts when a region fails
Section titled “How traffic shifts when a region fails”Conceptually (cloud products differ):
- Health checks mark the region’s endpoints unhealthy.
- The global layer stops steering new connections to that region (DNS updates, anycast withdrawal, or LB pool removal).
- In active-passive, you may promote a database replica and point traffic to the former passive region.
- In active-active, surviving regions take the share of traffic—pre-provisioned capacity and circuit breakers matter.
Active-passive often looks like a cutover (DNS or LB flip). Active-active looks like shedding a bad region while others absorb load.
Single-region outage: operational steps (high level)
Section titled “Single-region outage: operational steps (high level)”This is not a full runbook—see Failover and Failback and DR planning and testing—but the arc is:
- Detect — Alarms, health checks, synthetic probes, or regional error rate.
- Decide — Confirm regional failure vs. transient blip; invoke incident command.
- Shift — Route traffic away from the bad region; scale or throttle surviving regions.
- Stabilize — Promote databases if needed, validate RPO, fix caches, communicate status.
- Record — Timeline for post-incident review and runbook updates.
When the failed region comes back (failback)
Section titled “When the failed region comes back (failback)”At architecture level:
- Verify replication and disk health before sending user traffic.
- Reattach the region as a replica or secondary, or rebuild it from backup if needed.
- Reconcile divergent data if the outage allowed split writes (details depend on store—see Data, ordering, and stores).
- Ramp traffic gradually; warm caches; watch replication lag and error rates.
Phased implementation
Section titled “Phased implementation”Align spend and risk with need:
| Phase | Typical focus | Cost / complexity |
|---|---|---|
| 1 | Single region hardened: multi-AZ, backups, tested restore | Lowest |
| 2 | Async DR replica in second region | Low–medium |
| 3 | Active-passive with tested failover; optional warm traffic | Medium |
| 4 | Multi-region reads (replicas + cache); write still centralized | Medium–high |
| 5 | Active-active or geo-partitioned writes or global DB | Highest |
Signals to move up: latency SLOs across geographies, hard RTO/RPO, regulatory in-region serving, or write/read scale that demands distribution.
Observability by phase: early on, prioritize success rate, latency percentiles, and replication lag. As you add active regions, add per-region saturation, cross-region traffic share, and failover annotations on dashboards—see Grafana multi-region dashboards.
Security (light touch)
Section titled “Security (light touch)”- TLS for all cross-region traffic; private connectivity where supported.
- Secrets and keys—rotation and scope so a regional compromise does not imply global credential reuse without controls.
- Least privilege for replication agents, operators, and automation that can promote a database.
Cross-region blast radius
Section titled “Cross-region blast radius”Design so one compromised region or one bad deploy can be contained: separate accounts or projects where appropriate, feature flags, and bulkheads for cross-region calls—see Scalability Patterns and the Infrastructure redundancy example.
Recommendation summary
Section titled “Recommendation summary”- If you need strict low P99 globally and can fund engineering complexity, active-active with a clear write strategy (for example geo-sharded writes or a global database) is often a better fit than pure active-passive at large scale.
- If the workload is mostly single geography and you need DR, active-passive with a warm standby (not cold) is simpler and often sufficient.
See also
Section titled “See also”- Multi-region overview
- Data, ordering, and stores
- RTO and RPO
- Failover and Failback
- Consistency models
- Chaos engineering — game days and drills for failover behavior