Scalability Patterns

First PublishedFeb 26, 2026Last UpdatedMar 1, 2026ByAtif Alam

Scalability patterns are the recurring design choices that let a system handle more load—more users, more traffic, more data—without collapsing.

This page consolidates the main patterns: how you scale (horizontal vs vertical), how you shape the system (stateless, queue-decoupled), how you protect it when load exceeds capacity (backpressure, load shedding), how you isolate faults when dependencies fail (circuit breakers, retries, bulkheads), and multi-region active-active.

For building blocks like caches, queues, replication, and sharding, see System Design Checklist (Sections 2, 5, 6, 7). For infrastructure redundancy and failover design, see the Infrastructure redundancy example.

Horizontal vs Vertical Scaling

Vertical scaling (scale up) — Add more CPU, memory, or disk to the same machine. Simple to reason about and no distribution overhead, but you hit hardware limits and single points of failure.

Use for databases or stateful nodes where horizontal scaling is harder, or as a short-term lever.

Horizontal scaling (scale out) — Add more machines and distribute load across them. Gives you elasticity and fault tolerance (lose one node, others continue).

Requires stateless or sharded state, load balancing, and often replication and failover. See Capacity Planning for how to plan and operate at scale.

Choose based on workload and constraints: vertical for simplicity and single-node limits; horizontal for long-term growth and resilience.

Stateless Architecture

Stateless services don’t store request-scoped state on the server; any request can be handled by any instance. That makes horizontal scaling straightforward: add instances, put them behind a load balancer, and traffic spreads. Session state, if needed, lives in a shared store (e.g. database, cache, or client token).

Stateful services hold session or affinity state on the node. You need sticky sessions or routing by key, and losing a node can lose that state unless it’s replicated. Stateless design avoids that and simplifies deployment and scaling—see System Design Checklist Section 1 (compute) for stateless app servers and scaling strategy.

Queue-Based Decoupling

Inserting a queue (or event log) between producers and consumers decouples them: producers don’t block on consumer speed, and consumers can scale independently and process at their own rate. That smooths spikes, isolates backpressure to the queue, and often improves availability (queue absorbs load when a consumer is down or slow).

Patterns: Async job queues for background work; event-driven pipelines for stream processing; request/response over a queue when you need durability or guaranteed delivery. See System Design Checklist Section 6 (messaging) for components and delivery semantics. Queue depth and consumer lag are key metrics for capacity and backpressure—see Capacity Planning and Load and Stress Testing.

Backpressure

Backpressure is the signal that flows backward when a downstream component can’t keep up: “slow down” or “stop sending.” Without it, fast producers overwhelm slow consumers, queues grow unbounded, and memory or latency blows up.

Ways to apply it: Bounded queues with blocking or reject-when-full semantics; flow control (e.g. HTTP/2 or gRPC); consumer lag and queue depth as scaling or alerting signals; backpressure propagated through the pipeline so producers throttle or shed load. Designing for backpressure keeps the system stable under load and ties into load shedding (below).

Load Shedding

When the system is at or over capacity, load shedding is deliberately dropping or rejecting some work so the rest can succeed. Better to degrade gracefully (e.g. reject low-priority or excess requests) than to let everything time out or crash.

Approaches: Reject new requests when queue depth or error rate exceeds a threshold; rate limit or throttle by client or priority; serve a subset of traffic (e.g. critical paths only) and return 503 or a fallback for the rest. Combine with SLOs and error budgets so you know when to shed and how much. Load shedding is a last line of defense; capacity planning and autoscaling should reduce how often you need it. See Capacity Planning for headroom and surge policies.

Fault Isolation (Circuit Breakers, Retries, Bulkheads)

When a dependency (downstream service, DB, or API) is slow or failing, your system can cascade: threads block, queues back up, and one bad dependency takes everyone down. Fault isolation patterns limit the blast radius and give dependencies time to recover.

Circuit breaker — Treat the dependency as a “circuit.” Track failures (or error rate); after a threshold, “open” the circuit and stop calling for a period (or until a probe succeeds). Fail fast instead of timing out repeatedly; optionally return a fallback. When the circuit is “half-open,” allow a test request; if it succeeds, close the circuit and resume normal traffic.
Retries — Retry failed calls with backoff (e.g. exponential) so transient failures can succeed. Use with timeouts so you don’t wait forever, and with idempotency where the operation must not be applied twice (e.g. payments, writes).
Bulkheads — Isolate resources (thread pools, connection pools, queues) per dependency or priority. If one dependency is slow, it only consumes its share of resources; others keep working. Prevents one bad dependency from exhausting all capacity.

These often go together: retries with backoff for transient failures; circuit breaker to stop hammering a clearly failing dependency; bulkheads so that dependency’s load doesn’t starve the rest of the system. See System Design Checklist Section 1 (compute) for “Retry policy + timeouts + circuit breakers (service resilience).”

Multi-Region Active-Active

Active-active means multiple regions (or sites) serve traffic at the same time. You get lower latency (route users to the nearest region) and better availability (if one region fails, others continue). The tradeoffs are complexity: data replication and consistency across regions, conflict resolution, and operational and cost overhead.

Considerations: Data placement and replication (sync vs async; see RTO and RPO); routing and failover (DNS, global load balancing); consistency and Consistency Models when data is writable in more than one region. The Infrastructure redundancy example and Failover and Failback cover failover and capacity for N-1 regions.