Skip to content

Staged Design Examples

First PublishedLast UpdatedByAtif Alam

Every staged design example in this library (Feed, Chat, Search, URL shortener, and the rest) follows three stages. You don’t plan for Stage 3 on day one—you move forward when real signals tell you to.

  • Stage 1 — MVP: One path in, one store (or a minimal set). Ship, learn, and validate. Keep it simple enough that one team can build and run it.
  • Stage 2 — Growth: Bottlenecks or triggers appear. Add load balancing, cache, replicas, queues or a message bus, workers, or a dedicated index/store—whatever the signal calls for.
  • Stage 3 — Advanced Scale: Multi-region, retention and replay, ranking or stream processing, invalidation at scale, cost control. You’re here because Stage 2 solutions have hit their limits.

Find your signal, read across, get the component to consider and the stage it typically appears.

Signal / PainConsider AddingStage
Read latency high, repeated same queriesCache (Redis/Memcached), CDN2
DB overloaded on readsCache, read replicas2
DB overloaded on writesSharding / partitioning2–3
Slow work in request path (e.g. transcode, email)Queue + workers2
Fan-out to many servers (chat, collab, cache invalidation)Message bus (pub/sub)2
Need ordering, retention, replayEvent stream / log (e.g. Kafka)2–3
Complex queries, full-text, filtering + rankingSearch index (e.g. Elasticsearch)2
Geographic latency for end usersCDN (static), multi-region (data)2–3
Hot keys or uneven loadCache + sharding, dedicated partitions2–3
Many independent consumers of same dataEvent stream / log2
Connection count limit (WebSocket, long-poll)Load balancer, horizontal servers2
Need retention or “load history”Event log, history/cold store2–3
Data or index too stale for usersCache invalidation (write-through/invalidate), indexing pipeline refresh, replication-lag SLO2
Read-your-writes or strong consistency requiredPrimary/session routing for reads, write-through cache, sync replication or quorum reads2
Cost growing fast (storage, scans)Tiered storage, downsampling, retention3
Manual recovery or ops toilOrchestration, automation3

For goals, typical mechanisms, and risks per category, see Optimization Quick Reference.

Use these tables to identify where you are and what to consider adding. Find your pain on the left, read across for the answer and an example to explore.

Signal / PainConsider addingExample(s)
Single DB is the bottleneck (CPU, connections, disk)Read replicas, cacheFeed, Chat, Search
Read or write latency growingCache (reads), queue + workers (writes)URL shortener, Analytics
Need to fan out to many serversMessage bus (pub/sub)Chat, Collaborative editing
Need ordering or replayOperation log or event streamCollaborative editing, Streaming
Need search or complex queriesDedicated search indexAutocomplete, Marketplace
Connection count or CPU limit on single serverLoad balancer, multiple serversChat, Game backend
Signal / PainConsider addingExample(s)
Users in multiple regions need low latencyMulti-region replicas, regional routingChat, Distributed cache, Feed
Need retention or replay (e.g. “load last 7 days”)Event log, history storeChat, Streaming, IoT
Very high QPS or CCUSharding, dedicated storesFeed, Search, Game backend
Need ranking or personalizationRanking pipeline, dedicated feed storeFeed, Autocomplete
Invalidation at scale (many keys, many nodes)Pub/sub for invalidation fan-outDistributed cache
Cost growing fastTiered storage, downsampling, retention policyIoT, Analytics

When Do You Add Messages, Streams, or Queues?

Section titled “When Do You Add Messages, Streams, or Queues?”

This is where most confusion happens. The three types—queue, pub/sub, and stream—solve different problems. Find what you need on the left and see which type fits.

I need to…Queue + WorkersMessage Bus (Pub/Sub)Event Stream / Log
Move slow work off the request path
Retry failed deliveries with backoff + DLQ
Async fan-out on write (e.g. feed to followers)
Fan out to many servers in real time (e.g. chat)
Invalidate cache across nodes
Broadcast edits to all subscribers of a doc
Order events per key and replay from any point
Support multiple independent consumer groups
Buffer high-volume ingest (IoT, analytics)
Deliver or replicate across regions

Quick distinction: Queue = task goes to one worker. Pub/sub = message goes to all subscribers. Stream = ordered log, multiple consumers, replay.

For a deeper comparison, see Redis vs Kafka: When to Use Which.

  1. Map your problem to the quality dimensions in the Quality Requirements Checklist on the System Design Requirements page. Which dimensions matter most?
  2. Identify your stage from the trigger tables above. Are you seeing Stage 1 → 2 signals, or Stage 2 → 3?
  3. Pick one or two example docs that match your problem shape:

Browse the full list on the System Design overview.