Distributed Cache — Designed in Stages
You don’t need to design for scale on day one.
This guide is a staged design playbook: it tells you what to build at MVP, Growth, and Advanced scale so you don’t over- or under-build.
Define what you need—get, set, delete, and optionally invalidate by pattern—then build the simplest key-value cache that works and evolve as capacity, availability, and consistency requirements grow.
When to use this: Use this when you’re designing or evolving a distributed cache (Memcached-, Redis-style) from scratch or from an existing MVP; when you need low-latency key-value access, eviction (LRU/TTL), and optional replication and failover. Skip or adapt if you need only a local in-process cache (single app) or a full database.
Unlike designing for max scale on day one, this adds complexity only when triggers appear (e.g. capacity, availability, multi-region). Unlike ad-hoc growth with no structure, you get a clear sequence: MVP → Growth → Advanced. If you over-build (e.g. multi-region and invalidation fan-out before you need them), you pay in ops and consistency. If you under-invest in triggers (e.g. no replication when node failure hurts availability), you hit outages and load spikes. The stages tie additions to triggers so you avoid both.
Here we use a distributed cache (Memcached-, Redis-style) as the running example: cache keys, values, optional TTL, and nodes/shards. The same staged thinking applies to any system that offloads read load or accelerates access with in-memory key-value storage: low latency, high throughput, eviction policy (LRU, TTL), and fault tolerance (replication, failover) are central.
Requirements and Constraints (no architecture yet)
Section titled “Requirements and Constraints (no architecture yet)”Functional Requirements
- Get — given a key, return the stored value (or miss); low latency; dominant read path.
- Set — store (key, value); optional TTL (time-to-live); overwrite if key exists.
- Delete — remove key; subsequent get returns miss.
- Invalidate — optional; remove key(s) by pattern or tag (e.g. “all keys for user X”); used when source data changes so cache stays consistent.
Quality Requirements
- Low latency — get and set should complete in sub-millisecond to low single-digit ms; in-memory access; avoid disk on hot path.
- High throughput — support many get/set per second; scale with more nodes and connection pooling.
- Consistency — cache is typically eventual or session consistency with source of truth (e.g. DB); define read-your-writes if needed; invalidation or TTL limit staleness.
- Eviction policy — when memory is full, evict entries; LRU (least recently used) or TTL-based; avoid unbounded growth.
- Fault tolerance — node failure should not lose all data if replicated; failover so clients can keep using the cache; optional persistence (snapshot, AOF) for durability.
- Expected scale — key count, value size, QPS, memory per node, number of clients.
Key Entities
- Cache key — unique identifier (string or binary); used for get/set/delete; key-space is partitioned across nodes (sharding).
- Value — opaque blob (string, bytes); stored with key; optional metadata (TTL, flags).
- TTL (optional) — time-to-live; entry expires after TTL; get returns miss after expiry; set can specify TTL.
- Node / Shard — one cache server or process; holds a subset of the key-space; multiple nodes form a cluster; replication = copy of shard on another node.
Primary Use Cases and Access Patterns
- Get — read path; key → value or miss; must be very fast; cache hit reduces load on backing store (e.g. DB).
- Set — write path; (key, value, optional TTL); overwrite; often after compute or DB read to populate cache.
- Delete — write path; remove key; used when source data changes or to force refresh.
- Invalidate — write path; remove multiple keys (by pattern or tag); fan-out to nodes or pub/sub so all nodes drop relevant keys.
- Key-space partitioning — which node owns which key; consistent hashing or modulo so same key always routes to same node (for non-replicated); rebalance when nodes join/leave.
Given this, start with the simplest MVP: single cache node or small cluster, in-memory key-value store, consistent hashing or simple modulo for key partitioning, get/set/delete, basic TTL or LRU eviction—then add multiple nodes with sharding, replication for availability, connection pooling and load balancing, and monitoring as capacity and availability demands grow.
Stage 1 — MVP (simple, correct, not over-engineered)
Section titled “Stage 1 — MVP (simple, correct, not over-engineered)”Goal
Ship a working cache: clients can get, set, and delete by key; keys are partitioned across one or a few nodes; memory is bounded by eviction (TTL or LRU). Single node or small fixed cluster; no replication yet.
Components
- Cache node(s) — in-memory key-value store (e.g. hash map or off-the-shelf like Memcached, Redis in single-node mode); get(key), set(key, value, optional ttl), delete(key); single process or small cluster (e.g. 2–3 nodes for capacity).
- Key partitioning — which node owns which key: consistent hashing (key → hash → ring → node) or simple modulo (hash(key) % num_nodes); client or proxy routes request to correct node; same key always to same node so state is deterministic (e.g. 2–3 nodes at MVP).
- Eviction — when memory limit reached, evict entries: LRU (least recently used) or TTL (expire after set time); configurable max memory per node; document eviction policy.
- Protocol — clients speak cache protocol (e.g. Memcached ASCII/binary, Redis RESP); or thin API (HTTP/gRPC) in front of cache; keep protocol simple at MVP.
- No replication — each key lives on one node; node failure = loss of that shard’s data; acceptable for cache (re-populate from DB); or use 1 replica per node for durability if required.
Minimal Diagram
Stage 1: clients partition by key (consistent hashing or modulo); each node is an in-memory KV store with eviction (LRU/TTL).
Clients | vPartition (consistent hashing or modulo) | v+--------+ +--------+ +--------+| Node 1 | | Node 2 | | Node 3 || KV | | KV | | KV || in-mem | | in-mem | | in-mem || LRU/TTL| | LRU/TTL| | LRU/TTL|+--------+ +--------+ +--------+(e.g. Redis, Memcached — single-node or small cluster)Still Avoid (common over-engineering here)
- Replication and failover before you have availability or capacity requirements that justify them.
- Multi-region replication and invalidation fan-out (pub/sub) before you have geographic or consistency demands.
Patterns and Concerns (don’t overbuild)
- Key design: keys should be uniform and high-cardinality to avoid hot spots; avoid one node holding all popular keys (consistent hashing helps spread).
- Value size: cap value size to avoid large allocations and eviction storms; document limit (e.g. 1 MB).
- Basic monitoring: hit rate, miss rate, latency (p50, p95), eviction count, memory usage.
Why This Is a Correct MVP
- Single or small cluster, in-memory KV, get/set/delete, partitioning, eviction → enough to offload read load and speed up access; easy to reason about.
- Vertical scaling (bigger nodes) and a few nodes buy you time before you need replication and multi-region.
Stage 2 — Growth Phase (sharding, replication, load balancing)
Section titled “Stage 2 — Growth Phase (sharding, replication, load balancing)”You have a working MVP (single or small cluster, in-memory KV, partitioning, eviction). Now one or more of the triggers below are true.
What Triggers the Growth Phase?
- Need capacity: when more keys or higher QPS than a few nodes can handle → add more nodes with clear sharding (consistent hashing); adding nodes doesn’t require full rehash; virtual nodes (vnodes) for smoother distribution.
- Need availability: when single copy per key means node failure loses that shard’s data and overloads others → add replication (primary-replica or multi-copy); one node failure doesn’t lose the shard; failover (promote replica or use remaining copies).
- Many clients: when connection count or load is uneven → add connection pooling and load balancing; route get/set to correct node(s); health checks so failed nodes are skipped.
Components to Add (incrementally)
- Multiple nodes with sharding — scale to N nodes; consistent hashing so key → node is deterministic and adding/removing nodes only moves a fraction of keys (minimal disruption); virtual nodes (vnodes) for smoother distribution.
- Rehash cost minimized; document rebalance process; same key always to same primary (and replicas).
- Replication — each shard has 1 or more replicas; primary-replica: write to primary, replicate to replica(s); read from primary or replica (read-your-writes vs eventual). Multi-copy: write to N nodes (e.g. quorum); read from any; failover when node dies (promote replica or use remaining copies).
- Async replication = lower write latency, replica may lag; sync = stronger consistency, higher latency; choose per use case.
- Connection pooling and load balancing — clients use connection pool per node to avoid connection churn; load balancer or client-side partition map: route get/set to correct node(s); health checks so failed nodes are skipped.
- Partition map (e.g. consistent hashing); health checks so failed nodes are skipped; pool per node to limit connections.
- Monitoring — hit rate, latency (per node and global), evictions, replication lag (if async), node health; alert on failure and eviction spikes.
- Alert on node failure, eviction spikes, replication lag; runbooks for failover and rebalance.
Growth Diagram
Stage 2: we add load balancing, more nodes with sharding, and replication (primary-replica) for availability.
Clients | vLoad balancer / client partition map (consistent hashing) | v+--------+ +--------+ +--------+|Primary1|---->|Replica1| |Primary2|----> Replica2 ...| Shard 1| | Shard 1| | Shard 2|+--------+ +--------+ +--------+ ^ ^ | |Consistent hashing Read from primary or replicaPatterns and Concerns to Introduce (practical scaling)
- Replication consistency: async replication = lower write latency but replica may lag; sync replication = stronger consistency, higher latency; choose per use case.
- Failover: when primary dies, promote replica or redirect to replica; automatic or manual; clients may see brief errors or stale reads during failover.
- Rehash cost: when adding/removing nodes, consistent hashing minimizes keys moved; document rebalance process and impact.
Still Avoid (common over-engineering here)
- Multi-region replication before you have geographic latency or DR requirements.
- Cache warming and invalidation fan-out before you have strict consistency or scale that demands it.
- Persistence (disk) until you need cache to survive restarts; many caches are “best effort” and repopulate from DB.
Stage 3 — Advanced Scale (multi-region, warming, invalidation)
Section titled “Stage 3 — Advanced Scale (multi-region, warming, invalidation)”You have sharding, replication, load balancing, and monitoring. Now you need multi-region, cache warming, or invalidation at scale.
What Triggers Advanced Scale?
- Need multi-region: when users or services are in multiple regions and need low-latency local reads → replicate cache to multiple regions (async or eventual); clients in each region read from local cache; define consistency and conflict handling.
- Cache warming: when deploy or failover leaves cache cold and causes thundering herd on backing store → proactively populate hot keys (from DB or source); batch get from DB and set in cache; reduce miss storm.
- Invalidation at scale: when source data changes and many keys must be invalidated (by pattern or tag) → fan-out invalidation (e.g. pub/sub) so all nodes drop relevant keys; avoid stale reads.
- Capacity and cost: plan for growth; capacity planning and cost control; optional tiered cache (L1 in-process, L2 distributed) or persistence (snapshot, AOF) for durability.
Components to Add (incrementally)
- Multi-region replication — replicate cache data to multiple regions (async or eventual); clients in each region read from local cache; writes may go to primary region and replicate; define consistency (e.g. eventual) and conflict handling.
- Replication often eventual; document staleness; use TTL and invalidation to bound staleness where needed.
- Cache warming — on startup or after failover, preload hot keys (list from config or derive from access logs); batch get from DB and set in cache; reduce miss storm.
- Hot key list from config or access logs; batch load; single-flight or warming to mitigate thundering herd.
- Invalidation fan-out — when a key or pattern must be invalidated, publish event (e.g. pub/sub: “invalidate key X” or “invalidate pattern user:*”); all nodes (or relevant shards) receive and delete; optional version or tag so clients can invalidate by tag.
- Pub/sub (e.g. Redis Pub/Sub, Kafka) so all nodes receive invalidate; optional tag-based invalidation.
- Scale and capacity planning — monitor memory, QPS, and hit rate; add nodes or resize; optional persistence (snapshot, AOF) for durability; document SLO (latency, availability) and runbooks.
- SLO: latency (p50, p95), availability, hit rate; runbooks for failover and rebalance; optional persistence (e.g. RDB, AOF) for restart survival.
Advanced Diagram (conceptual)
Stage 3: cache cluster per region, pub/sub for invalidation fan-out, warming job, and optional cross-region replication.
Clients (multi-region) | vRegional load balancer | vCache cluster (per region; e.g. Redis cluster) - sharded + replicated - optional persistence (e.g. RDB, AOF) | vPub/sub (invalidation fan-out; e.g. Redis Pub/Sub) | vWarming job (hot keys from DB or log) | vCross-region replication (optional; eventual)Patterns and Concerns at This Stage
- Cross-region consistency: replication is often eventual; document staleness; use TTL and invalidation to bound staleness where needed.
- Thundering herd: when many clients miss same key, they all hit DB; use single-flight (one load, others wait) or warming to mitigate.
- SLO-driven ops: latency (p50, p95), availability (uptime), hit rate; error budgets and on-call; runbooks for failover and rebalance.
Still Avoid (common over-engineering here)
- Strong consistency across regions before you have proven staleness or conflict issues.
- Tiered L1/L2 cache and custom persistence until a single-tier cache is the bottleneck.
- Over-complicated invalidation (e.g. fine-grained versioning everywhere) before you have strict consistency requirements.
Summarizing the Evolution
Section titled “Summarizing the Evolution”MVP delivers a distributed cache with single or small cluster, in-memory key-value store, get/set/delete, key partitioning (consistent hashing or modulo), and basic TTL or LRU eviction. That’s enough to offload read load and improve latency.
As you grow, you add more nodes with sharding, replication (primary-replica or multi-copy) for availability, connection pooling and load balancing, and monitoring. You keep latency low and availability high.
At advanced scale, you add multi-region replication, cache warming, and invalidation fan-out (pub/sub). You plan capacity and cost and optionally add persistence. You scale without over-building on day one.
This approach gives you:
- Start Simple — single or few nodes, get/set/delete, partition, evict; ship and learn.
- Scale Intentionally — add sharding and replication when capacity and availability demand it; add load balancing when clients scale.
- Add Complexity Only When Required — avoid multi-region and invalidation fan-out until consistency and geography justify them; keep latency and hit rate first.
Example: Session store behind a web app
Stage 1: Small Redis or Memcached cluster (2–3 nodes), consistent hashing, get/set/delete with TTL; eviction LRU. Stage 2: When traffic grows, add more nodes (sharding), primary-replica replication for availability, load balancer and connection pooling; monitoring and failover runbooks. Stage 3: When you have multiple regions or strict invalidation needs, add cache per region, pub/sub for invalidation fan-out, and optional warming after deploy; cross-region replication if needed (eventual).
Limits and confidence
This approach fits distributed key-value cache for offloading read load and improving latency; adjust if you need a database, a local in-process cache only, or strong consistency. Use it as a heuristic, not a spec.
What do I do next?
- Capture your requirements using the sections above (functional, quality, entities, access patterns).
- Map your current system to Stage 1, 2, or 3.
- If you’re in Growth or Advanced, pick one trigger that applies and add the corresponding components first.