Chat / Real-time — Designed in Stages
You don’t need to design for scale on day one.
This guide is a staged design playbook: it tells you what to build at MVP, Growth, and Advanced scale so you don’t over- or under-build.
Define what you need—send and receive messages, presence, maybe typing indicators—then build the simplest thing that works and evolve as concurrent users and channels grow.
When to use this: Use this when you’re designing or evolving a real-time chat or messaging system from scratch or from an existing MVP; when you need low-latency delivery, message ordering per channel, and presence. Skip or adapt if you need only async messaging (queues) or no real-time push (see other patterns).
Unlike designing for max scale on day one, this adds complexity only when triggers appear (e.g. connection capacity, fan-out, multi-region). Unlike ad-hoc growth with no structure, you get a clear sequence: MVP → Growth → Advanced. If you over-build (e.g. multi-region and replay before you need them), you pay in ops and consistency. If you under-invest in triggers (e.g. no message bus when a single server is the bottleneck), you hit capacity and reliability issues. The stages tie additions to triggers so you avoid both.
Here we use a real-time chat system as the running example: users, channels or rooms, messages, and presence. The same staged thinking applies to live notifications, collaborative cursors, or any system where low latency and ordered delivery matter.
Requirements and Constraints (no architecture yet)
Section titled “Requirements and Constraints (no architecture yet)”Functional Requirements
- Send message to a channel or room
- List messages (paginated) for a channel
- Presence: who is online or in the room
- Optional: typing indicators, read receipts, message edit/delete
Quality Requirements
- Low latency for delivery (e.g. p95 < 200–500 ms for message receipt)
- Message ordering per channel (messages in a room appear in a consistent order to all participants)
- Delivery guarantees (at-least-once vs exactly-once; handling duplicates client-side or server-side)
- Expected scale: concurrent connections (CCU), messages per second (QPS), channels/rooms, retention
Key Entities
- User — identity, auth
- Session — a user’s connection(s); one user may have multiple devices
- Channel / Room — conversation container; access control (public, private, membership)
- Message — content, sender, channel, timestamp, optional metadata
- Presence — user online/offline or per-channel; optional typing state
Primary Use Cases and Access Patterns
- Send message (write path; must be fast and durable enough)
- List messages (read path; paginated by time or cursor; often read-heavy)
- Subscribe to new messages in a channel (real-time push)
- Presence updates (who joined/left; can be high churn)
- Typing indicators (ephemeral; low latency, may be best-effort)
Given this, start with the simplest MVP that delivers messages reliably and in order for a single server, then evolve as connection count and fan-out grow.
Stage 1 — MVP (simple, correct, not over-engineered)
Section titled “Stage 1 — MVP (simple, correct, not over-engineered)”Goal
Ship a working chat: users can join channels, send messages, see history, and see who’s online. One service, one store, minimal moving parts.
Components
- Client (web or mobile) — connects for real-time and calls API for history/auth
- Single API — REST or similar for auth, list messages, maybe create message (or messages go via real-time path)
- WebSocket or long-polling service — maintains persistent connections; receives new messages, broadcasts to subscribers in the same channel; may write messages to DB or hand off to API
- Primary DB — stores messages and optionally presence (or presence in memory for MVP); indexes by channel + time for paginated list
- Optional: simple in-memory or Redis cache for recent messages per channel to speed list and reduce DB load (e.g. last 50–100 messages per channel, short TTL)
Minimal Diagram
Stage 1: clients connect to one real-time layer and one API; messages and presence in one DB (optional cache for recent messages).
Client A Client B | | v v +-----------------+ | WebSocket or | | long-polling | +-----------------+ | | v v +-----------------+ | Single API | +-----------------+ | | v v Primary DB (messages + presence) (+ optional cache for recent messages)Still Avoid (common over-engineering here)
- Load balancer and multiple connection servers before you hit connection or CPU limits on a single server.
- Message bus (pub/sub) and dedicated real-time layer before you need to fan out to many servers or channels.
Patterns and Concerns (don’t overbuild)
- Auth: validate session or token on connect and on send; attach user id to messages
- Authorization: user can only send/list to channels they are allowed to access
- Ordering: single writer (e.g. API or single connection server) plus timestamp or sequence id per channel so clients can order messages
- Basic logging and metrics (connections, messages in/out, errors)
Why This Is a Correct MVP
- One connection server, one DB → clear ordering, simple deployment, easy to reason about
- Vertical scaling (bigger instance, more connections per process) buys you time before splitting connection handling from storage
Stage 2 — Growth Phase (more users, more channels, bottlenecks appear)
Section titled “Stage 2 — Growth Phase (more users, more channels, bottlenecks appear)”You have a working MVP (one connection server, one DB, optional cache). Now one or more of the triggers below are true.
What Triggers the Growth Phase?
- Need to scale connections: when a single connection server runs out of capacity (CPU, memory, or file descriptors) → add a load balancer and multiple connection servers; sticky session so each user’s connection stays on one server.
- Need to fan out messages: when multiple servers hold subscribers for the same channel and you need to deliver each message to all of them → add a message bus (pub/sub); any server can publish, all servers with subscribers for that channel receive and push to clients.
- DB or cache bottleneck: when DB write or read throughput or hot channels become a bottleneck → add or tune cache for recent messages; reduce DB load for list and history.
Components to Add (incrementally)
- Load balancer — route new connections to connection servers (sticky session or connection state so a user’s connection stays on one server where possible)
- Sticky by session or user so presence and subscription state stay on one server; avoids cross-server coordination for a single user.
- Multiple connection servers — each holds a subset of connections; same WebSocket/long-poll logic, horizontally scaled.
- Each server subscribes to the channels its connected clients care about; publish once, all relevant servers receive.
- Message bus (pub/sub) — when a message is published to a channel, all connection servers that have subscribers for that channel receive it and push to their local clients; decouples “who wrote” from “who is listening”.
- Publisher (API or connection server) writes to DB and publishes to channel; subscribers push to their local connections; no single server needs global connection list.
- Cache — for recent messages and/or session data; reduces DB load for hot channels and fast history.
- e.g. last 50–100 messages per channel; TTL and invalidate on new message; cache-aside for list.
Growth Diagram
Stage 2: we add load balancer, multiple connection servers, message bus (pub/sub), and cache.
+------------------+Clients ----------> | Load Balancer | (sticky by session or user) +------------------+ | | | +---------------+ | +----------------+ v v v Connection Server A Connection Server B Connection Server C | | | +----------+-----------+--------------------+ | v +---------------+ | Message Bus | | (pub/sub) | +---------------+ | v Primary DB + Cache (recent messages / sessions)Patterns and Concerns to Introduce (practical scaling)
- Sticky sessions or connection state: so that a user’s connection is pinned to one server; presence and subscription state stay local; when using pub/sub, subscribe each connection server to the channels its connections care about
- Fan-out via pub/sub: publisher writes message (or API writes to DB + publishes event); subscribers (connection servers) receive and push to clients; no single server needs to know all connections for a channel
- Cache-aside for recent messages: reduce DB reads for “list last N messages”; TTL and invalidation when new messages arrive
- Monitoring: connection count per server, publish/subscribe lag, DB and cache latency
Still Avoid (common over-engineering here)
- Multi-region active-active real-time before you need it.
- Durable queue per channel for replay before product needs it.
- Splitting into many microservices (e.g. separate “presence service”) until boundaries are clear.
Stage 3 — Advanced Scale (very high CCU, global, retention/replay)
Section titled “Stage 3 — Advanced Scale (very high CCU, global, retention/replay)”You have load balancer, multiple connection servers, message bus, and cache. Now you need retention/replay, multi-region, or isolation of presence at scale.
What Triggers Advanced Scale?
- Need retention and replay: when message storage or history query becomes the bottleneck, or compliance / product needs “load last 7 days” on join → add a message log or store that supports range read by channel + time; new joiners fetch history from this store.
- Need low latency in multiple regions: when users in EU and US (or more regions) need local connection endpoints → add multi-region connection routing; users connect to a nearby region; cross-region pub/sub or replication so messages reach subscribers in other regions.
- Presence or typing at very large scale: when presence or typing updates need isolation from the message path → consider dedicated store or eventual consistency; avoid presence blocking message delivery.
Components to Add (incrementally)
- Dedicated real-time layer — connection servers plus a well-defined pub/sub or message backbone; clear separation between “ingest message,” “store message,” and “deliver to connected clients”.
- Ingest → store (DB/log) and publish to channel; deliver → subscribe and push to local connections.
- Message retention and replay — messages written to a log or store that supports range read by channel + time; new joiners or reconnects can fetch history from this store; may be same DB with partitioning or a separate log store.
- Range read by channel + time; define retention (e.g. 7 days, 90 days); idempotency or dedup on replay.
- Multi-region connection routing — users connect to a nearby region; cross-region pub/sub or replication so a message in one region is delivered to subscribers in another; or regional channels with explicit routing.
- Geo DNS or routing to nearest region; cross-region pub/sub or replication; define ordering if writes can happen in multiple regions.
- Possibly separate history store — e.g. cold storage or search index for old messages; hot path uses recent-message store or cache.
- Hot path: recent store or cache; cold: search index or archive for “load older than X.”
Advanced Diagram (conceptual)
Stage 3: multi-region connection routing, message backbone, retention/replay store, and optional separate history.
Global DNS / Geo routing | v +------------------+ | Load Balancer | +------------------+ | | | Region A | Region B | Region C | | | | | v v v v v Connection Connection Connection servers servers servers | | | | | +-----+--------+-----+--------+ | v Message backbone (pub/sub, possibly multi-region) | +-----------+-----------+ v v v Message log Primary DB Cache (recent) (retention) (metadata, (per channel) presence)Patterns and Concerns at This Stage
- Ordering across regions: if messages can be written in multiple regions, define ordering (e.g. single-writer per channel, or vector clocks / timestamps + merge); avoid conflicting order guarantees
- Replay and idempotency: clients may see the same message after reconnect; idempotency keys or client-side dedup
- Presence at scale: may move to a dedicated store or eventual consistency; avoid presence updates blocking message path
- SLO-driven ops: connection availability, message delivery latency, history read latency; error budgets and on-call
Still Avoid (common over-engineering here)
- Full global active-active with strong consistency before you have latency or compliance requirements.
- Separate microservices for presence, typing, and message delivery until boundaries and bottlenecks are clear.
- Custom message ordering (e.g. vector clocks) across regions until you have proven ordering issues.
Summarizing the Evolution
Section titled “Summarizing the Evolution”MVP delivers messages in order with one connection server and one DB (and optional cache). That’s enough to ship and learn.
As you grow, the first bottlenecks are usually connection capacity and fan-out—so you add a load balancer, multiple connection servers, and a message bus (pub/sub) so any server can broadcast to subscribers in a channel. Cache for recent messages keeps the DB from becoming the bottleneck on hot channels.
At very high scale, you add a dedicated real-time layer with clear retention and replay, multi-region connection routing, and possibly a separate history store. You keep the message path simple and add complexity only where latency, availability, or compliance require it.
This approach gives you:
- Start Simple — one connection server, one DB, clear ordering, ship and learn.
- Scale Intentionally — add connection servers and pub/sub when connection count or fan-out justify it; add retention and multi-region when product and SLOs demand it.
- Add Complexity Only When Required — avoid separate presence/message/history services until boundaries and bottlenecks are clear.
Example: Team chat product
Stage 1: Single WebSocket server, single API, one DB for messages and presence; optional Redis cache for last 50 messages per channel. Stage 2: When connection count grows, add load balancer (sticky), multiple WebSocket servers, and a message bus (pub/sub); cache for recent messages. Stage 3: When you need “load last 7 days” on join or EU/US regions, add message log for retention/replay and multi-region connection routing; keep message path simple.
Limits and confidence
This approach fits real-time chat and messaging where low latency and per-channel ordering matter; adjust if you need only async queues or no real-time push. Use it as a heuristic, not a spec.
What do I do next?
- Capture your requirements using the sections above (functional, quality, entities, access patterns).
- Map your current system to Stage 1, 2, or 3.
- If you’re in Growth or Advanced, pick one trigger that applies and add the corresponding components first.