Collaborative Document Editing — Designed in Stages
You don’t need to design for scale on day one.
This guide is a staged design playbook: it tells you what to build at MVP, Growth, and Advanced scale so you don’t over- or under-build.
Define what you need—open a doc, apply edits, sync edits to others, show presence or cursors, and optionally version history—then build the simplest thing that works and evolve as concurrent editors and latency requirements grow.
When to use this: Use this when you’re designing or evolving a collaborative document editing system (Google Docs–style) from scratch or from an existing MVP; when you need real-time sync, conflict resolution (OT or CRDT), and optional presence. Skip or adapt if you need only async comments or no concurrent editing (single-writer flows).
Unlike designing for max scale on day one, this adds complexity only when triggers appear (e.g. connection scale, operation log, multi-region). Unlike ad-hoc growth with no structure, you get a clear sequence: MVP → Growth → Advanced. If you over-build (e.g. dedicated sync service and multi-region before you need them), you pay in ops and consistency. If you under-invest in triggers (e.g. no real-time layer when single-server broadcast is the bottleneck), you hit latency and scale limits. The stages tie additions to triggers so you avoid both.
Here we use collaborative document editing (Google Docs–style) as the running example: documents, users/sessions, operations/edits, cursors/selections, and versions/snapshots. The same staged thinking applies to any real-time collaborative editor: conflict resolution (OT or CRDT), low latency, and eventual consistency of doc state are central.
Requirements and Constraints (no architecture yet)
Section titled “Requirements and Constraints (no architecture yet)”Functional Requirements
- Open doc — load document state (or operation log to reconstruct); auth and permissions (who can view/edit).
- Apply edit — user types or changes content; edit is applied locally and sent to server; other collaborators receive and apply it.
- Sync edits to others — edits are propagated so everyone sees a consistent document; ordering and conflict resolution matter.
- Presence / cursors — show who is in the doc and where their cursor or selection is; optional but expected in collaborative UIs.
- Version history — optional snapshots or replay of operations for “restore to previous version” or audit.
Quality Requirements
- Low latency — edits should appear on other clients quickly (e.g. sub-second); round-trip and processing add delay; optimize for perceived responsiveness.
- Eventual consistency — all clients converge to the same document state; no permanent divergence; conflict resolution (e.g. OT or CRDT) ensures no lost updates.
- No lost updates — every acknowledged edit is reflected in final state; retries and ordering (e.g. operation log, sequence numbers) prevent drops or duplicates.
- Expected scale — doc size, number of concurrent editors per doc, number of docs, geographic distribution.
Key Entities
- Document — the shared artifact; has current state (e.g. text or structured content) and/or an operation log; doc_id, permissions.
- User / Session — user identity and session (e.g. connection); used for auth and presence.
- Operation / Edit — a single change (e.g. insert “hello” at index 5, delete range 3–7); has position, content, and ordering (e.g. version or sequence).
- Cursor / Selection — ephemeral; user’s cursor position or selection in the doc; used for presence; often not persisted long-term.
- Version / Snapshot — optional; saved state or checkpoint at a point in time for history or restore.
Primary Use Cases and Access Patterns
- Open doc — read path; load doc state or replay operation log; check permissions.
- Apply edit — write path; submit operation; server orders it (e.g. with sequence), applies conflict resolution, persists, and broadcasts to other clients.
- Sync edits — read path for other clients; receive operations (push via WebSocket or pull); apply locally with same conflict-resolution rules.
- Presence — read/write; send cursor/selection updates; receive others’ presence; often separate channel or message type from edits.
- Version history — read path; list snapshots or replay log up to a point; restore creates new state or branch.
Given this, start with the simplest MVP: one API, one primary DB (doc state or operation log), single server, broadcast or simple sync of edits in API server RAM (e.g. last-write-wins or minimal OT), and optional presence in API server RAM or DB—then add a real-time layer, operation log or event stream for ordering, and dedicated sync and presence at scale.
Stage 1 — MVP (simple, correct, not over-engineered)
Section titled “Stage 1 — MVP (simple, correct, not over-engineered)”Goal
Ship working collaborative editing: users open a doc, apply edits, and see each other’s changes with minimal delay. One API, one DB, single server; simple sync or minimal conflict resolution; optional presence.
Components
- API — REST or similar; auth; open doc (return doc state or operation log); submit edit (operation); optional: get presence, heartbeat. Single server so ordering is easier (e.g. single writer or sequence generator).
- Primary DB — store doc state (e.g. current content) and/or operation log (append-only list of operations with order); indexes by doc_id. Doc state is the source of truth for “open doc”; log supports replay and optional history.
- Broadcast or simple sync — when an edit is received, persist it, then push to other connected clients. With a single server, broadcast lives in API server RAM (e.g. pub/sub per doc in process); or use poll. No separate message broker at MVP.
- Conflict resolution — keep it minimal: last-write-wins for tiny teams, or a minimal Operational Transformation (OT) or CRDT for text so concurrent edits don’t corrupt the doc. Choose one approach and implement it consistently (e.g. server assigns sequence per doc; clients apply in order).
- Optional presence — store or broadcast cursor/selection in API server RAM (e.g. per-doc map) or in DB with short TTL; clients send heartbeat and receive others’ presence; no need for durable presence at MVP.
Minimal Diagram
Stage 1: clients talk to one API; doc state and optional op log in one DB; broadcast in API server RAM for edits and optional presence.
Client A Client B | | v v+----------------------+| API (single server) |+----------------------+ | | v vPrimary DB (doc state + optional op log; e.g. Postgres) | vBroadcast in API server RAM (edits + optional presence) ^ ^ | |Client A Client BStill Avoid (common over-engineering here)
- Real-time layer (WebSocket) and operation log before you hit connection or latency limits on single-server broadcast.
- Dedicated sync service and multi-region doc routing before you have scale or SLO requirements for them.
Patterns and Concerns (don’t overbuild)
- Ordering: assign a sequence or version per operation (e.g. server-side counter per doc); clients apply operations in order; single server keeps ordering simple.
- Idempotency: clients may retry “submit edit”; use idempotency key or operation id so the same op is not applied twice.
- Basic monitoring: open-doc latency, edit round-trip time, presence freshness, error rate.
Why This Is a Correct MVP
- One API, one DB, single server, broadcast or simple sync, minimal OT or CRDT → enough to ship a small-team collaborative doc; easy to reason about.
- Vertical scaling (bigger server) and a single ordering point buy you time before you need a dedicated sync service and multi-region.
Stage 2 — Growth Phase (real-time layer, operation log, scale)
Section titled “Stage 2 — Growth Phase (real-time layer, operation log, scale)”You have a working MVP (one API, one DB, single-server broadcast, minimal conflict resolution). Now one or more of the triggers below are true.
What Triggers the Growth Phase?
- Need lower latency and scale: when single-server broadcast doesn’t scale (many docs, many connections) or round-trip is too high → add a dedicated real-time layer (WebSocket) for edit and presence fan-out; API stays for auth and persistence.
- Need ordering and replay: when operation ordering or “open doc” from log matters → add an operation log or event stream (append-only per doc); server-authoritative ordering; conflict-resolution (OT/CRDT) can run in the API server, a dedicated service, or in clients (server only orders).
- Doc and presence scale: when many concurrent editors per doc or many docs → scale connections (e.g. connection manager, sticky sessions or shared pub/sub); scale doc access (e.g. partition by doc_id).
Components to Add (incrementally)
- Real-time layer (WebSocket or similar) — clients maintain a long-lived connection; server pushes edits and presence; reduces polling and round-trip; scale connections (e.g. connection manager, horizontal API servers with sticky sessions or shared pub/sub).
- Sticky session or pub/sub so edits for a doc reach all subscribers; API for auth and open doc.
- Operation log or event stream — append-only log of operations per doc (or event stream with doc partition); ordering and durability; clients or server replay for “open doc” and for conflict resolution. Where it’s kept: in the same Primary DB as an append-only table, or in a dedicated log/stream store (e.g. Kafka topic per doc or partition key).
- Ordering source for conflict resolution; replay for “open doc” and optional history; retention and snapshots for fast load.
- Conflict-resolution (OT/CRDT) — run in API server, in a dedicated service, or in clients. In API server or dedicated service: receives ops, transforms/merges, emits ordered ops. In clients: transformation in clients; server only orders and stores. Ensure no lost updates and convergent state.
- Choose one algorithm (OT or CRDT) and use it consistently; server assigns sequence; ensure convergent state.
- Doc and presence scale — doc access by doc_id (shard or partition if needed); presence can be in-memory per server with pub/sub across servers, or short-lived in a cache; scale out API and real-time layer.
- Presence: ephemeral, fan-out via real-time layer; optional cache with TTL for “who was recently in doc”; avoid storing every cursor move in primary DB.
Growth Diagram
Stage 2: we add real-time layer (WebSocket), operation log for ordering/replay, and scale doc and presence.
Client A Client B Client C | | | v v v+----------------------+| Real-time layer | (WebSocket; edit + presence fan-out)| (scale connections) |+----------------------+ | | v v+----------------------+| API (auth, open doc) |+----------------------+ | vOperation log / event stream (ordering, replay) stored in Primary DB (table) or dedicated log (e.g. Kafka) | vPrimary DB (doc state, snapshots; e.g. Postgres) | vConflict-resolution (OT/CRDT) — in API server or dedicated service, or in clients (server only orders)Patterns and Concerns to Introduce (practical scaling)
- Connection scaling: many WebSocket connections; use connection manager, load balancer with sticky session or pub/sub so edits for a doc reach all subscribers.
- Presence at scale: presence is ephemeral; fan-out via same real-time layer; optional presence store (e.g. cache with TTL) for “who was recently in doc”; avoid storing every cursor move in primary DB.
- Replay and open doc: “open doc” = load latest doc state from DB, or replay operation log from a snapshot; balance replay cost vs storage (snapshots + incremental log).
- Monitoring: connection count, edit latency (submit to broadcast), presence latency, op log growth, conflict-resolution errors.
Still Avoid (common over-engineering here)
- Multi-region doc routing and global replication before you have clear latency or availability requirements.
- Separate presence and history services before doc and presence load justify the split.
- Over-complicated OT/CRDT; use a well-understood algorithm and keep it consistent across clients and server.
Stage 3 — Advanced Scale (dedicated sync, multi-region, history)
Section titled “Stage 3 — Advanced Scale (dedicated sync, multi-region, history)”You have a real-time layer, operation log, and scaled doc and presence. Now you need dedicated sync, multi-region, or first-class history.
What Triggers Advanced Scale?
- Need dedicated sync: when many concurrent editors per doc or many docs globally and the real-time layer or API is the bottleneck → add a dedicated sync service that owns edit ingestion, ordering, conflict resolution, and fan-out; API and DB own auth and durable state.
- Operation log first-class: when replay, versioning, and history are product requirements → make the log the source of truth for ordering and replay; snapshots plus log for fast “open doc” and version history; define retention and compaction.
- Multi-region or separate services: when latency and availability SLOs require multi-region doc routing, or presence and history justify separate services for scale and isolation → route users to nearest region; optionally separate presence (ephemeral) and history (replay, versions) from the hot path.
Components to Add (incrementally)
- Dedicated sync service — separate service that owns edit ingestion, ordering, conflict resolution, and fan-out; API and DB own auth and durable state; sync service subscribes to doc log or writes to it; scales independently.
- Sync subscribes to or writes operation log; API for auth and open doc; clear separation of hot path from persistence.
- Operation log as first-class store — log is the source of truth for ordering and replay; snapshots (e.g. periodic or on demand) plus log for fast “open doc” and version history; retention and compaction policy.
- Snapshots (e.g. daily or on demand) + log; retention and compaction to balance storage and replay cost.
- Multi-region doc routing — route user to nearest or preferred region for low latency; doc may be replicated or primary in one region; sync traffic stays within region or replicates with clear consistency model.
- Define ordering if writable in multiple regions (e.g. single primary per doc); read replicas for open-doc where appropriate.
- Separate presence and history — presence service (ephemeral, low latency) and history/version service (replay, snapshots, restore) can be separate from the hot path; scale and optimize each for its workload.
- Presence: ephemeral, low latency; history: replay, snapshots, restore; avoid blocking edit path.
Advanced Diagram (conceptual)
Stage 3: dedicated sync per region, operation log as first-class store, optional separate presence and history.
Clients (region A) Clients (region B) | | v v+------------------+ +------------------+| Sync service (A) | | Sync service (B) || edit + presence | | edit + presence |+------------------+ +------------------+ | | v vOperation log (partitioned / replicated by doc; e.g. Kafka or DB) | vPrimary DB / doc store (doc state, snapshots; e.g. Postgres) | vPresence (ephemeral; e.g. Redis or in-memory) History (replay, versions; e.g. DB or object store)Patterns and Concerns at This Stage
- Consistency across regions: if doc is writable in multiple regions, define ordering (e.g. single primary vs multi-primary) and conflict resolution; often single primary per doc with read replicas for open-doc.
- Latency for many editors: sync service must fan out to many connections with low latency; optimize in-memory path and avoid unnecessary DB reads on the hot path.
- Version history and restore: snapshots + operation log enable “restore to date” or “list versions”; retention and cost (storage) vs product needs.
- SLO-driven ops: edit latency (submit to visible on others), presence latency, open-doc time, sync service availability; error budgets and on-call.
Still Avoid (common over-engineering here)
- Multi-primary doc writes across regions before you have proven consistency and conflict-resolution needs.
- Custom or experimental OT/CRDT before a well-understood algorithm is the bottleneck.
- Separate presence and history services until scale and SLOs clearly justify the split.
Summarizing the Evolution
Section titled “Summarizing the Evolution”MVP delivers collaborative editing with one API, one DB, single server, broadcast or simple sync in API server RAM, and minimal conflict resolution (e.g. OT or CRDT). Optional presence in API server RAM or DB. That’s enough to ship for small teams.
As you grow, you add a real-time layer (WebSocket) for edit and presence fan-out, an operation log or event stream for ordering and replay, and explicit conflict-resolution (in API server, dedicated service, or clients). You scale doc and presence access without over-building multi-region or separate history on day one.
At advanced scale, you add a dedicated sync service, operation log as first-class store (replay, versioning), and optionally multi-region doc routing and separate presence and history services. You optimize for latency and scale for many concurrent editors.
This approach gives you:
- Start Simple — one API, one DB, single server, broadcast or simple sync in API server RAM, minimal OT/CRDT; ship and learn.
- Scale Intentionally — add real-time layer and operation log when latency and scale demand it; add dedicated sync when connections and docs grow.
- Add Complexity Only When Required — avoid multi-region and separate presence/history until SLOs and product justify them; keep the edit path consistent and convergent first.
Example: Google Docs–style editor
Stage 1: Single API, one DB (doc state + optional op log), broadcast in API server RAM for edits and optional presence; minimal OT or CRDT; small team. Stage 2: When connection count and latency demand it, add WebSocket real-time layer, operation log for ordering and replay, scale connections (sticky or pub/sub). Stage 3: When many concurrent editors or global users, add dedicated sync service, operation log as first-class store (snapshots + retention), and optionally multi-region routing and separate presence/history services.
Limits and confidence
This approach fits real-time collaborative editing where conflict resolution and eventual consistency matter; adjust if you need only async comments or single-writer flows. Use it as a heuristic, not a spec.
What do I do next?
- Capture your requirements using the sections above (functional, quality, entities, access patterns).
- Map your current system to Stage 1, 2, or 3.
- If you’re in Growth or Advanced, pick one trigger that applies and add the corresponding components first.