Skip to content

Streaming / Events — Designed in Stages

First PublishedByAtif Alam

You don’t need to design for scale on day one.

Define what you need—produce events, consume them (possibly with multiple consumers), retention, and maybe replay—then build the simplest thing that works and evolve as throughput and consumer count grow.

Here we use an event streaming or log-based system as the running example: topics, events, and consumer groups. The same staged thinking applies to event-driven architectures, activity logs, or any system where ordering, retention, and multiple consumers are central.

Requirements and Constraints (no architecture yet)

Section titled “Requirements and Constraints (no architecture yet)”

Functional Requirements

  • Produce event — publishers send events to a topic (or queue); events are persisted and ordered within a partition or stream.
  • Consume — one or more consumers read events (in order); each consumer or consumer group may have its own position (offset); multiple consumers can read the same stream independently.
  • Retention — how long events are kept (time or size); affects storage and replay capability.
  • Replay — consumers can reset to an earlier offset (or timestamp) and re-read events; required for reprocessing, backfill, or new consumers.

Quality Requirements

  • Ordering — per partition or per key; events in the same partition are read in write order.
  • Durability — events are persisted before ack to producer; replication (in log systems) for fault tolerance.
  • Throughput — events per second (EPS) produce and consume; partition count and consumer parallelism drive scale.
  • Latency — produce latency (time to ack) and consume latency (end-to-end from produce to process); trade-off with batching.

Key Entities

  • Event — a record (key, value, timestamp, optional headers); the unit of data in the stream.
  • Topic — logical stream or channel; divided into partitions (or similar) for parallelism and ordering scope.
  • Consumer group — set of consumers that share the workload (each partition consumed by one member); each group has its own offset per partition.

Primary Use Cases and Access Patterns

  • Produce — write path; high throughput, low latency ack; batching for efficiency.
  • Consume — read path; poll or push; commit offset after processing (at-least-once or exactly-once semantics).
  • Retention and replay — storage and deletion policy; ability to seek to offset or timestamp for replay.

Given this, start with the simplest MVP: one API or gateway, one log/stream (e.g. Kafka) or queue, single producer and single consumer, then evolve with multiple consumers, retention tuning, and stream processing as throughput and use cases grow.

Stage 1 — MVP (simple, correct, not over-engineered)

Section titled “Stage 1 — MVP (simple, correct, not over-engineered)”

Goal

Ship working event flow: producers send events to a topic, a consumer reads and processes them. One stream (or queue), minimal moving parts.

Components

  • API or produce gateway — accept events (REST, SDK, or native protocol); validate and forward to stream; return ack after write (and optional replication).
  • Log/stream (e.g. Kafka) or queue — durable, ordered log per partition; single broker or small cluster for MVP; retention policy (e.g. 7 days, or size-based).
  • Single producer, single consumer — one application (or service) produces; one consumer process (or group) consumes; enough to validate the flow and ordering.

Minimal Diagram

Producer(s)
|
v
+-----------------+
| API / Gateway |
+-----------------+
|
v
+-----------------+
| Log / Stream |
| (e.g. Kafka) |
| 1 topic, N part |
+-----------------+
|
v
Consumer (single or one group)

Patterns and Concerns (don’t overbuild)

  • Partitioning: partition by key (e.g. user_id) so order is preserved per key; or single partition for global order if throughput is low.
  • Ack and durability: producer gets ack only after event is written (and replicated if configured); consumer commits offset after processing (at-least-once to start).
  • Basic monitoring: produce rate, consume lag (offset behind), broker storage, errors.

Why This Is a Correct MVP

  • One stream, one consumer (or one group) → clear flow, easy to reason about and operate.
  • Vertical scaling (more partitions, larger broker) buys you time before you need multiple consumer groups and separate processing clusters.

Stage 2 — Growth Phase (more consumers, higher throughput, lag and retention)

Section titled “Stage 2 — Growth Phase (more consumers, higher throughput, lag and retention)”

What Triggers the Growth Phase?

  • Multiple teams or services need to consume the same stream (multiple consumer groups).
  • Produce or consume throughput outgrows a single broker or small cluster.
  • Consumer lag grows (consumers can’t keep up); need to scale consumers or tune batch size and parallelism.
  • Retention or replay requirements increase (longer retention, or more replay use cases).

Components to Add (incrementally)

  • Multiple consumers (consumer groups) — each group has its own offset; same topic can feed search indexing, analytics, and notifications independently; scale each group by adding instances (partition assignment).
  • Retention tuning — increase retention (time or size) for replay and backfill; monitor storage and enforce limits to avoid unbounded growth.
  • Monitoring lag — track consumer lag (current offset vs latest offset per partition); alert when lag exceeds threshold; dashboards for produce/consume rates and broker health.
  • Optional stream processing — lightweight processing (filter, map, aggregate in windows) inside the stream layer (e.g. Kafka Streams, ksqlDB) or a separate worker that consumes and writes to another topic or store.

Growth Diagram

+------------------+
Producers --------> | API / Gateway |
+------------------+
|
v
+------------------+
| Log / Stream |
| (broker cluster) |
| retention tuned |
+------------------+
| | |
+-----------+ | +-----------+
v v v
Consumer Group A Consumer Group B Consumer Group C
(e.g. search) (e.g. analytics) (e.g. notify)
| | |
v v v
Processing Processing Processing

Patterns and Concerns to Introduce (practical scaling)

  • Consumer scaling: add consumer instances within a group; partitions are assigned to members; max parallelism per topic is often the partition count.
  • Lag and backpressure: if lag grows, scale consumers or optimize processing; avoid blocking the consumer loop (offload heavy work to a pool or another queue).
  • Idempotent consumers: consumers may see the same event twice (at-least-once); design processing to be idempotent (e.g. upsert by key, or dedup by event id).
  • Monitoring: lag per group and partition, produce/consume throughput, broker disk and CPU, retention and compaction metrics.

Still Avoid (common over-engineering here)

  • Multi-region replication before you have latency or DR requirements.
  • Exactly-once semantics across producers and consumers before you have proven duplication issues.
  • Separate stream processing cluster before a single pipeline or embedded processing is the bottleneck.

Stage 3 — Advanced Scale (multi-region, schema, exactly-once, dedicated processing)

Section titled “Stage 3 — Advanced Scale (multi-region, schema, exactly-once, dedicated processing)”

What Triggers Advanced Scale?

  • You need events in multiple regions (disaster recovery, or low-latency consume in each region).
  • Many producers and consumers; schema evolution and compatibility become critical.
  • You need exactly-once processing (no duplicate side effects) or transactional produce across topics.
  • Stream processing workload (aggregations, joins, windows) justifies a dedicated processing cluster separate from the log cluster.

Components (common advanced additions)

  • Multi-region replication — replicate topics across regions (e.g. mirroring); consumers in each region read from local replica; handle replication lag and failover.
  • Schema registry — central place for event schemas (e.g. Avro, Protobuf); enforce compatibility (forward/backward); producers and consumers validate against registry.
  • Exactly-once semantics — transactional produce (all-or-nothing across partitions); consumer with transactional commit (process and commit offset in one transaction); or idempotent producer + idempotent consumer with careful offset handling.
  • Separate processing clusters — dedicated cluster (e.g. Flink, Spark Streaming) that consumes from the log, does heavy aggregation or join, and writes to another topic or store; scale processing independently from the log.

Advanced Diagram (conceptual)

+------------------+
Producers --------> | API / Gateway | ---> Schema Registry
+------------------+
|
+-----------------+-----------------+
v v v
Region A Region B Region C
Log cluster Log cluster Log cluster
(replicated) (replicated) (replicated)
| | |
v v v
Consumers Consumers Consumers
| | |
+-----------------+-----------------+
|
v
Processing cluster
(e.g. Flink; exactly-once)
|
v
Sink (DB, topic, etc.)

Patterns and Concerns at This Stage

  • Replication lag: cross-region replication is eventually consistent; define RTO/RPO and failover procedure; consumers may see different offsets per region until caught up.
  • Schema evolution: add optional fields, avoid breaking changes; use schema registry and compatibility checks in CI.
  • Exactly-once: understand scope (producer only, consumer only, or end-to-end); trade-offs with latency and complexity; idempotency and dedup as fallback.
  • SLO-driven ops: produce/consume availability, end-to-end latency, consumer lag, replication lag; error budgets and on-call.

MVP delivers event flow with one API, one log/stream (e.g. Kafka) or queue, and a single producer/consumer (or one group). That’s enough to ship and learn.

As you grow, you add multiple consumer groups so different services can consume the same stream, tune retention for replay, and monitor lag. You optionally add stream processing (embedded or lightweight) and scale consumers by adding instances and partitions.

At very high scale, you introduce multi-region replication, a schema registry, exactly-once semantics where required, and possibly a separate processing cluster for heavy aggregation or join workloads. You keep the log simple and add complexity only where geography, schema, or processing demands it.

This approach gives you:

  • Start Simple — one stream, one consumer (or one group), clear ordering and retention; ship and learn.
  • Scale Intentionally — add consumer groups and tune retention when multiple consumers and replay justify it; add replication and schema registry when regions and evolution demand it.
  • Add Complexity Only When Required — avoid multi-region and exactly-once until product and SLOs require them.