System Design Checklist

First PublishedFeb 7, 2026Last UpdatedFeb 13, 2026ByAtif Alam

This checklist walks through 10 system-design areas—from clients and edge through traffic, compute, data, caching, messaging, distribution, consistency, observability, and security.

In each area, Components are the concrete building blocks; Patterns & Concerns are the design choices, strategies, and trade-offs you apply to them.

Use it as a mental checklist when you’re designing or reviewing systems.

1. Compute Layer

Components

API servers / app servers (stateless)
Service layout (monolith app vs microservices)
Background workers
Real-time servers (WebSocket/SSE)
Ranking / scoring service (e.g. feed ranking pipeline, search reranker, ML-based scoring)

Patterns & Concerns

Horizontal scaling strategy (statelessness, autoscaling triggers)
Backpressure + load shedding
Retry policy + timeouts + circuit breakers (service resilience)
Deployment strategy (blue/green, canary) if relevant — see Progressive Delivery

2. Data Layer (Core Stores)

Components

Relational databases (SQL)
NoSQL data stores (KV / document / wide-column / graph)
Object/blob storage
Search index
Time-series DB (when relevant)
Dedicated / purpose-built store (e.g. precomputed feed store, graph store) — when a single DB can’t serve mixed workloads

Patterns & Concerns

Data modeling choices (schema, indexing, access patterns)
Query patterns (read-heavy vs write-heavy, joins vs denormalization)
Hot partition avoidance (key design)
Data lifecycle (retention policy, archival, GDPR delete flows)
Polyglot persistence — using different store types for different access patterns (e.g. OLTP + search + cache + analytics)
Tiered storage (hot/warm/cold) — moving older or less-accessed data to cheaper storage for cost optimization
Downsampling — reducing granularity of old data (e.g. time-series, analytics) to save storage cost

3. Clients and Edge

Components

Clients: web, mobile, desktop, other services
DNS
Global routing / Anycast (conceptually “global traffic steering”)
CDN
WAF (Web Application Firewall) / DDoS (Distributed Denial of Service) protection

Patterns & Concerns

Edge caching strategy (what gets cached, TTLs, invalidation)
Geo routing goals (latency vs compliance vs cost)
Regional routing — directing user traffic to the nearest regional deployment for low latency (beyond DNS; includes request routing at the LB/app level)

4. Traffic Management and Entry

Components

Load balancer (L4/L7)
API Gateway / Reverse proxy

Patterns & Concerns

Rate limiting (token/leaky bucket; per-user/IP/key)
AuthN/AuthZ (OAuth/OIDC, JWT, service-to-service auth)
Request shaping: throttling, quotas, spike arrest
API versioning / backward compatibility (often discussed here)

5. Caching and Acceleration

Components

Client/device cache
CDN cache
In-memory cache (e.g. Redis/Memcached)

Patterns & Concerns

Cache strategy: cache-aside, write-through, write-behind — see Caching Strategies for detailed patterns
TTLs, eviction policies, invalidation
Read scaling via replicas
Stampede protection (locks, singleflight, request coalescing)

6. Messaging and Async

Components

Queue (e.g. RabbitMQ, Amazon SQS, Redis)
Pub/Sub / event bus (e.g. Kafka, Google Pub/Sub, AWS SNS, Redis Pub/Sub)
Event log / durable stream (e.g. Kafka with retention) — ordered, replayable log for history and replay
Stream processing (e.g. Kafka Streams, Apache Flink, ksqlDB) when needed

Patterns & Concerns

Delivery semantics: at-most/at-least/exactly-once (practically)
Retries + dead-letter queues (DLQ)
Ordering guarantees (per key/partition)
Consumer scaling + backpressure
Retention and replay — how long to keep messages/events, replay from offset or timestamp, history/cold store for older data

7. Data Distribution and Reliability Primitives

Components

Replication setup (as a system capability; often in the DB/platform)
Sharding/partitioning mechanism (db-level or app-level)
Coordination system (leader election / consensus) when needed
ID generator service (if not using UUID)

Patterns & Concerns

Partitioning strategy: hash/range/geo; consistent hashing
Replication mode: sync vs async; RPO/RTO implications
Failover strategy (active-active vs active-passive)
Rebalancing and resharding strategy

8. Consistency, Correctness, and Safety

Components

Usually no new infrastructure—you apply consistency patterns to existing DBs, queues, and services; optionally:
Transaction coordinator / workflow engine — coordinates multi-step or cross-service transactions (e.g. two-phase commit, saga orchestration; tools like Temporal, AWS Step Functions)
Lock service — distributed locking so processes can coordinate on “who holds the lock” (e.g. etcd, ZooKeeper, Redis with Redlock)

Patterns & Concerns

Transactions and isolation choices — which level (read uncommitted → serializable) and where to use them
Idempotency keys — same request applied once even if retried; client sends a key, server dedupes
Distributed locks (and alternatives) — exclusive access across nodes; or use DB row locks / optimistic concurrency
Saga / outbox pattern (cross-service consistency) — multi-step flow with compensating actions, or write events to outbox table then publish
Deduplication (handling at-least-once) — accept duplicate deliveries and make them harmless (e.g. idempotency, unique constraint)

9. Observability and Operations

Components

Logging pipeline (collector/store) (e.g. ELK, Loki, Fluentd, CloudWatch Logs)
Metrics system (time-series metrics backend) (e.g. Prometheus, Grafana, InfluxDB, Datadog)
Tracing system (e.g. Jaeger, Zipkin, OpenTelemetry, X-Ray)
Alerting/on-call tooling (e.g. PagerDuty, Opsgenie, Alertmanager)
Health checks system (e.g. load balancer health checks, Kubernetes liveness/readiness)

Patterns & Concerns

SLOs/SLIs (availability/latency/error budgets)
Sampling strategy for tracing/logging
Autoscaling policy (CPU/QPS/latency)
Runbooks + incident response expectations

10. Security and Compliance

Components

Secrets manager / KMS — store and serve secrets (passwords, API keys) and encryption keys with access control and audit (e.g. HashiCorp Vault, AWS Secrets Manager, AWS KMS, Azure Key Vault)
Audit log store

Patterns & Concerns

Encryption in transit (TLS) and at rest
Least privilege / IAM model
PII boundaries, retention, deletion
Threat modeling basics (abuse prevention, auth bypass, replay)

Use this as a mental checklist when you’re designing or reviewing systems.