System Design Checklist
This checklist walks through 10 system-design areas—from clients and edge through traffic, compute, data, caching, messaging, distribution, consistency, observability, and security.
In each area, Components are the concrete building blocks; Patterns & Concerns are the design choices, strategies, and trade-offs you apply to them.
Use it as a mental checklist when you’re designing or reviewing systems.
1. Compute Layer
Section titled “1. Compute Layer”Components
- API servers / app servers (stateless)
- Service layout (monolith app vs microservices)
- Background workers
- Real-time servers (WebSocket/SSE)
- Ranking / scoring service (e.g. feed ranking pipeline, search reranker, ML-based scoring)
Patterns & Concerns
- Horizontal scaling strategy (statelessness, autoscaling triggers)
- Backpressure + load shedding
- Retry policy + timeouts + circuit breakers (service resilience)
- Deployment strategy (blue/green, canary) if relevant — see Progressive Delivery
2. Data Layer (Core Stores)
Section titled “2. Data Layer (Core Stores)”Components
- Relational databases (SQL)
- NoSQL data stores (KV / document / wide-column / graph)
- Object/blob storage
- Search index
- Time-series DB (when relevant)
- Dedicated / purpose-built store (e.g. precomputed feed store, graph store) — when a single DB can’t serve mixed workloads
Patterns & Concerns
- Data modeling choices (schema, indexing, access patterns)
- Query patterns (read-heavy vs write-heavy, joins vs denormalization)
- Hot partition avoidance (key design)
- Data lifecycle (retention policy, archival, GDPR delete flows)
- Polyglot persistence — using different store types for different access patterns (e.g. OLTP + search + cache + analytics)
- Tiered storage (hot/warm/cold) — moving older or less-accessed data to cheaper storage for cost optimization
- Downsampling — reducing granularity of old data (e.g. time-series, analytics) to save storage cost
3. Clients and Edge
Section titled “3. Clients and Edge”Components
- Clients: web, mobile, desktop, other services
- DNS
- Global routing / Anycast (conceptually “global traffic steering”)
- CDN
- WAF (Web Application Firewall) / DDoS (Distributed Denial of Service) protection
Patterns & Concerns
- Edge caching strategy (what gets cached, TTLs, invalidation)
- Geo routing goals (latency vs compliance vs cost)
- Regional routing — directing user traffic to the nearest regional deployment for low latency (beyond DNS; includes request routing at the LB/app level)
4. Traffic Management and Entry
Section titled “4. Traffic Management and Entry”Components
- Load balancer (L4/L7)
- API Gateway / Reverse proxy
Patterns & Concerns
- Rate limiting (token/leaky bucket; per-user/IP/key)
- AuthN/AuthZ (OAuth/OIDC, JWT, service-to-service auth)
- Request shaping: throttling, quotas, spike arrest
- API versioning / backward compatibility (often discussed here)
5. Caching and Acceleration
Section titled “5. Caching and Acceleration”Components
- Client/device cache
- CDN cache
- In-memory cache (e.g. Redis/Memcached)
Patterns & Concerns
- Cache strategy: cache-aside, write-through, write-behind — see Caching Strategies for detailed patterns
- TTLs, eviction policies, invalidation
- Read scaling via replicas
- Stampede protection (locks, singleflight, request coalescing)
6. Messaging and Async
Section titled “6. Messaging and Async”Components
- Queue (e.g. RabbitMQ, Amazon SQS, Redis)
- Pub/Sub / event bus (e.g. Kafka, Google Pub/Sub, AWS SNS, Redis Pub/Sub)
- Event log / durable stream (e.g. Kafka with retention) — ordered, replayable log for history and replay
- Stream processing (e.g. Kafka Streams, Apache Flink, ksqlDB) when needed
Patterns & Concerns
- Delivery semantics: at-most/at-least/exactly-once (practically)
- Retries + dead-letter queues (DLQ)
- Ordering guarantees (per key/partition)
- Consumer scaling + backpressure
- Retention and replay — how long to keep messages/events, replay from offset or timestamp, history/cold store for older data
7. Data Distribution and Reliability Primitives
Section titled “7. Data Distribution and Reliability Primitives”Components
- Replication setup (as a system capability; often in the DB/platform)
- Sharding/partitioning mechanism (db-level or app-level)
- Coordination system (leader election / consensus) when needed
- ID generator service (if not using UUID)
Patterns & Concerns
- Partitioning strategy: hash/range/geo; consistent hashing
- Replication mode: sync vs async; RPO/RTO implications
- Failover strategy (active-active vs active-passive)
- Rebalancing and resharding strategy
8. Consistency, Correctness, and Safety
Section titled “8. Consistency, Correctness, and Safety”Components
- Usually no new infrastructure—you apply consistency patterns to existing DBs, queues, and services; optionally:
- Transaction coordinator / workflow engine — coordinates multi-step or cross-service transactions (e.g. two-phase commit, saga orchestration; tools like Temporal, AWS Step Functions)
- Lock service — distributed locking so processes can coordinate on “who holds the lock” (e.g. etcd, ZooKeeper, Redis with Redlock)
Patterns & Concerns
- Transactions and isolation choices — which level (read uncommitted → serializable) and where to use them
- Idempotency keys — same request applied once even if retried; client sends a key, server dedupes
- Distributed locks (and alternatives) — exclusive access across nodes; or use DB row locks / optimistic concurrency
- Saga / outbox pattern (cross-service consistency) — multi-step flow with compensating actions, or write events to outbox table then publish
- Deduplication (handling at-least-once) — accept duplicate deliveries and make them harmless (e.g. idempotency, unique constraint)
9. Observability and Operations
Section titled “9. Observability and Operations”Components
- Logging pipeline (collector/store) (e.g. ELK, Loki, Fluentd, CloudWatch Logs)
- Metrics system (time-series metrics backend) (e.g. Prometheus, Grafana, InfluxDB, Datadog)
- Tracing system (e.g. Jaeger, Zipkin, OpenTelemetry, X-Ray)
- Alerting/on-call tooling (e.g. PagerDuty, Opsgenie, Alertmanager)
- Health checks system (e.g. load balancer health checks, Kubernetes liveness/readiness)
Patterns & Concerns
- SLOs/SLIs (availability/latency/error budgets)
- Sampling strategy for tracing/logging
- Autoscaling policy (CPU/QPS/latency)
- Runbooks + incident response expectations
10. Security and Compliance
Section titled “10. Security and Compliance”Components
- Secrets manager / KMS — store and serve secrets (passwords, API keys) and encryption keys with access control and audit (e.g. HashiCorp Vault, AWS Secrets Manager, AWS KMS, Azure Key Vault)
- Audit log store
Patterns & Concerns
- Encryption in transit (TLS) and at rest
- Least privilege / IAM model
- PII boundaries, retention, deletion
- Threat modeling basics (abuse prevention, auth bypass, replay)
Use this as a mental checklist when you’re designing or reviewing systems.