Skip to content

Infrastructure / Redundancy — Designed in Stages

First PublishedByAtif Alam

You don’t need to design for scale on day one.

Define what you need—deploy workload across multiple nodes, health check, failover when something breaks, and scale—then build the simplest redundant setup and evolve as availability and blast-radius requirements grow.

Here we use infrastructure for a compute cluster (e.g. cloud redundancy, “build redundancy for the compute cluster”) as the running example: cluster, nodes, workload, and health. The same staged thinking applies to any system where availability (no single point of failure), detection and recovery, and minimal blast radius are central. This is an infra/SRE pattern: we focus on deployment, redundancy, and failover—not application-level logic.

Requirements and Constraints (no architecture yet)

Section titled “Requirements and Constraints (no architecture yet)”

Functional Requirements

  • Deploy workload — run application (or services) on compute nodes; deploy new version (rolling or blue-green); scale out (add nodes) or scale up (larger nodes).
  • Health check — periodically check that each node or instance is healthy (HTTP endpoint, TCP, or script); mark unhealthy so traffic stops going to it; trigger alert or failover.
  • Failover — when a node or zone fails, traffic and workload shift to healthy capacity; manual (ops runs script) or automatic (load balancer stops sending to unhealthy, or orchestrator reschedules).
  • Scale — add or remove nodes (or replicas) to meet load; manual or autoscale; capacity planning.

Quality Requirements

  • Availability (no single point of failure) — loss of one node (or one AZ) should not take down the service; multiple nodes behind load balancer; optional multi-AZ or multi-region.
  • Detection and recovery — detect failure quickly (health checks, timeouts); recover by failing over to healthy node or restarting; mean time to detection (MTTD) and mean time to recovery (MTTR) matter.
  • Minimal blast radius — failure of one component (node, AZ, region) should affect as few users or as little workload as possible; isolation (e.g. cells, shards) and redundancy boundaries.
  • Expected scale — number of nodes, regions, deployment frequency, acceptable downtime or RTO/RPO.

Key Entities

  • Cluster — logical group of nodes (or VMs/containers) that run the workload; may span one or more availability zones or regions.
  • Node — single unit of compute (VM, container host, or pod); runs one or more workload instances; has identity (e.g. IP, instance id) for health and routing.
  • Workload — the application or service to run; deployed onto nodes; may be stateless (any node can serve) or stateful (replication/failover rules).
  • Health — status of a node or instance (healthy / unhealthy); derived from health checks; used by load balancer or orchestrator to route or reschedule.

Primary Use Cases and Access Patterns

  • Deploy — write path; push new version to nodes (rolling update, canary, or blue-green); orchestration (e.g. Kubernetes, ECS) or scripted; rollback if deploy fails or metrics degrade.
  • Health check — read path; LB or orchestrator calls health endpoint on each node; success = healthy, failure or timeout = unhealthy; remove from rotation or reschedule workload.
  • Failover — read + write; on failure (node down, AZ down): LB stops sending to failed targets; optional: orchestrator replaces failed node with new one; stateful: promote replica or failover to standby.
  • Scale — write path; add nodes (manual or autoscale policy); register with LB or cluster; remove nodes on scale-in (drain, then remove).

Given this, start with the simplest MVP: multiple nodes, load balancer, health checks, manual or scripted failover, single region—then add auto-failover (LB + healthy nodes), multi-AZ or multi-region, deployment and rollback, and monitoring and alerting as availability demands grow.

Stage 1 — MVP (simple, correct, not over-engineered)

Section titled “Stage 1 — MVP (simple, correct, not over-engineered)”

Goal

Ship redundant compute: workload runs on multiple nodes behind a load balancer; health checks determine which nodes receive traffic; on node failure, manual or scripted failover (remove from LB, restart, or replace). Single region.

Components

  • Multiple nodes — run at least 2 (or 3+) instances of the workload (e.g. 2 VMs, 2 containers); no single point of failure; nodes can be in same AZ or spread across AZs if available.
  • Load balancer — in front of nodes; distributes traffic (round-robin, least connections, or simple); clients hit LB only; LB has list of backend nodes (or target group).
  • Health checks — LB (or orchestrator) calls health endpoint on each node (e.g. GET /health every 10–30 s); 2xx = healthy, else or timeout = unhealthy; LB stops sending traffic to unhealthy nodes; optional: mark node unhealthy after N consecutive failures.
  • Manual or scripted failover — when a node fails: ops removes it from LB backend list (or script does it on alert); optional: restart node or replace with new instance; no automatic replacement yet; document runbook.
  • Single region — all nodes in one region (or one AZ); acceptable for MVP; multi-AZ or multi-region later.

Minimal Diagram

Clients
|
v
Load balancer
| (health check each node)
v
+----+ +----+ +----+
|Node1| |Node2| |Node3|
+----+ +----+ +----+
| | |
v v v
Workload (same app on each node)
|
v
Failover: manual remove failed node from LB

Patterns and Concerns (don’t overbuild)

  • Stateless workload: prefer stateless app so any node can serve any request; session affinity (sticky) only if needed; stateful requires replication or failover design.
  • Health endpoint: must be cheap and fast; exclude heavy checks from critical path; optional: liveness (am I running?) vs readiness (can I take traffic?).
  • Basic monitoring: node health, LB backend count, request count and errors; alert when node goes unhealthy or backend count drops.

Why This Is a Correct MVP

  • Multiple nodes, LB, health checks, manual failover → no single point of failure for compute; one node down, others serve; easy to reason about.
  • Single region and manual recovery buy you time before you need auto-failover and multi-region.

Stage 2 — Growth Phase (auto-failover, multi-AZ, deploy and rollback)

Section titled “Stage 2 — Growth Phase (auto-failover, multi-AZ, deploy and rollback)”

What Triggers the Growth Phase?

  • Need automatic failover: when node fails, LB already stops sending traffic (health-based); optionally auto-replace failed node (orchestrator or autoscale); reduce MTTR without manual steps.
  • Need multi-AZ or multi-region: survive AZ or region failure; spread nodes across AZs; optional active-active or active-passive in second region.
  • Need safe deployment and rollback: rolling update or blue-green; health checks gate rollout; automatic rollback if error rate or latency degrades.

Components to Add (incrementally)

  • Auto-failover — LB only sends traffic to healthy nodes (from health checks); no manual removal needed when node fails; optional: orchestrator (e.g. Kubernetes) replaces failed pod/node automatically; or autoscale group replaces unhealthy instance; define “unhealthy” threshold (e.g. 2 failed checks).
  • Multi-AZ or multi-region — place nodes in multiple availability zones (same region) so AZ failure doesn’t take all nodes; or replicate to second region for DR; LB or DNS routes to healthy region; data layer (DB, cache) may need replication too—separate concern from compute redundancy.
  • Deployment and rollback — rolling update: deploy new version to subset of nodes, health check passes, then next subset; or blue-green (full new set, switch traffic); monitor error rate and latency; automatic rollback if SLO breach (e.g. error rate > 5%); feature flags or canary for gradual rollout.
  • Monitoring and alerting — metrics: request rate, error rate, latency per node and globally; health status; alert when backend count drops, when error rate spikes, or when deployment fails; runbooks for common failures.

Growth Diagram

Clients
|
v
Load balancer (multi-AZ targets)
|
v
AZ1: Node1, Node2 AZ2: Node3, Node4
| |
v v
Health check → Auto-remove unhealthy
|
v
Orchestrator: replace failed node
|
v
Deploy: rolling update, rollback on SLO breach
|
v
Monitoring & alerting

Patterns and Concerns to Introduce (practical scaling)

  • AZ balance: spread nodes evenly across AZs so one AZ down leaves enough capacity; LB should balance across AZs.
  • Rollback criteria: define what triggers rollback (e.g. error rate > X%, latency p99 > Y); automate so no human in loop for obvious failures.
  • Monitoring: dashboard for node count, health, request/error/latency; alert on threshold; on-call runbooks.

Still Avoid (common over-engineering here)

  • Chaos engineering and automated recovery until you have baseline stability and observability.
  • Capacity and placement optimization until scale and cost justify it.
  • Full multi-region active-active (complexity in data and routing) until RTO/RPO demand it.

Stage 3 — Advanced Scale (chaos, automated recovery, optimization)

Section titled “Stage 3 — Advanced Scale (chaos, automated recovery, optimization)”

What Triggers Advanced Scale?

  • Need to validate resilience: chaos engineering—inject failures (kill node, drop AZ) and verify system recovers; find gaps before production incidents.
  • Automated recovery: self-healing beyond simple restart—replace failed node, repair unhealthy deployment, or trigger runbook automatically.
  • Capacity and placement optimization: right-size nodes, bin-packing, or scale based on cost and performance; multi-region placement for latency or cost.

Components (common advanced additions)

  • Chaos engineering — run experiments: terminate random node, simulate AZ failure, or inject latency; measure impact (availability, error rate) and recovery time; automate in pre-prod or controlled prod; tools (e.g. Chaos Monkey, Chaos Mesh) or custom scripts; document findings and fix gaps.
  • Automated recovery — on alert (e.g. “node unhealthy for 5 min”), trigger runbook: replace node, rollback deploy, or failover to standby; reduce MTTR; optional: auto-scale out when load spikes so capacity is added before SLO breach.
  • Capacity and placement optimization — analyze utilization; right-size instance types (e.g. smaller nodes if CPU low); bin-packing for cost; multi-region: place workload in region closest to user or cheapest; spot/preemptible for fault-tolerant workloads with savings.
  • SLO and error budgets — define availability SLO (e.g. 99.9%); track error budget; use budget to gate releases (burn too fast = freeze) and prioritize reliability work; post-incident review and blameless culture.

Advanced Diagram (conceptual)

Clients
|
v
Global LB / DNS (multi-region if needed)
|
v
Per-region: LB + nodes (multi-AZ)
| |
v v
Chaos experiments Automated recovery
(kill node, AZ) (replace, rollback)
| |
v v
Validate recovery Capacity & placement
| (right-size, region choice)
v
SLO & error budget

Patterns and Concerns at This Stage

  • Chaos safety: run in pre-prod first; in prod use blast-radius controls (e.g. one node, one AZ) and business approval; never chaos critical stateful systems without replication.
  • Automation balance: automate recovery for known failures; avoid auto-acting on ambiguous alerts (could make things worse); human in loop for novel incidents.
  • SLO-driven ops: availability and latency SLOs; error budget; incident response and postmortems; continuous improvement.

MVP delivers infrastructure redundancy with multiple nodes, load balancer, health checks, and manual or scripted failover. No single point of failure for compute; single region.

As you grow, you add auto-failover (LB + healthy nodes, optional orchestrator replace), multi-AZ or multi-region, deployment and rollback with SLO gates, and monitoring and alerting. You reduce MTTR and blast radius.

At advanced scale, you add chaos engineering to validate resilience, automated recovery (runbooks, replace, rollback), and capacity and placement optimization. You operate to SLOs and error budgets. This pattern is infra/SRE-focused; application-level patterns (e.g. stateless design, idempotency) are covered in other examples.

This approach gives you:

  • Start Simple — multiple nodes, LB, health checks, manual failover; ship and learn.
  • Scale Intentionally — add auto-failover and multi-AZ when availability demands it; add deploy/rollback when release velocity demands it.
  • Add Complexity Only When Required — avoid chaos and full automation until you have observability and runbooks; keep availability and detection first.