Distributed Systems Failure Modes: What Breaks and How to Detect It
Distributed Systems Failure Modes#
In a single-process system, failure is binary — the process is running or it is not. Distributed systems fail in partial, ambiguous, and creative ways. Understanding failure modes is the first step to designing systems that tolerate them.
Taxonomy of Failures#
Failures
├── Crash failures — node stops, never recovers (or recovers later)
├── Omission failures — node drops messages (send or receive)
├── Timing failures — node responds outside expected time bounds
├── Byzantine failures — node behaves arbitrarily (including maliciously)
└── Gray failures — node appears healthy but performs incorrectly
Each class is strictly harder to handle than the one above it.
Crash Failures#
The simplest failure: a node stops responding. The process dies, the machine loses power, or the container gets OOM-killed.
Node A: [request] ──────> Node B: [CRASH]
Node A: ... waiting ...
Node A: ... timeout ...
Node A: "Node B is probably dead"
Detection#
- Heartbeats — periodic "I am alive" messages
- Timeout-based — no response within a deadline means presumed dead
Recovery Strategies#
- Replication — other nodes hold copies of the data
- Leader election — surviving nodes elect a new leader (Raft, Paxos)
- Restart with replay — write-ahead logs restore state after restart
Network Partitions#
A network partition splits the cluster into groups that cannot communicate with each other but are individually healthy.
[Node A] ──── [Node B] Partition
╳ ─────────
[Node C] ──── [Node D] Two groups, both think they are "the cluster"
The CAP Reality#
During a partition, you choose:
- CP — reject writes on the minority side (consistent but unavailable)
- AP — accept writes on both sides (available but inconsistent)
There is no option that gives you both consistency and availability during a partition. This is the CAP theorem.
Real-World Partition Examples#
- Cloud availability zone loses connectivity to another zone
- A misconfigured firewall rule blocks traffic between services
- DNS failure makes a subset of nodes unreachable by name
- A saturated network link drops packets selectively
Split Brain#
Split brain happens when a partition causes two nodes to both believe they are the leader.
Before partition:
Leader: Node A Followers: [B, C, D, E]
After partition:
Group 1: Node A (still thinks it is leader)
Group 2: Nodes B, C, D, E (elect Node B as new leader)
Result: Two leaders accepting writes → data divergence
Prevention#
- Fencing tokens — each leader gets a monotonically increasing token; storage rejects writes from stale tokens
- Quorum-based writes — require majority acknowledgment (3 of 5 nodes)
- STONITH (Shoot The Other Node In The Head) — forcibly power off the old leader
Byzantine Failures#
A Byzantine node can send different values to different peers, lie about its state, or behave arbitrarily. This is the hardest failure class.
Node B (Byzantine):
→ tells Node A: "value is 42"
→ tells Node C: "value is 99"
→ tells Node D: nothing
Where Byzantine Failures Matter#
- Blockchain networks — nodes do not trust each other by design
- Aviation and space systems — hardware bit-flips cause incorrect outputs
- Multi-party computation — participants may be adversarial
Where They Usually Do Not#
Most internal distributed systems assume crash-fault tolerance, not Byzantine tolerance. The overhead of BFT consensus (e.g., PBFT requires 3f+1 nodes to tolerate f failures) is too high for typical backends.
Clock Skew#
Distributed systems have no shared clock. Each node's clock drifts independently.
Node A clock: 10:00:00.000
Node B clock: 10:00:00.347 (347ms ahead)
Node C clock: 09:59:59.812 (188ms behind)
Problems Caused by Clock Skew#
- Event ordering — "which write happened first?" has no global answer
- Lease expiration — Node A's lease expires at a different real-time than Node B expects
- Cache TTLs — entries expire at different times across nodes
- Certificate validation — TLS certificates may appear expired on skewed clocks
Solutions#
- NTP/PTP — keep clocks synchronized (but never perfectly)
- Logical clocks (Lamport timestamps) — order events by causality, not wall time
- Vector clocks — track causality across multiple nodes
- Hybrid logical clocks — combine physical time with logical counters (used in CockroachDB, YugabyteDB)
Cascading Failures#
One failure triggers a chain reaction across the system.
1. Node A dies
2. Nodes B, C, D absorb Node A's traffic
3. Node B becomes overloaded → dies
4. Nodes C, D absorb even more traffic
5. Node C becomes overloaded → dies
6. Node D becomes overloaded → total outage
Prevention#
- Circuit breakers — stop sending requests to a failing service
- Bulkheads — isolate failure domains so one overloaded component does not drain shared resources
- Load shedding — reject excess requests early with 503 instead of queuing until OOM
- Backpressure — slow down producers when consumers cannot keep up
- Capacity headroom — provision enough surplus that losing a node does not overload survivors
Gray Failures#
Gray failures are the most insidious. The node appears healthy — responds to heartbeats, passes health checks — but produces incorrect or degraded results.
Health check: GET /health → 200 OK
Actual behavior:
- 30% of database queries return stale data (replica lag)
- Responses take 10x longer than normal
- TLS handshakes fail intermittently
Why They Are Hard#
- Standard health checks pass
- Monitoring thresholds may not trigger
- The failure is partial — some requests succeed
- Different observers may see different behavior
Detection Strategies#
- Deep health checks — verify actual functionality, not just process liveness
- Differential observation — compare a node's behavior as seen by multiple peers
- Outlier detection — flag nodes whose latency or error rate deviates from the cohort
- End-to-end probes — synthetic transactions that exercise the full request path
Failure Detection Mechanisms#
Simple Heartbeats#
Node A ──[heartbeat]──> Monitor
every 1s
Monitor: "3 missed heartbeats → Node A is dead"
Problems: a fixed timeout causes either false positives (healthy but slow node marked dead) or late detection (long timeout delays failover).
Phi Accrual Failure Detector#
Instead of a binary alive/dead decision, the phi accrual detector outputs a suspicion level (phi) based on the statistical distribution of heartbeat arrival times.
phi = -log10(probability that the heartbeat is merely late)
phi = 1 → 10% chance it is just late
phi = 2 → 1% chance it is just late
phi = 5 → 0.001% chance → almost certainly dead
Each node maintains a sliding window of inter-arrival times. When a heartbeat is late, phi increases based on how unusual the delay is relative to historical behavior.
Advantages of Phi Accrual#
- Adaptive — adjusts to network conditions automatically
- No fixed timeout — works across fast LANs and slow WANs
- Configurable threshold — set phi threshold based on your tolerance for false positives
- Used by Akka, Cassandra, and other production systems
SWIM Protocol#
SWIM (Scalable Weakly-consistent Infection-style Membership) combines failure detection with membership dissemination.
1. Node A pings Node B directly
2. If B does not respond, A asks Nodes C and D to ping B (indirect probe)
3. If indirect probes also fail, A marks B as suspected
4. Suspicion is disseminated via gossip protocol
5. If B does not refute within a timeout, B is declared dead
SWIM scales to thousands of nodes because each node only probes a constant number of peers per round.
Design Principles#
- Assume everything fails — network, disks, clocks, DNS, other services
- Detect failures quickly — use adaptive detectors, not fixed timeouts
- Contain blast radius — bulkheads, circuit breakers, failure domains
- Prefer availability over consistency for user-facing systems (unless correctness is critical)
- Test failures actively — chaos engineering, fault injection, game days
- Make failures visible — structured logging, distributed tracing, anomaly detection
- Design for recovery — idempotent operations, write-ahead logs, reconciliation
Wrapping Up#
Distributed systems do not fail cleanly. They fail partially, intermittently, and ambiguously. Understanding the spectrum from crash failures to gray failures — and choosing the right detection and mitigation strategies — is what separates systems that survive production from those that wake you up at 3 AM.
Article #368 -- Codelit has mass-produced 368 articles to date. Explore them at codelit.io.
Try it on Codelit
Chaos Mode
Simulate node failures and watch cascading impact across your architecture
Related articles
AI Agent Tool Use Architecture: Function Calling, ReAct Loops & Structured Outputs
6 min read
AI searchAI-Powered Search Architecture: Semantic Search, Hybrid Search, and RAG
8 min read
AI safetyAI Safety Guardrails Architecture: Input Validation, Output Filtering, and Human-in-the-Loop
8 min read
Try these templates
Netflix Video Streaming Architecture
Global video streaming platform with adaptive bitrate, CDN distribution, and recommendation engine.
10 componentsDistributed Rate Limiter
API rate limiting with sliding window, token bucket, and per-user quotas.
7 componentsWhatsApp-Scale Messaging System
End-to-end encrypted messaging with offline delivery, group chats, and media sharing at billions-of-messages scale.
9 componentsBuild this architecture
Generate an interactive architecture for Distributed Systems Failure Modes in seconds.
Try it in Codelit →
Comments