distributed systemsarchitecturereliabilitysystem designbackend

Distributed Systems Failure Modes: What Breaks and How to Detect It

March 29, 2026 8 min readBy Codelit Team Discussion

Distributed Systems Failure Modes#

In a single-process system, failure is binary — the process is running or it is not. Distributed systems fail in partial, ambiguous, and creative ways. Understanding failure modes is the first step to designing systems that tolerate them.

Taxonomy of Failures#

Failures
├── Crash failures      — node stops, never recovers (or recovers later)
├── Omission failures   — node drops messages (send or receive)
├── Timing failures     — node responds outside expected time bounds
├── Byzantine failures  — node behaves arbitrarily (including maliciously)
└── Gray failures       — node appears healthy but performs incorrectly

Each class is strictly harder to handle than the one above it.

Crash Failures#

The simplest failure: a node stops responding. The process dies, the machine loses power, or the container gets OOM-killed.

Node A: [request] ──────> Node B: [CRASH]
Node A: ... waiting ...
Node A: ... timeout ...
Node A: "Node B is probably dead"

Detection#

Heartbeats — periodic "I am alive" messages
Timeout-based — no response within a deadline means presumed dead

Recovery Strategies#

Replication — other nodes hold copies of the data
Leader election — surviving nodes elect a new leader (Raft, Paxos)
Restart with replay — write-ahead logs restore state after restart

Network Partitions#

A network partition splits the cluster into groups that cannot communicate with each other but are individually healthy.

[Node A] ──── [Node B]      Partition
                 ╳           ─────────
[Node C] ──── [Node D]      Two groups, both think they are "the cluster"

The CAP Reality#

During a partition, you choose:

CP — reject writes on the minority side (consistent but unavailable)
AP — accept writes on both sides (available but inconsistent)

There is no option that gives you both consistency and availability during a partition. This is the CAP theorem.

Real-World Partition Examples#

Cloud availability zone loses connectivity to another zone
A misconfigured firewall rule blocks traffic between services
DNS failure makes a subset of nodes unreachable by name
A saturated network link drops packets selectively

Split Brain#

Split brain happens when a partition causes two nodes to both believe they are the leader.

Before partition:
  Leader: Node A    Followers: [B, C, D, E]

After partition:
  Group 1: Node A (still thinks it is leader)
  Group 2: Nodes B, C, D, E (elect Node B as new leader)

Result: Two leaders accepting writes → data divergence

Prevention#

Fencing tokens — each leader gets a monotonically increasing token; storage rejects writes from stale tokens
Quorum-based writes — require majority acknowledgment (3 of 5 nodes)
STONITH (Shoot The Other Node In The Head) — forcibly power off the old leader

Byzantine Failures#

A Byzantine node can send different values to different peers, lie about its state, or behave arbitrarily. This is the hardest failure class.

Node B (Byzantine):
  → tells Node A: "value is 42"
  → tells Node C: "value is 99"
  → tells Node D: nothing

Where Byzantine Failures Matter#

Blockchain networks — nodes do not trust each other by design
Aviation and space systems — hardware bit-flips cause incorrect outputs
Multi-party computation — participants may be adversarial

Where They Usually Do Not#

Most internal distributed systems assume crash-fault tolerance, not Byzantine tolerance. The overhead of BFT consensus (e.g., PBFT requires 3f+1 nodes to tolerate f failures) is too high for typical backends.

Clock Skew#

Distributed systems have no shared clock. Each node's clock drifts independently.

Node A clock: 10:00:00.000
Node B clock: 10:00:00.347   (347ms ahead)
Node C clock: 09:59:59.812   (188ms behind)

Problems Caused by Clock Skew#

Event ordering — "which write happened first?" has no global answer
Lease expiration — Node A's lease expires at a different real-time than Node B expects
Cache TTLs — entries expire at different times across nodes
Certificate validation — TLS certificates may appear expired on skewed clocks

Solutions#

NTP/PTP — keep clocks synchronized (but never perfectly)
Logical clocks (Lamport timestamps) — order events by causality, not wall time
Vector clocks — track causality across multiple nodes
Hybrid logical clocks — combine physical time with logical counters (used in CockroachDB, YugabyteDB)

Cascading Failures#

One failure triggers a chain reaction across the system.

1. Node A dies
2. Nodes B, C, D absorb Node A's traffic
3. Node B becomes overloaded → dies
4. Nodes C, D absorb even more traffic
5. Node C becomes overloaded → dies
6. Node D becomes overloaded → total outage

Prevention#

Circuit breakers — stop sending requests to a failing service
Bulkheads — isolate failure domains so one overloaded component does not drain shared resources
Load shedding — reject excess requests early with 503 instead of queuing until OOM
Backpressure — slow down producers when consumers cannot keep up
Capacity headroom — provision enough surplus that losing a node does not overload survivors

Gray Failures#

Gray failures are the most insidious. The node appears healthy — responds to heartbeats, passes health checks — but produces incorrect or degraded results.

Health check: GET /health → 200 OK
Actual behavior:
  - 30% of database queries return stale data (replica lag)
  - Responses take 10x longer than normal
  - TLS handshakes fail intermittently

Why They Are Hard#

Standard health checks pass
Monitoring thresholds may not trigger
The failure is partial — some requests succeed
Different observers may see different behavior

Detection Strategies#

Deep health checks — verify actual functionality, not just process liveness
Differential observation — compare a node's behavior as seen by multiple peers
Outlier detection — flag nodes whose latency or error rate deviates from the cohort
End-to-end probes — synthetic transactions that exercise the full request path

Failure Detection Mechanisms#

Simple Heartbeats#

Node A ──[heartbeat]──> Monitor
         every 1s

Monitor: "3 missed heartbeats → Node A is dead"

Problems: a fixed timeout causes either false positives (healthy but slow node marked dead) or late detection (long timeout delays failover).

Phi Accrual Failure Detector#

Instead of a binary alive/dead decision, the phi accrual detector outputs a suspicion level (phi) based on the statistical distribution of heartbeat arrival times.

phi = -log10(probability that the heartbeat is merely late)

phi = 1  →  10% chance it is just late
phi = 2  →  1% chance it is just late
phi = 5  →  0.001% chance → almost certainly dead

Each node maintains a sliding window of inter-arrival times. When a heartbeat is late, phi increases based on how unusual the delay is relative to historical behavior.

Advantages of Phi Accrual#

Adaptive — adjusts to network conditions automatically
No fixed timeout — works across fast LANs and slow WANs
Configurable threshold — set phi threshold based on your tolerance for false positives
Used by Akka, Cassandra, and other production systems

SWIM Protocol#

SWIM (Scalable Weakly-consistent Infection-style Membership) combines failure detection with membership dissemination.

1. Node A pings Node B directly
2. If B does not respond, A asks Nodes C and D to ping B (indirect probe)
3. If indirect probes also fail, A marks B as suspected
4. Suspicion is disseminated via gossip protocol
5. If B does not refute within a timeout, B is declared dead

SWIM scales to thousands of nodes because each node only probes a constant number of peers per round.

Design Principles#

Assume everything fails — network, disks, clocks, DNS, other services
Detect failures quickly — use adaptive detectors, not fixed timeouts
Contain blast radius — bulkheads, circuit breakers, failure domains
Prefer availability over consistency for user-facing systems (unless correctness is critical)
Test failures actively — chaos engineering, fault injection, game days
Make failures visible — structured logging, distributed tracing, anomaly detection
Design for recovery — idempotent operations, write-ahead logs, reconciliation

Wrapping Up#

Distributed systems do not fail cleanly. They fail partially, intermittently, and ambiguously. Understanding the spectrum from crash failures to gray failures — and choosing the right detection and mitigation strategies — is what separates systems that survive production from those that wake you up at 3 AM.

Article #368 -- Codelit has mass-produced 368 articles to date. Explore them at codelit.io.

Try it on Codelit

Chaos Mode

Simulate node failures and watch cascading impact across your architecture

Build this architecture →

Comments

AI agents

Context Engineering for Agentic Systems

2 min read

AI agents

AI Agent Memory Architecture

2 min read

AI agents

Production AI Agent Deployment Checklist

2 min read

Try these templates

Netflix Video Streaming Architecture

Global video streaming platform with adaptive bitrate, CDN distribution, and recommendation engine.

10 components

Distributed Rate Limiter

API rate limiting with sliding window, token bucket, and per-user quotas.

7 components

WhatsApp-Scale Messaging System

End-to-end encrypted messaging with offline delivery, group chats, and media sharing at billions-of-messages scale.

9 components

Build this architecture

Generate an interactive architecture for Distributed Systems Failure Modes in seconds.

Try it in Codelit →

distributed systemsarchitecturereliabilitysystem designbackend

Distributed Systems Failure Modes: What Breaks and How to Detect It

March 29, 2026 8 min readBy Codelit Team Discussion

Distributed Systems Failure Modes#

Taxonomy of Failures#

Failures
├── Crash failures      — node stops, never recovers (or recovers later)
├── Omission failures   — node drops messages (send or receive)
├── Timing failures     — node responds outside expected time bounds
├── Byzantine failures  — node behaves arbitrarily (including maliciously)
└── Gray failures       — node appears healthy but performs incorrectly

Each class is strictly harder to handle than the one above it.

Crash Failures#

The simplest failure: a node stops responding. The process dies, the machine loses power, or the container gets OOM-killed.

Node A: [request] ──────> Node B: [CRASH]
Node A: ... waiting ...
Node A: ... timeout ...
Node A: "Node B is probably dead"

Detection#

Heartbeats — periodic "I am alive" messages
Timeout-based — no response within a deadline means presumed dead

Recovery Strategies#

Replication — other nodes hold copies of the data
Leader election — surviving nodes elect a new leader (Raft, Paxos)
Restart with replay — write-ahead logs restore state after restart

Network Partitions#

A network partition splits the cluster into groups that cannot communicate with each other but are individually healthy.

[Node A] ──── [Node B]      Partition
                 ╳           ─────────
[Node C] ──── [Node D]      Two groups, both think they are "the cluster"

The CAP Reality#

During a partition, you choose:

CP — reject writes on the minority side (consistent but unavailable)
AP — accept writes on both sides (available but inconsistent)

There is no option that gives you both consistency and availability during a partition. This is the CAP theorem.

Real-World Partition Examples#

Cloud availability zone loses connectivity to another zone
A misconfigured firewall rule blocks traffic between services
DNS failure makes a subset of nodes unreachable by name
A saturated network link drops packets selectively

Split Brain#

Split brain happens when a partition causes two nodes to both believe they are the leader.

Before partition:
  Leader: Node A    Followers: [B, C, D, E]

After partition:
  Group 1: Node A (still thinks it is leader)
  Group 2: Nodes B, C, D, E (elect Node B as new leader)

Result: Two leaders accepting writes → data divergence

Prevention#

Fencing tokens — each leader gets a monotonically increasing token; storage rejects writes from stale tokens
Quorum-based writes — require majority acknowledgment (3 of 5 nodes)
STONITH (Shoot The Other Node In The Head) — forcibly power off the old leader

Byzantine Failures#

A Byzantine node can send different values to different peers, lie about its state, or behave arbitrarily. This is the hardest failure class.

Node B (Byzantine):
  → tells Node A: "value is 42"
  → tells Node C: "value is 99"
  → tells Node D: nothing

Where Byzantine Failures Matter#

Blockchain networks — nodes do not trust each other by design
Aviation and space systems — hardware bit-flips cause incorrect outputs
Multi-party computation — participants may be adversarial

Where They Usually Do Not#

Clock Skew#

Distributed systems have no shared clock. Each node's clock drifts independently.

Node A clock: 10:00:00.000
Node B clock: 10:00:00.347   (347ms ahead)
Node C clock: 09:59:59.812   (188ms behind)

Problems Caused by Clock Skew#

Event ordering — "which write happened first?" has no global answer
Lease expiration — Node A's lease expires at a different real-time than Node B expects
Cache TTLs — entries expire at different times across nodes
Certificate validation — TLS certificates may appear expired on skewed clocks

Solutions#

NTP/PTP — keep clocks synchronized (but never perfectly)
Logical clocks (Lamport timestamps) — order events by causality, not wall time
Vector clocks — track causality across multiple nodes
Hybrid logical clocks — combine physical time with logical counters (used in CockroachDB, YugabyteDB)

Cascading Failures#

One failure triggers a chain reaction across the system.

1. Node A dies
2. Nodes B, C, D absorb Node A's traffic
3. Node B becomes overloaded → dies
4. Nodes C, D absorb even more traffic
5. Node C becomes overloaded → dies
6. Node D becomes overloaded → total outage

Prevention#

Circuit breakers — stop sending requests to a failing service
Bulkheads — isolate failure domains so one overloaded component does not drain shared resources
Load shedding — reject excess requests early with 503 instead of queuing until OOM
Backpressure — slow down producers when consumers cannot keep up
Capacity headroom — provision enough surplus that losing a node does not overload survivors

Gray Failures#

Gray failures are the most insidious. The node appears healthy — responds to heartbeats, passes health checks — but produces incorrect or degraded results.

Health check: GET /health → 200 OK
Actual behavior:
  - 30% of database queries return stale data (replica lag)
  - Responses take 10x longer than normal
  - TLS handshakes fail intermittently

Why They Are Hard#

Standard health checks pass
Monitoring thresholds may not trigger
The failure is partial — some requests succeed
Different observers may see different behavior

Detection Strategies#

Deep health checks — verify actual functionality, not just process liveness
Differential observation — compare a node's behavior as seen by multiple peers
Outlier detection — flag nodes whose latency or error rate deviates from the cohort
End-to-end probes — synthetic transactions that exercise the full request path

Failure Detection Mechanisms#

Simple Heartbeats#

Node A ──[heartbeat]──> Monitor
         every 1s

Monitor: "3 missed heartbeats → Node A is dead"

Problems: a fixed timeout causes either false positives (healthy but slow node marked dead) or late detection (long timeout delays failover).

Phi Accrual Failure Detector#

Instead of a binary alive/dead decision, the phi accrual detector outputs a suspicion level (phi) based on the statistical distribution of heartbeat arrival times.

phi = -log10(probability that the heartbeat is merely late)

phi = 1  →  10% chance it is just late
phi = 2  →  1% chance it is just late
phi = 5  →  0.001% chance → almost certainly dead

Each node maintains a sliding window of inter-arrival times. When a heartbeat is late, phi increases based on how unusual the delay is relative to historical behavior.

Advantages of Phi Accrual#

Adaptive — adjusts to network conditions automatically
No fixed timeout — works across fast LANs and slow WANs
Configurable threshold — set phi threshold based on your tolerance for false positives
Used by Akka, Cassandra, and other production systems

SWIM Protocol#

SWIM (Scalable Weakly-consistent Infection-style Membership) combines failure detection with membership dissemination.

1. Node A pings Node B directly
2. If B does not respond, A asks Nodes C and D to ping B (indirect probe)
3. If indirect probes also fail, A marks B as suspected
4. Suspicion is disseminated via gossip protocol
5. If B does not refute within a timeout, B is declared dead

SWIM scales to thousands of nodes because each node only probes a constant number of peers per round.

Design Principles#

Assume everything fails — network, disks, clocks, DNS, other services
Detect failures quickly — use adaptive detectors, not fixed timeouts
Contain blast radius — bulkheads, circuit breakers, failure domains
Prefer availability over consistency for user-facing systems (unless correctness is critical)
Test failures actively — chaos engineering, fault injection, game days
Make failures visible — structured logging, distributed tracing, anomaly detection
Design for recovery — idempotent operations, write-ahead logs, reconciliation

Wrapping Up#

Article #368 -- Codelit has mass-produced 368 articles to date. Explore them at codelit.io.

Try it on Codelit

Chaos Mode

Simulate node failures and watch cascading impact across your architecture

Build this architecture →

Comments

AI agents

Build this architecture

Generate an interactive architecture for Distributed Systems Failure Modes in seconds.

Try it in Codelit →

Distributed Systems Failure Modes: What Breaks and How to Detect It

Distributed Systems Failure Modes#

Taxonomy of Failures#

Crash Failures#

Detection#

Recovery Strategies#

Network Partitions#

The CAP Reality#

Real-World Partition Examples#

Split Brain#

Prevention#

Byzantine Failures#

Where Byzantine Failures Matter#

Where They Usually Do Not#

Clock Skew#

Problems Caused by Clock Skew#

Solutions#

Cascading Failures#

Prevention#

Gray Failures#

Why They Are Hard#

Detection Strategies#

Failure Detection Mechanisms#

Simple Heartbeats#

Phi Accrual Failure Detector#

Advantages of Phi Accrual#

SWIM Protocol#

Design Principles#

Wrapping Up#

Comments

Related articles

Context Engineering for Agentic Systems

AI Agent Memory Architecture

Production AI Agent Deployment Checklist

Try these templates

Netflix Video Streaming Architecture

Distributed Rate Limiter

WhatsApp-Scale Messaging System

Build this architecture

Distributed Systems Failure Modes: What Breaks and How to Detect It

Distributed Systems Failure Modes#

Taxonomy of Failures#

Crash Failures#

Detection#

Recovery Strategies#

Network Partitions#

The CAP Reality#

Real-World Partition Examples#

Split Brain#

Prevention#

Byzantine Failures#

Where Byzantine Failures Matter#

Where They Usually Do Not#

Clock Skew#

Problems Caused by Clock Skew#

Solutions#

Cascading Failures#

Prevention#

Gray Failures#

Why They Are Hard#

Detection Strategies#

Failure Detection Mechanisms#

Simple Heartbeats#

Phi Accrual Failure Detector#

Advantages of Phi Accrual#

SWIM Protocol#

Design Principles#

Wrapping Up#

Comments

Related articles

Context Engineering for Agentic Systems

AI Agent Memory Architecture

Production AI Agent Deployment Checklist

Try these templates

Netflix Video Streaming Architecture

Distributed Rate Limiter

WhatsApp-Scale Messaging System

Build this architecture