Distributed Consensus Explained: Raft, Paxos, and Beyond
Distributed Consensus Explained: Raft, Paxos, and Beyond#
In any distributed system with more than one node, a fundamental question arises: how do all nodes agree on the current state of the world? This is the consensus problem, and solving it correctly is what separates toy prototypes from production-grade infrastructure.
Why Consensus Matters#
Every time you write a row to CockroachDB, update a key in etcd, or elect a leader in a Kafka cluster, a consensus protocol is running underneath. Without consensus, distributed systems face:
- Data loss from conflicting writes
- Split brain where two partitions each believe they are authoritative
- Stale reads that violate user expectations
Consensus is the foundation of reliable replication, configuration management, and service discovery.
The CAP Theorem: Setting the Stage#
Before diving into algorithms, recall the CAP theorem: a distributed system can guarantee at most two of three properties simultaneously:
┌──────────────────────────────────┐
│ CAP Theorem │
│ │
│ Consistency (C) │
│ / \ │
│ / \ │
│ Availability Partition │
│ (A) Tolerance (P) │
│ │
│ CP: etcd, ZooKeeper, Consul │
│ AP: Cassandra, DynamoDB │
│ CA: single-node RDBMS (no P) │
└──────────────────────────────────┘
Most consensus algorithms (Raft, Paxos, Zab) fall in the CP camp — they sacrifice availability during network partitions to preserve strong consistency.
Raft Consensus#
Raft was designed by Diego Ongaro and John Ousterhout specifically to be understandable. It decomposes consensus into three sub-problems.
1. Leader Election#
┌─────────┐ timeout ┌───────────┐ majority vote ┌────────┐
│ Follower │ ───────► │ Candidate │ ──────────────► │ Leader │
└─────────┘ └───────────┘ └────────┘
▲ │
└───────────── heartbeat / higher term ─────────────┘
- Every node starts as a Follower.
- If a follower receives no heartbeat within a randomized timeout, it becomes a Candidate and requests votes.
- A candidate that receives votes from a majority (quorum) becomes the Leader.
- Only one leader can exist per term.
2. Log Replication#
Once elected, the leader accepts client writes and appends them to its log. It then replicates each entry to followers:
Client ──► Leader
│
├──► Follower A (AppendEntries RPC)
├──► Follower B
└──► Follower C
Commit when majority (3/5, 2/3, etc.) acknowledge.
An entry is committed once a majority of nodes have written it to their logs. The leader then applies it to the state machine and notifies followers.
3. Safety#
Raft guarantees that:
- Election Safety: at most one leader per term.
- Leader Append-Only: a leader never overwrites its log.
- Log Matching: if two logs contain an entry with the same index and term, all preceding entries are identical.
- State Machine Safety: once a log entry is applied, no other server applies a different entry for that index.
Paxos Algorithm#
Leslie Lamport's Paxos predates Raft by decades and remains the theoretical gold standard. It defines three roles:
| Role | Responsibility |
|---|---|
| Proposer | Proposes a value with a unique proposal number |
| Acceptor | Votes on proposals; promises not to accept lower-numbered ones |
| Learner | Learns the chosen value once a quorum of acceptors agrees |
Two-Phase Protocol#
Phase 1 — Prepare
- Proposer sends
Prepare(n)to acceptors. - Each acceptor promises not to accept proposals numbered less than
nand returns any previously accepted value.
Phase 2 — Accept
- Proposer sends
Accept(n, value)to acceptors. - If the acceptor has not promised a higher number, it accepts.
- Once a majority accepts, the value is chosen.
Proposer Acceptors (A1, A2, A3)
│ │
│── Prepare(n) ─────►│
│◄── Promise(n) ──── │
│ │
│── Accept(n, v) ───►│
│◄── Accepted ────── │
│ │
└── Value chosen ────┘
Paxos is provably correct but notoriously hard to implement. Multi-Paxos extends it for a sequence of values, resembling Raft's log replication.
PBFT: Byzantine Fault Tolerance#
Both Raft and Paxos assume nodes are honest but faulty (crash failures). Practical Byzantine Fault Tolerance (PBFT) handles nodes that lie, send conflicting messages, or act maliciously.
PBFT requires 3f + 1 nodes to tolerate f Byzantine faults and uses a three-phase protocol:
- Pre-prepare — leader assigns a sequence number
- Prepare — nodes broadcast agreement on ordering
- Commit — nodes confirm and execute
PBFT is used in permissioned blockchains and high-security systems where you cannot trust every participant.
Zab: ZooKeeper Atomic Broadcast#
Apache ZooKeeper uses Zab (ZooKeeper Atomic Broadcast), a protocol similar to Raft but designed before it. Key differences:
- Zab separates discovery, synchronization, and broadcast phases.
- The leader epoch replaces Raft's term concept.
- ZooKeeper guarantees total order of all state changes.
Zab powers the coordination layer for Hadoop, Kafka (older versions), and HBase.
Quorum-Based Reads and Writes#
Consensus systems rely on quorums — a majority subset of nodes. For a cluster of N nodes:
Write quorum (W) + Read quorum (R) > N
Example (N=5):
W = 3, R = 3 → Strong consistency
W = 5, R = 1 → Fast reads, slow writes
W = 1, R = 5 → Fast writes, slow reads
This arithmetic guarantees that any read quorum overlaps with the most recent write quorum, ensuring no stale data.
Split Brain Prevention#
Split brain occurs when a network partition creates two sub-clusters that each elect a leader. Consensus protocols prevent this through quorum requirements:
5-node cluster, partition: [A, B] | [C, D, E]
[A, B] → 2/5 nodes → cannot form quorum → read-only / unavailable
[C, D, E] → 3/5 nodes → quorum achieved → continues serving writes
This is why production clusters use odd numbers of nodes (3, 5, 7) — it maximizes fault tolerance per node.
Service Discovery: etcd and Consul#
Two dominant CP systems use consensus for service discovery:
etcd (Raft-based)
- Powers Kubernetes — every cluster state change goes through etcd.
- Linearizable reads and writes.
- Watch API for real-time change notifications.
Consul (Raft-based)
- Combines service discovery, health checking, and KV store.
- Supports multi-datacenter federation.
- Gossip protocol (Serf) for membership, Raft for state.
Real-World Usage#
| System | Protocol | Use Case |
|---|---|---|
| CockroachDB | Raft | Distributed SQL with serializable isolation |
| TiKV | Raft (Multi-Raft) | Distributed KV store backing TiDB |
| etcd | Raft | Kubernetes cluster state, distributed locking |
| ZooKeeper | Zab | Leader election, configuration management |
| Consul | Raft | Service mesh, health checks, KV store |
| YugabyteDB | Raft | Distributed PostgreSQL-compatible database |
CockroachDB and TiKV both use Multi-Raft — running a separate Raft group per data range — to scale consensus across terabytes of data without bottlenecking on a single leader.
Key Takeaways#
- Consensus is non-negotiable for strongly consistent distributed systems.
- Raft is the modern default — understandable, well-tested, and widely adopted.
- Paxos is the theoretical foundation but harder to implement correctly.
- PBFT handles Byzantine faults at the cost of higher message complexity.
- Quorum math (
W + R > N) is the universal primitive behind consistency guarantees. - Always deploy consensus clusters with odd node counts to maximize partition tolerance.
Understanding these protocols is essential for designing, operating, and debugging any distributed system at scale.
Ready to go deeper on distributed systems, consensus algorithms, and production architecture? Explore hands-on guides and system design breakdowns at codelit.io.
138 articles on system design at codelit.io/blog.
Try it on Codelit
GitHub Integration
Paste any repo URL to generate an interactive architecture diagram from real code
Related articles
Try these templates
Build this architecture
Generate an interactive architecture for Distributed Consensus Explained in seconds.
Try it in Codelit →
Comments