distributed consensusRaft consensusPaxos algorithmdistributed systemsCAP theoremleader electionsystem design

Distributed Consensus Explained: Raft, Paxos, and Beyond

March 28, 2026 7 min readBy Codelit Team Discussion

Distributed Consensus Explained: Raft, Paxos, and Beyond#

In any distributed system with more than one node, a fundamental question arises: how do all nodes agree on the current state of the world? This is the consensus problem, and solving it correctly is what separates toy prototypes from production-grade infrastructure.

Why Consensus Matters#

Every time you write a row to CockroachDB, update a key in etcd, or elect a leader in a Kafka cluster, a consensus protocol is running underneath. Without consensus, distributed systems face:

Data loss from conflicting writes
Split brain where two partitions each believe they are authoritative
Stale reads that violate user expectations

Consensus is the foundation of reliable replication, configuration management, and service discovery.

The CAP Theorem: Setting the Stage#

Before diving into algorithms, recall the CAP theorem: a distributed system can guarantee at most two of three properties simultaneously:

┌──────────────────────────────────┐
│          CAP Theorem             │
│                                  │
│       Consistency (C)            │
│          /       \               │
│         /         \              │
│   Availability    Partition      │
│       (A)      Tolerance (P)     │
│                                  │
│  CP: etcd, ZooKeeper, Consul     │
│  AP: Cassandra, DynamoDB         │
│  CA: single-node RDBMS (no P)   │
└──────────────────────────────────┘

Most consensus algorithms (Raft, Paxos, Zab) fall in the CP camp — they sacrifice availability during network partitions to preserve strong consistency.

Raft Consensus#

Raft was designed by Diego Ongaro and John Ousterhout specifically to be understandable. It decomposes consensus into three sub-problems.

1. Leader Election#

┌─────────┐  timeout  ┌───────────┐  majority vote  ┌────────┐
│ Follower │ ───────► │ Candidate │ ──────────────► │ Leader │
└─────────┘           └───────────┘                  └────────┘
     ▲                                                   │
     └───────────── heartbeat / higher term ─────────────┘

Every node starts as a Follower.
If a follower receives no heartbeat within a randomized timeout, it becomes a Candidate and requests votes.
A candidate that receives votes from a majority (quorum) becomes the Leader.
Only one leader can exist per term.

2. Log Replication#

Once elected, the leader accepts client writes and appends them to its log. It then replicates each entry to followers:

Client ──► Leader
             │
             ├──► Follower A  (AppendEntries RPC)
             ├──► Follower B
             └──► Follower C

Commit when majority (3/5, 2/3, etc.) acknowledge.

An entry is committed once a majority of nodes have written it to their logs. The leader then applies it to the state machine and notifies followers.

3. Safety#

Raft guarantees that:

Election Safety: at most one leader per term.
Leader Append-Only: a leader never overwrites its log.
Log Matching: if two logs contain an entry with the same index and term, all preceding entries are identical.
State Machine Safety: once a log entry is applied, no other server applies a different entry for that index.

Paxos Algorithm#

Leslie Lamport's Paxos predates Raft by decades and remains the theoretical gold standard. It defines three roles:

Role	Responsibility
Proposer	Proposes a value with a unique proposal number
Acceptor	Votes on proposals; promises not to accept lower-numbered ones
Learner	Learns the chosen value once a quorum of acceptors agrees

Two-Phase Protocol#

Phase 1 — Prepare

Proposer sends Prepare(n) to acceptors.
Each acceptor promises not to accept proposals numbered less than n and returns any previously accepted value.

Phase 2 — Accept

Proposer sends Accept(n, value) to acceptors.
If the acceptor has not promised a higher number, it accepts.
Once a majority accepts, the value is chosen.

Proposer           Acceptors (A1, A2, A3)
   │                    │
   │── Prepare(n) ─────►│
   │◄── Promise(n) ──── │
   │                    │
   │── Accept(n, v) ───►│
   │◄── Accepted ────── │
   │                    │
   └── Value chosen ────┘

Paxos is provably correct but notoriously hard to implement. Multi-Paxos extends it for a sequence of values, resembling Raft's log replication.

PBFT: Byzantine Fault Tolerance#

Both Raft and Paxos assume nodes are honest but faulty (crash failures). Practical Byzantine Fault Tolerance (PBFT) handles nodes that lie, send conflicting messages, or act maliciously.

PBFT requires 3f + 1 nodes to tolerate f Byzantine faults and uses a three-phase protocol:

Pre-prepare — leader assigns a sequence number
Prepare — nodes broadcast agreement on ordering
Commit — nodes confirm and execute

PBFT is used in permissioned blockchains and high-security systems where you cannot trust every participant.

Zab: ZooKeeper Atomic Broadcast#

Apache ZooKeeper uses Zab (ZooKeeper Atomic Broadcast), a protocol similar to Raft but designed before it. Key differences:

Zab separates discovery, synchronization, and broadcast phases.
The leader epoch replaces Raft's term concept.
ZooKeeper guarantees total order of all state changes.

Zab powers the coordination layer for Hadoop, Kafka (older versions), and HBase.

Quorum-Based Reads and Writes#

Consensus systems rely on quorums — a majority subset of nodes. For a cluster of N nodes:

Write quorum (W) + Read quorum (R) > N

Example (N=5):
  W = 3, R = 3  →  Strong consistency
  W = 5, R = 1  →  Fast reads, slow writes
  W = 1, R = 5  →  Fast writes, slow reads

This arithmetic guarantees that any read quorum overlaps with the most recent write quorum, ensuring no stale data.

Split Brain Prevention#

Split brain occurs when a network partition creates two sub-clusters that each elect a leader. Consensus protocols prevent this through quorum requirements:

5-node cluster, partition: [A, B] | [C, D, E]

[A, B]     → 2/5 nodes → cannot form quorum → read-only / unavailable
[C, D, E]  → 3/5 nodes → quorum achieved → continues serving writes

This is why production clusters use odd numbers of nodes (3, 5, 7) — it maximizes fault tolerance per node.

Service Discovery: etcd and Consul#

Two dominant CP systems use consensus for service discovery:

etcd (Raft-based)

Powers Kubernetes — every cluster state change goes through etcd.
Linearizable reads and writes.
Watch API for real-time change notifications.

Consul (Raft-based)

Combines service discovery, health checking, and KV store.
Supports multi-datacenter federation.
Gossip protocol (Serf) for membership, Raft for state.

Real-World Usage#

System	Protocol	Use Case
CockroachDB	Raft	Distributed SQL with serializable isolation
TiKV	Raft (Multi-Raft)	Distributed KV store backing TiDB
etcd	Raft	Kubernetes cluster state, distributed locking
ZooKeeper	Zab	Leader election, configuration management
Consul	Raft	Service mesh, health checks, KV store
YugabyteDB	Raft	Distributed PostgreSQL-compatible database

CockroachDB and TiKV both use Multi-Raft — running a separate Raft group per data range — to scale consensus across terabytes of data without bottlenecking on a single leader.

Key Takeaways#

Consensus is non-negotiable for strongly consistent distributed systems.
Raft is the modern default — understandable, well-tested, and widely adopted.
Paxos is the theoretical foundation but harder to implement correctly.
PBFT handles Byzantine faults at the cost of higher message complexity.
Quorum math (W + R > N) is the universal primitive behind consistency guarantees.
Always deploy consensus clusters with odd node counts to maximize partition tolerance.

Understanding these protocols is essential for designing, operating, and debugging any distributed system at scale.

Ready to go deeper on distributed systems, consensus algorithms, and production architecture? Explore hands-on guides and system design breakdowns at codelit.io.

138 articles on system design at codelit.io/blog.

Try it on Codelit

GitHub Integration

Paste any repo URL to generate an interactive architecture diagram from real code

Build this architecture →

Comments

AI search

AI-Powered Search Architecture: Semantic Search, Hybrid Search, and RAG

8 min read

AI safety

AI Safety Guardrails Architecture: Input Validation, Output Filtering, and Human-in-the-Loop

8 min read

API design

API Backward Compatibility: Ship Changes Without Breaking Consumers

6 min read

Try these templates

Distributed Rate Limiter

API rate limiting with sliding window, token bucket, and per-user quotas.

7 components

Distributed Key-Value Store

Redis/DynamoDB-like distributed KV store with consistent hashing, replication, and tunable consistency.

8 components

Build this architecture

Generate an interactive architecture for Distributed Consensus Explained in seconds.

Try it in Codelit →

distributed consensusRaft consensusPaxos algorithmdistributed systemsCAP theoremleader electionsystem design

Distributed Consensus Explained: Raft, Paxos, and Beyond

March 28, 2026 7 min readBy Codelit Team Discussion

Distributed Consensus Explained: Raft, Paxos, and Beyond#

Why Consensus Matters#

Every time you write a row to CockroachDB, update a key in etcd, or elect a leader in a Kafka cluster, a consensus protocol is running underneath. Without consensus, distributed systems face:

Data loss from conflicting writes
Split brain where two partitions each believe they are authoritative
Stale reads that violate user expectations

Consensus is the foundation of reliable replication, configuration management, and service discovery.

The CAP Theorem: Setting the Stage#

Before diving into algorithms, recall the CAP theorem: a distributed system can guarantee at most two of three properties simultaneously:

┌──────────────────────────────────┐
│          CAP Theorem             │
│                                  │
│       Consistency (C)            │
│          /       \               │
│         /         \              │
│   Availability    Partition      │
│       (A)      Tolerance (P)     │
│                                  │
│  CP: etcd, ZooKeeper, Consul     │
│  AP: Cassandra, DynamoDB         │
│  CA: single-node RDBMS (no P)   │
└──────────────────────────────────┘

Most consensus algorithms (Raft, Paxos, Zab) fall in the CP camp — they sacrifice availability during network partitions to preserve strong consistency.

Raft Consensus#

Raft was designed by Diego Ongaro and John Ousterhout specifically to be understandable. It decomposes consensus into three sub-problems.

1. Leader Election#

┌─────────┐  timeout  ┌───────────┐  majority vote  ┌────────┐
│ Follower │ ───────► │ Candidate │ ──────────────► │ Leader │
└─────────┘           └───────────┘                  └────────┘
     ▲                                                   │
     └───────────── heartbeat / higher term ─────────────┘

Every node starts as a Follower.
If a follower receives no heartbeat within a randomized timeout, it becomes a Candidate and requests votes.
A candidate that receives votes from a majority (quorum) becomes the Leader.
Only one leader can exist per term.

2. Log Replication#

Once elected, the leader accepts client writes and appends them to its log. It then replicates each entry to followers:

Client ──► Leader
             │
             ├──► Follower A  (AppendEntries RPC)
             ├──► Follower B
             └──► Follower C

Commit when majority (3/5, 2/3, etc.) acknowledge.

An entry is committed once a majority of nodes have written it to their logs. The leader then applies it to the state machine and notifies followers.

3. Safety#

Raft guarantees that:

Election Safety: at most one leader per term.
Leader Append-Only: a leader never overwrites its log.
Log Matching: if two logs contain an entry with the same index and term, all preceding entries are identical.
State Machine Safety: once a log entry is applied, no other server applies a different entry for that index.

Paxos Algorithm#

Leslie Lamport's Paxos predates Raft by decades and remains the theoretical gold standard. It defines three roles:

Role	Responsibility
Proposer	Proposes a value with a unique proposal number
Acceptor	Votes on proposals; promises not to accept lower-numbered ones
Learner	Learns the chosen value once a quorum of acceptors agrees

Two-Phase Protocol#

Phase 1 — Prepare

Proposer sends Prepare(n) to acceptors.
Each acceptor promises not to accept proposals numbered less than n and returns any previously accepted value.

Phase 2 — Accept

Proposer sends Accept(n, value) to acceptors.
If the acceptor has not promised a higher number, it accepts.
Once a majority accepts, the value is chosen.

Proposer           Acceptors (A1, A2, A3)
   │                    │
   │── Prepare(n) ─────►│
   │◄── Promise(n) ──── │
   │                    │
   │── Accept(n, v) ───►│
   │◄── Accepted ────── │
   │                    │
   └── Value chosen ────┘

Paxos is provably correct but notoriously hard to implement. Multi-Paxos extends it for a sequence of values, resembling Raft's log replication.

PBFT: Byzantine Fault Tolerance#

Both Raft and Paxos assume nodes are honest but faulty (crash failures). Practical Byzantine Fault Tolerance (PBFT) handles nodes that lie, send conflicting messages, or act maliciously.

PBFT requires 3f + 1 nodes to tolerate f Byzantine faults and uses a three-phase protocol:

Pre-prepare — leader assigns a sequence number
Prepare — nodes broadcast agreement on ordering
Commit — nodes confirm and execute

PBFT is used in permissioned blockchains and high-security systems where you cannot trust every participant.

Zab: ZooKeeper Atomic Broadcast#

Apache ZooKeeper uses Zab (ZooKeeper Atomic Broadcast), a protocol similar to Raft but designed before it. Key differences:

Zab separates discovery, synchronization, and broadcast phases.
The leader epoch replaces Raft's term concept.
ZooKeeper guarantees total order of all state changes.

Zab powers the coordination layer for Hadoop, Kafka (older versions), and HBase.

Quorum-Based Reads and Writes#

Consensus systems rely on quorums — a majority subset of nodes. For a cluster of N nodes:

Write quorum (W) + Read quorum (R) > N

Example (N=5):
  W = 3, R = 3  →  Strong consistency
  W = 5, R = 1  →  Fast reads, slow writes
  W = 1, R = 5  →  Fast writes, slow reads

This arithmetic guarantees that any read quorum overlaps with the most recent write quorum, ensuring no stale data.

Split Brain Prevention#

Split brain occurs when a network partition creates two sub-clusters that each elect a leader. Consensus protocols prevent this through quorum requirements:

5-node cluster, partition: [A, B] | [C, D, E]

[A, B]     → 2/5 nodes → cannot form quorum → read-only / unavailable
[C, D, E]  → 3/5 nodes → quorum achieved → continues serving writes

This is why production clusters use odd numbers of nodes (3, 5, 7) — it maximizes fault tolerance per node.

Service Discovery: etcd and Consul#

Two dominant CP systems use consensus for service discovery:

etcd (Raft-based)

Powers Kubernetes — every cluster state change goes through etcd.
Linearizable reads and writes.
Watch API for real-time change notifications.

Consul (Raft-based)

Combines service discovery, health checking, and KV store.
Supports multi-datacenter federation.
Gossip protocol (Serf) for membership, Raft for state.

Real-World Usage#

System	Protocol	Use Case
CockroachDB	Raft	Distributed SQL with serializable isolation
TiKV	Raft (Multi-Raft)	Distributed KV store backing TiDB
etcd	Raft	Kubernetes cluster state, distributed locking
ZooKeeper	Zab	Leader election, configuration management
Consul	Raft	Service mesh, health checks, KV store
YugabyteDB	Raft	Distributed PostgreSQL-compatible database

CockroachDB and TiKV both use Multi-Raft — running a separate Raft group per data range — to scale consensus across terabytes of data without bottlenecking on a single leader.

Key Takeaways#

Consensus is non-negotiable for strongly consistent distributed systems.
Raft is the modern default — understandable, well-tested, and widely adopted.
Paxos is the theoretical foundation but harder to implement correctly.
PBFT handles Byzantine faults at the cost of higher message complexity.
Quorum math (W + R > N) is the universal primitive behind consistency guarantees.
Always deploy consensus clusters with odd node counts to maximize partition tolerance.

Understanding these protocols is essential for designing, operating, and debugging any distributed system at scale.

Ready to go deeper on distributed systems, consensus algorithms, and production architecture? Explore hands-on guides and system design breakdowns at codelit.io.

138 articles on system design at codelit.io/blog.

Try it on Codelit

GitHub Integration

Paste any repo URL to generate an interactive architecture diagram from real code

Build this architecture →

Comments

AI search

Try these templates

Distributed Rate Limiter

API rate limiting with sliding window, token bucket, and per-user quotas.

7 components

Distributed Key-Value Store

Redis/DynamoDB-like distributed KV store with consistent hashing, replication, and tunable consistency.

8 components

Build this architecture

Generate an interactive architecture for Distributed Consensus Explained in seconds.

Try it in Codelit →

Distributed Consensus Explained: Raft, Paxos, and Beyond

Distributed Consensus Explained: Raft, Paxos, and Beyond#

Why Consensus Matters#

The CAP Theorem: Setting the Stage#

Raft Consensus#

1. Leader Election#

2. Log Replication#

3. Safety#

Paxos Algorithm#

Two-Phase Protocol#

PBFT: Byzantine Fault Tolerance#

Zab: ZooKeeper Atomic Broadcast#

Quorum-Based Reads and Writes#

Split Brain Prevention#

Service Discovery: etcd and Consul#

Real-World Usage#

Key Takeaways#

Comments

Related articles

AI-Powered Search Architecture: Semantic Search, Hybrid Search, and RAG

AI Safety Guardrails Architecture: Input Validation, Output Filtering, and Human-in-the-Loop

API Backward Compatibility: Ship Changes Without Breaking Consumers

Try these templates

Distributed Rate Limiter

Distributed Key-Value Store

Build this architecture

Distributed Consensus Explained: Raft, Paxos, and Beyond

Distributed Consensus Explained: Raft, Paxos, and Beyond#

Why Consensus Matters#

The CAP Theorem: Setting the Stage#

Raft Consensus#

1. Leader Election#

2. Log Replication#

3. Safety#

Paxos Algorithm#

Two-Phase Protocol#

PBFT: Byzantine Fault Tolerance#

Zab: ZooKeeper Atomic Broadcast#

Quorum-Based Reads and Writes#

Split Brain Prevention#

Service Discovery: etcd and Consul#

Real-World Usage#

Key Takeaways#

Comments

Related articles

AI-Powered Search Architecture: Semantic Search, Hybrid Search, and RAG

AI Safety Guardrails Architecture: Input Validation, Output Filtering, and Human-in-the-Loop

API Backward Compatibility: Ship Changes Without Breaking Consumers

Try these templates

Distributed Rate Limiter

Distributed Key-Value Store

Build this architecture