Consensus Algorithm Comparison: Raft vs Paxos vs PBFT vs Zab
Consensus Algorithm Comparison: Raft vs Paxos vs PBFT vs Zab#
Distributed systems need nodes to agree on state. That agreement is consensus. Pick the wrong algorithm and you get unnecessary complexity, poor throughput, or a system that cannot survive the failures you actually face.
This guide compares the four major consensus algorithms head-to-head.
Why consensus matters#
Without consensus:
- Two nodes accept conflicting writes
- Leader election never converges
- Split-brain produces divergent state
- Clients read stale or contradictory data
Consensus guarantees that a majority of nodes agree on every state transition, even when some nodes crash or the network partitions.
The four algorithms at a glance#
Paxos#
Invented by Leslie Lamport in 1989. The original consensus algorithm. Paxos defines three roles: proposer, acceptor, learner. A value is chosen when a majority of acceptors accept the same proposal.
Key properties:
- Tolerates up to
(n-1)/2crash failures - No single leader required (multi-proposer)
- Notoriously difficult to understand and implement correctly
- Single-decree Paxos decides one value; Multi-Paxos extends it to a log
Raft#
Designed by Diego Ongaro and John Ousterhout in 2014, explicitly for understandability. Raft decomposes consensus into leader election, log replication, and safety.
Key properties:
- Strong leader model — all writes go through the leader
- Tolerates up to
(n-1)/2crash failures - Clear separation of concerns makes implementation straightforward
- Log entries committed when majority acknowledges
PBFT (Practical Byzantine Fault Tolerance)#
Published by Miguel Castro and Barbara Liskov in 1999. PBFT handles Byzantine faults — nodes that lie, send conflicting messages, or behave arbitrarily.
Key properties:
- Tolerates up to
(n-1)/3Byzantine failures - Three-phase protocol: pre-prepare, prepare, commit
- Requires
3f + 1nodes to tolerateffaulty nodes - Higher message complexity: O(n^2) per round
Zab (ZooKeeper Atomic Broadcast)#
Designed specifically for Apache ZooKeeper. Zab provides total order broadcast with a primary-backup approach.
Key properties:
- Primary-backup model similar to Raft's strong leader
- Tolerates up to
(n-1)/2crash failures - Optimized for high-throughput state changes
- Separates recovery from normal operation (discovery, synchronization, broadcast phases)
Comparison table#
┌────────────────┬──────────┬──────────┬──────────┬──────────┐
│ │ Paxos │ Raft │ PBFT │ Zab │
├────────────────┼──────────┼──────────┼──────────┼──────────┤
│ Fault model │ Crash │ Crash │Byzantine │ Crash │
│ Tolerance │(n-1)/2 │(n-1)/2 │(n-1)/3 │(n-1)/2 │
│ Min nodes (f=1)│ 3 │ 3 │ 4 │ 3 │
│ Leader │ Optional │ Required │ Primary │ Primary │
│ Messages/round │ O(n) │ O(n) │ O(n²) │ O(n) │
│ Latency │ Low │ Low │ High │ Low │
│ Throughput │ High │ High │ Medium │ High │
│ Complexity │ Very High│ Low │ High │ Medium │
│ Understandable │ Hard │ Easy │ Medium │ Medium │
│ Year │ 1989 │ 2014 │ 1999 │ 2008 │
└────────────────┴──────────┴──────────┴──────────┴──────────┘
Performance characteristics#
Throughput#
Raft, Paxos, and Zab achieve similar throughput in normal operation — the leader batches and replicates log entries to followers. The bottleneck is disk fsync and network RTT, not the algorithm.
PBFT has lower throughput because every round requires O(n^2) messages. Each node must communicate with every other node during the prepare and commit phases.
Latency#
Raft and Zab commit in two network round trips (leader to followers, acknowledgment back). Paxos can match this with Multi-Paxos and a stable leader.
PBFT needs three phases. In practice, that means higher latency per operation — roughly 2-3x compared to crash-fault-tolerant algorithms.
Scalability#
All four algorithms degrade as cluster size grows:
- Raft/Paxos/Zab: Linear message growth. 3-7 nodes is typical.
- PBFT: Quadratic message growth. Practical limit is around 20 nodes.
Fault tolerance compared#
Crash fault tolerance (Raft, Paxos, Zab):
A crashed node stops responding. The algorithm continues as long as a majority is alive. Simple failure model. If a node comes back, it catches up from the leader's log.
Byzantine fault tolerance (PBFT):
A Byzantine node can do anything — send different values to different peers, delay messages, forge responses. PBFT handles this, but at the cost of more nodes (3f + 1 vs 2f + 1) and more messages.
When do you need BFT?
- Blockchain and cryptocurrency networks
- Multi-party systems with untrusted participants
- Financial systems with regulatory requirements for tamper resistance
Most internal distributed systems (databases, coordination services) only need crash fault tolerance.
Implementation difficulty#
Paxos — The original paper is famously hard to understand. Real implementations (Google Chubby, Spanner) required years of engineering. Getting Multi-Paxos right — with log compaction, membership changes, and snapshotting — is a multi-year project.
Raft — Designed to be implementable. The paper includes a complete specification. Dozens of production implementations exist (etcd, HashiCorp Consul, TiKV, CockroachDB). A competent team can build a working Raft in weeks.
PBFT — Medium difficulty. The protocol is well-specified but the message handling is complex. View changes (leader replacement) are the hardest part. Most teams use existing libraries (Tendermint, HotStuff variants).
Zab — Tightly coupled to ZooKeeper. Understanding Zab means understanding ZooKeeper's data model. Fewer standalone implementations exist because Zab was designed for one system.
Use cases — which algorithm for which system#
Use Raft when#
- Building a replicated state machine (database, key-value store)
- You want battle-tested libraries (etcd, Consul)
- Team understandability is a priority
- Examples: etcd, CockroachDB, TiKV, Consul, RethinkDB
Use Paxos when#
- Building at Google scale with dedicated distributed systems teams
- You need flexible quorum configurations
- Multi-region deployments with Flexible Paxos optimizations
- Examples: Google Spanner, Google Chubby, Amazon DynamoDB (Paxos variant)
Use PBFT (or modern BFT variants) when#
- Participants do not fully trust each other
- Blockchain or decentralized networks
- Regulatory requirements demand tamper-proof consensus
- Examples: Hyperledger Fabric, Tendermint/CometBFT, Libra/Diem
Use Zab when#
- You are running Apache ZooKeeper
- Coordination service (leader election, distributed locks, config management)
- Examples: ZooKeeper, Kafka (uses ZooKeeper for metadata — though KRaft is replacing it with Raft)
Decision flowchart#
Do you need Byzantine fault tolerance?
├── Yes → PBFT or modern BFT (HotStuff, Tendermint)
└── No → Crash fault tolerance
├── Need a coordination service?
│ ├── Yes → ZooKeeper (Zab) or etcd (Raft)
│ └── No → Building a replicated database?
│ ├── Yes → Raft (best ecosystem)
│ └── No → Paxos (if you have the team)
└── Already using ZooKeeper? → Zab is built in
Modern trends#
KRaft replacing Zab in Kafka — Apache Kafka is moving from ZooKeeper (Zab) to its own Raft implementation (KRaft). This removes the ZooKeeper dependency and simplifies operations.
HotStuff replacing PBFT — Modern BFT systems use HotStuff (linear message complexity) instead of PBFT (quadratic). Adopted by Meta's Diem/Libra blockchain.
Multi-Raft for sharding — Systems like TiKV and CockroachDB run one Raft group per shard. Each shard independently replicates, enabling horizontal scaling beyond the limits of a single Raft group.
Architecture example#
Single Raft Group (etcd / Consul):
Client → Leader (Node 1)
↓ AppendEntries RPC
Follower (Node 2) — ACK
Follower (Node 3) — ACK
Leader commits when majority ACKs
Multi-Raft (CockroachDB / TiKV):
Shard A: Leader(N1), Follower(N2), Follower(N3)
Shard B: Leader(N2), Follower(N3), Follower(N1)
Shard C: Leader(N3), Follower(N1), Follower(N2)
→ Leadership distributed across nodes
Visualize your consensus architecture at codelit.io — generate interactive diagrams showing leader election, log replication, and failure scenarios.
Summary#
- Raft — best default choice. Easy to understand, great ecosystem, production-proven at scale
- Paxos — theoretical foundation. Use when you need flexible quorums or have a world-class team
- PBFT — when you cannot trust participants. Higher cost in nodes and messages
- Zab — purpose-built for ZooKeeper. Solid but tightly coupled to one system
- Most systems need crash fault tolerance, not Byzantine — do not over-engineer
- 3-5 nodes is the sweet spot for crash-tolerant consensus. More nodes means more latency, not more safety
Article #443 in the Codelit engineering series. Explore our full library of system design, infrastructure, and architecture guides at codelit.io.
Try it on Codelit
Cost Estimator
See estimated AWS monthly costs for every component in your architecture
Related articles
Build this architecture
Generate an interactive architecture for Consensus Algorithm Comparison in seconds.
Try it in Codelit →
Comments