distributed systemsconsensusalgorithmssystem designinfrastructure

Consensus Algorithm Comparison: Raft vs Paxos vs PBFT vs Zab

March 29, 2026 7 min readBy Codelit Team Discussion

Consensus Algorithm Comparison: Raft vs Paxos vs PBFT vs Zab#

Distributed systems need nodes to agree on state. That agreement is consensus. Pick the wrong algorithm and you get unnecessary complexity, poor throughput, or a system that cannot survive the failures you actually face.

This guide compares the four major consensus algorithms head-to-head.

Why consensus matters#

Without consensus:

Two nodes accept conflicting writes
Leader election never converges
Split-brain produces divergent state
Clients read stale or contradictory data

Consensus guarantees that a majority of nodes agree on every state transition, even when some nodes crash or the network partitions.

The four algorithms at a glance#

Paxos#

Invented by Leslie Lamport in 1989. The original consensus algorithm. Paxos defines three roles: proposer, acceptor, learner. A value is chosen when a majority of acceptors accept the same proposal.

Key properties:

Tolerates up to (n-1)/2 crash failures
No single leader required (multi-proposer)
Notoriously difficult to understand and implement correctly
Single-decree Paxos decides one value; Multi-Paxos extends it to a log

Raft#

Designed by Diego Ongaro and John Ousterhout in 2014, explicitly for understandability. Raft decomposes consensus into leader election, log replication, and safety.

Key properties:

Strong leader model — all writes go through the leader
Tolerates up to (n-1)/2 crash failures
Clear separation of concerns makes implementation straightforward
Log entries committed when majority acknowledges

PBFT (Practical Byzantine Fault Tolerance)#

Published by Miguel Castro and Barbara Liskov in 1999. PBFT handles Byzantine faults — nodes that lie, send conflicting messages, or behave arbitrarily.

Key properties:

Tolerates up to (n-1)/3 Byzantine failures
Three-phase protocol: pre-prepare, prepare, commit
Requires 3f + 1 nodes to tolerate f faulty nodes
Higher message complexity: O(n^2) per round

Zab (ZooKeeper Atomic Broadcast)#

Designed specifically for Apache ZooKeeper. Zab provides total order broadcast with a primary-backup approach.

Key properties:

Primary-backup model similar to Raft's strong leader
Tolerates up to (n-1)/2 crash failures
Optimized for high-throughput state changes
Separates recovery from normal operation (discovery, synchronization, broadcast phases)

Comparison table#

┌────────────────┬──────────┬──────────┬──────────┬──────────┐
│                │  Paxos   │   Raft   │   PBFT   │   Zab    │
├────────────────┼──────────┼──────────┼──────────┼──────────┤
│ Fault model    │  Crash   │  Crash   │Byzantine │  Crash   │
│ Tolerance      │(n-1)/2   │(n-1)/2   │(n-1)/3   │(n-1)/2   │
│ Min nodes (f=1)│    3     │    3     │    4     │    3     │
│ Leader         │ Optional │ Required │ Primary  │ Primary  │
│ Messages/round │  O(n)    │  O(n)    │  O(n²)   │  O(n)    │
│ Latency        │  Low     │  Low     │  High    │  Low     │
│ Throughput     │  High    │  High    │  Medium  │  High    │
│ Complexity     │  Very High│  Low    │  High    │  Medium  │
│ Understandable │  Hard    │  Easy    │  Medium  │  Medium  │
│ Year           │  1989    │  2014    │  1999    │  2008    │
└────────────────┴──────────┴──────────┴──────────┴──────────┘

Performance characteristics#

Throughput#

Raft, Paxos, and Zab achieve similar throughput in normal operation — the leader batches and replicates log entries to followers. The bottleneck is disk fsync and network RTT, not the algorithm.

PBFT has lower throughput because every round requires O(n^2) messages. Each node must communicate with every other node during the prepare and commit phases.

Latency#

Raft and Zab commit in two network round trips (leader to followers, acknowledgment back). Paxos can match this with Multi-Paxos and a stable leader.

PBFT needs three phases. In practice, that means higher latency per operation — roughly 2-3x compared to crash-fault-tolerant algorithms.

Scalability#

All four algorithms degrade as cluster size grows:

Raft/Paxos/Zab: Linear message growth. 3-7 nodes is typical.
PBFT: Quadratic message growth. Practical limit is around 20 nodes.

Fault tolerance compared#

Crash fault tolerance (Raft, Paxos, Zab):

A crashed node stops responding. The algorithm continues as long as a majority is alive. Simple failure model. If a node comes back, it catches up from the leader's log.

Byzantine fault tolerance (PBFT):

A Byzantine node can do anything — send different values to different peers, delay messages, forge responses. PBFT handles this, but at the cost of more nodes (3f + 1 vs 2f + 1) and more messages.

When do you need BFT?

Blockchain and cryptocurrency networks
Multi-party systems with untrusted participants
Financial systems with regulatory requirements for tamper resistance

Most internal distributed systems (databases, coordination services) only need crash fault tolerance.

Implementation difficulty#

Paxos — The original paper is famously hard to understand. Real implementations (Google Chubby, Spanner) required years of engineering. Getting Multi-Paxos right — with log compaction, membership changes, and snapshotting — is a multi-year project.

Raft — Designed to be implementable. The paper includes a complete specification. Dozens of production implementations exist (etcd, HashiCorp Consul, TiKV, CockroachDB). A competent team can build a working Raft in weeks.

PBFT — Medium difficulty. The protocol is well-specified but the message handling is complex. View changes (leader replacement) are the hardest part. Most teams use existing libraries (Tendermint, HotStuff variants).

Zab — Tightly coupled to ZooKeeper. Understanding Zab means understanding ZooKeeper's data model. Fewer standalone implementations exist because Zab was designed for one system.

Use cases — which algorithm for which system#

Use Raft when#

Building a replicated state machine (database, key-value store)
You want battle-tested libraries (etcd, Consul)
Team understandability is a priority
Examples: etcd, CockroachDB, TiKV, Consul, RethinkDB

Use Paxos when#

Building at Google scale with dedicated distributed systems teams
You need flexible quorum configurations
Multi-region deployments with Flexible Paxos optimizations
Examples: Google Spanner, Google Chubby, Amazon DynamoDB (Paxos variant)

Use PBFT (or modern BFT variants) when#

Participants do not fully trust each other
Blockchain or decentralized networks
Regulatory requirements demand tamper-proof consensus
Examples: Hyperledger Fabric, Tendermint/CometBFT, Libra/Diem

Use Zab when#

You are running Apache ZooKeeper
Coordination service (leader election, distributed locks, config management)
Examples: ZooKeeper, Kafka (uses ZooKeeper for metadata — though KRaft is replacing it with Raft)

Decision flowchart#

Do you need Byzantine fault tolerance?
  ├── Yes → PBFT or modern BFT (HotStuff, Tendermint)
  └── No → Crash fault tolerance
              ├── Need a coordination service?
              │     ├── Yes → ZooKeeper (Zab) or etcd (Raft)
              │     └── No → Building a replicated database?
              │               ├── Yes → Raft (best ecosystem)
              │               └── No → Paxos (if you have the team)
              └── Already using ZooKeeper? → Zab is built in

Modern trends#

KRaft replacing Zab in Kafka — Apache Kafka is moving from ZooKeeper (Zab) to its own Raft implementation (KRaft). This removes the ZooKeeper dependency and simplifies operations.

HotStuff replacing PBFT — Modern BFT systems use HotStuff (linear message complexity) instead of PBFT (quadratic). Adopted by Meta's Diem/Libra blockchain.

Multi-Raft for sharding — Systems like TiKV and CockroachDB run one Raft group per shard. Each shard independently replicates, enabling horizontal scaling beyond the limits of a single Raft group.

Architecture example#

Single Raft Group (etcd / Consul):
  Client → Leader (Node 1)
              ↓ AppendEntries RPC
           Follower (Node 2) — ACK
           Follower (Node 3) — ACK
           Leader commits when majority ACKs

Multi-Raft (CockroachDB / TiKV):
  Shard A: Leader(N1), Follower(N2), Follower(N3)
  Shard B: Leader(N2), Follower(N3), Follower(N1)
  Shard C: Leader(N3), Follower(N1), Follower(N2)
  → Leadership distributed across nodes

Visualize your consensus architecture at codelit.io — generate interactive diagrams showing leader election, log replication, and failure scenarios.

Summary#

Raft — best default choice. Easy to understand, great ecosystem, production-proven at scale
Paxos — theoretical foundation. Use when you need flexible quorums or have a world-class team
PBFT — when you cannot trust participants. Higher cost in nodes and messages
Zab — purpose-built for ZooKeeper. Solid but tightly coupled to one system
Most systems need crash fault tolerance, not Byzantine — do not over-engineer
3-5 nodes is the sweet spot for crash-tolerant consensus. More nodes means more latency, not more safety

Article #443 in the Codelit engineering series. Explore our full library of system design, infrastructure, and architecture guides at codelit.io.

Try it on Codelit

Cost Estimator

See estimated AWS monthly costs for every component in your architecture

Build this architecture →

Comments

AI search

AI-Powered Search Architecture: Semantic Search, Hybrid Search, and RAG

8 min read

AI safety

AI Safety Guardrails Architecture: Input Validation, Output Filtering, and Human-in-the-Loop

8 min read

API design

API Backward Compatibility: Ship Changes Without Breaking Consumers

6 min read

Build this architecture

Generate an interactive architecture for Consensus Algorithm Comparison in seconds.

Try it in Codelit →

distributed systemsconsensusalgorithmssystem designinfrastructure

Consensus Algorithm Comparison: Raft vs Paxos vs PBFT vs Zab

March 29, 2026 7 min readBy Codelit Team Discussion

Consensus Algorithm Comparison: Raft vs Paxos vs PBFT vs Zab#

This guide compares the four major consensus algorithms head-to-head.

Why consensus matters#

Without consensus:

Two nodes accept conflicting writes
Leader election never converges
Split-brain produces divergent state
Clients read stale or contradictory data

Consensus guarantees that a majority of nodes agree on every state transition, even when some nodes crash or the network partitions.

The four algorithms at a glance#

Paxos#

Invented by Leslie Lamport in 1989. The original consensus algorithm. Paxos defines three roles: proposer, acceptor, learner. A value is chosen when a majority of acceptors accept the same proposal.

Key properties:

Tolerates up to (n-1)/2 crash failures
No single leader required (multi-proposer)
Notoriously difficult to understand and implement correctly
Single-decree Paxos decides one value; Multi-Paxos extends it to a log

Raft#

Designed by Diego Ongaro and John Ousterhout in 2014, explicitly for understandability. Raft decomposes consensus into leader election, log replication, and safety.

Key properties:

Strong leader model — all writes go through the leader
Tolerates up to (n-1)/2 crash failures
Clear separation of concerns makes implementation straightforward
Log entries committed when majority acknowledges

PBFT (Practical Byzantine Fault Tolerance)#

Published by Miguel Castro and Barbara Liskov in 1999. PBFT handles Byzantine faults — nodes that lie, send conflicting messages, or behave arbitrarily.

Key properties:

Tolerates up to (n-1)/3 Byzantine failures
Three-phase protocol: pre-prepare, prepare, commit
Requires 3f + 1 nodes to tolerate f faulty nodes
Higher message complexity: O(n^2) per round

Zab (ZooKeeper Atomic Broadcast)#

Designed specifically for Apache ZooKeeper. Zab provides total order broadcast with a primary-backup approach.

Key properties:

Primary-backup model similar to Raft's strong leader
Tolerates up to (n-1)/2 crash failures
Optimized for high-throughput state changes
Separates recovery from normal operation (discovery, synchronization, broadcast phases)

Comparison table#

┌────────────────┬──────────┬──────────┬──────────┬──────────┐
│                │  Paxos   │   Raft   │   PBFT   │   Zab    │
├────────────────┼──────────┼──────────┼──────────┼──────────┤
│ Fault model    │  Crash   │  Crash   │Byzantine │  Crash   │
│ Tolerance      │(n-1)/2   │(n-1)/2   │(n-1)/3   │(n-1)/2   │
│ Min nodes (f=1)│    3     │    3     │    4     │    3     │
│ Leader         │ Optional │ Required │ Primary  │ Primary  │
│ Messages/round │  O(n)    │  O(n)    │  O(n²)   │  O(n)    │
│ Latency        │  Low     │  Low     │  High    │  Low     │
│ Throughput     │  High    │  High    │  Medium  │  High    │
│ Complexity     │  Very High│  Low    │  High    │  Medium  │
│ Understandable │  Hard    │  Easy    │  Medium  │  Medium  │
│ Year           │  1989    │  2014    │  1999    │  2008    │
└────────────────┴──────────┴──────────┴──────────┴──────────┘

Performance characteristics#

Throughput#

Raft, Paxos, and Zab achieve similar throughput in normal operation — the leader batches and replicates log entries to followers. The bottleneck is disk fsync and network RTT, not the algorithm.

PBFT has lower throughput because every round requires O(n^2) messages. Each node must communicate with every other node during the prepare and commit phases.

Latency#

Raft and Zab commit in two network round trips (leader to followers, acknowledgment back). Paxos can match this with Multi-Paxos and a stable leader.

PBFT needs three phases. In practice, that means higher latency per operation — roughly 2-3x compared to crash-fault-tolerant algorithms.

Scalability#

All four algorithms degrade as cluster size grows:

Raft/Paxos/Zab: Linear message growth. 3-7 nodes is typical.
PBFT: Quadratic message growth. Practical limit is around 20 nodes.

Fault tolerance compared#

Crash fault tolerance (Raft, Paxos, Zab):

A crashed node stops responding. The algorithm continues as long as a majority is alive. Simple failure model. If a node comes back, it catches up from the leader's log.

Byzantine fault tolerance (PBFT):

When do you need BFT?

Blockchain and cryptocurrency networks
Multi-party systems with untrusted participants
Financial systems with regulatory requirements for tamper resistance

Most internal distributed systems (databases, coordination services) only need crash fault tolerance.

Implementation difficulty#

Zab — Tightly coupled to ZooKeeper. Understanding Zab means understanding ZooKeeper's data model. Fewer standalone implementations exist because Zab was designed for one system.

Use cases — which algorithm for which system#

Use Raft when#

Building a replicated state machine (database, key-value store)
You want battle-tested libraries (etcd, Consul)
Team understandability is a priority
Examples: etcd, CockroachDB, TiKV, Consul, RethinkDB

Use Paxos when#

Building at Google scale with dedicated distributed systems teams
You need flexible quorum configurations
Multi-region deployments with Flexible Paxos optimizations
Examples: Google Spanner, Google Chubby, Amazon DynamoDB (Paxos variant)

Use PBFT (or modern BFT variants) when#

Participants do not fully trust each other
Blockchain or decentralized networks
Regulatory requirements demand tamper-proof consensus
Examples: Hyperledger Fabric, Tendermint/CometBFT, Libra/Diem

Use Zab when#

You are running Apache ZooKeeper
Coordination service (leader election, distributed locks, config management)
Examples: ZooKeeper, Kafka (uses ZooKeeper for metadata — though KRaft is replacing it with Raft)

Decision flowchart#

Do you need Byzantine fault tolerance?
  ├── Yes → PBFT or modern BFT (HotStuff, Tendermint)
  └── No → Crash fault tolerance
              ├── Need a coordination service?
              │     ├── Yes → ZooKeeper (Zab) or etcd (Raft)
              │     └── No → Building a replicated database?
              │               ├── Yes → Raft (best ecosystem)
              │               └── No → Paxos (if you have the team)
              └── Already using ZooKeeper? → Zab is built in

Modern trends#

KRaft replacing Zab in Kafka — Apache Kafka is moving from ZooKeeper (Zab) to its own Raft implementation (KRaft). This removes the ZooKeeper dependency and simplifies operations.

HotStuff replacing PBFT — Modern BFT systems use HotStuff (linear message complexity) instead of PBFT (quadratic). Adopted by Meta's Diem/Libra blockchain.

Architecture example#

Single Raft Group (etcd / Consul):
  Client → Leader (Node 1)
              ↓ AppendEntries RPC
           Follower (Node 2) — ACK
           Follower (Node 3) — ACK
           Leader commits when majority ACKs

Multi-Raft (CockroachDB / TiKV):
  Shard A: Leader(N1), Follower(N2), Follower(N3)
  Shard B: Leader(N2), Follower(N3), Follower(N1)
  Shard C: Leader(N3), Follower(N1), Follower(N2)
  → Leadership distributed across nodes

Visualize your consensus architecture at codelit.io — generate interactive diagrams showing leader election, log replication, and failure scenarios.

Summary#

Raft — best default choice. Easy to understand, great ecosystem, production-proven at scale
Paxos — theoretical foundation. Use when you need flexible quorums or have a world-class team
PBFT — when you cannot trust participants. Higher cost in nodes and messages
Zab — purpose-built for ZooKeeper. Solid but tightly coupled to one system
Most systems need crash fault tolerance, not Byzantine — do not over-engineer
3-5 nodes is the sweet spot for crash-tolerant consensus. More nodes means more latency, not more safety

Article #443 in the Codelit engineering series. Explore our full library of system design, infrastructure, and architecture guides at codelit.io.

Try it on Codelit

Cost Estimator

See estimated AWS monthly costs for every component in your architecture

Build this architecture →

Comments

AI search

Build this architecture

Generate an interactive architecture for Consensus Algorithm Comparison in seconds.

Try it in Codelit →

Consensus Algorithm Comparison: Raft vs Paxos vs PBFT vs Zab

Consensus Algorithm Comparison: Raft vs Paxos vs PBFT vs Zab#

Why consensus matters#

The four algorithms at a glance#

Paxos#

Raft#

PBFT (Practical Byzantine Fault Tolerance)#

Zab (ZooKeeper Atomic Broadcast)#

Comparison table#

Performance characteristics#

Throughput#

Latency#

Scalability#

Fault tolerance compared#

Implementation difficulty#

Use cases — which algorithm for which system#

Use Raft when#

Use Paxos when#

Use PBFT (or modern BFT variants) when#

Use Zab when#

Decision flowchart#

Modern trends#

Architecture example#

Summary#

Comments

Related articles

AI-Powered Search Architecture: Semantic Search, Hybrid Search, and RAG

AI Safety Guardrails Architecture: Input Validation, Output Filtering, and Human-in-the-Loop

API Backward Compatibility: Ship Changes Without Breaking Consumers

Build this architecture

Consensus Algorithm Comparison: Raft vs Paxos vs PBFT vs Zab

Consensus Algorithm Comparison: Raft vs Paxos vs PBFT vs Zab#

Why consensus matters#

The four algorithms at a glance#

Paxos#

Raft#

PBFT (Practical Byzantine Fault Tolerance)#

Zab (ZooKeeper Atomic Broadcast)#

Comparison table#

Performance characteristics#

Throughput#

Latency#

Scalability#

Fault tolerance compared#

Implementation difficulty#

Use cases — which algorithm for which system#

Use Raft when#

Use Paxos when#

Use PBFT (or modern BFT variants) when#

Use Zab when#

Decision flowchart#

Modern trends#

Architecture example#

Summary#

Comments

Related articles

AI-Powered Search Architecture: Semantic Search, Hybrid Search, and RAG

AI Safety Guardrails Architecture: Input Validation, Output Filtering, and Human-in-the-Loop

API Backward Compatibility: Ship Changes Without Breaking Consumers

Build this architecture