Leader Election: Algorithms, Protocols & Split-Brain Prevention
In any distributed system where multiple nodes share responsibility, exactly one node often needs to act as the coordinator. That coordinator is the leader, and the process of choosing it is leader election. Get it wrong and you face split brain — two nodes both believing they are in charge, issuing conflicting writes and corrupting state.
Why Leader Election Matters#
Leader election solves coordination problems that arise when you need a single point of decision-making:
- Write serialization — A single leader orders writes so replicas stay consistent.
- Task scheduling — One scheduler assigns work; duplicates waste resources or cause conflicts.
- Lease management — A leader holds a lease on a shared resource, preventing concurrent access.
Without a reliable election mechanism, you either accept the chaos of multiple writers or halt the system entirely when the current leader fails.
The Bully Algorithm#
The Bully algorithm, proposed by Garcia-Molina in 1982, is one of the simplest election protocols. Every node has a unique numeric ID, and the node with the highest ID wins.
How it works:
- A node that detects the leader is down sends an Election message to all nodes with higher IDs.
- If any higher-ID node responds with an Alive message, the initiator backs off and waits.
- The highest-ID node that receives no response from an even higher node declares itself leader by broadcasting a Victory message.
Node 3 detects leader failure
-> sends Election to Node 4, Node 5
Node 5 responds Alive
Node 5 sends Election (no higher nodes)
Node 5 broadcasts Victory
Trade-offs: Simple to implement but generates O(n^2) messages in the worst case. It assumes reliable failure detection, which is difficult in asynchronous networks.
The Ring Algorithm#
In the Ring algorithm, nodes are arranged in a logical ring. When a node detects the leader has failed, it sends an Election message around the ring, collecting node IDs. Once the message completes a full circuit, the node with the highest ID is declared leader.
Steps:
- The detecting node adds its own ID to the election message and forwards it to its successor.
- Each node appends its ID and forwards the message.
- When the message returns to the initiator, it picks the highest ID and sends a Coordinator message around the ring.
This approach uses O(n) messages but requires a well-maintained ring topology.
Raft Leader Election#
Raft separates consensus into leader election, log replication, and safety. The election mechanism is designed for clarity:
- Terms — Time is divided into numbered terms. Each term has at most one leader.
- Heartbeats — The leader sends periodic heartbeats. If a follower receives no heartbeat within its election timeout, it becomes a candidate.
- Voting — The candidate increments the term, votes for itself, and requests votes from peers. A node grants its vote to the first candidate it hears from in a given term.
- Majority wins — The candidate that receives votes from a majority becomes leader.
Follower (timeout expires) -> Candidate
Candidate: RequestVote(term=5) -> all peers
Peers: grant vote if term is current and log is up-to-date
Candidate receives majority -> Leader
Leader: AppendEntries heartbeats to all followers
Raft randomizes election timeouts (e.g., 150–300 ms) to reduce the chance of split votes where no candidate achieves a majority.
ZooKeeper Ephemeral Nodes#
Apache ZooKeeper provides leader election as a coordination primitive using ephemeral sequential nodes:
- Each candidate creates an ephemeral sequential znode under a designated path (e.g.,
/election/candidate-). - ZooKeeper assigns a monotonically increasing sequence number.
- The candidate with the lowest sequence number is the leader.
- All other candidates set a watch on the node with the next-lower sequence number.
- When the leader's session expires or it disconnects, ZooKeeper deletes the ephemeral node, triggering the watch on the next candidate, which becomes the new leader.
This avoids the herd effect — only one node is notified per failure, not all of them.
Lease-Based Election#
A lease is a time-bounded lock. The leader must periodically renew its lease before it expires. If it fails to renew, other nodes can acquire the lease and become leader.
Properties:
- Bounded uncertainty — The system knows within a bounded time whether the leader is alive.
- Clock dependency — Requires reasonably synchronized clocks. Clock skew can cause two nodes to hold the lease simultaneously.
- Graceful degradation — If the leader is slow but alive, it loses the lease and a faster node takes over.
Lease-based election is used in systems like Google Chubby and etcd.
Split-Brain Prevention#
Split brain occurs when a network partition causes two subgroups to each elect their own leader. Prevention strategies include:
Quorum requirement — A leader must maintain acknowledgment from a majority of nodes. If it loses contact with the majority, it must step down. This is the approach used by Raft, Paxos, and ZooKeeper.
Fencing tokens — Every time a new leader is elected, it receives a monotonically increasing fencing token. Storage systems reject requests carrying an old token. Even if a stale leader sends a write, the storage layer rejects it because its token is outdated.
Leader A: fencing token = 33
Network partition -> Leader B elected: fencing token = 34
Leader A recovers, sends write with token 33
Storage rejects token 33 (current minimum = 34)
STONITH (Shoot The Other Node In The Head) — In some systems, the new leader forcibly powers off the old leader via a hardware management interface before taking over. This is common in traditional HA clusters.
Kubernetes Leader Election#
Kubernetes implements leader election using its API server and resource locking. The standard approach uses a Lease object (or historically, ConfigMap or Endpoints annotations):
- A candidate attempts to create or update a Lease resource with its identity and a
renewTime. - If successful, it becomes the leader and must renew the lease before
leaseDurationSecondsexpires. - Other candidates watch the Lease. If
renewTimeplusleaseDurationSecondshas passed, they attempt to acquire it.
apiVersion: coordination.k8s.io/v1
kind: Lease
metadata:
name: my-app-leader
namespace: default
spec:
holderIdentity: "pod-abc-123"
leaseDurationSeconds: 15
renewTime: "2026-03-29T10:00:00Z"
The client-go library provides a leaderelection package that handles the acquire-renew-release cycle, making it straightforward to build leader-elected controllers.
Choosing the Right Approach#
| Approach | Consistency | Complexity | Use Case |
|---|---|---|---|
| Bully / Ring | Weak | Low | Small, tightly coupled clusters |
| Raft | Strong (majority) | Medium | Consensus-based replicated logs |
| ZooKeeper | Strong (ZAB) | Medium | Coordination service for large systems |
| Lease-based | Depends on clocks | Low | Simple leader tasks, cloud-native |
| Kubernetes Lease | Depends on API server | Low | Kubernetes-native workloads |
Key Takeaways#
- Leader election is the foundation for write serialization, scheduling, and resource coordination in distributed systems.
- The Bully and Ring algorithms are conceptually simple but assume reliable failure detection.
- Raft provides a well-understood, majority-based election with randomized timeouts to break ties.
- ZooKeeper uses ephemeral sequential nodes and watches to avoid the herd effect during failover.
- Lease-based election trades clock dependency for simplicity and bounded failover time.
- Fencing tokens are essential for preventing stale leaders from corrupting data after a partition heals.
- Kubernetes wraps lease-based election into a first-class API object, making it accessible for controller authors.
No single algorithm fits every scenario. Match your election strategy to your consistency requirements, failure model, and operational complexity budget.
Build and explore system design concepts hands-on at codelit.io.
257 articles on system design at codelit.io/blog.
Try it on Codelit
GitHub Integration
Paste any repo URL to generate an interactive architecture diagram from real code
Related articles
Try these templates
Build this architecture
Generate an interactive architecture for Leader Election in seconds.
Try it in Codelit →
Comments