Two-Phase Commit: The Backbone of Distributed Transactions
Two-Phase Commit (2PC)#
When a single transaction spans multiple databases or services, you need a protocol that guarantees all nodes commit or all nodes abort. Two-phase commit is the classic answer — battle-tested, widely deployed, and fundamentally important to understand.
The Core Problem#
Imagine transferring $500 from Bank A to Bank B. Two databases must update atomically:
Bank A: balance -= 500
Bank B: balance += 500
If Bank A commits but Bank B crashes before committing, the money vanishes. You need atomic commitment across independent systems.
How 2PC Works#
The protocol has two distinct phases, orchestrated by a coordinator node.
Phase 1: Prepare (Voting)#
Coordinator → Participant A: "PREPARE to commit transaction T1"
Coordinator → Participant B: "PREPARE to commit transaction T1"
Participant A: acquires locks, writes to WAL → "YES, I can commit"
Participant B: acquires locks, writes to WAL → "YES, I can commit"
Each participant does everything needed to commit except actually committing. Data is written to the write-ahead log. Locks are held. The participant promises it can commit if asked.
Phase 2: Commit (Decision)#
Coordinator receives all YES votes:
Coordinator → Participant A: "COMMIT transaction T1"
Coordinator → Participant B: "COMMIT transaction T1"
Both participants: make changes permanent, release locks → "ACK"
If any participant votes NO (or times out), the coordinator sends ABORT to everyone.
The Decision Rule#
- All vote YES → COMMIT
- Any vote NO or timeout → ABORT
This is the key invariant. Once the coordinator decides, the outcome is final.
State Machine#
Coordinator states:
INIT → WAITING → COMMITTED | ABORTED
Participant states:
INIT → PREPARED → COMMITTED | ABORTED
The critical moment is when a participant enters PREPARED state — it has promised to commit but hasn't yet. It must wait for the coordinator's decision.
The Blocking Problem#
This is 2PC's Achilles heel. If the coordinator crashes after sending PREPARE but before sending COMMIT/ABORT:
Participant A: in PREPARED state, holding locks, waiting...
Participant B: in PREPARED state, holding locks, waiting...
Coordinator: crashed
Both participants are BLOCKED — they cannot safely commit or abort
Participants hold locks indefinitely. They cannot abort (the coordinator might have decided COMMIT). They cannot commit (the coordinator might have decided ABORT). The system is stuck.
Why Participants Can't Decide Alone#
Scenario 1: Participant A votes YES, Participant B voted NO
→ Coordinator decided ABORT
→ If A commits unilaterally → inconsistency
Scenario 2: Both voted YES
→ Coordinator decided COMMIT
→ If A aborts unilaterally → inconsistency
Without hearing from the coordinator, no safe choice exists.
Recovery and Logging#
Every 2PC participant writes to a durable log at critical moments:
Coordinator log:
1. Write PREPARE record before sending PREPARE
2. Write COMMIT/ABORT record before sending decision
3. Write COMPLETE record after all ACKs received
Participant log:
1. Write YES record before voting YES
2. Write COMMIT/ABORT record after receiving decision
On recovery, nodes consult their logs. If a participant finds a YES record but no decision record, it must contact the coordinator (or other participants) to learn the outcome.
Three-Phase Commit (3PC)#
3PC adds a PRE-COMMIT phase to avoid blocking:
Phase 1: PREPARE → collect votes (same as 2PC)
Phase 2: PRE-COMMIT → tell all participants "we will commit"
Phase 3: COMMIT → finalize
The key insight: if a participant is in PRE-COMMIT state and the coordinator crashes, it knows all participants voted YES. It can safely proceed to commit after a timeout.
3PC Trade-offs#
- Solves the blocking problem under certain failure models
- Does NOT work with network partitions (a partitioned node might abort while others commit)
- Adds an extra round trip of latency
- Rarely used in practice — most systems prefer 2PC with timeout-based recovery
XA Transactions#
XA is the industry standard interface for 2PC, defined by the X/Open group:
Application (Transaction Manager)
├── xa_start() → begin transaction on resource
├── xa_end() → end work on resource
├── xa_prepare() → Phase 1
├── xa_commit() → Phase 2
└── xa_rollback() → abort
Supported by PostgreSQL, MySQL, Oracle, and most message brokers. The application server acts as coordinator, databases act as participants.
XA Limitations#
- Performance: locks held across two network round trips
- Heterogeneous systems: all participants must support XA
- Coordinator is a single point of failure
- Latency-sensitive: not suitable for cross-datacenter transactions
2PC in Distributed Databases#
Modern distributed databases use 2PC internally but hide the complexity:
Google Spanner: uses 2PC for cross-shard transactions with Paxos-replicated participants — each participant is a Paxos group, so coordinator failure doesn't block.
CockroachDB: parallel commits optimization — the coordinator writes its commit record and the participant commit records simultaneously, reducing latency from 2 round trips to 1.
TiDB: uses a Percolator-inspired protocol where the "primary key" acts as the commit point, avoiding a separate coordinator.
2PC vs Saga Pattern#
| Aspect | 2PC | Saga |
|---|---|---|
| Consistency | Strong (ACID) | Eventual |
| Isolation | Full (locks held) | None (intermediate states visible) |
| Latency | Higher (lock duration) | Lower (no global locks) |
| Availability | Lower (blocking on coordinator) | Higher (no coordinator) |
| Complexity | Protocol complexity | Compensation logic complexity |
| Best for | Short-lived DB transactions | Long-running business processes |
When to Use 2PC#
- Transactions complete in milliseconds
- Strong consistency is non-negotiable
- All participants are within the same datacenter
- You control all participating systems
When to Use Sagas#
- Transactions span multiple services or datacenters
- Operations take seconds or longer
- Availability matters more than strict consistency
- Services are independently owned/deployed
Practical Considerations#
Timeout tuning: too short and you abort healthy transactions; too long and you hold locks unnecessarily. Start with 2-5x your p99 latency.
Coordinator availability: run the coordinator as a replicated state machine (Raft/Paxos) to avoid single point of failure.
Heuristic decisions: some systems allow participants to make a "heuristic commit" or "heuristic rollback" after a long timeout — this risks inconsistency but unblocks the system.
Monitoring: track prepare-to-commit latency, abort rates, and in-doubt transaction counts. A rising in-doubt count signals coordinator issues.
Key Takeaways#
- 2PC guarantees atomic commitment across distributed nodes
- The blocking problem is the fundamental limitation
- 3PC solves blocking but fails under network partitions
- XA is the standard API for 2PC across databases
- Modern distributed databases use enhanced 2PC internally
- Choose 2PC for short, strongly-consistent transactions; sagas for long-running, cross-service workflows
This is article #225 in the Codelit system design series. Explore all articles at codelit.io.
Try it on Codelit
GitHub Integration
Paste any repo URL to generate an interactive architecture diagram from real code
Comments