distributed-systemssystem-designpatterns

Distributed Transactions and the Saga Pattern — A Practical Guide

March 23, 2026 4 min readBy Mo Discussion

The problem: transactions across services#

In a monolith, transactions are simple. Start a transaction, do some work, commit. If anything fails, rollback. The database handles it.

In microservices, a single business operation might span multiple services, each with its own database:

Order Service creates the order
Payment Service charges the card
Inventory Service reserves the items
Shipping Service schedules delivery

If step 3 fails (out of stock), you need to undo steps 1 and 2. But they're in different databases. There's no shared transaction.

Two-phase commit (2PC): the textbook answer#

The coordinator asks all participants: "Can you commit?" If everyone says yes, it sends "commit." If anyone says no, it sends "rollback."

Why it's rarely used in microservices:

The coordinator is a single point of failure
All participants are blocked during the prepare phase (holding locks)
If the coordinator crashes after sending "prepare" but before "commit," everyone is stuck
Performance is terrible at scale — every transaction requires multiple network round trips

2PC works for databases within a single data center. It doesn't work for services that need to be independently deployable and scalable.

The saga pattern: the practical answer#

A saga is a sequence of local transactions. Each service performs its own transaction and publishes an event. If a step fails, compensating transactions undo the previous steps.

Choreography-based sagas#

Each service listens to events and decides what to do:

OrderCreated → Payment Service charges card
PaymentCompleted → Inventory Service reserves items
InventoryReserved → Shipping Service schedules delivery

If inventory fails:

InventoryFailed → Payment Service refunds card
PaymentRefunded → Order Service cancels order

Pros: Decoupled, no central coordinator, simple for small flows. Cons: Hard to understand the full flow. Debugging is painful. Adding steps means modifying multiple services.

Orchestration-based sagas#

A central orchestrator tells each service what to do:

Orchestrator → OrderService.create()
Orchestrator → PaymentService.charge()
Orchestrator → InventoryService.reserve()
Orchestrator → ShippingService.schedule()

On failure:

Orchestrator → PaymentService.refund()
Orchestrator → OrderService.cancel()

Pros: Flow is visible in one place. Easy to add steps. Better for complex workflows. Cons: Orchestrator can become a bottleneck. Still need to handle orchestrator failures.

Compensating transactions#

The key insight: you can't rollback across services. Instead, you compensate — perform a new action that undoes the effect.

Original action	Compensation
Create order	Cancel order
Charge card	Refund card
Reserve inventory	Release inventory
Send email	Send cancellation email
Schedule delivery	Cancel delivery

Not every action is perfectly reversible. You can refund a payment, but you can't un-send an email. Design your saga so irreversible actions happen last.

Idempotency: the non-negotiable requirement#

In distributed systems, messages can be delivered more than once. If "charge card" is executed twice, the customer is charged twice. Every service in a saga must be idempotent.

How: Include a unique transaction ID in every request. Before processing, check if you've already handled this ID. If yes, return the cached result.

Handling edge cases#

What if the orchestrator crashes mid-saga? Store the saga state in a database. On restart, resume from where it left off. The saga state machine persists across crashes.

What if a compensation fails? Retry with exponential backoff. If it keeps failing, alert an operator. Some compensations (like refunds) are critical and must eventually succeed.

What if two sagas conflict? Semantic locks: mark resources as "in-progress" so other sagas wait or fail fast. For example, inventory is "reserved" until the saga completes or times out.

When to use sagas vs when to avoid them#

Use sagas when:

Business operations span multiple services
You need eventual consistency (not immediate)
Each service owns its own data

Avoid sagas when:

You need strict ACID transactions — use a single database
The flow is simple enough for synchronous calls with retry
You can restructure services to keep related data together

See it in action#

On Codelit, generate any e-commerce or payment system and you'll see exactly where distributed transactions happen — the edges between Order Service, Payment Service, and Inventory Service. Click any node to audit the transaction boundaries.

Explore transaction patterns: describe your system on Codelit.io and see how services coordinate across boundaries.

{ }

Explore the Stripe architecture interactively

Try it →

Try it on Codelit

Chaos Mode

Simulate node failures and watch cascading impact across your architecture

Build this architecture →

Comments

api design

Batch API Endpoints — Patterns for Bulk Operations, Partial Success, and Idempotency

8 min read

system design

Circuit Breaker Implementation — State Machine, Failure Counting, Fallbacks, and Resilience4j

7 min read

api

API-First Design Methodology — Design Before You Implement

7 min read

Try these templates

Distributed Rate Limiter

API rate limiting with sliding window, token bucket, and per-user quotas.

7 components

Distributed Key-Value Store

Redis/DynamoDB-like distributed KV store with consistent hashing, replication, and tunable consistency.

8 components

Build this architecture

Generate an interactive architecture for Distributed Transactions and the Saga Pattern in seconds.

Try it in Codelit →

distributed-systemssystem-designpatterns

Distributed Transactions and the Saga Pattern — A Practical Guide

March 23, 2026 4 min readBy Mo Discussion

The problem: transactions across services#

In a monolith, transactions are simple. Start a transaction, do some work, commit. If anything fails, rollback. The database handles it.

In microservices, a single business operation might span multiple services, each with its own database:

Order Service creates the order
Payment Service charges the card
Inventory Service reserves the items
Shipping Service schedules delivery

If step 3 fails (out of stock), you need to undo steps 1 and 2. But they're in different databases. There's no shared transaction.

Two-phase commit (2PC): the textbook answer#

The coordinator asks all participants: "Can you commit?" If everyone says yes, it sends "commit." If anyone says no, it sends "rollback."

Why it's rarely used in microservices:

The coordinator is a single point of failure
All participants are blocked during the prepare phase (holding locks)
If the coordinator crashes after sending "prepare" but before "commit," everyone is stuck
Performance is terrible at scale — every transaction requires multiple network round trips

2PC works for databases within a single data center. It doesn't work for services that need to be independently deployable and scalable.

The saga pattern: the practical answer#

A saga is a sequence of local transactions. Each service performs its own transaction and publishes an event. If a step fails, compensating transactions undo the previous steps.

Choreography-based sagas#

Each service listens to events and decides what to do:

OrderCreated → Payment Service charges card
PaymentCompleted → Inventory Service reserves items
InventoryReserved → Shipping Service schedules delivery

If inventory fails:

InventoryFailed → Payment Service refunds card
PaymentRefunded → Order Service cancels order

Pros: Decoupled, no central coordinator, simple for small flows. Cons: Hard to understand the full flow. Debugging is painful. Adding steps means modifying multiple services.

Orchestration-based sagas#

A central orchestrator tells each service what to do:

Orchestrator → OrderService.create()
Orchestrator → PaymentService.charge()
Orchestrator → InventoryService.reserve()
Orchestrator → ShippingService.schedule()

On failure:

Orchestrator → PaymentService.refund()
Orchestrator → OrderService.cancel()

Pros: Flow is visible in one place. Easy to add steps. Better for complex workflows. Cons: Orchestrator can become a bottleneck. Still need to handle orchestrator failures.

Compensating transactions#

The key insight: you can't rollback across services. Instead, you compensate — perform a new action that undoes the effect.

Original action	Compensation
Create order	Cancel order
Charge card	Refund card
Reserve inventory	Release inventory
Send email	Send cancellation email
Schedule delivery	Cancel delivery

Not every action is perfectly reversible. You can refund a payment, but you can't un-send an email. Design your saga so irreversible actions happen last.

Idempotency: the non-negotiable requirement#

In distributed systems, messages can be delivered more than once. If "charge card" is executed twice, the customer is charged twice. Every service in a saga must be idempotent.

How: Include a unique transaction ID in every request. Before processing, check if you've already handled this ID. If yes, return the cached result.

Handling edge cases#

What if the orchestrator crashes mid-saga? Store the saga state in a database. On restart, resume from where it left off. The saga state machine persists across crashes.

What if a compensation fails? Retry with exponential backoff. If it keeps failing, alert an operator. Some compensations (like refunds) are critical and must eventually succeed.

What if two sagas conflict? Semantic locks: mark resources as "in-progress" so other sagas wait or fail fast. For example, inventory is "reserved" until the saga completes or times out.

When to use sagas vs when to avoid them#

Use sagas when:

Business operations span multiple services
You need eventual consistency (not immediate)
Each service owns its own data

Avoid sagas when:

You need strict ACID transactions — use a single database
The flow is simple enough for synchronous calls with retry
You can restructure services to keep related data together

See it in action#

Explore transaction patterns: describe your system on Codelit.io and see how services coordinate across boundaries.

{ }

Explore the Stripe architecture interactively

Try it →

Try it on Codelit

Chaos Mode

Simulate node failures and watch cascading impact across your architecture

Build this architecture →

Comments

api design

Try these templates

Distributed Rate Limiter

API rate limiting with sliding window, token bucket, and per-user quotas.

7 components

Distributed Key-Value Store

Redis/DynamoDB-like distributed KV store with consistent hashing, replication, and tunable consistency.

8 components

Build this architecture

Generate an interactive architecture for Distributed Transactions and the Saga Pattern in seconds.

Try it in Codelit →

Distributed Transactions and the Saga Pattern — A Practical Guide

The problem: transactions across services#

Two-phase commit (2PC): the textbook answer#

The saga pattern: the practical answer#

Choreography-based sagas#

Orchestration-based sagas#

Compensating transactions#

Idempotency: the non-negotiable requirement#

Handling edge cases#

When to use sagas vs when to avoid them#

See it in action#

Comments

Related articles

Batch API Endpoints — Patterns for Bulk Operations, Partial Success, and Idempotency

Circuit Breaker Implementation — State Machine, Failure Counting, Fallbacks, and Resilience4j

API-First Design Methodology — Design Before You Implement

Try these templates

Distributed Rate Limiter

Distributed Key-Value Store

Build this architecture

Distributed Transactions and the Saga Pattern — A Practical Guide

The problem: transactions across services#

Two-phase commit (2PC): the textbook answer#

The saga pattern: the practical answer#

Choreography-based sagas#

Orchestration-based sagas#

Compensating transactions#

Idempotency: the non-negotiable requirement#

Handling edge cases#

When to use sagas vs when to avoid them#

See it in action#

Comments

Related articles

Batch API Endpoints — Patterns for Bulk Operations, Partial Success, and Idempotency

Circuit Breaker Implementation — State Machine, Failure Counting, Fallbacks, and Resilience4j

API-First Design Methodology — Design Before You Implement

Try these templates

Distributed Rate Limiter

Distributed Key-Value Store

Build this architecture