Event-Driven Saga Orchestration: Managing Distributed Transactions
Microservices break the comfort of ACID transactions. When a business operation spans multiple services — each owning its own database — you cannot simply BEGIN; ... COMMIT;. The saga pattern solves this by decomposing a distributed transaction into a sequence of local transactions, each paired with a compensating action that undoes its effect if a later step fails.
Choreography vs. Orchestration#
There are two ways to coordinate a saga:
Choreography — Each service listens for events and decides independently what to do next. No central coordinator exists.
Orchestration — A dedicated orchestrator service drives the saga, telling each participant what to execute and when.
Choreography Orchestration
OrderSvc ──event──▶ PaymentSvc Orchestrator
PaymentSvc ─event─▶ InventorySvc │
InventorySvc ─event▶ ShippingSvc ├──▶ OrderSvc
├──▶ PaymentSvc
(implicit flow, hard to trace) ├──▶ InventorySvc
└──▶ ShippingSvc
(explicit flow, easy to trace)
Choreography works for simple flows with few participants. Orchestration excels when:
- The flow has many steps or conditional branches.
- You need centralized visibility into saga state.
- Timeout and retry policies differ per step.
- Business stakeholders need to understand the process flow.
This article focuses on the orchestration approach.
The Orchestrator as a State Machine#
An orchestrator is fundamentally a finite state machine. Each state represents a step in the saga, and transitions are triggered by the success or failure of that step.
┌─────────┐ success ┌─────────────┐ success ┌──────────────┐
│ Reserve │────────────▶│ Charge │────────────▶│ Ship Order │
│ Inventory│ │ Payment │ │ │
└────┬─────┘ └──────┬───────┘ └──────┬───────┘
│ failure │ failure │ success
▼ ▼ ▼
┌─────────┐ ┌──────────────┐ ┌──────────────┐
│ FAILED │◀───────────│ Compensate: │ │ COMPLETED │
│ (done) │ │ Release │ │ (done) │
└─────────┘ │ Inventory │ └──────────────┘
└──────────────┘
State machine properties that matter:
- Deterministic transitions — Given a state and an event, the next state is always the same.
- Persistence — The current state is stored durably so the saga survives orchestrator restarts.
- Idempotent transitions — Replaying the same event in the same state produces no side effects.
Compensation Logic#
Compensation is the inverse of a forward action. It does not "undo" in the ACID rollback sense — it applies a semantic reverse:
| Forward Action | Compensation |
|---|---|
| Reserve inventory | Release inventory |
| Charge payment | Issue refund |
| Create shipping label | Cancel shipment |
| Send confirmation email | Send cancellation email |
| Allocate seat | Release seat |
Compensation rules:
- Compensations execute in reverse order — If steps A, B, C succeeded and D fails, compensate C, then B, then A.
- Compensations must be idempotent — They may be retried if the orchestrator crashes mid-compensation.
- Compensations can fail — You need a strategy for this (manual intervention queue, dead letter, retry with backoff).
- Not all actions are compensatable — Sending an email cannot be unsent. Design around this by placing non-compensatable actions last.
class SagaStep:
def __init__(self, name, action, compensation):
self.name = name
self.action = action # async callable
self.compensation = compensation # async callable or None
class SagaOrchestrator:
def __init__(self, steps: list[SagaStep]):
self.steps = steps
self.completed = []
async def execute(self, context):
for step in self.steps:
try:
result = await step.action(context)
context[step.name] = result
self.completed.append(step)
except Exception as e:
await self.compensate(context, e)
raise SagaFailed(step.name, e)
async def compensate(self, context, original_error):
for step in reversed(self.completed):
if step.compensation:
try:
await step.compensation(context)
except Exception as comp_error:
# Log and continue — best-effort compensation
log.error(f"Compensation failed for {step.name}", exc_info=comp_error)
Timeout Handling#
Distributed systems are slow and unreliable. Every saga step needs timeout policies:
Step-level timeouts — How long to wait for a single step to complete before declaring failure.
Saga-level timeouts — Maximum wall-clock time for the entire saga. Prevents sagas from hanging indefinitely.
Retry policies — Define per-step retry behavior:
retry_policy = RetryPolicy(
initial_interval=timedelta(seconds=1),
backoff_coefficient=2.0,
maximum_interval=timedelta(seconds=30),
maximum_attempts=5,
non_retryable_errors=[InsufficientFundsError, ItemNotFoundError],
)
Deadline propagation — Pass the saga deadline to each participant so they can fail fast if the deadline has already passed:
Saga starts at T=0, deadline T=30s
Step 1 receives deadline: T=30s (completes at T=2s)
Step 2 receives deadline: T=30s (completes at T=8s)
Step 3 receives deadline: T=30s (starts at T=8s, has 22s remaining)
Implementing Sagas with Temporal#
Temporal is a durable execution platform purpose-built for orchestrating long-running workflows like sagas. It persists every state transition to a history, making workflows resilient to crashes.
// Temporal workflow definition in Go
func OrderSagaWorkflow(ctx workflow.Context, order Order) error {
options := workflow.ActivityOptions{
StartToCloseTimeout: 30 * time.Second,
RetryPolicy: &temporal.RetryPolicy{
MaximumAttempts: 3,
},
}
ctx = workflow.WithActivityOptions(ctx, options)
// Step 1: Reserve inventory
var reserveResult ReserveResult
err := workflow.ExecuteActivity(ctx, ReserveInventory, order).Get(ctx, &reserveResult)
if err != nil {
return fmt.Errorf("reserve inventory failed: %w", err)
}
// Step 2: Charge payment
var chargeResult ChargeResult
err = workflow.ExecuteActivity(ctx, ChargePayment, order).Get(ctx, &chargeResult)
if err != nil {
// Compensate step 1
_ = workflow.ExecuteActivity(ctx, ReleaseInventory, reserveResult).Get(ctx, nil)
return fmt.Errorf("charge payment failed: %w", err)
}
// Step 3: Create shipment
var shipResult ShipResult
err = workflow.ExecuteActivity(ctx, CreateShipment, order).Get(ctx, &shipResult)
if err != nil {
// Compensate steps 2 and 1
_ = workflow.ExecuteActivity(ctx, RefundPayment, chargeResult).Get(ctx, nil)
_ = workflow.ExecuteActivity(ctx, ReleaseInventory, reserveResult).Get(ctx, nil)
return fmt.Errorf("create shipment failed: %w", err)
}
return nil
}
Why Temporal works well for sagas:
- Durable execution — Workflow state survives process crashes. Temporal replays the event history to reconstruct state.
- Built-in retries — Configure retry policies per activity with backoff.
- Timeouts at every level — Schedule-to-start, start-to-close, and heartbeat timeouts.
- Visibility — Temporal Web UI shows every saga's current state, history, and pending activities.
- Versioning — Safely deploy updated saga logic without breaking in-flight workflows.
Saga with a Generic Compensator in Temporal#
For sagas with many steps, a generic compensator reduces boilerplate:
func OrderSagaWithCompensation(ctx workflow.Context, order Order) error {
var compensations []func(workflow.Context) error
defer func() {
if len(compensations) > 0 {
// Execute compensations in reverse on failure
for i := len(compensations) - 1; i >= 0; i-- {
compensations[i](ctx)
}
}
}()
// Step 1
var inv ReserveResult
if err := workflow.ExecuteActivity(ctx, ReserveInventory, order).Get(ctx, &inv); err != nil {
return err
}
compensations = append(compensations, func(ctx workflow.Context) error {
return workflow.ExecuteActivity(ctx, ReleaseInventory, inv).Get(ctx, nil)
})
// Step 2
var pay ChargeResult
if err := workflow.ExecuteActivity(ctx, ChargePayment, order).Get(ctx, &pay); err != nil {
return err
}
compensations = append(compensations, func(ctx workflow.Context) error {
return workflow.ExecuteActivity(ctx, RefundPayment, pay).Get(ctx, nil)
})
// Step 3
if err := workflow.ExecuteActivity(ctx, CreateShipment, order).Get(ctx, nil); err != nil {
return err
}
// Success — clear compensations
compensations = nil
return nil
}
Observability and Debugging#
Sagas span multiple services and can take minutes or hours. Observability is critical:
Correlation IDs — Assign a unique saga ID and propagate it through every step. All logs, traces, and metrics reference this ID.
State persistence — Store saga state transitions in a queryable store:
CREATE TABLE saga_state (
saga_id UUID PRIMARY KEY,
saga_type TEXT NOT NULL,
current_state TEXT NOT NULL,
context JSONB NOT NULL,
created_at TIMESTAMPTZ DEFAULT now(),
updated_at TIMESTAMPTZ DEFAULT now(),
deadline_at TIMESTAMPTZ
);
CREATE TABLE saga_step_log (
id BIGSERIAL PRIMARY KEY,
saga_id UUID REFERENCES saga_state(saga_id),
step_name TEXT NOT NULL,
action TEXT NOT NULL, -- 'forward' or 'compensate'
status TEXT NOT NULL, -- 'started', 'succeeded', 'failed'
error_msg TEXT,
started_at TIMESTAMPTZ,
ended_at TIMESTAMPTZ
);
Alerts — Monitor for sagas stuck in intermediate states, compensation failures, and deadline breaches.
Common Pitfalls#
- Forgetting idempotency — If a step is retried, it must not double-charge, double-reserve, or double-ship. Use idempotency keys.
- Compensation ordering — Always compensate in reverse. Forward ordering can leave inconsistent state.
- Ignoring compensation failures — Compensation can fail too. Have a dead-letter queue and manual resolution process.
- Unbounded sagas — Always set a saga-level deadline. Without one, a stuck saga consumes resources indefinitely.
- Mixing sync and async — If some steps are synchronous RPC and others are event-driven, ensure the orchestrator handles both consistently.
When Not to Use Sagas#
Sagas add complexity. Consider alternatives first:
- Single database — If all data fits in one database, use regular transactions.
- Outbox pattern — For reliable event publishing from a single service.
- Two-phase commit — If your infrastructure supports it and latency is acceptable (rare in microservices).
Sagas are the right choice when you have genuinely distributed data ownership across multiple services with independent databases.
That is article #387 on Codelit. Browse all articles or explore the platform to level up your engineering skills.
Try it on Codelit
GitHub Integration
Paste any repo URL to generate an interactive architecture diagram from real code
Related articles
Try these templates
Scalable SaaS Application
Modern SaaS with microservices, event-driven processing, and multi-tenant architecture.
10 componentsDistributed Rate Limiter
API rate limiting with sliding window, token bucket, and per-user quotas.
7 componentsDistributed Key-Value Store
Redis/DynamoDB-like distributed KV store with consistent hashing, replication, and tunable consistency.
8 componentsBuild this architecture
Generate an interactive architecture for Event in seconds.
Try it in Codelit →
Comments