saga orchestrationdistributed transactionsevent-driven architectureTemporalstate machinescompensation patternmicroservicessystem design

Event-Driven Saga Orchestration: Managing Distributed Transactions

March 29, 2026 8 min readBy Codelit Team Discussion

Microservices break the comfort of ACID transactions. When a business operation spans multiple services — each owning its own database — you cannot simply BEGIN; ... COMMIT;. The saga pattern solves this by decomposing a distributed transaction into a sequence of local transactions, each paired with a compensating action that undoes its effect if a later step fails.

Choreography vs. Orchestration#

There are two ways to coordinate a saga:

Choreography — Each service listens for events and decides independently what to do next. No central coordinator exists.

Orchestration — A dedicated orchestrator service drives the saga, telling each participant what to execute and when.

Choreography                          Orchestration

 OrderSvc ──event──▶ PaymentSvc       Orchestrator
 PaymentSvc ─event─▶ InventorySvc        │
 InventorySvc ─event▶ ShippingSvc         ├──▶ OrderSvc
                                          ├──▶ PaymentSvc
 (implicit flow, hard to trace)           ├──▶ InventorySvc
                                          └──▶ ShippingSvc
                                       (explicit flow, easy to trace)

Choreography works for simple flows with few participants. Orchestration excels when:

The flow has many steps or conditional branches.
You need centralized visibility into saga state.
Timeout and retry policies differ per step.
Business stakeholders need to understand the process flow.

This article focuses on the orchestration approach.

The Orchestrator as a State Machine#

An orchestrator is fundamentally a finite state machine. Each state represents a step in the saga, and transitions are triggered by the success or failure of that step.

┌─────────┐   success   ┌─────────────┐   success   ┌──────────────┐
│ Reserve  │────────────▶│  Charge      │────────────▶│  Ship Order  │
│ Inventory│             │  Payment     │             │              │
└────┬─────┘             └──────┬───────┘             └──────┬───────┘
     │ failure                  │ failure                    │ success
     ▼                          ▼                            ▼
┌─────────┐             ┌──────────────┐             ┌──────────────┐
│  FAILED  │◀───────────│  Compensate: │             │  COMPLETED   │
│  (done)  │             │  Release     │             │  (done)      │
└─────────┘             │  Inventory   │             └──────────────┘
                         └──────────────┘

State machine properties that matter:

Deterministic transitions — Given a state and an event, the next state is always the same.
Persistence — The current state is stored durably so the saga survives orchestrator restarts.
Idempotent transitions — Replaying the same event in the same state produces no side effects.

Compensation Logic#

Compensation is the inverse of a forward action. It does not "undo" in the ACID rollback sense — it applies a semantic reverse:

Forward Action	Compensation
Reserve inventory	Release inventory
Charge payment	Issue refund
Create shipping label	Cancel shipment
Send confirmation email	Send cancellation email
Allocate seat	Release seat

Compensation rules:

Compensations execute in reverse order — If steps A, B, C succeeded and D fails, compensate C, then B, then A.
Compensations must be idempotent — They may be retried if the orchestrator crashes mid-compensation.
Compensations can fail — You need a strategy for this (manual intervention queue, dead letter, retry with backoff).
Not all actions are compensatable — Sending an email cannot be unsent. Design around this by placing non-compensatable actions last.

class SagaStep:
    def __init__(self, name, action, compensation):
        self.name = name
        self.action = action          # async callable
        self.compensation = compensation  # async callable or None

class SagaOrchestrator:
    def __init__(self, steps: list[SagaStep]):
        self.steps = steps
        self.completed = []

    async def execute(self, context):
        for step in self.steps:
            try:
                result = await step.action(context)
                context[step.name] = result
                self.completed.append(step)
            except Exception as e:
                await self.compensate(context, e)
                raise SagaFailed(step.name, e)

    async def compensate(self, context, original_error):
        for step in reversed(self.completed):
            if step.compensation:
                try:
                    await step.compensation(context)
                except Exception as comp_error:
                    # Log and continue — best-effort compensation
                    log.error(f"Compensation failed for {step.name}", exc_info=comp_error)

Timeout Handling#

Distributed systems are slow and unreliable. Every saga step needs timeout policies:

Step-level timeouts — How long to wait for a single step to complete before declaring failure.

Saga-level timeouts — Maximum wall-clock time for the entire saga. Prevents sagas from hanging indefinitely.

Retry policies — Define per-step retry behavior:

retry_policy = RetryPolicy(
    initial_interval=timedelta(seconds=1),
    backoff_coefficient=2.0,
    maximum_interval=timedelta(seconds=30),
    maximum_attempts=5,
    non_retryable_errors=[InsufficientFundsError, ItemNotFoundError],
)

Deadline propagation — Pass the saga deadline to each participant so they can fail fast if the deadline has already passed:

Saga starts at T=0, deadline T=30s

Step 1 receives deadline: T=30s (completes at T=2s)
Step 2 receives deadline: T=30s (completes at T=8s)
Step 3 receives deadline: T=30s (starts at T=8s, has 22s remaining)

Implementing Sagas with Temporal#

Temporal is a durable execution platform purpose-built for orchestrating long-running workflows like sagas. It persists every state transition to a history, making workflows resilient to crashes.

// Temporal workflow definition in Go
func OrderSagaWorkflow(ctx workflow.Context, order Order) error {
    options := workflow.ActivityOptions{
        StartToCloseTimeout: 30 * time.Second,
        RetryPolicy: &temporal.RetryPolicy{
            MaximumAttempts: 3,
        },
    }
    ctx = workflow.WithActivityOptions(ctx, options)

    // Step 1: Reserve inventory
    var reserveResult ReserveResult
    err := workflow.ExecuteActivity(ctx, ReserveInventory, order).Get(ctx, &reserveResult)
    if err != nil {
        return fmt.Errorf("reserve inventory failed: %w", err)
    }

    // Step 2: Charge payment
    var chargeResult ChargeResult
    err = workflow.ExecuteActivity(ctx, ChargePayment, order).Get(ctx, &chargeResult)
    if err != nil {
        // Compensate step 1
        _ = workflow.ExecuteActivity(ctx, ReleaseInventory, reserveResult).Get(ctx, nil)
        return fmt.Errorf("charge payment failed: %w", err)
    }

    // Step 3: Create shipment
    var shipResult ShipResult
    err = workflow.ExecuteActivity(ctx, CreateShipment, order).Get(ctx, &shipResult)
    if err != nil {
        // Compensate steps 2 and 1
        _ = workflow.ExecuteActivity(ctx, RefundPayment, chargeResult).Get(ctx, nil)
        _ = workflow.ExecuteActivity(ctx, ReleaseInventory, reserveResult).Get(ctx, nil)
        return fmt.Errorf("create shipment failed: %w", err)
    }

    return nil
}

Why Temporal works well for sagas:

Durable execution — Workflow state survives process crashes. Temporal replays the event history to reconstruct state.
Built-in retries — Configure retry policies per activity with backoff.
Timeouts at every level — Schedule-to-start, start-to-close, and heartbeat timeouts.
Visibility — Temporal Web UI shows every saga's current state, history, and pending activities.
Versioning — Safely deploy updated saga logic without breaking in-flight workflows.

Saga with a Generic Compensator in Temporal#

For sagas with many steps, a generic compensator reduces boilerplate:

func OrderSagaWithCompensation(ctx workflow.Context, order Order) error {
    var compensations []func(workflow.Context) error
    defer func() {
        if len(compensations) > 0 {
            // Execute compensations in reverse on failure
            for i := len(compensations) - 1; i >= 0; i-- {
                compensations[i](ctx)
            }
        }
    }()

    // Step 1
    var inv ReserveResult
    if err := workflow.ExecuteActivity(ctx, ReserveInventory, order).Get(ctx, &inv); err != nil {
        return err
    }
    compensations = append(compensations, func(ctx workflow.Context) error {
        return workflow.ExecuteActivity(ctx, ReleaseInventory, inv).Get(ctx, nil)
    })

    // Step 2
    var pay ChargeResult
    if err := workflow.ExecuteActivity(ctx, ChargePayment, order).Get(ctx, &pay); err != nil {
        return err
    }
    compensations = append(compensations, func(ctx workflow.Context) error {
        return workflow.ExecuteActivity(ctx, RefundPayment, pay).Get(ctx, nil)
    })

    // Step 3
    if err := workflow.ExecuteActivity(ctx, CreateShipment, order).Get(ctx, nil); err != nil {
        return err
    }

    // Success — clear compensations
    compensations = nil
    return nil
}

Observability and Debugging#

Sagas span multiple services and can take minutes or hours. Observability is critical:

Correlation IDs — Assign a unique saga ID and propagate it through every step. All logs, traces, and metrics reference this ID.

State persistence — Store saga state transitions in a queryable store:

CREATE TABLE saga_state (
    saga_id       UUID PRIMARY KEY,
    saga_type     TEXT NOT NULL,
    current_state TEXT NOT NULL,
    context       JSONB NOT NULL,
    created_at    TIMESTAMPTZ DEFAULT now(),
    updated_at    TIMESTAMPTZ DEFAULT now(),
    deadline_at   TIMESTAMPTZ
);

CREATE TABLE saga_step_log (
    id         BIGSERIAL PRIMARY KEY,
    saga_id    UUID REFERENCES saga_state(saga_id),
    step_name  TEXT NOT NULL,
    action     TEXT NOT NULL,  -- 'forward' or 'compensate'
    status     TEXT NOT NULL,  -- 'started', 'succeeded', 'failed'
    error_msg  TEXT,
    started_at TIMESTAMPTZ,
    ended_at   TIMESTAMPTZ
);

Alerts — Monitor for sagas stuck in intermediate states, compensation failures, and deadline breaches.

Common Pitfalls#

Forgetting idempotency — If a step is retried, it must not double-charge, double-reserve, or double-ship. Use idempotency keys.
Compensation ordering — Always compensate in reverse. Forward ordering can leave inconsistent state.
Ignoring compensation failures — Compensation can fail too. Have a dead-letter queue and manual resolution process.
Unbounded sagas — Always set a saga-level deadline. Without one, a stuck saga consumes resources indefinitely.
Mixing sync and async — If some steps are synchronous RPC and others are event-driven, ensure the orchestrator handles both consistently.

When Not to Use Sagas#

Sagas add complexity. Consider alternatives first:

Single database — If all data fits in one database, use regular transactions.
Outbox pattern — For reliable event publishing from a single service.
Two-phase commit — If your infrastructure supports it and latency is acceptable (rare in microservices).

Sagas are the right choice when you have genuinely distributed data ownership across multiple services with independent databases.

That is article #387 on Codelit. Browse all articles or explore the platform to level up your engineering skills.

Try it on Codelit

GitHub Integration

Paste any repo URL to generate an interactive architecture diagram from real code

Build this architecture →

Comments

AI search

AI-Powered Search Architecture: Semantic Search, Hybrid Search, and RAG

8 min read

AI safety

AI Safety Guardrails Architecture: Input Validation, Output Filtering, and Human-in-the-Loop

8 min read

API design

API Backward Compatibility: Ship Changes Without Breaking Consumers

6 min read

Try these templates

Scalable SaaS Application

Modern SaaS with microservices, event-driven processing, and multi-tenant architecture.

10 components

Distributed Rate Limiter

API rate limiting with sliding window, token bucket, and per-user quotas.

7 components

Distributed Key-Value Store

Redis/DynamoDB-like distributed KV store with consistent hashing, replication, and tunable consistency.

8 components

Build this architecture

Generate an interactive architecture for Event in seconds.

Try it in Codelit →

saga orchestrationdistributed transactionsevent-driven architectureTemporalstate machinescompensation patternmicroservicessystem design

Event-Driven Saga Orchestration: Managing Distributed Transactions

March 29, 2026 8 min readBy Codelit Team Discussion

Choreography vs. Orchestration#

There are two ways to coordinate a saga:

Choreography — Each service listens for events and decides independently what to do next. No central coordinator exists.

Orchestration — A dedicated orchestrator service drives the saga, telling each participant what to execute and when.

Choreography                          Orchestration

 OrderSvc ──event──▶ PaymentSvc       Orchestrator
 PaymentSvc ─event─▶ InventorySvc        │
 InventorySvc ─event▶ ShippingSvc         ├──▶ OrderSvc
                                          ├──▶ PaymentSvc
 (implicit flow, hard to trace)           ├──▶ InventorySvc
                                          └──▶ ShippingSvc
                                       (explicit flow, easy to trace)

Choreography works for simple flows with few participants. Orchestration excels when:

The flow has many steps or conditional branches.
You need centralized visibility into saga state.
Timeout and retry policies differ per step.
Business stakeholders need to understand the process flow.

This article focuses on the orchestration approach.

The Orchestrator as a State Machine#

An orchestrator is fundamentally a finite state machine. Each state represents a step in the saga, and transitions are triggered by the success or failure of that step.

┌─────────┐   success   ┌─────────────┐   success   ┌──────────────┐
│ Reserve  │────────────▶│  Charge      │────────────▶│  Ship Order  │
│ Inventory│             │  Payment     │             │              │
└────┬─────┘             └──────┬───────┘             └──────┬───────┘
     │ failure                  │ failure                    │ success
     ▼                          ▼                            ▼
┌─────────┐             ┌──────────────┐             ┌──────────────┐
│  FAILED  │◀───────────│  Compensate: │             │  COMPLETED   │
│  (done)  │             │  Release     │             │  (done)      │
└─────────┘             │  Inventory   │             └──────────────┘
                         └──────────────┘

State machine properties that matter:

Deterministic transitions — Given a state and an event, the next state is always the same.
Persistence — The current state is stored durably so the saga survives orchestrator restarts.
Idempotent transitions — Replaying the same event in the same state produces no side effects.

Compensation Logic#

Compensation is the inverse of a forward action. It does not "undo" in the ACID rollback sense — it applies a semantic reverse:

Forward Action	Compensation
Reserve inventory	Release inventory
Charge payment	Issue refund
Create shipping label	Cancel shipment
Send confirmation email	Send cancellation email
Allocate seat	Release seat

Compensation rules:

Compensations execute in reverse order — If steps A, B, C succeeded and D fails, compensate C, then B, then A.
Compensations must be idempotent — They may be retried if the orchestrator crashes mid-compensation.
Compensations can fail — You need a strategy for this (manual intervention queue, dead letter, retry with backoff).
Not all actions are compensatable — Sending an email cannot be unsent. Design around this by placing non-compensatable actions last.

class SagaStep:
    def __init__(self, name, action, compensation):
        self.name = name
        self.action = action          # async callable
        self.compensation = compensation  # async callable or None

class SagaOrchestrator:
    def __init__(self, steps: list[SagaStep]):
        self.steps = steps
        self.completed = []

    async def execute(self, context):
        for step in self.steps:
            try:
                result = await step.action(context)
                context[step.name] = result
                self.completed.append(step)
            except Exception as e:
                await self.compensate(context, e)
                raise SagaFailed(step.name, e)

    async def compensate(self, context, original_error):
        for step in reversed(self.completed):
            if step.compensation:
                try:
                    await step.compensation(context)
                except Exception as comp_error:
                    # Log and continue — best-effort compensation
                    log.error(f"Compensation failed for {step.name}", exc_info=comp_error)

Timeout Handling#

Distributed systems are slow and unreliable. Every saga step needs timeout policies:

Step-level timeouts — How long to wait for a single step to complete before declaring failure.

Saga-level timeouts — Maximum wall-clock time for the entire saga. Prevents sagas from hanging indefinitely.

Retry policies — Define per-step retry behavior:

retry_policy = RetryPolicy(
    initial_interval=timedelta(seconds=1),
    backoff_coefficient=2.0,
    maximum_interval=timedelta(seconds=30),
    maximum_attempts=5,
    non_retryable_errors=[InsufficientFundsError, ItemNotFoundError],
)

Deadline propagation — Pass the saga deadline to each participant so they can fail fast if the deadline has already passed:

Saga starts at T=0, deadline T=30s

Step 1 receives deadline: T=30s (completes at T=2s)
Step 2 receives deadline: T=30s (completes at T=8s)
Step 3 receives deadline: T=30s (starts at T=8s, has 22s remaining)

Implementing Sagas with Temporal#

Temporal is a durable execution platform purpose-built for orchestrating long-running workflows like sagas. It persists every state transition to a history, making workflows resilient to crashes.

// Temporal workflow definition in Go
func OrderSagaWorkflow(ctx workflow.Context, order Order) error {
    options := workflow.ActivityOptions{
        StartToCloseTimeout: 30 * time.Second,
        RetryPolicy: &temporal.RetryPolicy{
            MaximumAttempts: 3,
        },
    }
    ctx = workflow.WithActivityOptions(ctx, options)

    // Step 1: Reserve inventory
    var reserveResult ReserveResult
    err := workflow.ExecuteActivity(ctx, ReserveInventory, order).Get(ctx, &reserveResult)
    if err != nil {
        return fmt.Errorf("reserve inventory failed: %w", err)
    }

    // Step 2: Charge payment
    var chargeResult ChargeResult
    err = workflow.ExecuteActivity(ctx, ChargePayment, order).Get(ctx, &chargeResult)
    if err != nil {
        // Compensate step 1
        _ = workflow.ExecuteActivity(ctx, ReleaseInventory, reserveResult).Get(ctx, nil)
        return fmt.Errorf("charge payment failed: %w", err)
    }

    // Step 3: Create shipment
    var shipResult ShipResult
    err = workflow.ExecuteActivity(ctx, CreateShipment, order).Get(ctx, &shipResult)
    if err != nil {
        // Compensate steps 2 and 1
        _ = workflow.ExecuteActivity(ctx, RefundPayment, chargeResult).Get(ctx, nil)
        _ = workflow.ExecuteActivity(ctx, ReleaseInventory, reserveResult).Get(ctx, nil)
        return fmt.Errorf("create shipment failed: %w", err)
    }

    return nil
}

Why Temporal works well for sagas:

Durable execution — Workflow state survives process crashes. Temporal replays the event history to reconstruct state.
Built-in retries — Configure retry policies per activity with backoff.
Timeouts at every level — Schedule-to-start, start-to-close, and heartbeat timeouts.
Visibility — Temporal Web UI shows every saga's current state, history, and pending activities.
Versioning — Safely deploy updated saga logic without breaking in-flight workflows.

Saga with a Generic Compensator in Temporal#

For sagas with many steps, a generic compensator reduces boilerplate:

func OrderSagaWithCompensation(ctx workflow.Context, order Order) error {
    var compensations []func(workflow.Context) error
    defer func() {
        if len(compensations) > 0 {
            // Execute compensations in reverse on failure
            for i := len(compensations) - 1; i >= 0; i-- {
                compensations[i](ctx)
            }
        }
    }()

    // Step 1
    var inv ReserveResult
    if err := workflow.ExecuteActivity(ctx, ReserveInventory, order).Get(ctx, &inv); err != nil {
        return err
    }
    compensations = append(compensations, func(ctx workflow.Context) error {
        return workflow.ExecuteActivity(ctx, ReleaseInventory, inv).Get(ctx, nil)
    })

    // Step 2
    var pay ChargeResult
    if err := workflow.ExecuteActivity(ctx, ChargePayment, order).Get(ctx, &pay); err != nil {
        return err
    }
    compensations = append(compensations, func(ctx workflow.Context) error {
        return workflow.ExecuteActivity(ctx, RefundPayment, pay).Get(ctx, nil)
    })

    // Step 3
    if err := workflow.ExecuteActivity(ctx, CreateShipment, order).Get(ctx, nil); err != nil {
        return err
    }

    // Success — clear compensations
    compensations = nil
    return nil
}

Observability and Debugging#

Sagas span multiple services and can take minutes or hours. Observability is critical:

Correlation IDs — Assign a unique saga ID and propagate it through every step. All logs, traces, and metrics reference this ID.

State persistence — Store saga state transitions in a queryable store:

CREATE TABLE saga_state (
    saga_id       UUID PRIMARY KEY,
    saga_type     TEXT NOT NULL,
    current_state TEXT NOT NULL,
    context       JSONB NOT NULL,
    created_at    TIMESTAMPTZ DEFAULT now(),
    updated_at    TIMESTAMPTZ DEFAULT now(),
    deadline_at   TIMESTAMPTZ
);

CREATE TABLE saga_step_log (
    id         BIGSERIAL PRIMARY KEY,
    saga_id    UUID REFERENCES saga_state(saga_id),
    step_name  TEXT NOT NULL,
    action     TEXT NOT NULL,  -- 'forward' or 'compensate'
    status     TEXT NOT NULL,  -- 'started', 'succeeded', 'failed'
    error_msg  TEXT,
    started_at TIMESTAMPTZ,
    ended_at   TIMESTAMPTZ
);

Alerts — Monitor for sagas stuck in intermediate states, compensation failures, and deadline breaches.

Common Pitfalls#

Forgetting idempotency — If a step is retried, it must not double-charge, double-reserve, or double-ship. Use idempotency keys.
Compensation ordering — Always compensate in reverse. Forward ordering can leave inconsistent state.
Ignoring compensation failures — Compensation can fail too. Have a dead-letter queue and manual resolution process.
Unbounded sagas — Always set a saga-level deadline. Without one, a stuck saga consumes resources indefinitely.
Mixing sync and async — If some steps are synchronous RPC and others are event-driven, ensure the orchestrator handles both consistently.

When Not to Use Sagas#

Sagas add complexity. Consider alternatives first:

Single database — If all data fits in one database, use regular transactions.
Outbox pattern — For reliable event publishing from a single service.
Two-phase commit — If your infrastructure supports it and latency is acceptable (rare in microservices).

Sagas are the right choice when you have genuinely distributed data ownership across multiple services with independent databases.

That is article #387 on Codelit. Browse all articles or explore the platform to level up your engineering skills.

Try it on Codelit

GitHub Integration

Paste any repo URL to generate an interactive architecture diagram from real code

Build this architecture →

Comments

AI search

Build this architecture

Generate an interactive architecture for Event in seconds.

Try it in Codelit →

Event-Driven Saga Orchestration: Managing Distributed Transactions

Choreography vs. Orchestration#

The Orchestrator as a State Machine#

Compensation Logic#

Timeout Handling#

Implementing Sagas with Temporal#

Saga with a Generic Compensator in Temporal#

Observability and Debugging#

Common Pitfalls#

When Not to Use Sagas#

Comments

Related articles

AI-Powered Search Architecture: Semantic Search, Hybrid Search, and RAG

AI Safety Guardrails Architecture: Input Validation, Output Filtering, and Human-in-the-Loop

API Backward Compatibility: Ship Changes Without Breaking Consumers

Try these templates

Scalable SaaS Application

Distributed Rate Limiter

Distributed Key-Value Store

Build this architecture

Event-Driven Saga Orchestration: Managing Distributed Transactions

Choreography vs. Orchestration#

The Orchestrator as a State Machine#

Compensation Logic#

Timeout Handling#

Implementing Sagas with Temporal#

Saga with a Generic Compensator in Temporal#

Observability and Debugging#

Common Pitfalls#

When Not to Use Sagas#

Comments

Related articles

AI-Powered Search Architecture: Semantic Search, Hybrid Search, and RAG

AI Safety Guardrails Architecture: Input Validation, Output Filtering, and Human-in-the-Loop

API Backward Compatibility: Ship Changes Without Breaking Consumers

Try these templates

Scalable SaaS Application

Distributed Rate Limiter

Distributed Key-Value Store

Build this architecture