Workflow Engine Architecture: Durable Execution with Temporal
Distributed systems fail in distributed ways. A payment flow touches a billing service, a fraud check, a ledger write, and a notification — any of which can time out, crash, or return an unexpected error. Traditional approaches scatter retry logic, state tracking, and compensation across application code. Workflow engines centralize that complexity into a durable execution layer.
The Problem with Ad-Hoc Orchestration#
Consider a subscription renewal flow without a workflow engine:
- Charge the payment method.
- If the charge fails, retry with exponential backoff.
- If the charge succeeds, update the subscription record.
- Send a confirmation email.
- If step 3 fails after step 1 succeeds, issue a refund.
In practice, teams implement this with a combination of message queues, cron jobs, database flags, and retry loops. The logic scatters across services. Failure states multiply. Edge cases like "the server crashed between step 1 and step 3" become debugging nightmares because the orchestration state lives nowhere explicit.
Durable Execution#
Durable execution is the core idea behind modern workflow engines. The runtime guarantees that a function will run to completion — even across process restarts, deployments, and infrastructure failures.
The mechanism:
┌─────────────────────────────────────────────────────┐
│ Temporal Server │
│ ┌──────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ History │ │ Matching │ │ Timer │ │
│ │ Service │ │ Service │ │ Service │ │
│ │ │ │ (task queue │ │ (durable │ │
│ │ (event │ │ routing) │ │ sleep) │ │
│ │ log) │ │ │ │ │ │
│ └────┬─────┘ └──────┬───────┘ └──────┬───────┘ │
│ │ │ │ │
│ └───────────────┼────────────────┘ │
│ ▼ │
│ Persistence Layer │
│ (Cassandra / PostgreSQL / MySQL) │
└─────────────────────────────────────────────────────┘
▲ ▲
│ task dispatch / complete │
┌────┴────┐ ┌────┴────┐
│ Worker │ │ Worker │
│ (runs │ │ (runs │
│ workflow │ │ activity│
│ code) │ │ code) │
└─────────┘ └─────────┘
- The workflow function executes on a worker.
- Every side effect (activity call, timer, signal) is recorded as an event in the history.
- If the worker crashes, a new worker picks up the workflow and replays the event history to reconstruct state — without re-executing side effects.
- Execution resumes from exactly where it left off.
This replay model means workflow code must be deterministic. It cannot call Math.random(), read the current time directly, or perform I/O outside of activities.
Temporal and Cadence#
Cadence was developed at Uber to orchestrate microservices. Temporal is its commercial successor, founded by the same engineers. Both implement the durable execution model. Temporal has become the dominant choice with broader language support and an active open-source community.
Temporal supports SDKs in Go, Java, TypeScript, Python, .NET, and PHP. The programming model is the same across languages: workflows are functions, activities are functions, and the runtime handles durability.
Workflow as Code#
Unlike BPMN engines or YAML-based orchestrators, Temporal uses workflow as code. The orchestration logic is written in a general-purpose programming language:
// TypeScript workflow
async function subscriptionRenewal(customerId: string): Promise<void> {
// Step 1: Charge
const chargeResult = await activities.chargePaymentMethod(customerId);
if (!chargeResult.success) {
await activities.notifyPaymentFailed(customerId);
return;
}
// Step 2: Update subscription
try {
await activities.extendSubscription(customerId, "1-month");
} catch (err) {
// Compensate: refund the charge
await activities.refundCharge(chargeResult.transactionId);
throw err;
}
// Step 3: Notify
await activities.sendRenewalConfirmation(customerId);
}
This looks like ordinary async code. The difference is that every await is a durable checkpoint. If the worker crashes after chargePaymentMethod completes but before extendSubscription starts, replay will skip the charge (using the recorded result) and resume at the subscription update.
Activities#
Activities are the side-effecting building blocks of a workflow. They run on workers and can:
- Call external APIs
- Read and write databases
- Send messages
- Perform file I/O
Activities are not replayed — they execute once, and their result is stored in the event history. If an activity fails, the retry policy governs what happens next.
Retry Policies#
Temporal provides fine-grained retry configuration:
const activities = proxyActivities<typeof activitiesImpl>({
startToCloseTimeout: "30s",
retry: {
initialInterval: "1s",
backoffCoefficient: 2.0,
maximumInterval: "60s",
maximumAttempts: 5,
nonRetryableErrorTypes: ["InvalidAccountError"],
},
});
The runtime handles retry scheduling, backoff, and attempt counting. Application code does not need retry loops.
Timers and Durable Sleep#
Workflows can sleep for arbitrary durations — minutes, hours, days, or months:
// Wait 30 days before checking renewal
await sleep("30 days");
await activities.checkRenewalEligibility(customerId);
This is a durable timer, not a thread sleep. The workflow is evicted from the worker's memory. When the timer fires, the matching service dispatches the workflow to an available worker, replay reconstructs state, and execution continues.
This makes Temporal suitable for long-running business processes that span weeks or months.
Signals#
Signals are asynchronous messages sent to a running workflow from the outside world. They allow external events to influence workflow execution:
// Workflow listens for a cancellation signal
const cancellationRequested = new Signal<void>();
// In the workflow
const cancelPromise = cancellationRequested.wait();
const timerPromise = sleep("30 days");
// Race: either the timer fires or the user cancels
const result = await Promise.race([cancelPromise, timerPromise]);
Common signal use cases:
- User actions (approve, cancel, escalate)
- External webhook callbacks
- Administrative overrides
Queries#
Queries let external code read the current state of a running workflow without affecting its execution:
// Define a query handler inside the workflow
const getStatus = defineQuery<string>("getStatus");
setHandler(getStatus, () => currentStatus);
// External code queries the workflow
const status = await client.workflow.query(workflowId, getStatus);
Queries are synchronous, read-only, and served from the workflow's in-memory state. They are useful for building dashboards, status pages, and admin tools.
Versioning Workflows#
Production workflows evolve. Temporal provides two versioning strategies:
Patching (Inline Versioning)#
Use patched() to branch behavior within a single workflow definition:
if (patched("add-fraud-check")) {
await activities.runFraudCheck(customerId);
}
await activities.chargePaymentMethod(customerId);
Existing workflow executions that started before the patch take the old path. New executions take the new path. Over time, once all old executions complete, the patch guard can be removed.
Worker Versioning#
Temporal supports build ID-based versioning where different worker builds handle different workflow versions. The server routes tasks to the correct worker build, ensuring that a workflow always replays on a compatible codebase.
Task Queues#
Task queues decouple workflow scheduling from worker deployment:
- Multiple queues — Route different workflow types to specialized worker pools.
- Priority — Assign priority weights to control processing order.
- Rate limiting — Throttle task dispatch to protect downstream services.
- Worker affinity — Route workflows to workers with specific capabilities (GPU, region, credentials).
Use Cases#
Payment Processing#
Payment flows involve multiple external services, each with different failure modes. Temporal ensures charges, refunds, and ledger updates remain consistent even when individual services fail.
Microservice Orchestration#
Instead of choreography (events flying between services with no central visibility), Temporal provides orchestration with a full audit trail. Each step is recorded, queryable, and retryable.
Data Pipelines#
ETL workflows that extract data, transform it, and load it into warehouses benefit from durable execution. Failed stages retry without re-running completed stages.
User Onboarding#
Multi-day onboarding sequences — welcome emails, trial reminders, activation checks — map naturally to long-running workflows with durable timers.
Order Fulfillment#
E-commerce order flows (payment, inventory reservation, shipping, delivery tracking) span hours or days. Temporal tracks each step and handles compensation when orders are modified or cancelled.
Observability#
Temporal provides built-in observability:
- Web UI — Visualize running, completed, and failed workflows with full event history.
- Search attributes — Index custom fields for filtering workflows by business identifiers.
- Metrics — Prometheus-compatible metrics for workflow latency, failure rates, and queue depth.
- Tracing — OpenTelemetry integration for distributed trace correlation.
When to Use a Workflow Engine#
A workflow engine adds value when:
- The process has multiple steps with different failure modes.
- You need exactly-once semantics or reliable compensation.
- The process spans long durations (hours, days, months).
- You need visibility into the state of in-flight business processes.
- Multiple teams own different steps and need a shared orchestration layer.
A workflow engine is overkill for simple request-response handlers, single-service CRUD operations, or purely event-driven architectures where ordering does not matter.
Article #345 in the Codelit engineering series. Explore our full library of system design, distributed systems, and architecture guides at codelit.io.
Try it on Codelit
Chaos Mode
Simulate node failures and watch cascading impact across your architecture
Related articles
AI-Powered Search Architecture: Semantic Search, Hybrid Search, and RAG
8 min read
AI safetyAI Safety Guardrails Architecture: Input Validation, Output Filtering, and Human-in-the-Loop
8 min read
AI workflowsAI Workflow Orchestration: Chains, DAGs, Human-in-the-Loop & Production Patterns
6 min read
Try these templates
Search Engine Architecture
Web-scale search with crawling, indexing, ranking, and sub-second query serving.
8 componentsKubernetes Container Orchestration
K8s cluster with pod scheduling, service mesh, auto-scaling, and CI/CD deployment pipeline.
9 componentsRecommendation Engine
Personalized recommendation system with collaborative filtering, content-based matching, and real-time ranking.
8 componentsBuild this architecture
Generate an interactive Workflow Engine Architecture in seconds.
Try it in Codelit →
Comments