Long-Running API Operations: Async Patterns, Polling, Webhooks, and the Google LRO Pattern
Long-Running API Operations: Beyond Request-Response#
Some operations take seconds, minutes, or hours. Video transcoding, ML model training, large data exports, and payment processing cannot return results in a single HTTP request. You need async patterns.
The Problem with Synchronous APIs#
Client → POST /api/export (1M rows)
↓
Waiting... 30s... 60s... 90s...
↓
504 Gateway Timeout
Load balancers, reverse proxies, and clients all have timeout limits. Long operations need a different approach.
Pattern 1: Async Request-Response#
Accept the request immediately, process in the background, return a reference.
Client → POST /api/exports
Server → 202 Accepted
{
"operationId": "op_abc123",
"status": "pending",
"statusUrl": "/api/operations/op_abc123"
}
(Background: worker processes the export)
Client → GET /api/operations/op_abc123
Server → 200 OK
{
"operationId": "op_abc123",
"status": "completed",
"result": { "downloadUrl": "/api/exports/op_abc123/download" }
}
Key: Return 202 Accepted (not 200 OK) to signal the request was received but not yet fulfilled.
Pattern 2: Polling#
The client periodically checks the operation status.
Basic Polling#
Client Server
│ │
├─ POST /api/reports ──────────────→ │ 202 { operationId: "op_1" }
│ │
│ (wait 2s) │
├─ GET /api/operations/op_1 ───────→ │ 200 { status: "running" }
│ │
│ (wait 2s) │
├─ GET /api/operations/op_1 ───────→ │ 200 { status: "running" }
│ │
│ (wait 2s) │
├─ GET /api/operations/op_1 ───────→ │ 200 { status: "completed", result: {...} }
│ │
Smart Polling with Retry-After#
Tell clients when to check back:
HTTP/1.1 200 OK
Retry-After: 5
{
"operationId": "op_abc123",
"status": "running",
"progress": 45,
"estimatedTimeRemaining": "12s"
}
Exponential Backoff Polling#
Poll 1: wait 1s → GET /operations/op_1 → running
Poll 2: wait 2s → GET /operations/op_1 → running
Poll 3: wait 4s → GET /operations/op_1 → running
Poll 4: wait 8s → GET /operations/op_1 → completed
Max: wait 30s (cap the backoff)
Pattern 3: Webhooks for Completion#
Instead of polling, the server notifies the client when done.
Client Server Worker
│ │ │
├─ POST /api/exports ──────────────→ │ │
│ { callbackUrl: "https://..."} │ │
│ ├─ Queue job ────────────→ │
│ ←── 202 { operationId: "op_1" } │ │
│ │ │
│ (time passes) │ │
│ │ │
│ │ ←── Job complete ────────┤
│ ←── POST to callbackUrl ──────────┤ │
│ { operationId: "op_1", │ │
│ status: "completed", │ │
│ result: {...} } │ │
Webhook Security#
Sign webhook payloads so clients can verify authenticity:
POST /webhook/exports HTTP/1.1
Content-Type: application/json
X-Signature: sha256=a1b2c3d4...
X-Timestamp: 1711616400
{ "operationId": "op_1", "status": "completed" }
Clients verify: HMAC-SHA256(timestamp + "." + body, secret) == signature
Webhook Retry Strategy#
Attempt 1: immediate → POST callback → timeout
Attempt 2: wait 10s → POST callback → 500 error
Attempt 3: wait 30s → POST callback → 200 OK ✓
Max retries: 5
Max delay: 1 hour
Dead letter: store failed webhooks for manual retry
Progress Tracking#
For operations that take minutes, show meaningful progress.
Progress Endpoint#
GET /api/operations/op_abc123
{
"operationId": "op_abc123",
"status": "running",
"progress": {
"percent": 67,
"currentStep": "Processing records",
"stepsCompleted": 2,
"totalSteps": 3,
"recordsProcessed": 670000,
"totalRecords": 1000000
},
"createdAt": "2026-03-29T10:00:00Z",
"estimatedCompletion": "2026-03-29T10:05:30Z"
}
Server-Sent Events for Real-Time Progress#
Client → GET /api/operations/op_1/stream
Accept: text/event-stream
Server → event: progress
data: {"percent": 25, "step": "Fetching data"}
event: progress
data: {"percent": 50, "step": "Processing records"}
event: progress
data: {"percent": 100, "step": "Generating file"}
event: complete
data: {"downloadUrl": "/exports/op_1/download"}
Timeout Handling#
Operation-Level Timeouts#
Client → POST /api/exports
{ "timeout": 300 } // 5 minute max
Server → starts processing
↓
If not done in 300s:
status → "timed_out"
cleanup resources
notify client
Timeout Response#
GET /api/operations/op_abc123
{
"operationId": "op_abc123",
"status": "timed_out",
"error": {
"code": "OPERATION_TIMEOUT",
"message": "Operation exceeded maximum duration of 300s",
"retryable": true
}
}
The Google LRO Pattern#
Google's Long Running Operations API is a well-established standard. It models operations as first-class resources.
Resource Model#
Operation {
name: "operations/export-abc123"
metadata: { type-specific progress info }
done: false
result: oneof {
error: Status { code, message }
response: { the actual result }
}
}
API Surface#
POST /v1/datasets/ds1:export → returns Operation
GET /v1/operations/op_abc123 → returns Operation (poll)
POST /v1/operations/op_abc123:cancel → cancels the operation
POST /v1/operations/op_abc123:wait → long-poll until done or timeout
DELETE /v1/operations/op_abc123 → delete operation record
GET /v1/operations → list operations (filter by status)
Long-Poll with :wait#
Instead of repeated short polls, the client sends one request that blocks until the operation completes or a timeout is reached:
Client → POST /v1/operations/op_1:wait
{ "timeout": "30s" }
Server holds connection open...
↓ operation completes at 12s
Server → 200 { done: true, response: {...} }
If the timeout expires before completion, the server returns the current state and the client can call :wait again.
Cancellation#
Cancel Endpoint#
Client → POST /api/operations/op_abc123/cancel
Server → 200 OK
{
"operationId": "op_abc123",
"status": "cancelling"
}
(Worker receives cancellation signal, cleans up)
Client → GET /api/operations/op_abc123
Server → 200 OK
{
"operationId": "op_abc123",
"status": "cancelled",
"cancelledAt": "2026-03-29T10:03:00Z"
}
Cancellation Is Not Instant#
Status transitions:
pending → running → cancelling → cancelled
→ completed
→ failed
→ timed_out
The cancelling state gives the worker time to clean up partial results, release resources, and roll back if needed.
Architecture Overview#
┌──────────────┐
Client ──→ API ──→ │ Job Queue │
│ │ (Redis/SQS) │
│ └──────┬───────┘
│ │
│ ┌──────▼───────┐
│ │ Workers │
│ │ (processing) │
│ └──────┬───────┘
│ │
│ ┌──────▼───────┐
└──GET── │ Operations │
│ Store │
│ (DB/Redis) │
└──────────────┘
Choosing the Right Pattern#
Need Pattern
Simple, infrequent ops Polling with Retry-After
Real-time progress SSE or WebSocket
Server-to-server Webhooks
Standard API design Google LRO pattern
Mix of clients Polling + webhook option
Best Practices#
- Always return 202 for accepted async operations, never 200
- Include a status URL in the initial response so clients know where to check
- Support both polling and webhooks when possible for flexibility
- Set operation TTLs so completed operations are cleaned up after days/weeks
- Make operations idempotent with client-supplied idempotency keys
- Include error details when operations fail, with retry guidance
- Track operation metadata (who started it, when, parameters used) for debugging
Article #448 in the Codelit engineering series. Explore our full library of system design, infrastructure, and architecture guides at codelit.io.
Try it on Codelit
GitHub Integration
Paste any repo URL to generate an interactive architecture diagram from real code
Related articles
Try these templates
Search Engine Architecture
Web-scale search with crawling, indexing, ranking, and sub-second query serving.
8 componentsPinterest Visual Discovery Platform
Visual discovery and bookmarking platform with image search, recommendation engine, and ad targeting.
10 componentsGoogle Search Engine Architecture
Web-scale search with crawling, indexing, PageRank, query processing, ads, and knowledge graph.
10 components
Comments