API designasynclong-running operationswebhookssystem designREST

Long-Running API Operations: Async Patterns, Polling, Webhooks, and the Google LRO Pattern

March 29, 2026 6 min readBy Codelit Team Discussion

Long-Running API Operations: Beyond Request-Response#

Some operations take seconds, minutes, or hours. Video transcoding, ML model training, large data exports, and payment processing cannot return results in a single HTTP request. You need async patterns.

The Problem with Synchronous APIs#

Client → POST /api/export (1M rows)
         ↓
         Waiting... 30s... 60s... 90s...
         ↓
         504 Gateway Timeout

Load balancers, reverse proxies, and clients all have timeout limits. Long operations need a different approach.

Pattern 1: Async Request-Response#

Accept the request immediately, process in the background, return a reference.

Client → POST /api/exports
Server → 202 Accepted
         {
           "operationId": "op_abc123",
           "status": "pending",
           "statusUrl": "/api/operations/op_abc123"
         }

         (Background: worker processes the export)

Client → GET /api/operations/op_abc123
Server → 200 OK
         {
           "operationId": "op_abc123",
           "status": "completed",
           "result": { "downloadUrl": "/api/exports/op_abc123/download" }
         }

Key: Return 202 Accepted (not 200 OK) to signal the request was received but not yet fulfilled.

Pattern 2: Polling#

The client periodically checks the operation status.

Basic Polling#

Client                              Server
  │                                    │
  ├─ POST /api/reports ──────────────→ │ 202 { operationId: "op_1" }
  │                                    │
  │  (wait 2s)                         │
  ├─ GET /api/operations/op_1 ───────→ │ 200 { status: "running" }
  │                                    │
  │  (wait 2s)                         │
  ├─ GET /api/operations/op_1 ───────→ │ 200 { status: "running" }
  │                                    │
  │  (wait 2s)                         │
  ├─ GET /api/operations/op_1 ───────→ │ 200 { status: "completed", result: {...} }
  │                                    │

Smart Polling with Retry-After#

Tell clients when to check back:

HTTP/1.1 200 OK
Retry-After: 5

{
  "operationId": "op_abc123",
  "status": "running",
  "progress": 45,
  "estimatedTimeRemaining": "12s"
}

Exponential Backoff Polling#

Poll 1:  wait 1s  → GET /operations/op_1 → running
Poll 2:  wait 2s  → GET /operations/op_1 → running
Poll 3:  wait 4s  → GET /operations/op_1 → running
Poll 4:  wait 8s  → GET /operations/op_1 → completed
Max:     wait 30s (cap the backoff)

Pattern 3: Webhooks for Completion#

Instead of polling, the server notifies the client when done.

Client                              Server                     Worker
  │                                    │                          │
  ├─ POST /api/exports ──────────────→ │                          │
  │  { callbackUrl: "https://..."}     │                          │
  │                                    ├─ Queue job ────────────→ │
  │ ←── 202 { operationId: "op_1" }   │                          │
  │                                    │                          │
  │    (time passes)                   │                          │
  │                                    │                          │
  │                                    │ ←── Job complete ────────┤
  │ ←── POST to callbackUrl ──────────┤                          │
  │    { operationId: "op_1",          │                          │
  │      status: "completed",          │                          │
  │      result: {...} }               │                          │

Webhook Security#

Sign webhook payloads so clients can verify authenticity:

POST /webhook/exports HTTP/1.1
Content-Type: application/json
X-Signature: sha256=a1b2c3d4...
X-Timestamp: 1711616400

{ "operationId": "op_1", "status": "completed" }

Clients verify: HMAC-SHA256(timestamp + "." + body, secret) == signature

Webhook Retry Strategy#

Attempt 1:  immediate     → POST callback → timeout
Attempt 2:  wait 10s      → POST callback → 500 error
Attempt 3:  wait 30s      → POST callback → 200 OK ✓

Max retries: 5
Max delay: 1 hour
Dead letter: store failed webhooks for manual retry

Progress Tracking#

For operations that take minutes, show meaningful progress.

Progress Endpoint#

GET /api/operations/op_abc123

{
  "operationId": "op_abc123",
  "status": "running",
  "progress": {
    "percent": 67,
    "currentStep": "Processing records",
    "stepsCompleted": 2,
    "totalSteps": 3,
    "recordsProcessed": 670000,
    "totalRecords": 1000000
  },
  "createdAt": "2026-03-29T10:00:00Z",
  "estimatedCompletion": "2026-03-29T10:05:30Z"
}

Server-Sent Events for Real-Time Progress#

Client → GET /api/operations/op_1/stream
         Accept: text/event-stream

Server → event: progress
         data: {"percent": 25, "step": "Fetching data"}

         event: progress
         data: {"percent": 50, "step": "Processing records"}

         event: progress
         data: {"percent": 100, "step": "Generating file"}

         event: complete
         data: {"downloadUrl": "/exports/op_1/download"}

Timeout Handling#

Operation-Level Timeouts#

Client → POST /api/exports
         { "timeout": 300 }  // 5 minute max

Server → starts processing
         ↓
         If not done in 300s:
           status → "timed_out"
           cleanup resources
           notify client

Timeout Response#

GET /api/operations/op_abc123

{
  "operationId": "op_abc123",
  "status": "timed_out",
  "error": {
    "code": "OPERATION_TIMEOUT",
    "message": "Operation exceeded maximum duration of 300s",
    "retryable": true
  }
}

The Google LRO Pattern#

Google's Long Running Operations API is a well-established standard. It models operations as first-class resources.

Resource Model#

Operation {
  name: "operations/export-abc123"
  metadata: { type-specific progress info }
  done: false
  result: oneof {
    error: Status { code, message }
    response: { the actual result }
  }
}

API Surface#

POST   /v1/datasets/ds1:export       → returns Operation
GET    /v1/operations/op_abc123       → returns Operation (poll)
POST   /v1/operations/op_abc123:cancel → cancels the operation
POST   /v1/operations/op_abc123:wait   → long-poll until done or timeout
DELETE /v1/operations/op_abc123       → delete operation record
GET    /v1/operations                 → list operations (filter by status)

Long-Poll with :wait#

Instead of repeated short polls, the client sends one request that blocks until the operation completes or a timeout is reached:

Client → POST /v1/operations/op_1:wait
         { "timeout": "30s" }

Server holds connection open...
  ↓ operation completes at 12s
Server → 200 { done: true, response: {...} }

If the timeout expires before completion, the server returns the current state and the client can call :wait again.

Cancellation#

Cancel Endpoint#

Client → POST /api/operations/op_abc123/cancel

Server → 200 OK
{
  "operationId": "op_abc123",
  "status": "cancelling"
}

(Worker receives cancellation signal, cleans up)

Client → GET /api/operations/op_abc123
Server → 200 OK
{
  "operationId": "op_abc123",
  "status": "cancelled",
  "cancelledAt": "2026-03-29T10:03:00Z"
}

Cancellation Is Not Instant#

Status transitions:
  pending → running → cancelling → cancelled
                   → completed
                   → failed
                   → timed_out

The cancelling state gives the worker time to clean up partial results, release resources, and roll back if needed.

Architecture Overview#

                    ┌──────────────┐
Client ──→ API ──→  │  Job Queue   │
           │        │ (Redis/SQS)  │
           │        └──────┬───────┘
           │               │
           │        ┌──────▼───────┐
           │        │   Workers    │
           │        │ (processing) │
           │        └──────┬───────┘
           │               │
           │        ┌──────▼───────┐
           └──GET── │  Operations  │
                    │    Store     │
                    │ (DB/Redis)   │
                    └──────────────┘

Choosing the Right Pattern#

Need                          Pattern
Simple, infrequent ops        Polling with Retry-After
Real-time progress             SSE or WebSocket
Server-to-server               Webhooks
Standard API design            Google LRO pattern
Mix of clients                 Polling + webhook option

Best Practices#

Always return 202 for accepted async operations, never 200
Include a status URL in the initial response so clients know where to check
Support both polling and webhooks when possible for flexibility
Set operation TTLs so completed operations are cleaned up after days/weeks
Make operations idempotent with client-supplied idempotency keys
Include error details when operations fail, with retry guidance
Track operation metadata (who started it, when, parameters used) for debugging

Generate your async API architecture at codelit.io →

Article #448 in the Codelit engineering series. Explore our full library of system design, infrastructure, and architecture guides at codelit.io.

Try it on Codelit

GitHub Integration

Paste any repo URL to generate an interactive architecture diagram from real code

Build this architecture →

Comments

AI search

AI-Powered Search Architecture: Semantic Search, Hybrid Search, and RAG

8 min read

AI safety

AI Safety Guardrails Architecture: Input Validation, Output Filtering, and Human-in-the-Loop

8 min read

API design

API Backward Compatibility: Ship Changes Without Breaking Consumers

6 min read

Try these templates

Search Engine Architecture

Web-scale search with crawling, indexing, ranking, and sub-second query serving.

8 components

Pinterest Visual Discovery Platform

Visual discovery and bookmarking platform with image search, recommendation engine, and ad targeting.

10 components

Google Search Engine Architecture

Web-scale search with crawling, indexing, PageRank, query processing, ads, and knowledge graph.

10 components

Build this architecture

Generate an interactive architecture for Long in seconds.

Try it in Codelit →

API designasynclong-running operationswebhookssystem designREST

Long-Running API Operations: Async Patterns, Polling, Webhooks, and the Google LRO Pattern

March 29, 2026 6 min readBy Codelit Team Discussion

Long-Running API Operations: Beyond Request-Response#

The Problem with Synchronous APIs#

Client → POST /api/export (1M rows)
         ↓
         Waiting... 30s... 60s... 90s...
         ↓
         504 Gateway Timeout

Load balancers, reverse proxies, and clients all have timeout limits. Long operations need a different approach.

Pattern 1: Async Request-Response#

Accept the request immediately, process in the background, return a reference.

Client → POST /api/exports
Server → 202 Accepted
         {
           "operationId": "op_abc123",
           "status": "pending",
           "statusUrl": "/api/operations/op_abc123"
         }

         (Background: worker processes the export)

Client → GET /api/operations/op_abc123
Server → 200 OK
         {
           "operationId": "op_abc123",
           "status": "completed",
           "result": { "downloadUrl": "/api/exports/op_abc123/download" }
         }

Key: Return 202 Accepted (not 200 OK) to signal the request was received but not yet fulfilled.

Pattern 2: Polling#

The client periodically checks the operation status.

Basic Polling#

Client                              Server
  │                                    │
  ├─ POST /api/reports ──────────────→ │ 202 { operationId: "op_1" }
  │                                    │
  │  (wait 2s)                         │
  ├─ GET /api/operations/op_1 ───────→ │ 200 { status: "running" }
  │                                    │
  │  (wait 2s)                         │
  ├─ GET /api/operations/op_1 ───────→ │ 200 { status: "running" }
  │                                    │
  │  (wait 2s)                         │
  ├─ GET /api/operations/op_1 ───────→ │ 200 { status: "completed", result: {...} }
  │                                    │

Smart Polling with Retry-After#

Tell clients when to check back:

HTTP/1.1 200 OK
Retry-After: 5

{
  "operationId": "op_abc123",
  "status": "running",
  "progress": 45,
  "estimatedTimeRemaining": "12s"
}

Exponential Backoff Polling#

Poll 1:  wait 1s  → GET /operations/op_1 → running
Poll 2:  wait 2s  → GET /operations/op_1 → running
Poll 3:  wait 4s  → GET /operations/op_1 → running
Poll 4:  wait 8s  → GET /operations/op_1 → completed
Max:     wait 30s (cap the backoff)

Pattern 3: Webhooks for Completion#

Instead of polling, the server notifies the client when done.

Client                              Server                     Worker
  │                                    │                          │
  ├─ POST /api/exports ──────────────→ │                          │
  │  { callbackUrl: "https://..."}     │                          │
  │                                    ├─ Queue job ────────────→ │
  │ ←── 202 { operationId: "op_1" }   │                          │
  │                                    │                          │
  │    (time passes)                   │                          │
  │                                    │                          │
  │                                    │ ←── Job complete ────────┤
  │ ←── POST to callbackUrl ──────────┤                          │
  │    { operationId: "op_1",          │                          │
  │      status: "completed",          │                          │
  │      result: {...} }               │                          │

Webhook Security#

Sign webhook payloads so clients can verify authenticity:

POST /webhook/exports HTTP/1.1
Content-Type: application/json
X-Signature: sha256=a1b2c3d4...
X-Timestamp: 1711616400

{ "operationId": "op_1", "status": "completed" }

Clients verify: HMAC-SHA256(timestamp + "." + body, secret) == signature

Webhook Retry Strategy#

Attempt 1:  immediate     → POST callback → timeout
Attempt 2:  wait 10s      → POST callback → 500 error
Attempt 3:  wait 30s      → POST callback → 200 OK ✓

Max retries: 5
Max delay: 1 hour
Dead letter: store failed webhooks for manual retry

Progress Tracking#

For operations that take minutes, show meaningful progress.

Progress Endpoint#

GET /api/operations/op_abc123

{
  "operationId": "op_abc123",
  "status": "running",
  "progress": {
    "percent": 67,
    "currentStep": "Processing records",
    "stepsCompleted": 2,
    "totalSteps": 3,
    "recordsProcessed": 670000,
    "totalRecords": 1000000
  },
  "createdAt": "2026-03-29T10:00:00Z",
  "estimatedCompletion": "2026-03-29T10:05:30Z"
}

Server-Sent Events for Real-Time Progress#

Client → GET /api/operations/op_1/stream
         Accept: text/event-stream

Server → event: progress
         data: {"percent": 25, "step": "Fetching data"}

         event: progress
         data: {"percent": 50, "step": "Processing records"}

         event: progress
         data: {"percent": 100, "step": "Generating file"}

         event: complete
         data: {"downloadUrl": "/exports/op_1/download"}

Timeout Handling#

Operation-Level Timeouts#

Client → POST /api/exports
         { "timeout": 300 }  // 5 minute max

Server → starts processing
         ↓
         If not done in 300s:
           status → "timed_out"
           cleanup resources
           notify client

Timeout Response#

GET /api/operations/op_abc123

{
  "operationId": "op_abc123",
  "status": "timed_out",
  "error": {
    "code": "OPERATION_TIMEOUT",
    "message": "Operation exceeded maximum duration of 300s",
    "retryable": true
  }
}

The Google LRO Pattern#

Google's Long Running Operations API is a well-established standard. It models operations as first-class resources.

Resource Model#

Operation {
  name: "operations/export-abc123"
  metadata: { type-specific progress info }
  done: false
  result: oneof {
    error: Status { code, message }
    response: { the actual result }
  }
}

API Surface#

POST   /v1/datasets/ds1:export       → returns Operation
GET    /v1/operations/op_abc123       → returns Operation (poll)
POST   /v1/operations/op_abc123:cancel → cancels the operation
POST   /v1/operations/op_abc123:wait   → long-poll until done or timeout
DELETE /v1/operations/op_abc123       → delete operation record
GET    /v1/operations                 → list operations (filter by status)

Long-Poll with :wait#

Instead of repeated short polls, the client sends one request that blocks until the operation completes or a timeout is reached:

Client → POST /v1/operations/op_1:wait
         { "timeout": "30s" }

Server holds connection open...
  ↓ operation completes at 12s
Server → 200 { done: true, response: {...} }

If the timeout expires before completion, the server returns the current state and the client can call :wait again.

Cancellation#

Cancel Endpoint#

Client → POST /api/operations/op_abc123/cancel

Server → 200 OK
{
  "operationId": "op_abc123",
  "status": "cancelling"
}

(Worker receives cancellation signal, cleans up)

Client → GET /api/operations/op_abc123
Server → 200 OK
{
  "operationId": "op_abc123",
  "status": "cancelled",
  "cancelledAt": "2026-03-29T10:03:00Z"
}

Cancellation Is Not Instant#

Status transitions:
  pending → running → cancelling → cancelled
                   → completed
                   → failed
                   → timed_out

The cancelling state gives the worker time to clean up partial results, release resources, and roll back if needed.

Architecture Overview#

                    ┌──────────────┐
Client ──→ API ──→  │  Job Queue   │
           │        │ (Redis/SQS)  │
           │        └──────┬───────┘
           │               │
           │        ┌──────▼───────┐
           │        │   Workers    │
           │        │ (processing) │
           │        └──────┬───────┘
           │               │
           │        ┌──────▼───────┐
           └──GET── │  Operations  │
                    │    Store     │
                    │ (DB/Redis)   │
                    └──────────────┘

Choosing the Right Pattern#

Need                          Pattern
Simple, infrequent ops        Polling with Retry-After
Real-time progress             SSE or WebSocket
Server-to-server               Webhooks
Standard API design            Google LRO pattern
Mix of clients                 Polling + webhook option

Best Practices#

Always return 202 for accepted async operations, never 200
Include a status URL in the initial response so clients know where to check
Support both polling and webhooks when possible for flexibility
Set operation TTLs so completed operations are cleaned up after days/weeks
Make operations idempotent with client-supplied idempotency keys
Include error details when operations fail, with retry guidance
Track operation metadata (who started it, when, parameters used) for debugging

Generate your async API architecture at codelit.io →

Article #448 in the Codelit engineering series. Explore our full library of system design, infrastructure, and architecture guides at codelit.io.

Try it on Codelit

GitHub Integration

Paste any repo URL to generate an interactive architecture diagram from real code

Build this architecture →

Comments

AI search

Build this architecture

Generate an interactive architecture for Long in seconds.

Try it in Codelit →

Long-Running API Operations: Async Patterns, Polling, Webhooks, and the Google LRO Pattern

Long-Running API Operations: Beyond Request-Response#

The Problem with Synchronous APIs#

Pattern 1: Async Request-Response#

Pattern 2: Polling#

Basic Polling#

Smart Polling with Retry-After#

Exponential Backoff Polling#

Pattern 3: Webhooks for Completion#

Webhook Security#

Webhook Retry Strategy#

Progress Tracking#

Progress Endpoint#

Server-Sent Events for Real-Time Progress#

Timeout Handling#

Operation-Level Timeouts#

Timeout Response#

The Google LRO Pattern#

Resource Model#

API Surface#

Long-Poll with :wait#

Cancellation#

Cancel Endpoint#

Cancellation Is Not Instant#

Architecture Overview#

Choosing the Right Pattern#

Best Practices#

Comments

Related articles

AI-Powered Search Architecture: Semantic Search, Hybrid Search, and RAG

AI Safety Guardrails Architecture: Input Validation, Output Filtering, and Human-in-the-Loop

API Backward Compatibility: Ship Changes Without Breaking Consumers

Try these templates

Search Engine Architecture

Pinterest Visual Discovery Platform

Google Search Engine Architecture

Build this architecture

Long-Running API Operations: Async Patterns, Polling, Webhooks, and the Google LRO Pattern

Long-Running API Operations: Beyond Request-Response#

The Problem with Synchronous APIs#

Pattern 1: Async Request-Response#

Pattern 2: Polling#

Basic Polling#

Smart Polling with Retry-After#

Exponential Backoff Polling#

Pattern 3: Webhooks for Completion#

Webhook Security#

Webhook Retry Strategy#

Progress Tracking#

Progress Endpoint#

Server-Sent Events for Real-Time Progress#

Timeout Handling#

Operation-Level Timeouts#

Timeout Response#

The Google LRO Pattern#

Resource Model#

API Surface#

Long-Poll with :wait#

Cancellation#

Cancel Endpoint#

Cancellation Is Not Instant#

Architecture Overview#

Choosing the Right Pattern#

Best Practices#

Comments

Related articles

AI-Powered Search Architecture: Semantic Search, Hybrid Search, and RAG

AI Safety Guardrails Architecture: Input Validation, Output Filtering, and Human-in-the-Loop

API Backward Compatibility: Ship Changes Without Breaking Consumers

Try these templates

Search Engine Architecture

Pinterest Visual Discovery Platform

Google Search Engine Architecture

Build this architecture