api-designsystem-designinfrastructurewebhooks

Webhook Delivery Guarantees — At-Least-Once, Retries, HMAC & Dead Letters

March 29, 2026 7 min readBy Codelit Team Discussion

Why webhooks fail#

You send a POST request to your customer's endpoint. Their server returns a 500. Or it times out. Or DNS fails. Or they deployed and the endpoint is gone for 30 seconds. Webhooks fail constantly — and if you do not handle failures, your customers lose data.

Building reliable webhook delivery means accepting that every HTTP call can fail and designing around it.

Delivery semantics#

At-most-once#

Fire and forget. Send the webhook once. If it fails, the event is lost. Simple to implement, useless for anything important.

At-least-once#

Retry on failure until you get a success (2xx response). The consumer might receive the same event multiple times. This is the standard for webhook systems — Stripe, GitHub, Shopify all use at-least-once delivery.

Exactly-once#

Impossible to guarantee across a network boundary. You can approximate it with at-least-once delivery on the sender side and idempotency on the receiver side.

Always design for at-least-once. It is the only practical guarantee for webhooks.

Retry with exponential backoff#

When a delivery fails, you need to retry. But hammering a failing endpoint every second makes things worse. Exponential backoff spaces out retries:

Attempt 1: immediate
Attempt 2: 30 seconds
Attempt 3: 2 minutes
Attempt 4: 8 minutes
Attempt 5: 30 minutes
Attempt 6: 2 hours
Attempt 7: 8 hours
Attempt 8: 24 hours

Jitter#

If 10,000 webhooks fail at the same time (the consumer had a brief outage), all 10,000 retries will fire at the same backoff interval — creating a thundering herd. Add random jitter:

delay = base_delay * (2 ^ attempt) + random(0, base_delay)

Retry budget#

Set a maximum number of retries (typically 5–8) and a maximum retry window (24–72 hours). After exhausting retries, move the event to a dead letter queue.

Retry-After header#

Respect the Retry-After response header if the consumer sends one. It tells you exactly when to try again — often more useful than your backoff schedule.

HMAC signature verification#

How does the consumer know the webhook actually came from you and was not tampered with? HMAC signatures.

How it works#

When the customer registers a webhook endpoint, generate a shared secret
On every delivery, compute HMAC-SHA256(secret, request_body) and include it in a header (e.g., X-Webhook-Signature)
The consumer computes the same HMAC with their copy of the secret and compares

Implementation details#

Use the raw request body for HMAC computation — not a parsed-and-reserialized version
Include a timestamp in the signed payload to prevent replay attacks: signed_payload = timestamp + "." + body
Use constant-time comparison to prevent timing attacks
Rotate secrets — provide an endpoint for customers to rotate their webhook secret without downtime (sign with both old and new secret during rotation)

Common header conventions#

Provider	Signature header
Stripe	`Stripe-Signature` (includes timestamp + signature)
GitHub	`X-Hub-Signature-256`
Shopify	`X-Shopify-Hmac-Sha256`
Svix	`svix-signature` (includes timestamp)

Idempotency keys#

At-least-once means the consumer will receive duplicates. Idempotency keys let them deduplicate safely.

How it works#

Include a unique event_id (or idempotency-key header) with every webhook delivery. The consumer stores processed event IDs. Before processing, check if the event ID has been seen:

if event_id in processed_events:
    return 200  # Already handled, skip
process(event)
processed_events.add(event_id)
return 200

Best practices#

Generate the ID on the sender side — the same event must always have the same ID across retries
Use UUIDs or deterministic hashes — SHA256(event_type + entity_id + timestamp) works well
Set a TTL on the deduplication store — you do not need to remember events forever. 72 hours covers most retry windows
Store IDs in Redis or a database — in-memory sets are lost on restart

Dead letter handling#

After all retries are exhausted, the event goes to a dead letter queue (DLQ). Do not silently drop it.

What belongs in the DLQ#

The full event payload
Delivery metadata: endpoint URL, HTTP status codes from each attempt, timestamps
Error details: timeout, connection refused, 4xx vs 5xx

What to do with dead letters#

Alert the customer — send an email or dashboard notification that deliveries are failing
Provide a replay endpoint — let customers manually retry failed events once they fix their endpoint
Auto-disable endpoints — after sustained failures (e.g., 3 days of failures), disable the webhook and notify the customer. Do not keep burning resources on a dead endpoint.
Provide event logs — a searchable log of all deliveries (successful and failed) with request and response details

Automatic endpoint disabling#

A reasonable policy:

After 3 consecutive days of 100% failure rate, disable the endpoint
Send a warning email after 1 day of failures
Require the customer to manually re-enable and verify the endpoint

Webhook infrastructure: build vs buy#

Building in-house#

You need:

A durable event queue (SQS, RabbitMQ, Kafka)
A delivery worker with retry logic and backoff
HMAC signing
A dead letter queue
A customer-facing dashboard for logs, retries, and endpoint management
Monitoring: delivery latency, success rate, DLQ depth

This is a significant amount of infrastructure to build and maintain correctly.

Svix — webhook infrastructure as a service#

Svix is an open-source webhook delivery platform that handles all of the above:

At-least-once delivery with configurable retry schedules
HMAC signatures with automatic secret rotation
Idempotency built in
Customer portal — embeddable UI for your customers to manage their endpoints
Event catalog — typed event schemas with versioning
Operational dashboard — delivery logs, success rates, latency metrics

You can self-host the open-source version or use the managed service. Other options include Hookdeck and AWS EventBridge for event routing.

Sender-side architecture#

A well-designed webhook sender looks like this:

Event producer — your application emits events (e.g., payment.completed)
Event queue — durable queue buffers events (SQS, Kafka)
Delivery workers — pull from queue, look up registered endpoints, deliver via HTTP
Retry queue — failed deliveries go back with backoff metadata
Dead letter queue — exhausted retries land here
API for consumers — register endpoints, view logs, replay events, rotate secrets

Scaling considerations#

Fan-out: One event may need delivery to hundreds of endpoints (multi-tenant SaaS). Use a fan-out step between the event queue and delivery workers.
Rate limiting per endpoint: Do not overwhelm a consumer with 1,000 concurrent deliveries. Queue per endpoint and throttle.
Timeout budget: Set a 30-second timeout per delivery attempt. Slow consumers should not block your workers.

Receiver-side best practices#

If you are consuming webhooks:

Return 200 immediately — do heavy processing asynchronously. Acknowledge receipt, then process in a background job.
Verify the HMAC signature — never trust unverified webhooks
Implement idempotency — deduplicate by event ID before processing
Use a queue internally — enqueue the raw payload into your own job queue for reliable processing
Log everything — store raw payloads for debugging and replay

Visualize your webhook architecture#

Map out your event producers, queues, delivery workers, and dead letter handling — try Codelit to generate an interactive diagram.

Key takeaways#

At-least-once is the only practical delivery guarantee for webhooks
Exponential backoff with jitter prevents thundering herds on retry
HMAC-SHA256 signatures verify authenticity and prevent tampering
Idempotency keys let consumers safely handle duplicate deliveries
Dead letter queues catch events that exhaust all retries — never silently drop them
Svix and Hookdeck provide turnkey webhook infrastructure so you do not build from scratch
Consumers should return 200 immediately and process asynchronously

This is article #426 of the Codelit engineering blog.

{ }

Explore the Discord architecture interactively

Try it →

Try it on Codelit

Chaos Mode

Simulate node failures and watch cascading impact across your architecture

Cost Estimator

See estimated AWS monthly costs for every component in your architecture

Build this architecture →

Comments

api design

Batch API Endpoints — Patterns for Bulk Operations, Partial Success, and Idempotency

8 min read

system design

Circuit Breaker Implementation — State Machine, Failure Counting, Fallbacks, and Resilience4j

7 min read

api

API-First Design Methodology — Design Before You Implement

7 min read

Try these templates

Food Delivery Platform

DoorDash-like food delivery with restaurant discovery, real-time tracking, dispatch optimization, and driver management.

9 components

DoorDash Delivery Platform

On-demand food delivery with real-time order tracking, driver dispatch, restaurant integration, and dynamic pricing.

10 components

Build this architecture

Generate an interactive architecture for Webhook Delivery Guarantees in seconds.

Try it in Codelit →

api-designsystem-designinfrastructurewebhooks

Webhook Delivery Guarantees — At-Least-Once, Retries, HMAC & Dead Letters

March 29, 2026 7 min readBy Codelit Team Discussion

Why webhooks fail#

Building reliable webhook delivery means accepting that every HTTP call can fail and designing around it.

Delivery semantics#

At-most-once#

Fire and forget. Send the webhook once. If it fails, the event is lost. Simple to implement, useless for anything important.

At-least-once#

Exactly-once#

Impossible to guarantee across a network boundary. You can approximate it with at-least-once delivery on the sender side and idempotency on the receiver side.

Always design for at-least-once. It is the only practical guarantee for webhooks.

Retry with exponential backoff#

When a delivery fails, you need to retry. But hammering a failing endpoint every second makes things worse. Exponential backoff spaces out retries:

Attempt 1: immediate
Attempt 2: 30 seconds
Attempt 3: 2 minutes
Attempt 4: 8 minutes
Attempt 5: 30 minutes
Attempt 6: 2 hours
Attempt 7: 8 hours
Attempt 8: 24 hours

Jitter#

If 10,000 webhooks fail at the same time (the consumer had a brief outage), all 10,000 retries will fire at the same backoff interval — creating a thundering herd. Add random jitter:

delay = base_delay * (2 ^ attempt) + random(0, base_delay)

Retry budget#

Set a maximum number of retries (typically 5–8) and a maximum retry window (24–72 hours). After exhausting retries, move the event to a dead letter queue.

Retry-After header#

Respect the Retry-After response header if the consumer sends one. It tells you exactly when to try again — often more useful than your backoff schedule.

HMAC signature verification#

How does the consumer know the webhook actually came from you and was not tampered with? HMAC signatures.

How it works#

When the customer registers a webhook endpoint, generate a shared secret
On every delivery, compute HMAC-SHA256(secret, request_body) and include it in a header (e.g., X-Webhook-Signature)
The consumer computes the same HMAC with their copy of the secret and compares

Implementation details#

Use the raw request body for HMAC computation — not a parsed-and-reserialized version
Include a timestamp in the signed payload to prevent replay attacks: signed_payload = timestamp + "." + body
Use constant-time comparison to prevent timing attacks
Rotate secrets — provide an endpoint for customers to rotate their webhook secret without downtime (sign with both old and new secret during rotation)

Common header conventions#

Provider	Signature header
Stripe	`Stripe-Signature` (includes timestamp + signature)
GitHub	`X-Hub-Signature-256`
Shopify	`X-Shopify-Hmac-Sha256`
Svix	`svix-signature` (includes timestamp)

Idempotency keys#

At-least-once means the consumer will receive duplicates. Idempotency keys let them deduplicate safely.

How it works#

Include a unique event_id (or idempotency-key header) with every webhook delivery. The consumer stores processed event IDs. Before processing, check if the event ID has been seen:

if event_id in processed_events:
    return 200  # Already handled, skip
process(event)
processed_events.add(event_id)
return 200

Best practices#

Generate the ID on the sender side — the same event must always have the same ID across retries
Use UUIDs or deterministic hashes — SHA256(event_type + entity_id + timestamp) works well
Set a TTL on the deduplication store — you do not need to remember events forever. 72 hours covers most retry windows
Store IDs in Redis or a database — in-memory sets are lost on restart

Dead letter handling#

After all retries are exhausted, the event goes to a dead letter queue (DLQ). Do not silently drop it.

What belongs in the DLQ#

The full event payload
Delivery metadata: endpoint URL, HTTP status codes from each attempt, timestamps
Error details: timeout, connection refused, 4xx vs 5xx

What to do with dead letters#

Alert the customer — send an email or dashboard notification that deliveries are failing
Provide a replay endpoint — let customers manually retry failed events once they fix their endpoint
Auto-disable endpoints — after sustained failures (e.g., 3 days of failures), disable the webhook and notify the customer. Do not keep burning resources on a dead endpoint.
Provide event logs — a searchable log of all deliveries (successful and failed) with request and response details

Automatic endpoint disabling#

A reasonable policy:

After 3 consecutive days of 100% failure rate, disable the endpoint
Send a warning email after 1 day of failures
Require the customer to manually re-enable and verify the endpoint

Webhook infrastructure: build vs buy#

Building in-house#

You need:

A durable event queue (SQS, RabbitMQ, Kafka)
A delivery worker with retry logic and backoff
HMAC signing
A dead letter queue
A customer-facing dashboard for logs, retries, and endpoint management
Monitoring: delivery latency, success rate, DLQ depth

This is a significant amount of infrastructure to build and maintain correctly.

Svix — webhook infrastructure as a service#

Svix is an open-source webhook delivery platform that handles all of the above:

At-least-once delivery with configurable retry schedules
HMAC signatures with automatic secret rotation
Idempotency built in
Customer portal — embeddable UI for your customers to manage their endpoints
Event catalog — typed event schemas with versioning
Operational dashboard — delivery logs, success rates, latency metrics

You can self-host the open-source version or use the managed service. Other options include Hookdeck and AWS EventBridge for event routing.

Sender-side architecture#

A well-designed webhook sender looks like this:

Event producer — your application emits events (e.g., payment.completed)
Event queue — durable queue buffers events (SQS, Kafka)
Delivery workers — pull from queue, look up registered endpoints, deliver via HTTP
Retry queue — failed deliveries go back with backoff metadata
Dead letter queue — exhausted retries land here
API for consumers — register endpoints, view logs, replay events, rotate secrets

Scaling considerations#

Fan-out: One event may need delivery to hundreds of endpoints (multi-tenant SaaS). Use a fan-out step between the event queue and delivery workers.
Rate limiting per endpoint: Do not overwhelm a consumer with 1,000 concurrent deliveries. Queue per endpoint and throttle.
Timeout budget: Set a 30-second timeout per delivery attempt. Slow consumers should not block your workers.

Receiver-side best practices#

If you are consuming webhooks:

Return 200 immediately — do heavy processing asynchronously. Acknowledge receipt, then process in a background job.
Verify the HMAC signature — never trust unverified webhooks
Implement idempotency — deduplicate by event ID before processing
Use a queue internally — enqueue the raw payload into your own job queue for reliable processing
Log everything — store raw payloads for debugging and replay

Visualize your webhook architecture#

Map out your event producers, queues, delivery workers, and dead letter handling — try Codelit to generate an interactive diagram.

Key takeaways#

At-least-once is the only practical delivery guarantee for webhooks
Exponential backoff with jitter prevents thundering herds on retry
HMAC-SHA256 signatures verify authenticity and prevent tampering
Idempotency keys let consumers safely handle duplicate deliveries
Dead letter queues catch events that exhaust all retries — never silently drop them
Svix and Hookdeck provide turnkey webhook infrastructure so you do not build from scratch
Consumers should return 200 immediately and process asynchronously

This is article #426 of the Codelit engineering blog.

{ }

Explore the Discord architecture interactively

Try it →

Try it on Codelit

Chaos Mode

Simulate node failures and watch cascading impact across your architecture

Cost Estimator

See estimated AWS monthly costs for every component in your architecture

Build this architecture →

Comments

api design

Batch API Endpoints — Patterns for Bulk Operations, Partial Success, and Idempotency

8 min read

system design

Circuit Breaker Implementation — State Machine, Failure Counting, Fallbacks, and Resilience4j

7 min read

api

API-First Design Methodology — Design Before You Implement

7 min read

Try these templates

Food Delivery Platform

DoorDash-like food delivery with restaurant discovery, real-time tracking, dispatch optimization, and driver management.

9 components

DoorDash Delivery Platform

On-demand food delivery with real-time order tracking, driver dispatch, restaurant integration, and dynamic pricing.

10 components

Build this architecture

Generate an interactive architecture for Webhook Delivery Guarantees in seconds.

Try it in Codelit →

Webhook Delivery Guarantees — At-Least-Once, Retries, HMAC & Dead Letters

Why webhooks fail#

Delivery semantics#

At-most-once#

At-least-once#

Exactly-once#

Retry with exponential backoff#

Jitter#

Retry budget#

Retry-After header#

HMAC signature verification#

How it works#

Implementation details#

Common header conventions#

Idempotency keys#

How it works#

Best practices#

Dead letter handling#

What belongs in the DLQ#

What to do with dead letters#

Automatic endpoint disabling#

Webhook infrastructure: build vs buy#

Building in-house#

Svix — webhook infrastructure as a service#

Sender-side architecture#

Scaling considerations#

Receiver-side best practices#

Visualize your webhook architecture#

Key takeaways#

Comments

Related articles

Batch API Endpoints — Patterns for Bulk Operations, Partial Success, and Idempotency

Circuit Breaker Implementation — State Machine, Failure Counting, Fallbacks, and Resilience4j

API-First Design Methodology — Design Before You Implement

Try these templates

Food Delivery Platform

DoorDash Delivery Platform

Build this architecture

Webhook Delivery Guarantees — At-Least-Once, Retries, HMAC & Dead Letters

Why webhooks fail#

Delivery semantics#

At-most-once#

At-least-once#

Exactly-once#

Retry with exponential backoff#

Jitter#

Retry budget#

Retry-After header#

HMAC signature verification#

How it works#

Implementation details#

Common header conventions#

Idempotency keys#

How it works#

Best practices#

Dead letter handling#

What belongs in the DLQ#

What to do with dead letters#

Automatic endpoint disabling#

Webhook infrastructure: build vs buy#

Building in-house#

Svix — webhook infrastructure as a service#

Sender-side architecture#

Scaling considerations#

Receiver-side best practices#

Visualize your webhook architecture#

Key takeaways#

Comments

Related articles

Batch API Endpoints — Patterns for Bulk Operations, Partial Success, and Idempotency

Circuit Breaker Implementation — State Machine, Failure Counting, Fallbacks, and Resilience4j

API-First Design Methodology — Design Before You Implement

Try these templates

Food Delivery Platform

DoorDash Delivery Platform

Build this architecture