Webhook Delivery Guarantees — At-Least-Once, Retries, HMAC & Dead Letters
Why webhooks fail#
You send a POST request to your customer's endpoint. Their server returns a 500. Or it times out. Or DNS fails. Or they deployed and the endpoint is gone for 30 seconds. Webhooks fail constantly — and if you do not handle failures, your customers lose data.
Building reliable webhook delivery means accepting that every HTTP call can fail and designing around it.
Delivery semantics#
At-most-once#
Fire and forget. Send the webhook once. If it fails, the event is lost. Simple to implement, useless for anything important.
At-least-once#
Retry on failure until you get a success (2xx response). The consumer might receive the same event multiple times. This is the standard for webhook systems — Stripe, GitHub, Shopify all use at-least-once delivery.
Exactly-once#
Impossible to guarantee across a network boundary. You can approximate it with at-least-once delivery on the sender side and idempotency on the receiver side.
Always design for at-least-once. It is the only practical guarantee for webhooks.
Retry with exponential backoff#
When a delivery fails, you need to retry. But hammering a failing endpoint every second makes things worse. Exponential backoff spaces out retries:
Attempt 1: immediate
Attempt 2: 30 seconds
Attempt 3: 2 minutes
Attempt 4: 8 minutes
Attempt 5: 30 minutes
Attempt 6: 2 hours
Attempt 7: 8 hours
Attempt 8: 24 hours
Jitter#
If 10,000 webhooks fail at the same time (the consumer had a brief outage), all 10,000 retries will fire at the same backoff interval — creating a thundering herd. Add random jitter:
delay = base_delay * (2 ^ attempt) + random(0, base_delay)
Retry budget#
Set a maximum number of retries (typically 5–8) and a maximum retry window (24–72 hours). After exhausting retries, move the event to a dead letter queue.
Retry-After header#
Respect the Retry-After response header if the consumer sends one. It tells you exactly when to try again — often more useful than your backoff schedule.
HMAC signature verification#
How does the consumer know the webhook actually came from you and was not tampered with? HMAC signatures.
How it works#
- When the customer registers a webhook endpoint, generate a shared secret
- On every delivery, compute
HMAC-SHA256(secret, request_body)and include it in a header (e.g.,X-Webhook-Signature) - The consumer computes the same HMAC with their copy of the secret and compares
Implementation details#
- Use the raw request body for HMAC computation — not a parsed-and-reserialized version
- Include a timestamp in the signed payload to prevent replay attacks:
signed_payload = timestamp + "." + body - Use constant-time comparison to prevent timing attacks
- Rotate secrets — provide an endpoint for customers to rotate their webhook secret without downtime (sign with both old and new secret during rotation)
Common header conventions#
| Provider | Signature header |
|---|---|
| Stripe | Stripe-Signature (includes timestamp + signature) |
| GitHub | X-Hub-Signature-256 |
| Shopify | X-Shopify-Hmac-Sha256 |
| Svix | svix-signature (includes timestamp) |
Idempotency keys#
At-least-once means the consumer will receive duplicates. Idempotency keys let them deduplicate safely.
How it works#
Include a unique event_id (or idempotency-key header) with every webhook delivery. The consumer stores processed event IDs. Before processing, check if the event ID has been seen:
if event_id in processed_events:
return 200 # Already handled, skip
process(event)
processed_events.add(event_id)
return 200
Best practices#
- Generate the ID on the sender side — the same event must always have the same ID across retries
- Use UUIDs or deterministic hashes —
SHA256(event_type + entity_id + timestamp)works well - Set a TTL on the deduplication store — you do not need to remember events forever. 72 hours covers most retry windows
- Store IDs in Redis or a database — in-memory sets are lost on restart
Dead letter handling#
After all retries are exhausted, the event goes to a dead letter queue (DLQ). Do not silently drop it.
What belongs in the DLQ#
- The full event payload
- Delivery metadata: endpoint URL, HTTP status codes from each attempt, timestamps
- Error details: timeout, connection refused, 4xx vs 5xx
What to do with dead letters#
- Alert the customer — send an email or dashboard notification that deliveries are failing
- Provide a replay endpoint — let customers manually retry failed events once they fix their endpoint
- Auto-disable endpoints — after sustained failures (e.g., 3 days of failures), disable the webhook and notify the customer. Do not keep burning resources on a dead endpoint.
- Provide event logs — a searchable log of all deliveries (successful and failed) with request and response details
Automatic endpoint disabling#
A reasonable policy:
- After 3 consecutive days of 100% failure rate, disable the endpoint
- Send a warning email after 1 day of failures
- Require the customer to manually re-enable and verify the endpoint
Webhook infrastructure: build vs buy#
Building in-house#
You need:
- A durable event queue (SQS, RabbitMQ, Kafka)
- A delivery worker with retry logic and backoff
- HMAC signing
- A dead letter queue
- A customer-facing dashboard for logs, retries, and endpoint management
- Monitoring: delivery latency, success rate, DLQ depth
This is a significant amount of infrastructure to build and maintain correctly.
Svix — webhook infrastructure as a service#
Svix is an open-source webhook delivery platform that handles all of the above:
- At-least-once delivery with configurable retry schedules
- HMAC signatures with automatic secret rotation
- Idempotency built in
- Customer portal — embeddable UI for your customers to manage their endpoints
- Event catalog — typed event schemas with versioning
- Operational dashboard — delivery logs, success rates, latency metrics
You can self-host the open-source version or use the managed service. Other options include Hookdeck and AWS EventBridge for event routing.
Sender-side architecture#
A well-designed webhook sender looks like this:
- Event producer — your application emits events (e.g.,
payment.completed) - Event queue — durable queue buffers events (SQS, Kafka)
- Delivery workers — pull from queue, look up registered endpoints, deliver via HTTP
- Retry queue — failed deliveries go back with backoff metadata
- Dead letter queue — exhausted retries land here
- API for consumers — register endpoints, view logs, replay events, rotate secrets
Scaling considerations#
- Fan-out: One event may need delivery to hundreds of endpoints (multi-tenant SaaS). Use a fan-out step between the event queue and delivery workers.
- Rate limiting per endpoint: Do not overwhelm a consumer with 1,000 concurrent deliveries. Queue per endpoint and throttle.
- Timeout budget: Set a 30-second timeout per delivery attempt. Slow consumers should not block your workers.
Receiver-side best practices#
If you are consuming webhooks:
- Return 200 immediately — do heavy processing asynchronously. Acknowledge receipt, then process in a background job.
- Verify the HMAC signature — never trust unverified webhooks
- Implement idempotency — deduplicate by event ID before processing
- Use a queue internally — enqueue the raw payload into your own job queue for reliable processing
- Log everything — store raw payloads for debugging and replay
Visualize your webhook architecture#
Map out your event producers, queues, delivery workers, and dead letter handling — try Codelit to generate an interactive diagram.
Key takeaways#
- At-least-once is the only practical delivery guarantee for webhooks
- Exponential backoff with jitter prevents thundering herds on retry
- HMAC-SHA256 signatures verify authenticity and prevent tampering
- Idempotency keys let consumers safely handle duplicate deliveries
- Dead letter queues catch events that exhaust all retries — never silently drop them
- Svix and Hookdeck provide turnkey webhook infrastructure so you do not build from scratch
- Consumers should return 200 immediately and process asynchronously
This is article #426 of the Codelit engineering blog.
Try it on Codelit
Chaos Mode
Simulate node failures and watch cascading impact across your architecture
Cost Estimator
See estimated AWS monthly costs for every component in your architecture
Related articles
Try these templates
Food Delivery Platform
DoorDash-like food delivery with restaurant discovery, real-time tracking, dispatch optimization, and driver management.
9 componentsDoorDash Delivery Platform
On-demand food delivery with real-time order tracking, driver dispatch, restaurant integration, and dynamic pricing.
10 componentsBuild this architecture
Generate an interactive architecture for Webhook Delivery Guarantees in seconds.
Try it in Codelit →
Comments