Webhook Architecture: Patterns for Reliable Event-Driven Integration
Webhooks are the simplest form of event-driven integration — an HTTP POST sent to a URL when something happens. But building webhook architecture that's reliable at scale requires careful thought about delivery guarantees, security, and failure handling.
This guide covers the patterns that separate toy webhook implementations from production-grade systems.
Webhooks vs. Polling#
The alternative to webhooks is polling — repeatedly hitting an API to check for changes. Here's why webhooks win in most scenarios:
| Factor | Webhooks | Polling |
|---|---|---|
| Latency | Near real-time | Depends on interval |
| Efficiency | Event-driven, no wasted calls | Most requests return nothing |
| Complexity | Receiver must expose an endpoint | Simpler to implement initially |
| Reliability | Requires retry logic | Naturally retries on next poll |
Polling works for simple integrations where latency doesn't matter. Webhooks are the right choice when you need timely updates without burning API quota on empty responses.
Webhook Delivery: Retry and Exponential Backoff#
The fundamental challenge of webhooks is that HTTP delivery is unreliable. Receivers go down, networks fail, and deployments cause brief outages. A robust webhook sender must implement retries.
Retry Strategy#
A common pattern is exponential backoff with jitter:
- Attempt 1 — immediate
- Attempt 2 — 1 minute later
- Attempt 3 — 5 minutes later
- Attempt 4 — 30 minutes later
- Attempt 5 — 2 hours later
- Attempt 6 — 8 hours later
After exhausting retries, mark the delivery as failed and optionally notify the subscriber. Most systems cap retries between 5 and 10 attempts over a 24 to 72 hour window.
Jitter#
Without jitter, a receiver that goes down and comes back up gets hit by a thundering herd of retried webhooks. Adding random jitter (plus or minus 20% of the backoff interval) spreads the load.
Idempotent Receivers#
Because senders retry on failure, receivers will get duplicate deliveries. If the first attempt succeeded but the acknowledgment was lost, the sender retries and the receiver processes the same event twice.
The fix is idempotency. Every webhook payload should include a unique event ID. Receivers track which IDs they've already processed:
- Receive the webhook.
- Check if the event ID exists in your processed set.
- If yes, return 200 and skip processing.
- If no, process the event, store the ID, and return 200.
Use a database table or Redis set with a TTL for storing processed IDs. The TTL should exceed the sender's maximum retry window.
Signature Verification with HMAC#
Webhooks arrive as HTTP requests to a public URL. Without verification, anyone who discovers your endpoint can send forged payloads. HMAC signature verification solves this.
The pattern works like this:
- The sender and receiver share a secret key during setup.
- For each webhook, the sender computes
HMAC-SHA256(secret, raw_request_body)and includes it in a header (commonlyX-SignatureorX-Hub-Signature-256). - The receiver computes the same HMAC over the raw body and compares it to the header value.
- If they match, the request is authentic.
Critical implementation details:
- Always use constant-time comparison to prevent timing attacks.
- Compute the HMAC over the raw body bytes, not a parsed-and-reserialized version.
- Rotate secrets periodically and support multiple active secrets during the rotation window.
Fan-Out Webhooks#
Fan-out is when a single event must be delivered to multiple subscribers. For example, a payment processor notifying both the merchant's backend and their analytics service.
Approaches#
Independent delivery — treat each subscriber as a separate delivery with its own retry queue. One subscriber's failure doesn't affect others. This is the most common pattern.
Topic-based routing — subscribers register for specific event types. A payment.completed event only goes to subscribers who opted into that topic. This reduces noise and processing overhead.
Batching — instead of one HTTP request per event, accumulate events and deliver them in batches on a schedule. This reduces connection overhead but increases latency. Useful for high-volume, latency-tolerant use cases.
Webhook Infrastructure Tools#
Building reliable webhook delivery from scratch is harder than it looks. These tools handle the hard parts:
Svix#
An open-source webhook sending service. Svix provides retry logic, signature verification, a management dashboard, and SDKs for multiple languages. You focus on generating events; Svix handles delivery. Available as a hosted service or self-hosted.
Hookdeck#
A webhook infrastructure platform focused on the receiving side. Hookdeck sits between senders and your application, providing queuing, retries, filtering, and a debugging dashboard. Useful when you consume webhooks from third-party services and need reliability guarantees.
Amazon EventBridge#
For AWS-native architectures, EventBridge can route webhook-like events with built-in filtering, transformation, and delivery to multiple targets including Lambda, SQS, and HTTP endpoints.
Roll Your Own#
If you build it yourself, the minimum viable architecture is:
- Ingestion endpoint — accepts the event and writes it to a queue.
- Queue — SQS, RabbitMQ, or Redis Streams for durability.
- Worker — dequeues events and attempts HTTP delivery.
- Retry scheduler — requeues failed deliveries with backoff.
- Dead letter queue — captures events that exhaust all retries.
This is significantly more work than using Svix or Hookdeck, but gives you full control.
Debugging Failed Webhooks#
When webhooks fail, debugging is painful because the sender and receiver are different systems. Build observability into your architecture from the start:
- Log every delivery attempt — request body, response status, response body, latency.
- Provide a webhook event log UI — let subscribers see what was sent, when, and what the response was.
- Support manual replay — allow re-sending a specific event for debugging.
- Include a request ID — a unique ID per delivery attempt (distinct from the event ID) makes it easy to correlate logs across systems.
- Expose delivery status via API — let subscribers programmatically check if deliveries are failing.
Stripe's webhook dashboard is the gold standard here — it shows every event, every delivery attempt, the response code, and lets you manually resend.
Scaling Webhook Consumers#
As inbound webhook volume grows, a single HTTP server becomes a bottleneck. Scaling patterns include:
Async Processing#
Don't process webhooks inline. Accept the request, write to a queue, return 200 immediately. A pool of workers processes events asynchronously. This decouples reception from processing and prevents slow handlers from causing timeouts.
Horizontal Scaling#
Run multiple instances of your webhook receiver behind a load balancer. Combined with idempotent processing, this lets you scale throughput linearly.
Rate Limiting and Backpressure#
If you're the sender, respect receiver rate limits. If a receiver starts returning 429 responses, slow down delivery. If you're the receiver and can't keep up, returning 429 signals the sender to back off — assuming they implement it correctly.
Ordering Guarantees#
Webhooks are delivered over HTTP, which provides no ordering guarantees. If event order matters, include a sequence number or timestamp in the payload and have receivers reorder on their end. Alternatively, process events idempotently so that out-of-order delivery produces the same final state.
Key Takeaways#
- Webhooks beat polling for real-time, efficient event delivery.
- Retry with exponential backoff and jitter handles transient failures gracefully.
- Idempotent receivers are non-negotiable — duplicates will happen.
- HMAC signature verification prevents forged payloads.
- Fan-out patterns let you deliver events to multiple subscribers independently.
- Tools like Svix and Hookdeck save months of engineering effort.
- Async processing with queues is essential for scaling webhook consumers.
Webhooks look deceptively simple. The architecture around them is what makes them reliable.
Build reliable integrations with Codelit.
This is post #172 in the Codelit engineering blog series.
Try it on Codelit
Chaos Mode
Simulate node failures and watch cascading impact across your architecture
Comments