Webhook Architecture: Design Patterns for Reliable Event Delivery
Webhook Architecture: Reliable Event Delivery#
Webhooks are HTTP callbacks — your server sends a POST request to the subscriber's URL when something happens. Simple in concept, tricky to make reliable.
How Webhooks Work#
Event occurs → Your Server → POST to subscriber URL → Subscriber processes
← 200 OK (acknowledged)
The Reliability Problem#
Networks fail. Subscribers crash. Without retry logic, events get lost:
Event → POST → subscriber returns 500 → event lost forever ❌
Event → POST → network timeout → event lost forever ❌
Event → POST → subscriber is down → event lost forever ❌
Patterns for Reliability#
1. Retry with Exponential Backoff#
Attempt 1: immediate
Attempt 2: 1 minute
Attempt 3: 5 minutes
Attempt 4: 30 minutes
Attempt 5: 2 hours
Attempt 6: 8 hours (final)
After all retries fail → mark as failed → alert subscriber.
Stripe retries 3 times over 24 hours. GitHub retries for 3 days.
2. Signature Verification#
Prove the webhook came from you, not an attacker:
// Sender: sign the payload
const signature = hmac('sha256', secret, JSON.stringify(payload));
headers['X-Webhook-Signature'] = signature;
// Receiver: verify
const expected = hmac('sha256', secret, body);
if (signature !== expected) return 401; // reject
Always verify signatures. Without this, anyone can POST to your webhook URL.
3. Idempotency#
Retries mean duplicate deliveries. Receivers must handle them:
headers['X-Webhook-ID'] = 'evt_abc123';
// Receiver
const processed = await redis.setnx(`webhook:${eventId}`, '1');
if (!processed) return 200; // already handled, skip
4. Outbox Pattern#
Ensure events are published reliably:
-- In the same transaction as the business logic:
BEGIN;
INSERT INTO orders (id, amount) VALUES ('ord_123', 100);
INSERT INTO webhook_outbox (event_type, payload) VALUES ('order.created', '...');
COMMIT;
-- Background worker polls outbox → sends webhooks → marks as sent
Never publish a webhook directly in the request handler — if the request fails after sending, the webhook fires but the action didn't complete.
5. Fan-Out Queue#
Don't send webhooks synchronously. Queue them:
Event → Queue (SQS/Kafka) → Webhook Worker → POST to subscriber
→ retry on failure
→ dead letter on exhaust
Benefits: Non-blocking, parallel delivery, retry isolation.
Architecture#
Your Service → Event occurs → Webhook Service
→ Store event in DB
→ Queue delivery jobs
→ Worker: POST + retry
→ Log: delivery status
→ Dashboard: subscriber health
Subscriber → receives POST → verifies signature → processes → returns 200
Webhook Event Format#
Standard format (similar to Stripe/GitHub):
{
"id": "evt_abc123",
"type": "order.created",
"created": "2026-03-28T14:30:00Z",
"data": {
"id": "ord_456",
"amount": 9999,
"currency": "usd",
"customer": "cus_789"
}
}
Include: Event ID, type, timestamp, full payload. Don't include: Secrets, tokens, or internal IDs the subscriber can't use.
Delivery Guarantees#
| Guarantee | Meaning | Implementation |
|---|---|---|
| At-most-once | May miss events | Fire and forget (no retry) |
| At-least-once | May duplicate | Retry + subscriber dedup |
| Exactly-once | No miss, no duplicate | At-least-once + idempotency key |
At-least-once is the standard. Require subscribers to be idempotent.
Monitoring#
Track per subscriber:
- Delivery rate — % of successful deliveries
- Latency — time from event to successful delivery
- Failure rate — % of failed attempts
- Consecutive failures — auto-disable after N failures
Subscriber health:
acme.com/webhooks: 99.2% success, avg 230ms, 0 consecutive failures ✓
corp.io/hooks: 85% success, avg 1.2s, 12 consecutive failures ⚠️ → auto-disabled
Best Practices#
- Sign every webhook — HMAC-SHA256 with per-subscriber secret
- Include event ID — subscribers need it for dedup
- Retry with backoff — 5-6 attempts over 24 hours
- Use a queue — never send synchronously from the request handler
- Outbox pattern — ensure events match database state
- Subscriber dashboard — show delivery logs, let them resend
- Auto-disable after N consecutive failures — alert the subscriber
- Timeout at 30 seconds — don't wait forever for slow subscribers
Design webhook architectures at codelit.io — generate interactive diagrams with infrastructure exports.
114 articles on system design at codelit.io/blog.
Try it on Codelit
Chaos Mode
Simulate node failures and watch cascading impact across your architecture
Related articles
AI Agent Tool Use Architecture: Function Calling, ReAct Loops & Structured Outputs
6 min read
AI searchAI-Powered Search Architecture: Semantic Search, Hybrid Search, and RAG
8 min read
AI safetyAI Safety Guardrails Architecture: Input Validation, Output Filtering, and Human-in-the-Loop
8 min read
Try these templates
OpenAI API Request Pipeline
7-stage pipeline from API call to token generation, handling millions of requests per minute.
8 componentsNetflix Video Streaming Architecture
Global video streaming platform with adaptive bitrate, CDN distribution, and recommendation engine.
10 componentsDistributed Rate Limiter
API rate limiting with sliding window, token bucket, and per-user quotas.
7 components
Comments