Exponential Backoff and Retries — Building Resilient Distributed Systems
Why retry at all?#
Distributed systems fail in transient ways constantly. A network blip, a brief CPU spike on the server, a momentary connection pool exhaustion — these failures resolve themselves within seconds. A single retry often succeeds where the first attempt failed.
Without retries, every transient failure becomes a user-visible error. With retries, most transient failures become invisible.
The naive retry — and why it fails#
The simplest retry strategy is to immediately retry a fixed number of times:
attempt 1: fail → retry immediately
attempt 2: fail → retry immediately
attempt 3: fail → give up
The problem: if the server is overloaded, immediate retries add more load. Thousands of clients retrying simultaneously create a thundering herd that can turn a momentary slowdown into a full outage.
Exponential backoff#
Exponential backoff spaces retries apart with geometrically increasing delays:
attempt 1: fail → wait 1s
attempt 2: fail → wait 2s
attempt 3: fail → wait 4s
attempt 4: fail → wait 8s
attempt 5: fail → give up
The general formula:
delay = base_delay * (2 ^ attempt_number)
This gives the failing service time to recover. Each successive retry waits longer, reducing pressure on the struggling system.
Capping the maximum delay#
Without a cap, exponential growth produces absurd wait times. After 10 attempts with a 1-second base, the delay would be 1,024 seconds (17 minutes).
delay = min(base_delay * (2 ^ attempt), max_delay)
A typical cap is 30 to 60 seconds. Beyond that, the failure is unlikely to be transient.
Adding jitter#
Even with exponential backoff, synchronized clients produce periodic spikes. If 1,000 clients all start at the same time, they all retry at 1s, then 2s, then 4s — still a thundering herd, just slower.
Jitter adds randomness to break the synchronization:
Full jitter#
delay = random(0, base_delay * (2 ^ attempt))
Spreads retries uniformly across the entire window. Produces the best load distribution.
Equal jitter#
half = (base_delay * (2 ^ attempt)) / 2
delay = half + random(0, half)
Guarantees a minimum wait of half the exponential delay while still randomizing.
Decorrelated jitter#
delay = random(base_delay, previous_delay * 3)
Each delay is based on the previous one, not the attempt number. AWS recommends this approach for its simplicity and effectiveness.
Which jitter to use? Full jitter provides the best theoretical distribution. In practice, any jitter is dramatically better than none.
Retry budget — preventing cascade failures#
A retry budget limits the total number of retries across all requests in a given time window, rather than per-request.
Rule: no more than 10% of requests may be retries
Current traffic: 1,000 requests/second
Retry budget: 100 retries/second
Why this matters: per-request retry limits (e.g., "retry 3 times") can still overwhelm a service. If every request retries 3 times during an outage, the failing service sees 4x its normal load. A retry budget caps the total additional load.
Implementation#
class RetryBudget:
def __init__(self, ratio=0.1, window_seconds=10):
self.ratio = ratio
self.window_seconds = window_seconds
self.requests = []
self.retries = []
def can_retry(self):
now = time.time()
cutoff = now - self.window_seconds
# Clean old entries
self.requests = [t for t in self.requests if t > cutoff]
self.retries = [t for t in self.retries if t > cutoff]
# Check budget
max_retries = len(self.requests) * self.ratio
return len(self.retries) < max_retries
def record_request(self):
self.requests.append(time.time())
def record_retry(self):
self.retries.append(time.time())
Google SRE recommends a retry budget of 10% as a starting point.
Circuit breaker integration#
A circuit breaker stops retries entirely when a downstream service is clearly down, preventing wasted resources and cascading failures.
States#
- Closed — requests flow normally. Failures are counted.
- Open — the failure threshold is exceeded. All requests fail immediately without hitting the downstream service. A timer starts.
- Half-open — after the timer expires, a limited number of test requests are allowed through. If they succeed, the circuit closes. If they fail, it reopens.
Combining with exponential backoff#
if circuit_breaker.is_open():
return FAIL_FAST # no retry, no backoff
try:
response = make_request()
circuit_breaker.record_success()
return response
except TransientError:
circuit_breaker.record_failure()
if retry_budget.can_retry():
delay = backoff_with_jitter(attempt)
sleep(delay)
retry()
else:
return FAIL
The circuit breaker makes the macro decision (is the service healthy?), while exponential backoff handles the micro decision (how long to wait between attempts).
Idempotency — the prerequisite for safe retries#
Retries are only safe when the operation can be executed multiple times without side effects. This is idempotency.
Idempotent operations:
SET status = 'shipped'— same result regardless of how many times it runsPUT /orders/123with a complete resource body — replaces the resource identically each time- Reads (
GET,SELECT) — inherently idempotent
Non-idempotent operations:
INSERT INTO ledger (amount) VALUES (100)— creates a duplicate entry on retryPOST /chargeswithout a deduplication key — charges the customer twicebalance = balance + 100— increments on every retry
Idempotency keys#
For non-idempotent operations, attach a unique key to each request. The server stores the key and its result. On retry, the server returns the stored result instead of re-executing.
POST /charges
Idempotency-Key: req_abc123
Content-Type: application/json
{"amount": 5000, "currency": "usd"}
The server checks: has req_abc123 been processed before? If yes, return the cached response. If no, process it and store the result keyed by req_abc123.
Key storage: use a database table or Redis with a TTL (e.g., 24 hours). The TTL prevents unbounded storage growth.
Dead letter queues for failed retries#
When all retries are exhausted, the message must go somewhere. A dead letter queue (DLQ) captures failed messages for later investigation and reprocessing.
Producer → Main Queue → Consumer (fails)
↓ (after max retries)
Dead Letter Queue
↓
Alert + Dashboard
↓
Manual review or automated reprocessing
DLQ best practices#
- Preserve context — include the original message, error details, attempt count, and timestamps
- Alert on DLQ depth — a growing DLQ indicates a systemic issue, not just transient failures
- Automate reprocessing — build tooling to replay DLQ messages back to the main queue after the root cause is fixed
- Set a DLQ retention policy — messages older than N days should be archived or purged
- Monitor separately — DLQ metrics (depth, age of oldest message, ingestion rate) deserve their own dashboard
Poison messages#
Some messages will never succeed regardless of retries — malformed payloads, references to deleted entities, schema violations. These are poison messages.
Detect them early: if a message has failed N times with the same non-transient error, route it to a separate poison message queue and alert the engineering team. Do not let it consume retry budget.
Putting it all together#
A production retry strategy combines all these components:
- Classify the error — transient (retry) or permanent (fail fast to DLQ)
- Check the circuit breaker — is the downstream service healthy?
- Check the retry budget — has the system-wide retry limit been reached?
- Calculate delay — exponential backoff with jitter, capped at max delay
- Ensure idempotency — attach an idempotency key if the operation is not naturally idempotent
- Execute the retry — make the request
- On final failure — route to the dead letter queue with full context
Explore retry patterns visually#
On Codelit, generate a microservice architecture with retry and circuit breaker flows to see how exponential backoff, retry budgets, and dead letter queues protect your system during failures.
This is article #312 in the Codelit engineering blog series.
Build and explore resilient system architectures visually at codelit.io.
Try it on Codelit
Chaos Mode
Simulate node failures and watch cascading impact across your architecture
Related articles
Try these templates
Build this architecture
Generate an interactive architecture for Exponential Backoff and Retries in seconds.
Try it in Codelit →
Comments