Design Patterns for Distributed Systems — Retry, Circuit Breaker, and More
Failure is not the exception — it's the norm#
In a monolith, a function call either works or throws an exception. In distributed systems, a call can succeed, fail, time out, succeed but the response is lost, or succeed partially. Your system must handle all of these.
These patterns are your toolkit.
Retry with exponential backoff#
When a request fails, try again. But not immediately — wait longer between each attempt.
Attempt 1: wait 100ms
Attempt 2: wait 200ms
Attempt 3: wait 400ms
Attempt 4: wait 800ms (give up)
Add jitter: Random variation prevents all clients from retrying at the exact same time (thundering herd).
Only retry on transient failures. A 500 (server error) is retryable. A 400 (bad request) is not — your data is wrong, retrying won't help.
Circuit breaker#
Stop calling a service that's clearly down. Like an electrical circuit breaker that trips to prevent damage.
Three states:
- Closed (normal): Requests pass through. Track failure rate.
- Open (tripped): All requests fail immediately. No calls to the broken service.
- Half-open (testing): Allow one request through. If it succeeds, close the circuit. If it fails, reopen.
Why it matters: Without a circuit breaker, a slow downstream service causes your service to pile up threads waiting for responses, eventually crashing your own service.
Bulkhead#
Isolate failures so they don't cascade. Named after ship bulkheads that contain flooding.
Implementation: Separate thread pools or connection pools for different downstream services. If Service A's pool is exhausted (because A is slow), Service B's pool is unaffected.
Example: Your payment service and email service share a thread pool. Payments are critical; email is not. If the email provider is slow and consumes all threads, payments stop working. Bulkheads prevent this.
Timeout#
Always set timeouts. A missing timeout turns a slow dependency into your own outage.
Guidelines:
- Set timeouts based on p99 latency, not average
- Different timeouts for different operations (reads: 1s, writes: 5s)
- Timeout < circuit breaker threshold
Sidecar pattern#
Attach a helper process alongside your main service. The sidecar handles cross-cutting concerns.
Common uses:
- Service mesh proxy (Envoy): handles mTLS, load balancing, retries
- Log collector (Fluentd): ships logs to central storage
- Config agent: watches for config changes and reloads
Why sidecars: You get consistent behavior across services without modifying each service's code. Deploy once, apply everywhere.
Strangler fig#
Gradually replace a legacy system without a big-bang rewrite.
- New requests go to the new system
- Old requests still go to the legacy system
- Gradually migrate routes from old to new
- When everything is migrated, remove the old system
Named after: Strangler fig trees that grow around a host tree, eventually replacing it entirely.
Why it works: No risky big-bang migration. You can pause, rollback, or take years. Each step is independently testable.
Ambassador pattern#
A proxy that handles outbound connections from your service. Like a sidecar, but specifically for outbound traffic.
Uses: Retry logic, circuit breaking, logging, and monitoring — all without modifying the service.
Choosing patterns#
| Problem | Pattern |
|---|---|
| Transient failures | Retry with backoff + jitter |
| Downstream outage | Circuit breaker |
| Cascading failures | Bulkhead |
| Slow dependencies | Timeout |
| Cross-cutting concerns | Sidecar |
| Legacy migration | Strangler fig |
| Outbound connection management | Ambassador |
Most production systems use retry + circuit breaker + timeout as a baseline. Add bulkhead and sidecar as complexity grows.
See patterns in your architecture#
On Codelit, generate any microservices system and click the edges between services. The audit tool identifies where retries, circuit breakers, and timeouts should be applied based on the data flow patterns.
Design resilient systems: describe your architecture on Codelit.io and audit the connections between services.
Try it on Codelit
Chaos Mode
Simulate node failures and watch cascading impact across your architecture
Related articles
Try these templates
Distributed Rate Limiter
API rate limiting with sliding window, token bucket, and per-user quotas.
7 componentsDistributed Key-Value Store
Redis/DynamoDB-like distributed KV store with consistent hashing, replication, and tunable consistency.
8 componentsFigma Collaborative Design Platform
Browser-based design tool with real-time multiplayer editing, component libraries, and developer handoff.
10 componentsBuild this architecture
Generate an interactive architecture for Design Patterns for Distributed Systems in seconds.
Try it in Codelit →
Comments