Circuit Breaker Pattern & Resilience Patterns for Distributed Systems
Circuit Breaker Pattern#
In distributed systems, failures cascade. One slow service causes thread pool exhaustion in the caller, which starves other callers, which collapses the entire system. The circuit breaker pattern stops cascading failures before they start.
Why You Need Circuit Breakers#
Without protection:
Payment Service → Inventory Service (down)
→ Thread hangs for 30s timeout
→ 100 concurrent requests × 30s = thread pool exhausted
→ Payment Service stops responding
→ Checkout Service stops responding
→ Entire platform down
With a circuit breaker:
Payment Service → Circuit Breaker → Inventory Service (down)
→ 5 failures detected → circuit OPENS
→ Subsequent calls fail immediately (< 1ms)
→ Payment Service stays healthy
→ Returns fallback: "Inventory check pending"
The Three States#
A circuit breaker is a state machine with three states:
┌──────────┐ failures > threshold ┌──────────┐
│ CLOSED │ ───────────────────────→ │ OPEN │
│ (normal) │ │ (failing)│
└──────────┘ └──────────┘
↑ │
│ success wait timeout
│ ┌───────────┐ │
└──── │ HALF-OPEN │ ←──────────────────┘
│ (testing) │
└───────────┘
│
│ failure
└──→ back to OPEN
- Closed: Requests flow normally. Failures are counted. When failures exceed a threshold within a time window, the circuit opens.
- Open: All requests fail immediately without calling the downstream service. After a configured wait duration, the circuit moves to half-open.
- Half-Open: A limited number of trial requests are allowed through. If they succeed, the circuit closes. If they fail, it reopens.
Implementation From Scratch#
Here is a basic circuit breaker in TypeScript:
type State = "CLOSED" | "OPEN" | "HALF_OPEN";
class CircuitBreaker {
private state: State = "CLOSED";
private failureCount = 0;
private lastFailureTime = 0;
constructor(
private failureThreshold: number = 5,
private resetTimeoutMs: number = 30_000,
private halfOpenMaxCalls: number = 3
) {}
async call<T>(fn: () => Promise<T>, fallback?: () => T): Promise<T> {
if (this.state === "OPEN") {
if (Date.now() - this.lastFailureTime > this.resetTimeoutMs) {
this.state = "HALF_OPEN";
} else {
if (fallback) return fallback();
throw new Error("Circuit is OPEN");
}
}
try {
const result = await fn();
this.onSuccess();
return result;
} catch (error) {
this.onFailure();
if (fallback) return fallback();
throw error;
}
}
private onSuccess() {
this.failureCount = 0;
this.state = "CLOSED";
}
private onFailure() {
this.failureCount++;
this.lastFailureTime = Date.now();
if (this.failureCount >= this.failureThreshold) {
this.state = "OPEN";
}
}
}
Usage:
const breaker = new CircuitBreaker(5, 30_000);
const result = await breaker.call(
() => fetch("https://inventory-service/api/stock/item-42"),
() => ({ status: "unknown", message: "Inventory check pending" })
);
Bulkhead Pattern#
Circuit breakers protect against cascading failures. Bulkheads isolate failures so one bad dependency does not consume all resources.
Thread Pool A (10 threads) → Payment Service
Thread Pool B (10 threads) → Inventory Service ← this one is slow
Thread Pool C (10 threads) → Notification Service
Inventory is slow → Pool B exhausted → Pools A and C unaffected
Without bulkheads, all services share one thread pool, and one slow dependency drains everything.
Semaphore Bulkhead#
Lighter than thread pools — limits concurrency with a counter:
class Bulkhead {
private active = 0;
constructor(private maxConcurrent: number = 10) {}
async call<T>(fn: () => Promise<T>): Promise<T> {
if (this.active >= this.maxConcurrent) {
throw new Error("Bulkhead full");
}
this.active++;
try {
return await fn();
} finally {
this.active--;
}
}
}
Retry with Exponential Backoff#
Retries handle transient failures. Backoff prevents retry storms:
async function retryWithBackoff<T>(
fn: () => Promise<T>,
maxRetries: number = 3,
baseDelayMs: number = 1000
): Promise<T> {
for (let attempt = 0; attempt <= maxRetries; attempt++) {
try {
return await fn();
} catch (error) {
if (attempt === maxRetries) throw error;
const jitter = Math.random() * 500;
const delay = baseDelayMs * Math.pow(2, attempt) + jitter;
await new Promise((r) => setTimeout(r, delay));
}
}
throw new Error("Unreachable");
}
Key principles:
- Exponential backoff: 1s, 2s, 4s, 8s between retries
- Jitter: Random offset prevents thundering herd
- Max retries: Always cap retries to prevent infinite loops
- Idempotency: Only retry operations that are safe to repeat
Timeout Pattern#
Every external call needs a timeout. No exceptions.
function withTimeout<T>(promise: Promise<T>, ms: number): Promise<T> {
return Promise.race([
promise,
new Promise<never>((_, reject) =>
setTimeout(() => reject(new Error(`Timeout after ${ms}ms`)), ms)
),
]);
}
// Usage
const data = await withTimeout(fetch("/api/inventory"), 5000);
Set timeouts based on P99 latency, not average. If P99 is 800ms, a 2s timeout is reasonable. A 30s timeout is almost never correct.
Fallback Strategies#
When a service fails, what do you return?
| Strategy | Description | Example |
|---|---|---|
| Cached value | Return last known good response | Product price from cache |
| Default value | Return a sensible static value | "Inventory: check back later" |
| Degraded response | Return partial data | Show product without reviews |
| Queue for later | Accept the request, process async | Place order, verify inventory later |
| Redirect | Route to a backup service | Failover to secondary region |
Tools and Libraries#
Resilience4j (Java/Kotlin)#
The modern standard for JVM resilience. Modular, lightweight, functional.
CircuitBreaker breaker = CircuitBreaker.ofDefaults("inventory");
Retry retry = Retry.ofDefaults("inventory");
Bulkhead bulkhead = Bulkhead.ofDefaults("inventory");
Supplier<String> decorated = Decorators.ofSupplier(() -> inventoryService.check())
.withCircuitBreaker(breaker)
.withBulkhead(bulkhead)
.withRetry(retry)
.withFallback(List.of(CallNotPermittedException.class),
e -> "Inventory unavailable")
.decorate();
Polly (.NET)#
The go-to for .NET resilience. Fluent API, policy composition.
var policy = Policy
.Handle<HttpRequestException>()
.CircuitBreakerAsync(
exceptionsAllowedBeforeBreaking: 5,
durationOfBreak: TimeSpan.FromSeconds(30)
);
var retryPolicy = Policy
.Handle<HttpRequestException>()
.WaitAndRetryAsync(3, attempt =>
TimeSpan.FromSeconds(Math.Pow(2, attempt)));
var combined = Policy.WrapAsync(retryPolicy, policy);
Hystrix (Legacy)#
Netflix Hystrix pioneered circuit breakers in microservices but is now in maintenance mode. Migrate to Resilience4j for new projects.
Chaos Engineering#
Resilience patterns are only as good as your testing. Chaos engineering verifies that your circuit breakers, bulkheads, and fallbacks actually work in production.
Principles:
- Define steady state (normal response times, error rates)
- Hypothesize that steady state holds during failure
- Inject real-world failures (network latency, service crashes, disk full)
- Observe the difference
Tools:
- Chaos Monkey (Netflix) — randomly terminates instances
- Litmus (Kubernetes-native) — pod, network, and node chaos
- Gremlin — managed chaos-as-a-service
- Toxiproxy — simulate network conditions locally
Start small: inject 100ms latency on one service in staging. If your circuit breaker does not trip and your timeouts do not fire, you have a problem to fix before production.
Combining Patterns#
In practice, you layer these patterns:
Request
→ Timeout (5s)
→ Retry (3 attempts, exponential backoff)
→ Circuit Breaker (opens after 5 failures)
→ Bulkhead (max 10 concurrent)
→ Actual HTTP call
→ Fallback on any failure
Order matters. The timeout wraps everything. Retries happen inside the timeout budget. The circuit breaker tracks failures across retries. The bulkhead limits concurrency to the downstream service.
Key Takeaways#
- Circuit breakers prevent cascading failures by failing fast
- The three states (closed, open, half-open) provide automatic recovery
- Combine with bulkheads, retries, timeouts, and fallbacks for full resilience
- Use Resilience4j (JVM), Polly (.NET), or build your own for other stacks
- Chaos engineering validates that your resilience patterns actually work
- Always set timeouts based on P99 latency, not averages
Design resilient architectures with codelit.io — the all-in-one workspace for engineering teams.
Article 162 on the Codelit engineering blog.
Try it on Codelit
GitHub Integration
Paste any repo URL to generate an interactive architecture diagram from real code
Related articles
Try these templates
Scalable SaaS Application
Modern SaaS with microservices, event-driven processing, and multi-tenant architecture.
10 componentsDistributed Rate Limiter
API rate limiting with sliding window, token bucket, and per-user quotas.
7 componentsDistributed Key-Value Store
Redis/DynamoDB-like distributed KV store with consistent hashing, replication, and tunable consistency.
8 componentsBuild this architecture
Generate an interactive architecture for Circuit Breaker Pattern & Resilience Patterns for Distributed Systems in seconds.
Try it in Codelit →
Comments