circuit breakerresiliencedistributed systemsmicroservicessystem design

Circuit Breaker Pattern & Resilience Patterns for Distributed Systems

March 28, 2026 7 min readBy Codelit Team Discussion

Circuit Breaker Pattern#

In distributed systems, failures cascade. One slow service causes thread pool exhaustion in the caller, which starves other callers, which collapses the entire system. The circuit breaker pattern stops cascading failures before they start.

Why You Need Circuit Breakers#

Without protection:

Payment Service → Inventory Service (down)
  → Thread hangs for 30s timeout
  → 100 concurrent requests × 30s = thread pool exhausted
  → Payment Service stops responding
  → Checkout Service stops responding
  → Entire platform down

With a circuit breaker:

Payment Service → Circuit Breaker → Inventory Service (down)
  → 5 failures detected → circuit OPENS
  → Subsequent calls fail immediately (< 1ms)
  → Payment Service stays healthy
  → Returns fallback: "Inventory check pending"

The Three States#

A circuit breaker is a state machine with three states:

    ┌──────────┐   failures > threshold   ┌──────────┐
    │  CLOSED  │ ───────────────────────→ │   OPEN   │
    │ (normal) │                           │ (failing)│
    └──────────┘                           └──────────┘
         ↑                                      │
         │        success                  wait timeout
         │     ┌───────────┐                    │
         └──── │ HALF-OPEN │ ←──────────────────┘
               │  (testing) │
               └───────────┘
                    │
                    │ failure
                    └──→ back to OPEN

Closed: Requests flow normally. Failures are counted. When failures exceed a threshold within a time window, the circuit opens.
Open: All requests fail immediately without calling the downstream service. After a configured wait duration, the circuit moves to half-open.
Half-Open: A limited number of trial requests are allowed through. If they succeed, the circuit closes. If they fail, it reopens.

Implementation From Scratch#

Here is a basic circuit breaker in TypeScript:

type State = "CLOSED" | "OPEN" | "HALF_OPEN";

class CircuitBreaker {
  private state: State = "CLOSED";
  private failureCount = 0;
  private lastFailureTime = 0;

  constructor(
    private failureThreshold: number = 5,
    private resetTimeoutMs: number = 30_000,
    private halfOpenMaxCalls: number = 3
  ) {}

  async call<T>(fn: () => Promise<T>, fallback?: () => T): Promise<T> {
    if (this.state === "OPEN") {
      if (Date.now() - this.lastFailureTime > this.resetTimeoutMs) {
        this.state = "HALF_OPEN";
      } else {
        if (fallback) return fallback();
        throw new Error("Circuit is OPEN");
      }
    }

    try {
      const result = await fn();
      this.onSuccess();
      return result;
    } catch (error) {
      this.onFailure();
      if (fallback) return fallback();
      throw error;
    }
  }

  private onSuccess() {
    this.failureCount = 0;
    this.state = "CLOSED";
  }

  private onFailure() {
    this.failureCount++;
    this.lastFailureTime = Date.now();
    if (this.failureCount >= this.failureThreshold) {
      this.state = "OPEN";
    }
  }
}

Usage:

const breaker = new CircuitBreaker(5, 30_000);

const result = await breaker.call(
  () => fetch("https://inventory-service/api/stock/item-42"),
  () => ({ status: "unknown", message: "Inventory check pending" })
);

Bulkhead Pattern#

Circuit breakers protect against cascading failures. Bulkheads isolate failures so one bad dependency does not consume all resources.

Thread Pool A (10 threads) → Payment Service
Thread Pool B (10 threads) → Inventory Service  ← this one is slow
Thread Pool C (10 threads) → Notification Service

Inventory is slow → Pool B exhausted → Pools A and C unaffected

Without bulkheads, all services share one thread pool, and one slow dependency drains everything.

Semaphore Bulkhead#

Lighter than thread pools — limits concurrency with a counter:

class Bulkhead {
  private active = 0;
  constructor(private maxConcurrent: number = 10) {}

  async call<T>(fn: () => Promise<T>): Promise<T> {
    if (this.active >= this.maxConcurrent) {
      throw new Error("Bulkhead full");
    }
    this.active++;
    try {
      return await fn();
    } finally {
      this.active--;
    }
  }
}

Retry with Exponential Backoff#

Retries handle transient failures. Backoff prevents retry storms:

async function retryWithBackoff<T>(
  fn: () => Promise<T>,
  maxRetries: number = 3,
  baseDelayMs: number = 1000
): Promise<T> {
  for (let attempt = 0; attempt <= maxRetries; attempt++) {
    try {
      return await fn();
    } catch (error) {
      if (attempt === maxRetries) throw error;
      const jitter = Math.random() * 500;
      const delay = baseDelayMs * Math.pow(2, attempt) + jitter;
      await new Promise((r) => setTimeout(r, delay));
    }
  }
  throw new Error("Unreachable");
}

Key principles:

Exponential backoff: 1s, 2s, 4s, 8s between retries
Jitter: Random offset prevents thundering herd
Max retries: Always cap retries to prevent infinite loops
Idempotency: Only retry operations that are safe to repeat

Timeout Pattern#

Every external call needs a timeout. No exceptions.

function withTimeout<T>(promise: Promise<T>, ms: number): Promise<T> {
  return Promise.race([
    promise,
    new Promise<never>((_, reject) =>
      setTimeout(() => reject(new Error(`Timeout after ${ms}ms`)), ms)
    ),
  ]);
}

// Usage
const data = await withTimeout(fetch("/api/inventory"), 5000);

Set timeouts based on P99 latency, not average. If P99 is 800ms, a 2s timeout is reasonable. A 30s timeout is almost never correct.

Fallback Strategies#

When a service fails, what do you return?

Strategy	Description	Example
Cached value	Return last known good response	Product price from cache
Default value	Return a sensible static value	"Inventory: check back later"
Degraded response	Return partial data	Show product without reviews
Queue for later	Accept the request, process async	Place order, verify inventory later
Redirect	Route to a backup service	Failover to secondary region

Tools and Libraries#

Resilience4j (Java/Kotlin)#

The modern standard for JVM resilience. Modular, lightweight, functional.

CircuitBreaker breaker = CircuitBreaker.ofDefaults("inventory");
Retry retry = Retry.ofDefaults("inventory");
Bulkhead bulkhead = Bulkhead.ofDefaults("inventory");

Supplier<String> decorated = Decorators.ofSupplier(() -> inventoryService.check())
    .withCircuitBreaker(breaker)
    .withBulkhead(bulkhead)
    .withRetry(retry)
    .withFallback(List.of(CallNotPermittedException.class),
        e -> "Inventory unavailable")
    .decorate();

Polly (.NET)#

The go-to for .NET resilience. Fluent API, policy composition.

var policy = Policy
    .Handle<HttpRequestException>()
    .CircuitBreakerAsync(
        exceptionsAllowedBeforeBreaking: 5,
        durationOfBreak: TimeSpan.FromSeconds(30)
    );

var retryPolicy = Policy
    .Handle<HttpRequestException>()
    .WaitAndRetryAsync(3, attempt =>
        TimeSpan.FromSeconds(Math.Pow(2, attempt)));

var combined = Policy.WrapAsync(retryPolicy, policy);

Hystrix (Legacy)#

Netflix Hystrix pioneered circuit breakers in microservices but is now in maintenance mode. Migrate to Resilience4j for new projects.

Chaos Engineering#

Resilience patterns are only as good as your testing. Chaos engineering verifies that your circuit breakers, bulkheads, and fallbacks actually work in production.

Principles:

Define steady state (normal response times, error rates)
Hypothesize that steady state holds during failure
Inject real-world failures (network latency, service crashes, disk full)
Observe the difference

Tools:

Chaos Monkey (Netflix) — randomly terminates instances
Litmus (Kubernetes-native) — pod, network, and node chaos
Gremlin — managed chaos-as-a-service
Toxiproxy — simulate network conditions locally

Start small: inject 100ms latency on one service in staging. If your circuit breaker does not trip and your timeouts do not fire, you have a problem to fix before production.

Combining Patterns#

In practice, you layer these patterns:

Request
  → Timeout (5s)
    → Retry (3 attempts, exponential backoff)
      → Circuit Breaker (opens after 5 failures)
        → Bulkhead (max 10 concurrent)
          → Actual HTTP call
            → Fallback on any failure

Order matters. The timeout wraps everything. Retries happen inside the timeout budget. The circuit breaker tracks failures across retries. The bulkhead limits concurrency to the downstream service.

Key Takeaways#

Circuit breakers prevent cascading failures by failing fast
The three states (closed, open, half-open) provide automatic recovery
Combine with bulkheads, retries, timeouts, and fallbacks for full resilience
Use Resilience4j (JVM), Polly (.NET), or build your own for other stacks
Chaos engineering validates that your resilience patterns actually work
Always set timeouts based on P99 latency, not averages

Design resilient architectures with codelit.io — the all-in-one workspace for engineering teams.

Article 162 on the Codelit engineering blog.

Try it on Codelit

GitHub Integration

Paste any repo URL to generate an interactive architecture diagram from real code

Build this architecture →

Comments

AI search

AI-Powered Search Architecture: Semantic Search, Hybrid Search, and RAG

8 min read

AI safety

AI Safety Guardrails Architecture: Input Validation, Output Filtering, and Human-in-the-Loop

8 min read

API design

API Backward Compatibility: Ship Changes Without Breaking Consumers

6 min read

Try these templates

Scalable SaaS Application

Modern SaaS with microservices, event-driven processing, and multi-tenant architecture.

10 components

Distributed Rate Limiter

API rate limiting with sliding window, token bucket, and per-user quotas.

7 components

Distributed Key-Value Store

Redis/DynamoDB-like distributed KV store with consistent hashing, replication, and tunable consistency.

8 components

Build this architecture

Generate an interactive architecture for Circuit Breaker Pattern & Resilience Patterns for Distributed Systems in seconds.

Try it in Codelit →

circuit breakerresiliencedistributed systemsmicroservicessystem design

Circuit Breaker Pattern & Resilience Patterns for Distributed Systems

March 28, 2026 7 min readBy Codelit Team Discussion

Circuit Breaker Pattern#

Why You Need Circuit Breakers#

Without protection:

Payment Service → Inventory Service (down)
  → Thread hangs for 30s timeout
  → 100 concurrent requests × 30s = thread pool exhausted
  → Payment Service stops responding
  → Checkout Service stops responding
  → Entire platform down

With a circuit breaker:

Payment Service → Circuit Breaker → Inventory Service (down)
  → 5 failures detected → circuit OPENS
  → Subsequent calls fail immediately (< 1ms)
  → Payment Service stays healthy
  → Returns fallback: "Inventory check pending"

The Three States#

A circuit breaker is a state machine with three states:

    ┌──────────┐   failures > threshold   ┌──────────┐
    │  CLOSED  │ ───────────────────────→ │   OPEN   │
    │ (normal) │                           │ (failing)│
    └──────────┘                           └──────────┘
         ↑                                      │
         │        success                  wait timeout
         │     ┌───────────┐                    │
         └──── │ HALF-OPEN │ ←──────────────────┘
               │  (testing) │
               └───────────┘
                    │
                    │ failure
                    └──→ back to OPEN

Closed: Requests flow normally. Failures are counted. When failures exceed a threshold within a time window, the circuit opens.
Open: All requests fail immediately without calling the downstream service. After a configured wait duration, the circuit moves to half-open.
Half-Open: A limited number of trial requests are allowed through. If they succeed, the circuit closes. If they fail, it reopens.

Implementation From Scratch#

Here is a basic circuit breaker in TypeScript:

type State = "CLOSED" | "OPEN" | "HALF_OPEN";

class CircuitBreaker {
  private state: State = "CLOSED";
  private failureCount = 0;
  private lastFailureTime = 0;

  constructor(
    private failureThreshold: number = 5,
    private resetTimeoutMs: number = 30_000,
    private halfOpenMaxCalls: number = 3
  ) {}

  async call<T>(fn: () => Promise<T>, fallback?: () => T): Promise<T> {
    if (this.state === "OPEN") {
      if (Date.now() - this.lastFailureTime > this.resetTimeoutMs) {
        this.state = "HALF_OPEN";
      } else {
        if (fallback) return fallback();
        throw new Error("Circuit is OPEN");
      }
    }

    try {
      const result = await fn();
      this.onSuccess();
      return result;
    } catch (error) {
      this.onFailure();
      if (fallback) return fallback();
      throw error;
    }
  }

  private onSuccess() {
    this.failureCount = 0;
    this.state = "CLOSED";
  }

  private onFailure() {
    this.failureCount++;
    this.lastFailureTime = Date.now();
    if (this.failureCount >= this.failureThreshold) {
      this.state = "OPEN";
    }
  }
}

Usage:

const breaker = new CircuitBreaker(5, 30_000);

const result = await breaker.call(
  () => fetch("https://inventory-service/api/stock/item-42"),
  () => ({ status: "unknown", message: "Inventory check pending" })
);

Bulkhead Pattern#

Circuit breakers protect against cascading failures. Bulkheads isolate failures so one bad dependency does not consume all resources.

Thread Pool A (10 threads) → Payment Service
Thread Pool B (10 threads) → Inventory Service  ← this one is slow
Thread Pool C (10 threads) → Notification Service

Inventory is slow → Pool B exhausted → Pools A and C unaffected

Without bulkheads, all services share one thread pool, and one slow dependency drains everything.

Semaphore Bulkhead#

Lighter than thread pools — limits concurrency with a counter:

class Bulkhead {
  private active = 0;
  constructor(private maxConcurrent: number = 10) {}

  async call<T>(fn: () => Promise<T>): Promise<T> {
    if (this.active >= this.maxConcurrent) {
      throw new Error("Bulkhead full");
    }
    this.active++;
    try {
      return await fn();
    } finally {
      this.active--;
    }
  }
}

Retry with Exponential Backoff#

Retries handle transient failures. Backoff prevents retry storms:

async function retryWithBackoff<T>(
  fn: () => Promise<T>,
  maxRetries: number = 3,
  baseDelayMs: number = 1000
): Promise<T> {
  for (let attempt = 0; attempt <= maxRetries; attempt++) {
    try {
      return await fn();
    } catch (error) {
      if (attempt === maxRetries) throw error;
      const jitter = Math.random() * 500;
      const delay = baseDelayMs * Math.pow(2, attempt) + jitter;
      await new Promise((r) => setTimeout(r, delay));
    }
  }
  throw new Error("Unreachable");
}

Key principles:

Exponential backoff: 1s, 2s, 4s, 8s between retries
Jitter: Random offset prevents thundering herd
Max retries: Always cap retries to prevent infinite loops
Idempotency: Only retry operations that are safe to repeat

Timeout Pattern#

Every external call needs a timeout. No exceptions.

function withTimeout<T>(promise: Promise<T>, ms: number): Promise<T> {
  return Promise.race([
    promise,
    new Promise<never>((_, reject) =>
      setTimeout(() => reject(new Error(`Timeout after ${ms}ms`)), ms)
    ),
  ]);
}

// Usage
const data = await withTimeout(fetch("/api/inventory"), 5000);

Set timeouts based on P99 latency, not average. If P99 is 800ms, a 2s timeout is reasonable. A 30s timeout is almost never correct.

Fallback Strategies#

When a service fails, what do you return?

Strategy	Description	Example
Cached value	Return last known good response	Product price from cache
Default value	Return a sensible static value	"Inventory: check back later"
Degraded response	Return partial data	Show product without reviews
Queue for later	Accept the request, process async	Place order, verify inventory later
Redirect	Route to a backup service	Failover to secondary region

Tools and Libraries#

Resilience4j (Java/Kotlin)#

The modern standard for JVM resilience. Modular, lightweight, functional.

CircuitBreaker breaker = CircuitBreaker.ofDefaults("inventory");
Retry retry = Retry.ofDefaults("inventory");
Bulkhead bulkhead = Bulkhead.ofDefaults("inventory");

Supplier<String> decorated = Decorators.ofSupplier(() -> inventoryService.check())
    .withCircuitBreaker(breaker)
    .withBulkhead(bulkhead)
    .withRetry(retry)
    .withFallback(List.of(CallNotPermittedException.class),
        e -> "Inventory unavailable")
    .decorate();

Polly (.NET)#

The go-to for .NET resilience. Fluent API, policy composition.

var policy = Policy
    .Handle<HttpRequestException>()
    .CircuitBreakerAsync(
        exceptionsAllowedBeforeBreaking: 5,
        durationOfBreak: TimeSpan.FromSeconds(30)
    );

var retryPolicy = Policy
    .Handle<HttpRequestException>()
    .WaitAndRetryAsync(3, attempt =>
        TimeSpan.FromSeconds(Math.Pow(2, attempt)));

var combined = Policy.WrapAsync(retryPolicy, policy);

Hystrix (Legacy)#

Netflix Hystrix pioneered circuit breakers in microservices but is now in maintenance mode. Migrate to Resilience4j for new projects.

Chaos Engineering#

Resilience patterns are only as good as your testing. Chaos engineering verifies that your circuit breakers, bulkheads, and fallbacks actually work in production.

Principles:

Define steady state (normal response times, error rates)
Hypothesize that steady state holds during failure
Inject real-world failures (network latency, service crashes, disk full)
Observe the difference

Tools:

Chaos Monkey (Netflix) — randomly terminates instances
Litmus (Kubernetes-native) — pod, network, and node chaos
Gremlin — managed chaos-as-a-service
Toxiproxy — simulate network conditions locally

Start small: inject 100ms latency on one service in staging. If your circuit breaker does not trip and your timeouts do not fire, you have a problem to fix before production.

Combining Patterns#

In practice, you layer these patterns:

Request
  → Timeout (5s)
    → Retry (3 attempts, exponential backoff)
      → Circuit Breaker (opens after 5 failures)
        → Bulkhead (max 10 concurrent)
          → Actual HTTP call
            → Fallback on any failure

Order matters. The timeout wraps everything. Retries happen inside the timeout budget. The circuit breaker tracks failures across retries. The bulkhead limits concurrency to the downstream service.

Key Takeaways#

Circuit breakers prevent cascading failures by failing fast
The three states (closed, open, half-open) provide automatic recovery
Combine with bulkheads, retries, timeouts, and fallbacks for full resilience
Use Resilience4j (JVM), Polly (.NET), or build your own for other stacks
Chaos engineering validates that your resilience patterns actually work
Always set timeouts based on P99 latency, not averages

Design resilient architectures with codelit.io — the all-in-one workspace for engineering teams.

Article 162 on the Codelit engineering blog.

Try it on Codelit

GitHub Integration

Paste any repo URL to generate an interactive architecture diagram from real code

Build this architecture →

Comments

AI search

AI-Powered Search Architecture: Semantic Search, Hybrid Search, and RAG

8 min read

AI safety

AI Safety Guardrails Architecture: Input Validation, Output Filtering, and Human-in-the-Loop

8 min read

API design

API Backward Compatibility: Ship Changes Without Breaking Consumers

6 min read

Build this architecture

Generate an interactive architecture for Circuit Breaker Pattern & Resilience Patterns for Distributed Systems in seconds.

Try it in Codelit →

Circuit Breaker Pattern & Resilience Patterns for Distributed Systems

Circuit Breaker Pattern#

Why You Need Circuit Breakers#

The Three States#

Implementation From Scratch#

Bulkhead Pattern#

Semaphore Bulkhead#

Retry with Exponential Backoff#

Timeout Pattern#

Fallback Strategies#

Tools and Libraries#

Resilience4j (Java/Kotlin)#

Polly (.NET)#

Hystrix (Legacy)#

Chaos Engineering#

Combining Patterns#

Key Takeaways#

Comments

Related articles

AI-Powered Search Architecture: Semantic Search, Hybrid Search, and RAG

AI Safety Guardrails Architecture: Input Validation, Output Filtering, and Human-in-the-Loop

API Backward Compatibility: Ship Changes Without Breaking Consumers

Try these templates

Scalable SaaS Application

Distributed Rate Limiter

Distributed Key-Value Store

Build this architecture

Circuit Breaker Pattern & Resilience Patterns for Distributed Systems

Circuit Breaker Pattern#

Why You Need Circuit Breakers#

The Three States#

Implementation From Scratch#

Bulkhead Pattern#

Semaphore Bulkhead#

Retry with Exponential Backoff#

Timeout Pattern#

Fallback Strategies#

Tools and Libraries#

Resilience4j (Java/Kotlin)#

Polly (.NET)#

Hystrix (Legacy)#

Chaos Engineering#

Combining Patterns#

Key Takeaways#

Comments

Related articles

AI-Powered Search Architecture: Semantic Search, Hybrid Search, and RAG

AI Safety Guardrails Architecture: Input Validation, Output Filtering, and Human-in-the-Loop

API Backward Compatibility: Ship Changes Without Breaking Consumers

Try these templates

Scalable SaaS Application

Distributed Rate Limiter

Distributed Key-Value Store

Build this architecture