Bulkhead Pattern: Isolate Failures Before They Spread
Bulkhead Pattern#
Ships have bulkheads — watertight compartments that prevent a single hull breach from sinking the entire vessel. The bulkhead pattern applies the same principle to software: isolate components so one failure cannot consume all resources.
The Problem Without Bulkheads#
API Gateway (shared thread pool: 200 threads)
→ /checkout (calls Payment Service)
→ /search (calls Search Service)
→ /profile (calls User Service)
Payment Service goes down:
→ /checkout requests hang, each holding a thread
→ 200 threads exhausted waiting on Payment Service
→ /search and /profile also fail (no threads available)
→ Complete system outage from one dependency failure
One slow dependency starves the entire system of resources. This is cascading failure.
The Solution: Resource Isolation#
Partition resources so each dependency gets its own isolated pool:
API Gateway
┌─────────────────────────────────────────┐
│ Checkout pool (80 threads) │
│ → Payment Service (down) │
│ → 80 threads blocked... but contained │
├─────────────────────────────────────────┤
│ Search pool (80 threads) │
│ → Search Service (healthy) │
│ → Serving requests normally ✓ │
├─────────────────────────────────────────┤
│ Profile pool (40 threads) │
│ → User Service (healthy) │
│ → Serving requests normally ✓ │
└─────────────────────────────────────────┘
Payment Service is down, but search and profile continue working. The blast radius is contained to the checkout pool.
Isolation Types#
Thread Pool Isolation#
Each dependency gets a dedicated thread pool. Requests execute on pool threads, not the caller's thread.
Caller thread → submits task to pool → pool thread executes
→ if pool full, reject immediately (fail fast)
Config:
pool-checkout: maxThreads=80, queueSize=20
pool-search: maxThreads=80, queueSize=50
pool-profile: maxThreads=40, queueSize=10
Pros: Full isolation, can set timeouts per pool, queue overflow protection Cons: Thread context switching overhead, higher memory usage
Semaphore Isolation#
Limits concurrent calls using a counter. The caller's own thread executes the call — no thread pool overhead.
Caller thread → acquire semaphore (if permits available)
→ execute call on same thread
→ release semaphore
→ if no permits → reject immediately
Config:
semaphore-checkout: maxConcurrent=80
semaphore-search: maxConcurrent=80
semaphore-profile: maxConcurrent=40
Pros: Lower overhead (no thread switching), simpler Cons: No timeout enforcement (caller thread blocks), no queuing
Process Isolation#
Run each component in a separate process, container, or VM. The operating system enforces isolation.
Container A: Checkout service (CPU: 2 cores, RAM: 4GB)
Container B: Search service (CPU: 4 cores, RAM: 8GB)
Container C: Profile service (CPU: 1 core, RAM: 2GB)
Container A crashes → B and C unaffected
Container A uses 100% CPU → B and C have their own CPU allocation
This is the strongest form of bulkhead — resource limits enforced by the OS/container runtime.
Resource Partitioning#
Bulkheads apply beyond threads. Partition any shared resource:
Database connections:
Checkout pool: max 20 connections
Search pool: max 30 connections
Profile pool: max 10 connections
Total DB pool: 60 connections
API rate limits:
Tenant A: 1000 req/s
Tenant B: 500 req/s
Tenant C: 200 req/s
(One tenant cannot starve others)
Message queue consumers:
Priority queue: 10 consumers
Standard queue: 5 consumers
Bulk queue: 2 consumers
Blast Radius Control#
The goal is to minimize the impact radius of any failure:
No bulkheads:
1 failure → entire system down
Blast radius: 100%
Service-level bulkheads:
1 service failure → that service degraded
Blast radius: ~20%
Cell-based architecture:
1 cell failure → only that cell's users affected
Blast radius: ~5%
Swim Lane Architecture#
Swim lanes are end-to-end vertical partitions of your infrastructure. Each swim lane contains all the services, databases, and queues needed for a set of requests.
Swim Lane A (Users 1-1M) Swim Lane B (Users 1M-2M)
┌─────────────────────┐ ┌─────────────────────┐
│ API Gateway A │ │ API Gateway B │
│ Order Service A │ │ Order Service B │
│ Payment Service A │ │ Payment Service B │
│ Database A │ │ Database B │
│ Kafka Cluster A │ │ Kafka Cluster B │
└─────────────────────┘ └─────────────────────┘
Swim Lane A failure → only Users 1-1M affected
Swim Lane B continues serving normally
No shared dependencies between lanes. This is the most aggressive form of bulkhead isolation.
Cell-Based Architecture#
Cells are the modern evolution of swim lanes, popularized by AWS. Each cell is a self-contained, independently deployable unit.
Router (thin, stateless)
│
├── Cell 1 (Region: us-east-1a)
│ All services + data for partition 1
│
├── Cell 2 (Region: us-east-1b)
│ All services + data for partition 2
│
└── Cell 3 (Region: us-west-2a)
All services + data for partition 3
Cell sizing: small enough that losing one is acceptable
Routing: hash(customer_id) → cell assignment
Properties:
- Independent deployment — deploy to one cell, canary test, then roll out
- Independent failure — one cell down affects only its partition
- Independent scaling — scale hot cells without scaling cold ones
- Blast radius — bounded to 1/N of total traffic
Combined with Circuit Breaker#
Bulkheads and circuit breakers are complementary:
Request flow:
1. Bulkhead: "Is there capacity in this pool?" → Yes/No
2. Circuit breaker: "Is this dependency healthy?" → Open/Closed
3. Execute call (if both allow)
Bulkhead alone:
→ Limits concurrent calls but keeps sending to broken dependency
→ Threads still blocked until timeout
Circuit breaker alone:
→ Stops calling broken dependency but doesn't limit concurrency
→ Healthy-but-slow dependency can still exhaust resources
Both together:
→ Bulkhead limits concurrent calls
→ Circuit breaker fast-fails when dependency is down
→ Minimal resource waste
Tools and Implementation#
Resilience4j (Java)#
// Bulkhead configuration
BulkheadConfig config = BulkheadConfig.custom()
.maxConcurrentCalls(80)
.maxWaitDuration(Duration.ofMillis(500))
.build();
Bulkhead bulkhead = Bulkhead.of("checkout", config);
// Combined with circuit breaker
Supplier decorated = Decorators.ofSupplier(() -> paymentService.charge(order))
.withBulkhead(bulkhead)
.withCircuitBreaker(circuitBreaker)
.withFallback(List.of(CallNotPermittedException.class),
e -> fallbackResponse())
.decorate();
Polly (.NET)#
// Bulkhead policy
var bulkhead = Policy.BulkheadAsync(
maxParallelization: 80,
maxQueuingActions: 20,
onBulkheadRejectedAsync: context =>
{
logger.LogWarning("Bulkhead rejected request");
return Task.CompletedTask;
});
// Wrap with circuit breaker
var policy = Policy.WrapAsync(bulkhead, circuitBreaker, retry);
await policy.ExecuteAsync(() => httpClient.GetAsync("/api/payment"));
Kubernetes Resource Limits#
apiVersion: v1
kind: Pod
spec:
containers:
- name: checkout
resources:
requests:
cpu: "500m"
memory: "512Mi"
limits:
cpu: "2000m"
memory: "2Gi"
Sizing Bulkheads#
Formula:
Pool size = (requests/sec) x (avg latency in seconds) x (safety factor)
Example:
Checkout: 100 req/s x 0.5s avg latency x 1.5 = 75 threads
Search: 200 req/s x 0.2s avg latency x 1.5 = 60 threads
Monitor and adjust:
→ Pool utilization consistently above 80%? Increase size.
→ Pool utilization below 20%? Decrease size — you're wasting memory.
Anti-Patterns#
- Bulkheads too large — a pool of 500 threads provides no real isolation
- Bulkheads without timeouts — threads block forever, pool still exhausted
- Shared databases across lanes — defeats the purpose of swim lane isolation
- No monitoring on rejection rate — you need alerts when bulkheads start rejecting
Summary#
- Bulkheads isolate failures — one slow dependency cannot exhaust all resources
- Thread pool isolation provides the strongest in-process protection
- Semaphore isolation is lighter weight but offers no timeout enforcement
- Process/container isolation uses OS-level resource limits
- Swim lanes and cells extend bulkheads to entire infrastructure partitions
- Combine with circuit breakers — bulkheads limit concurrency, breakers stop futile calls
- Resilience4j and Polly provide production-ready bulkhead implementations
Design your resilience architecture at codelit.io — generate interactive diagrams with failure isolation boundaries.
236 articles on system design at codelit.io/blog.
Try it on Codelit
Chaos Mode
Simulate node failures and watch cascading impact across your architecture
Related articles
AI Agent Tool Use Architecture: Function Calling, ReAct Loops & Structured Outputs
6 min read
AI searchAI-Powered Search Architecture: Semantic Search, Hybrid Search, and RAG
8 min read
AI safetyAI Safety Guardrails Architecture: Input Validation, Output Filtering, and Human-in-the-Loop
8 min read
Try these templates
Scalable SaaS Application
Modern SaaS with microservices, event-driven processing, and multi-tenant architecture.
10 componentsNetflix Video Streaming Architecture
Global video streaming platform with adaptive bitrate, CDN distribution, and recommendation engine.
10 componentsSearch Engine Architecture
Web-scale search with crawling, indexing, ranking, and sub-second query serving.
8 componentsBuild this architecture
Generate an interactive architecture for Bulkhead Pattern in seconds.
Try it in Codelit →
Comments