Circuit Breaker Implementation — State Machine, Failure Counting, Fallbacks, and Resilience4j
Why you need a circuit breaker#
Service A calls Service B. Service B's database is overloaded, so every request takes 30 seconds before timing out. Service A's thread pool fills up waiting. Service C calls Service A. Same thing happens. Within minutes, one slow database has cascaded into a full system outage.
A circuit breaker detects the failure pattern and stops making calls to the broken dependency. Instead of waiting 30 seconds per request, it fails immediately in milliseconds. The broken service gets breathing room to recover.
The state machine#
A circuit breaker has three states, like an electrical breaker that trips when current exceeds safe levels.
Closed (normal operation)#
Requests flow through to the downstream service. The breaker monitors every call and tracks failures. This is the initial state.
Open (tripped)#
Failure rate exceeded the threshold. The breaker rejects all requests immediately without calling the downstream service. Returns a fallback response or throws a known exception. A timer starts.
Half-Open (testing recovery)#
The timer expired. The breaker allows a limited number of test requests through. If they succeed, the breaker transitions back to Closed. If they fail, it transitions back to Open and resets the timer.
[Closed] --failure threshold exceeded--> [Open]
[Open] --wait duration expires--> [Half-Open]
[Half-Open] --test requests succeed--> [Closed]
[Half-Open] --test requests fail--> [Open]
Failure counting strategies#
Count-based sliding window#
Track the last N calls. If the failure rate in that window exceeds the threshold, trip the breaker.
Window size: 10 calls
Failure threshold: 50%
Calls: [ok, ok, fail, fail, ok, fail, fail, fail, ok, fail]
Failures: 6 / 10 = 60% → breaker OPENS
Pros: Simple to implement. Predictable behavior. Cons: Slow to react during traffic spikes. 10 calls take different amounts of time under different load.
Time-based sliding window#
Track all calls within the last N seconds. If the failure rate in that window exceeds the threshold, trip the breaker.
Window: last 60 seconds
Failure threshold: 50%
Calls in window: 247 total, 158 failed
Failure rate: 64% → breaker OPENS
Pros: Reacts consistently regardless of traffic volume. Cons: High-traffic services may accumulate many calls in the window. Low-traffic services may not have enough data to trigger.
Minimum call threshold#
Both strategies need a minimum call count before evaluating. Without it, 1 failure out of 1 call gives a 100% failure rate and immediately trips the breaker.
Minimum calls: 5
Window: 10 calls
Calls so far: [fail, fail] → only 2 calls, minimum not met → STAY CLOSED
Calls so far: [fail, fail, fail, ok, fail] → 5 calls, 80% failure → OPEN
Timeout configuration#
Wait duration in Open state#
How long the breaker stays open before transitioning to Half-Open:
- Too short (5s): The downstream service hasn't recovered yet. Test requests fail. Breaker reopens. Pointless.
- Too long (5min): The downstream service recovered 4 minutes ago. You're rejecting requests unnecessarily.
- Reasonable default: 30-60 seconds. Adjust based on your dependency's typical recovery time.
Slow call handling#
A call that succeeds but takes too long should count as a failure. A 29-second response that technically returns 200 OK is still a problem.
Slow call threshold: 3 seconds
Slow call rate threshold: 50%
If more than 50% of calls take longer than 3 seconds → OPEN
Automatic transition timing#
Open state:
wait duration = 30 seconds
Half-Open state:
permitted calls = 5
If 3 out of 5 succeed → CLOSE
If 3 out of 5 fail → reopen, restart wait timer
Fallback strategies#
When the breaker is open, you need to return something useful instead of an error.
Cached response#
Return the last known good response. Works for data that's relatively stable — product catalogs, user profiles, configuration.
public Product getProduct(String id) {
try {
return circuitBreaker.executeSupplier(
() -> productService.getById(id)
);
} catch (CallNotPermittedException e) {
return productCache.get(id); // stale but available
}
}
Default value#
Return a sensible default. Works when partial data is better than no data.
public List<Recommendation> getRecommendations(String userId) {
try {
return circuitBreaker.executeSupplier(
() -> recommendationService.forUser(userId)
);
} catch (CallNotPermittedException e) {
return DEFAULT_RECOMMENDATIONS; // popular items
}
}
Graceful degradation#
Disable the feature entirely and tell the user.
public CheckoutResponse checkout(Cart cart) {
try {
FraudScore score = circuitBreaker.executeSupplier(
() -> fraudService.evaluate(cart)
);
return processWithFraudCheck(cart, score);
} catch (CallNotPermittedException e) {
// Skip fraud check, flag for manual review later
return processWithManualReviewFlag(cart);
}
}
Queue for retry#
Accept the request and queue it for later processing when the dependency recovers.
public void sendNotification(Notification notification) {
try {
circuitBreaker.executeRunnable(
() -> notificationService.send(notification)
);
} catch (CallNotPermittedException e) {
retryQueue.enqueue(notification); // process when breaker closes
}
}
Resilience4j implementation#
Resilience4j is the standard circuit breaker library for Java/Kotlin. It replaced Netflix Hystrix, which is no longer maintained.
Dependency#
<dependency>
<groupId>io.github.resilience4j</groupId>
<artifactId>resilience4j-circuitbreaker</artifactId>
<version>2.2.0</version>
</dependency>
Configuration#
CircuitBreakerConfig config = CircuitBreakerConfig.custom()
// Sliding window
.slidingWindowType(SlidingWindowType.COUNT_BASED)
.slidingWindowSize(10)
.minimumNumberOfCalls(5)
// Failure thresholds
.failureRateThreshold(50)
.slowCallRateThreshold(50)
.slowCallDurationThreshold(Duration.ofSeconds(3))
// Open state
.waitDurationInOpenState(Duration.ofSeconds(30))
// Half-Open state
.permittedNumberOfCallsInHalfOpenState(5)
.automaticTransitionFromOpenToHalfOpenEnabled(true)
// What counts as failure
.recordExceptions(IOException.class, TimeoutException.class)
.ignoreExceptions(BusinessException.class)
.build();
CircuitBreaker breaker = CircuitBreaker.of("payment-service", config);
Using the breaker#
// Wrap a supplier
Supplier<Payment> decorated = CircuitBreaker
.decorateSupplier(breaker, () -> paymentService.process(order));
Try<Payment> result = Try.ofSupplier(decorated)
.recover(CallNotPermittedException.class,
e -> fallbackPayment(order))
.recover(IOException.class,
e -> fallbackPayment(order));
Spring Boot integration#
# application.yml
resilience4j:
circuitbreaker:
instances:
payment-service:
sliding-window-size: 10
minimum-number-of-calls: 5
failure-rate-threshold: 50
wait-duration-in-open-state: 30s
permitted-number-of-calls-in-half-open-state: 5
automatic-transition-from-open-to-half-open-enabled: true
slow-call-rate-threshold: 50
slow-call-duration-threshold: 3s
@Service
public class PaymentGateway {
@CircuitBreaker(name = "payment-service", fallbackMethod = "fallback")
public Payment processPayment(Order order) {
return paymentClient.charge(order);
}
private Payment fallback(Order order, Exception e) {
return Payment.pending(order, "Payment service unavailable");
}
}
Monitoring breaker state#
A circuit breaker opening is an alert-worthy event. It means a dependency is failing.
Metrics to expose#
CircuitBreaker breaker = CircuitBreaker.of("payment-service", config);
// Register event handlers
breaker.getEventPublisher()
.onStateTransition(event ->
log.warn("Circuit breaker {} transitioned: {} -> {}",
event.getCircuitBreakerName(),
event.getStateTransition().getFromState(),
event.getStateTransition().getToState()))
.onCallNotPermitted(event ->
metrics.increment("circuit_breaker.rejected",
"name", event.getCircuitBreakerName()));
Prometheus metrics (via Micrometer)#
Resilience4j integrates with Micrometer to export metrics automatically:
# Current state (0=closed, 1=open, 2=half-open)
resilience4j_circuitbreaker_state{name="payment-service"}
# Failure rate
resilience4j_circuitbreaker_failure_rate{name="payment-service"}
# Calls by outcome
resilience4j_circuitbreaker_calls_seconds_count{name="payment-service", kind="successful"}
resilience4j_circuitbreaker_calls_seconds_count{name="payment-service", kind="failed"}
resilience4j_circuitbreaker_calls_seconds_count{name="payment-service", kind="not_permitted"}
Alerting rules#
# Alert when a breaker opens
- alert: CircuitBreakerOpen
expr: resilience4j_circuitbreaker_state == 1
for: 0m
labels:
severity: critical
annotations:
summary: "Circuit breaker {{ $labels.name }} is OPEN"
# Alert on high rejection rate
- alert: CircuitBreakerRejecting
expr: |
rate(resilience4j_circuitbreaker_calls_seconds_count{kind="not_permitted"}[5m]) > 0
for: 1m
labels:
severity: warning
Dashboard essentials#
Build a Grafana dashboard showing:
- Breaker state timeline — when did each breaker open and close?
- Failure rate trend — is the rate approaching the threshold before it trips?
- Rejected calls count — how many requests are being short-circuited?
- Recovery time — how long does the breaker stay open before successfully closing?
- Latency distribution — are slow calls increasing before the breaker trips?
The practical takeaway#
A circuit breaker is a state machine that prevents cascading failures by failing fast when a dependency is down. The implementation checklist:
- Choose count-based or time-based sliding window — count-based is simpler, time-based reacts more consistently
- Set minimum call threshold — prevent false positives from low traffic
- Configure slow call detection — a 30-second 200 OK is still a failure
- Always implement a fallback — cached data, defaults, queue for retry, or graceful degradation
- Monitor breaker state transitions — an open breaker is an incident
- Start with 50% failure rate, 30s open duration, 5 test calls — tune from there based on your SLOs
Article #454 in the Codelit engineering series. Explore our full library of system design, infrastructure, and architecture guides at codelit.io.
Try it on Codelit
Chaos Mode
Simulate node failures and watch cascading impact across your architecture
Related articles
Try these templates
Build this architecture
Generate an interactive architecture for Circuit Breaker Implementation in seconds.
Try it in Codelit →
Comments