Gauri Katara

Posted on Apr 22 • Originally published at gaurikatara.hashnode.dev

After Fixing Thread Pool Exhaustion, A Reader Asked: Did You Add a Circuit Breaker? Here's What I Did Next

#java #springboot #microservices #backenddevelopment

This is a follow-up to my previous article —

Our Spring Boot API Froze Under Load

[https://hashnode.com/edit/cmny3jv0u00bg2dlndzza08zy]

A reader asked me a question I couldn't stop thinking about.

In my last article I wrote about how our Spring Boot API froze under load because of thread pool exhaustion. We fixed it with two things — adding timeouts on external calls and isolating slow dependencies into their own thread pool using @async. A reader left a comment that stopped me mid-scroll: "After the fix, did you keep monitoring thread pool metrics as a guardrail — or add any alerts around it? And did you consider a circuit breaker?" We had added monitoring. But the circuit breaker part? Honestly, we had talked about it and not implemented it yet. That comment pushed me to finally do it properly — and in this article I want to share what I learned about Resilience4j circuit breakers, why they are the missing layer in most Spring Boot microservices, and how to implement them step by step.

The problem with just timeouts

Why timeouts alone are not enough
After our fix, here is what our system looked like: — External service calls had a 5 second timeout — Slow calls ran on an isolated thread pool — CloudWatch alerted us when active threads crossed 70% of max This was much better than before. But there was still a problem. Imagine the external service we were calling started failing — not slowly, but completely. Every call we made was timing out after 5 seconds and returning an error. With just timeouts, here is what happens: every single request to our API still tries to call that external service, waits 5 seconds, times out, and then returns an error. We are calling a service we already know is down — over and over — wasting threads, wasting time, and making the recovery slower. What we needed was something that would say: "This service has been failing repeatedly — stop calling it for a while. Let it recover. Then try again carefully." That is exactly what a circuit breaker does.

What is a circuit breaker
Circuit breaker — the concept in plain English

The name comes from electrical engineering. A circuit breaker in your home trips when there is too much current — it breaks the circuit to prevent damage. You reset it manually when the problem is fixed. In software it works the same way — with three states: CLOSED — Everything is working normally. Requests go through. The circuit breaker monitors failure rate in the background. OPEN — Too many failures have happened. The circuit breaker trips. Instead of trying the actual call, it immediately returns a fallback response. No waiting, no timeouts — fast failure. HALF-OPEN — After a waiting period, the circuit breaker allows a few test requests through. If they succeed, it goes back to CLOSED. If they fail, it goes back to OPEN. The result: when a dependency is down, you fail fast instead of failing slow. Your threads are not stuck waiting. Your system stays responsive. And the failing service gets breathing room to recover.

Circuit breakers don't prevent failures. They contain them — stopping one failing dependency from bringing down your entire system.

Add Resilience4j
Step 1 — Add the dependency
Resilience4j is the most widely used resilience library for Java. It integrates cleanly with Spring Boot and is lightweight compared to alternatives like Hystrix (which is now in maintenance mode). Add this to your pom.xml:

<dependency>
    <groupId>io.github.resilience4j</groupId>
    <artifactId>resilience4j-spring-boot3</artifactId>
    <version>2.1.0</version>
</dependency>

<!-- Also needed for Actuator metrics -->
<dependency>
    <groupId>org.springframework.boot</groupId>
    <artifactId>spring-boot-starter-aop</artifactId>
</dependency>

Use resilience4j-spring-boot3 for Spring Boot 3.x. For Spring Boot 2.x use resilience4j-spring-boot2.

Configure it
Step 2 — Configure the circuit breaker
Add this to your application.properties or application.yml:

#application.properties
#Name your circuit breaker — use the same name in your code
resilience4j.circuitbreaker.instances.externalService.sliding-window-size=10

resilience4j.circuitbreaker.instances.externalService.failure-rate-threshold=50 

resilience4j.circuitbreaker.instances.externalService.wait-duration-in-open-state=10s 

resilience4j.circuitbreaker.instances.externalService.permitted-calls-in-half-open-state=3 

resilience4j.circuitbreaker.instances.externalService.automatic-transition-from-open-to-half-open-enabled=true 

resilience4j.circuitbreaker.instances.externalService.register-health-indicator=true

Let me explain what each setting means in plain English:

sliding-window-size=10
// Look at the last 10 calls to decide the failure rate
failure-rate-threshold=50
// If 50% or more of those 10 calls failed — trip the circuit (go OPEN)
wait-duration-in-open-state=10s
// Stay OPEN for 10 seconds before trying again (HALF-OPEN)
permitted-calls-in-half-open-state=3
// In HALF-OPEN, allow 3 test calls — if they pass, go back to CLOSED
automatic-transition-from-open-to-half-open-enabled=true
// Automatically move to HALF-OPEN after wait time — no manual reset needed.

Use it in code
Step 3 — Add the annotation and fallback

Now wrap the method that calls your external service with the @CircuitBreaker annotation:

@Service 
public class ExternalDataService {

// Circuit breaker name matches what you set in properties
@CircuitBreaker(name = "externalService", fallbackMethod = "getDataFallback")
public EnrichmentData getExternalData(String userId) {
    // This is the call that could fail or be slow
    return externalApiClient.fetchEnrichmentData(userId);
}

// Fallback — called automatically when circuit is OPEN
// Must have same parameters + a Throwable parameter
public EnrichmentData getDataFallback(String userId, Throwable ex) {
    // Return a safe default instead of failing the whole request
    // Options: cached data, empty object, default values
    log.warn("Circuit breaker open for externalService. " +
             "Returning fallback for userId: {}. Reason: {}",
             userId, ex.getMessage());

    return EnrichmentData.defaultData();
}
}

The fallback method signature must match the original method exactly — same parameters — plus one extra Throwable parameter at the end. If the signature doesn't match, Spring will throw an error at startup.

Combine with timeout
Step 4 — Combine circuit breaker with timeout

A circuit breaker works best when combined with a timeout. The timeout prevents threads from waiting too long. The circuit breaker prevents calling a service that is already known to be failing. Resilience4j has a built-in TimeLimiter for this:

#Add timeout configuration alongside circuit breaker
resilience4j.timelimiter.instances.externalService.timeout-duration=3s
resilience4j.timelimiter.instances.externalService.cancel-running-future=true

// Use both annotations together
@CircuitBreaker(name = "externalService", fallbackMethod = "getDataFallback")
@TimeLimiter(name = "externalService")
public CompletableFuture<EnrichmentData> getExternalData(String userId) {
    return CompletableFuture.supplyAsync(() ->
        externalApiClient.fetchEnrichmentData(userId)
    );
}

// Fallback for combined circuit breaker + time limiter
public CompletableFuture<EnrichmentData> getDataFallback(
        String userId, Throwable ex) {
    log.warn("Fallback triggered for userId: {}. Reason: {}",
             userId, ex.getMessage());
    return CompletableFuture.completedFuture(
        EnrichmentData.defaultData()
    );
}

Monitor it
Step 5 — Monitor circuit breaker state with Actuator

Once Resilience4j is set up with register-health-indicator=true, you can see the circuit breaker state through Spring Boot Actuator:

#Enable circuit breaker metrics in Actuator
management.endpoints.web.exposure.include=health,metrics,circuitbreakers 
management.health.circuitbreakers.enabled=true
#Check health endpoint
GET /actuator/health
#Response shows circuit breaker state:
#"externalService": { "status": "CLOSED", "details": {...} }
#Check detailed metrics
GET /actuator/metrics/resilience4j.circuitbreaker.state
GET /actuator/metrics/resilience4j.circuitbreaker.failure.rate

Add a CloudWatch alarm on the circuit breaker state metric — if it goes OPEN, you want to know immediately. A circuit breaker going OPEN is a signal that something in your system needs attention.

Full picture

The complete resilience stack — how it all fits together

After everything we implemented, here is the full picture of how an external call is protected in our service: 1. Request comes in to our API 2. Main Tomcat thread handles it normally 3. External call is handed off to isolated async thread pool (@async) 4. TimeLimiter ensures the call times out after 3 seconds maximum 5. Circuit breaker monitors failure rate across last 10 calls 6. If failure rate exceeds 50% — circuit opens, fallback returns immediately 7. After 10 seconds — 3 test calls go through in HALF-OPEN state 8. If tests pass — circuit closes, normal operation resumes 9. CloudWatch monitors active thread count AND circuit breaker state Each layer handles a different failure scenario. Together they mean one slow or failing external dependency cannot bring down our service.

This is called the Bulkhead + Circuit Breaker pattern. It is one of the core resilience patterns in microservices architecture — documented in the Microsoft Azure Architecture patterns and Netflix's engineering blog.

What I learned from this whole journey

The honest truth is that we should have had a circuit breaker from day one. Timeouts and thread isolation were necessary but they were always incomplete without it. What I found interesting about implementing Resilience4j is how much it forces you to think about failure modes upfront. What should the fallback return? What is an acceptable failure rate? How long should we wait before retrying? These are questions that reveal a lot about how well you understand your own system's dependencies. If I were starting a new microservice project today, these three things would go in before the first endpoint is written:
— Timeouts on every external call
— Isolated thread pools for external dependencies
— Circuit breaker with a sensible fallback Not after the first production incident.

Before. --- This article was directly inspired by a comment on my previous post about thread pool exhaustion. If you haven't read that one, it gives context for why we ended up here. If you found this useful or have questions about Resilience4j configuration

— drop a comment.

And if you are using a different resilience pattern in your Spring Boot services, I would genuinely love to hear how you approached it.

Follow me here for more practical Java backend content — no theory-only posts, just things that come from real production systems.

Top comments (4)

buildbasekit • Apr 22

Thread pool exhaustion fixed → confidence +100
Reader: ‘where circuit breaker?’
You: opens another Jira ticket immediately 😭

Jokes aside, this is the exact evolution every backend goes through:

timeouts → async → still pain → finally adult supervision (circuit breaker)

Also love the honesty here. Most posts end at ‘we fixed it’.
Yours goes: ‘we fixed it… then realized we didn’t actually fix it’ 💀

Solid write-up. Real production learning, not tutorial energy.

Gauri Katara • Apr 28

Haha the Jira ticket comment hit too close to home 😭

"Adult supervision" — that is genuinely the best
description of a circuit breaker I have ever heard
and I will be using it forever now 😄

You are absolutely right about the evolution.
I think the most honest thing I could do was
write the follow-up instead of pretending
the first fix was complete.

Production has a way of humbling you into
writing Part 2 whether you planned to or not 😅

Really appreciate you taking the time to read
and leave such a thoughtful comment. 🙏

buildbasekit • Apr 29

That “production writes Part 2 for you” line is too real 😄

Respect for actually shipping the follow-up instead of pretending the system was “done”. Most people stop at the first fix and miss the real lesson.

Also now I’m convinced every system needs “adult supervision” from day one. Saves a lot of future Jira tickets.

buildbasekit • Apr 29

I’ve been working on Spring Boot boilerplates with built-in resilience patterns (auth, rate limiting, etc).
Might save you some setup time.
If you’re open, I can share it.