Gauri Katara

Posted on Apr 16 • Originally published at gaurikatara.hashnode.dev

Our Spring Boot API Froze Under Load — Here's Exactly How We Fixed It

#java #springboot #backenddevelopment #api

This article was originally published on
My Hashnode blog [https://gaurikatara.hashnode.dev]

The API didn't throw an error. It just... stopped responding or frozen.

No exceptions in the logs. No 500 errors. No obvious reason. Just requests piling up, response times climbing from milliseconds to 30 seconds, and then — timeouts. It was strange. Traffic had picked up slightly — nothing unusual for that time of day — and suddenly our backend service was barely responding. Users were seeing blank screens. This is the story from one of my typical workdays about how we diagnosed and fixed issue in our Spring Boot microservice — and what I learned from it that I hadn't found clearly explained anywhere online at single point.

I began investigating all the potential causes:

Network issues
Examine database metrics
Monitor CPU and memory usage
Inspect thread dumps
Review load balancer logs

What I discovered was surprising—it was related to the threads.

The issue stemmed from thread pool exhaustion in production.

WHAT IS THREAD POOL EXHAUSTION — and why is it so sneaky?

Spring Boot uses an embedded Tomcat server by default. Tomcat handles incoming HTTP requests using a fixed pool of threads. By default, this pool has a maximum of 200 threads. Here is how it works normally: a request comes in, Tomcat assigns it a thread, the thread does its work, returns a response, and the thread goes back to the pool. Fast, clean, repeatable.

Now here is the problem: what if a thread gets stuck waiting?

Maybe it's calling an external API that is slow.
Maybe it's waiting on a database query that is taking too long.
Maybe it's doing something blocking that it shouldn't be.

If enough threads get stuck waiting at the same time, the pool runs out. New requests come in but there are no threads available to handle them. So they wait in a queue. The queue fills up. New requests start getting rejected or timing out. Your API is not down. It is not throwing errors. It is just frozen. And the worst part — from the outside it looks exactly like a network issue or a deployment problem. Most developers waste hours looking in the wrong place.

Thread pool exhaustion does not always look like an error. It often looks like extreme slowness or total unresponsiveness — with perfectly clean logs.

HOW WE DIAGNOSED IT

Step 1 — Check the thread pool metrics

The first thing I did after ruling out obvious causes — no deployment had happened, no database was down — was check our application metrics on AWS CloudWatch. We had Spring Boot Actuator enabled, which exposes metrics including thread pool stats. I looked at the active thread count and it was sitting at exactly 200 — the Tomcat default maximum. That was the first real clue. If your team has Actuator set up, you can check this yourself at: /actuator/metrics/tomcat.threads.busy If that number is at or near your maximum thread count while the API is slow, thread pool exhaustion is almost certainly your problem.

Step 2 — Take a thread dump

To confirm, I took a thread dump while the service was under load. A thread dump shows you exactly what every thread in your JVM is doing at that moment. You can trigger one using:

Using jstack (get the PID first with: ps aux | grep java)

jstack > thread_dump.txt
Or via Spring Boot Actuator if enabled

Postman -- http://localhost:8080/actuator/threaddump
When I opened the thread dump, I saw the same pattern repeating across most of the 200 threads. They were all stuck in a WAITING or TIMED_WAITING state, blocked on an HTTP call to an external third-party service we were calling to enrich our response data. That external service had started responding slowly — averaging 25 seconds per call instead of the usual 200ms. Our code was calling it synchronously, blocking the thread the entire time. With enough concurrent requests, every thread in the pool was stuck waiting for that external service. New requests had nowhere to go.

THE FIX

Step 3 — The fix had two parts

We fixed it with two changes. The first was immediate — add a timeout to the external HTTP call so threads don't wait forever. The second was the proper fix — make the call non-blocking.

Part 1 — Add connection and read timeouts immediately

This was a quick fix we could deploy right away. We configured our RestTemplate with explicit timeouts:

@Configuration public class RestTemplateConfig {

@Bean
public RestTemplate restTemplate() {
    HttpComponentsClientHttpRequestFactory factory =
        new HttpComponentsClientHttpRequestFactory();

    // Don't wait more than 3 seconds to connect
    factory.setConnectTimeout(3000);

    // Don't wait more than 5 seconds for a response
    factory.setReadTimeout(5000);

    return new RestTemplate(factory);
}
}

Always set timeouts on any external HTTP call. No timeout means a thread can be stuck waiting forever. This is one of the most common mistakes in Java backend services.

Part 2 — Increase thread pool size as a short-term buffer

While we worked on the proper async solution, we also increased the Tomcat thread pool size to give us more breathing room under load. This goes in your application.properties:

Maximum number of threads to handle requests
server.tomcat.threads.max=400

Minimum threads always kept alive
server.tomcat.threads.min-spare=20

Max requests that can wait when all threads are busy
server.tomcat.accept-count=100 Increasing thread count is not a real fix — it just delays the problem. If your threads are blocking, adding more threads means more threads will block. Always fix the root cause.

Part 3 — The proper fix: make the external call async

The real solution was to stop blocking a thread while waiting for the external service. We used Spring's @async with a dedicated thread pool for external calls, isolating them from the main request-handling threads:

@Configuration 
@EnableAsync 
public class AsyncConfig {

@Bean(name = "externalCallExecutor")
public Executor externalCallExecutor() {
    ThreadPoolTaskExecutor executor = new ThreadPoolTaskExecutor();
    executor.setCorePoolSize(10);
    executor.setMaxPoolSize(50);
    executor.setQueueCapacity(100);
    executor.setThreadNamePrefix("external-call-");
    executor.initialize();
    return executor;
}
}

Now the external call runs on its own isolated thread pool. Even if that external service slows down completely, our main Tomcat threads keep handling requests normally. The two concerns are fully separated.

What happened after the fix

After deploying the timeout fix first, the immediate crisis resolved. Threads were no longer getting stuck indefinitely — after 5 seconds they would time out, return an error or fallback response, and free up for the next request. After the async refactor, the problem disappeared entirely. Active thread count during peak traffic dropped from 200 (maxed out) to consistently under 40. Response times went back to normal. The external service could be slow or even temporarily down, and our API kept working.

What I learned from this

Always set timeouts on external calls. Every HTTP call your service makes to another service or API must have a connect timeout and a read timeout. No exceptions. This is the single most common cause of thread pool exhaustion.
Thread pool exhaustion looks like slowness, not errors. If your API is suddenly unresponsive with no exceptions in the logs, check your active thread count before you do anything else.
Increase thread pool size only as a temporary measure. It buys you time but does not fix the root cause. The real fix is to stop blocking threads in the first place.
Isolate slow dependencies. Any external call that could be slow or unreliable should run on its own thread pool, completely separate from your main request-handling threads. This is called bulkhead pattern — and it is one of the most important resilience patterns in microservices.
Actuator is your best friend in production. If you are running Spring Boot and you do not have Actuator enabled with metrics exposed, you are flying blind. Enable it. It will save you hours the next time something goes wrong. --- I hope this walkthrough helped.

Thread pool exhaustion may seem complex but is easily fixable once you know the signs. If you've faced this issue or are dealing with it now, share your approach in the comments. If you found this useful, follow me for insights on Java backend engineering, Spring Boot, Kafka, AWS, and real-world production challenges

Top comments (4)

buildbasekit • Apr 16

This is a great breakdown. The “no errors, just slow” part is exactly what makes thread pool exhaustion so painful to debug.

One thing I’ve seen in similar cases, even after adding timeouts, teams still run into cascading slowdowns because everything shares the same request flow.

Isolating external calls into a separate pool like you did is key. Without that, one slow dependency can quietly take down the whole system.

Out of curiosity, did you consider using WebClient or reactive flows instead of @async, or was keeping it simple with thread isolation the goal?

Gauri Katara • Apr 18

Thank you! You've hit on exactly the right point — the "no errors,
just slow" part is what makes it dangerous. By the time someone
notices, the damage is already done.

And yes, that cascading effect is real. Even after adding timeouts,
if everything shares the same thread pool, one slow dependency
creates a queue that backs up everything else. Isolation is the
only proper fix.

On your question about WebClient and reactive flows — we did
discuss it briefly. The honest reason we went with @async and a
separate thread pool instead was pragmatic: our existing codebase
was fully synchronous Spring MVC. Introducing reactive programming
mid-project would have required significant refactoring across
multiple layers — not just the one service. The risk and effort
didn't justify it at that point.

That said, WebClient is absolutely the cleaner long-term solution
if you're starting fresh or already on a reactive stack. It removes
the thread-blocking problem at the root rather than managing around
it with thread pools.

It's one of those decisions where "right answer" depends heavily on
what you're working with. In a greenfield project I'd go reactive
from day one.

Have you used WebClient in production? Curious how your team
handled the learning curve on reactive flows.

buildbasekit • Apr 18

Yeah, makes sense. Retrofitting reactive into a synchronous codebase is usually not worth the disruption.

I’ve used WebClient in production, mostly in setups where we designed for it early. The biggest challenge wasn’t the API itself, but getting the team comfortable with debugging and reasoning about reactive flows.

In most existing Spring MVC apps, I’ve seen teams get 80% of the benefit just by doing what you did: timeouts + isolating slow dependencies.

Curious, after the fix, did you keep monitoring thread pool metrics as a guardrail or add any alerts around it?

Gauri Katara • Apr 20 • Edited

That's a really honest take on reactive — the debugging and
reasoning part is what most articles skip. It's one thing to
write reactive code, it's another to sit with a thread dump at
2 AM trying to understand a reactive chain that's gone wrong.
The mental model shift is real.

And completely agree on the 80% point. For most existing Spring
MVC apps, clean isolation and proper timeouts gets you most of
the way there without the overhead of a full reactive migration.

On your question about monitoring after the fix — yes, we kept
it in place and actually tightened it up.

We already had Spring Boot Actuator running so we added a
CloudWatch alarm specifically on tomcat.threads.busy. The
threshold we set was 70% of max pool size — so if active
threads crossed 140 out of 200, we'd get an alert over email before it
became a crisis rather than after.

The key thing we learned is that thread pool metrics should be
a continuous guardrail, not something you check only when
something breaks. By the time it looks bad on the dashboard
it's usually already impacting users.

We also added logging around the async executor we created —
tracking queue size and rejection count. Rejected tasks in that
pool became a leading indicator that something upstream was
slowing down again.

One thing I'd add in hindsight — a circuit breaker like
Resilience4j around that external dependency would have been
the complete solution. Timeouts + isolation handles the
symptom. A circuit breaker stops calling a dependency that's
already failing, which is the next layer of protection.

That's actually something I'm planning to cover in a follow-up
article — Resilience4j patterns in Spring Boot microservices.

Will that be helpful for you ?