DEV Community

Cover image for Diagnosing Cross-Region Latency & Thread Pool Exhaustion in Java + Azure Cosmos DB
Sudheer Kondeti
Sudheer Kondeti

Posted on

Diagnosing Cross-Region Latency & Thread Pool Exhaustion in Java + Azure Cosmos DB

In one of our recent production incidents, we encountered a tricky issue involving Java workloads running on AKS and Azure Cosmos DB. The symptoms were subtle at first—occasional pod restarts and sudden throttling of TPM (Transactions Per Minute). However, after systematic investigation, the root cause turned out to be a combination of architectural gaps and a code-level bug.

This post captures the full diagnostic journey, technical insights, and architectural lessons learned. My goal is to help engineers and architects avoid similar pitfalls—especially when building globally distributed, low‑latency cloud systems.

🗺️ Architecture Flow Diagram

description of the image

🚩 The Initial Symptoms

  • AKS application pods were randomly restarting.

  • We observed unexpected TPM throttling during otherwise normal traffic (~2–3k TPM).

  • Symptoms consistently reproduced only when the application and Cosmos DB were deployed in different regions.

This pointed toward a deeper cross‑region interaction issue—but we needed more data.

🧪 Reproducing the Problem in a Controlled Setup

We deployed:

  • The Java application in Region A

  • Cosmos DB in Region B

This reliably reproduced the high-latency events.

To inspect the JVM behavior during these spikes, we aimed to capture thread dumps. But then came the next hurdle…

🛠️ Running Distroless Containers: No Debugging Tools

Our application runs in distroless containers, which:

  • Contain no shell,

  • No jstack,

  • No standard directory structure,

  • No ability to exec into a process for diagnostics.

Solution: AKS Sidecar for Debugging

We introduced a lightweight debugging sidecar with:

  • Process namespace sharing (to inspect the main JVM process)

  • A shared emptyDir volume to store thread dumps and logs

This allowed us to capture thread dumps exactly when latency spiked.

🧵 Thread Dump Analysis: Thread Pool Exhaustion

The thread dumps revealed:

  • An unusually high number of Cosmos client threads stuck in WAITING state

  • Most were waiting on async operations

  • The Cosmos client thread pool was fully saturated, despite low load

This immediately raised a red flag—Cosmos is designed to be extremely fast, and 2–3k TPM is nowhere near a threshold that would stress the SDK.

So we dug deeper...

👻 The Hidden Culprit: Ghost Connections

We eventually traced the root cause to a bug in our async flow, causing:

  • Improperly managed async calls

  • “Ghost connections” (orphaned/stuck network operations)

  • Increasing pressure on I/O and SDK connection pools

Under low latency, these ghost connections were less noticeable, but when cross‑region latency was added to the mix, they accumulated fast enough to exhaust the thread pool.

Fixing this async handling bug eliminated the ghost connections.

But one question still remained:

Why was the cross-region case amplifying the issue so drastically?

🌍 Architecture Gap: Missing Preferred Region Configuration

When Cosmos DB and the application ran in the same region:

  • Average latency per request was 2–5 ms.

But cross‑region:

  • Latency spiked to 40–60 ms per db query.

We realized:

❗ We never configured preferred regions in the Cosmos SDK.

So even though we had read replicas in multiple regions from day one, the SDK always routed reads to the primary region.

This meant:

  • Every read call incurred cross‑region latency

  • Ghost connections took much longer to time out

  • Thread pool exhaustion accelerated

After configuring preferred regions, Cosmos could serve reads from the nearest replica, reducing latency and stabilizing connection churn.

🎯 Final Call to Action (CTA)

If you found this useful, follow me for more:

  • Real-world Azure architecture breakdowns

  • AKS, Cosmos DB, and JVM deep dives

  • Production incident postmortems

Cloud design patterns for architects

I publish high-quality, experience-backed content to help engineers design scalable, resilient systems.

Stay tuned — more deep architecture insights coming soon!, consider following — I'm publishing more Java, Spring Boot, Kafka, Azure, AKS, and cloud architecture deep dives soon!

Top comments (0)