In one of our recent production incidents, we encountered a tricky issue involving Java workloads running on AKS and Azure Cosmos DB. The symptoms were subtle at first—occasional pod restarts and sudden throttling of TPM (Transactions Per Minute). However, after systematic investigation, the root cause turned out to be a combination of architectural gaps and a code-level bug.
This post captures the full diagnostic journey, technical insights, and architectural lessons learned. My goal is to help engineers and architects avoid similar pitfalls—especially when building globally distributed, low‑latency cloud systems.
🗺️ Architecture Flow Diagram
🚩 The Initial Symptoms
AKS application pods were randomly restarting.
We observed unexpected TPM throttling during otherwise normal traffic (~2–3k TPM).
Symptoms consistently reproduced only when the application and Cosmos DB were deployed in different regions.
This pointed toward a deeper cross‑region interaction issue—but we needed more data.
🧪 Reproducing the Problem in a Controlled Setup
We deployed:
The Java application in Region A
Cosmos DB in Region B
This reliably reproduced the high-latency events.
To inspect the JVM behavior during these spikes, we aimed to capture thread dumps. But then came the next hurdle…
🛠️ Running Distroless Containers: No Debugging Tools
Our application runs in distroless containers, which:
Contain no shell,
No jstack,
No standard directory structure,
No ability to exec into a process for diagnostics.
Solution: AKS Sidecar for Debugging
We introduced a lightweight debugging sidecar with:
Process namespace sharing (to inspect the main JVM process)
A shared emptyDir volume to store thread dumps and logs
This allowed us to capture thread dumps exactly when latency spiked.
🧵 Thread Dump Analysis: Thread Pool Exhaustion
The thread dumps revealed:
An unusually high number of Cosmos client threads stuck in WAITING state
Most were waiting on async operations
The Cosmos client thread pool was fully saturated, despite low load
This immediately raised a red flag—Cosmos is designed to be extremely fast, and 2–3k TPM is nowhere near a threshold that would stress the SDK.
So we dug deeper...
👻 The Hidden Culprit: Ghost Connections
We eventually traced the root cause to a bug in our async flow, causing:
Improperly managed async calls
“Ghost connections” (orphaned/stuck network operations)
Increasing pressure on I/O and SDK connection pools
Under low latency, these ghost connections were less noticeable, but when cross‑region latency was added to the mix, they accumulated fast enough to exhaust the thread pool.
Fixing this async handling bug eliminated the ghost connections.
But one question still remained:
Why was the cross-region case amplifying the issue so drastically?
🌍 Architecture Gap: Missing Preferred Region Configuration
When Cosmos DB and the application ran in the same region:
- Average latency per request was 2–5 ms.
But cross‑region:
- Latency spiked to 40–60 ms per db query.
We realized:
❗ We never configured preferred regions in the Cosmos SDK.
So even though we had read replicas in multiple regions from day one, the SDK always routed reads to the primary region.
This meant:
Every read call incurred cross‑region latency
Ghost connections took much longer to time out
Thread pool exhaustion accelerated
After configuring preferred regions, Cosmos could serve reads from the nearest replica, reducing latency and stabilizing connection churn.
🎯 Final Call to Action (CTA)
If you found this useful, follow me for more:
Real-world Azure architecture breakdowns
AKS, Cosmos DB, and JVM deep dives
Production incident postmortems
Cloud design patterns for architects
I publish high-quality, experience-backed content to help engineers design scalable, resilient systems.
Stay tuned — more deep architecture insights coming soon!, consider following — I'm publishing more Java, Spring Boot, Kafka, Azure, AKS, and cloud architecture deep dives soon!

Top comments (0)