The problem ended up being 80ms between nodes.

#distributedsystems #kubernetes #networking #performance

A few days ago we had one of those annoying infra issues where nothing looked fully broken, but everything felt off.

No deploys, config changes or traffic spikes, but pods started failing readiness, some requests were timing out and a few services became flaky.

The confusing part was that the usual metrics looked fine.
From the outside, the cluster looked healthy enough, but it wasn't.

What actually happened

We run workloads across multiple availability zones.

Normally, latency between the affected nodes sits around 3–5ms, but this day our network provider had a routing issue, and that jumped to around 80ms.

That was enough to start causing real problems.

Why 80ms was such a big deal

Because inside a Kubernetes cluster, a request is rarely just one request.
It might hit ingress, then a service, then another internal service, then Redis or MySQL, and then work its way back.

When node-to-node latency is 3–5ms, you barely think about it.
When it becomes 80ms, all of those internal hops start getting expensive very quickly.

Nothing is technically down but normal request paths start getting slow enough to hit timeouts, probes start failing, and the whole system begins to feel random.

That is exactly what we saw.

The readiness failures were the clue

The pods were not crashing.
They were just becoming slow enough that readiness checks started failing.

And once that happens, Kubernetes does what it is supposed to do, it removes those pods from service.

Then they recover.
Then they fail again.
Then recover again.

From the outside, it looks like instability.
But really, the application is just getting dragged down by a slower network path underneath it.

What fixed it

We temporarily moved traffic off the primary provider onto a backup provider which made the latency drop to around 7ms, and the cluster started behaving normal almost immediately.

DEV Community

The problem ended up being 80ms between nodes.

What actually happened

Why 80ms was such a big deal

The readiness failures were the clue

What fixed it

Top comments (0)