GKE's Noisy Neighbor Problem Can Be Invisible in Metrics Explorer

#monitoring #googlecloud #observability #kubernetes

Google Cloud's Metrics Explorer has plenty of metrics, and for most monitoring needs, it's more than enough.

However, the sampling interval of those metrics can hide real problems. I once ran into a situation where an API server on Google Kubernetes Engine (GKE) had intermittent response time spikes, yet Metrics Explorer showed nothing abnormal. The root cause turned out to be short-lived batch jobs on the same Node eating up all the CPU, a classic Noisy Neighbor problem.

Here's how I fell into that trap.

An API server that was mysteriously slow from time to time

I had a development API server running on GKE that would occasionally slow down for no obvious reason.

A request that normally completed in around 200 ms would sometimes take about 4 seconds, even under the same conditions. The slowdown was random/intermittent, and I could not find a clear pattern in when it happened.

When the issue occurred, CPU usage for the two GKE Nodes looked like this in Metrics Explorer:

CPU utilization was sitting around 35%. Nothing suggested the CPUs were being saturated.

I then checked the Load Average (1m) for the same Nodes:

There were some spikes, but with a 2-core CPU Node averaging around 1.5 to 2, and combined with the CPU utilization graph, it was hard to conclude that the CPU was saturated.

(In hindsight, the Load Average spikes might have been a clue that something was happening in short bursts. But at the time, I couldn't connect the dots.)

What was actually happening on the API server node

Metrics Explorer wasn't giving me any clues. The database wasn't overloaded, there were no notable error logs from the API server. I was stuck. Since this was a development environment, I let it sit for a while.

One day, I finally decided to log in to the affected node over SSH and watch it in real time with htop¹.

That was the turning point.

At some moments, both CPU cores were pinned at 100% for around 20 to 30 seconds, and the load average reached 7.44.

The process list showed multiple Rails batch tasks running at the same time. These batch jobs were consuming all the CPU, starving the API Pod running on the same Node.

That was the noisy neighbor.

When the batch jobs were not running, CPU usage dropped back down to around 6% to 11%.

So why didn't Metrics Explorer show this?

Because the CPU-hungry batch jobs were short-lived. Each run finished in around 20 to 30 seconds. The VM CPU utilization metric in Metrics Explorer is sampled at 60-second intervals². If CPU is fully saturated for only 20 to 30 seconds out of a 60-second window, the result can still look like only about 33% to 50% average utilization.

That was exactly the trap: the node really was getting hammered, but only briefly, and the 1-minute metric smoothed it into something that looked unremarkable.

(Side note) Why the batch jobs were eating all the CPU

The batch pod had no CPU limits configured, so there was no upper bound on how much CPU it could use.

As a result, when multiple batch jobs ran at the same time, they were able to consume most of the CPU available on the node and interfere with other pods running there.

After I added CPU limits to the batch pod, the API response time became stable.

What I learned

Metrics Explorer is a powerful tool, but you need to be aware of its sampling intervals. Short-lived CPU spikes can get averaged out and won't show up clearly on the graphs.

In this case, I had already noticed the symptom from the application side: “the dev API feels slow sometimes.” But the infrastructure metrics alone did not reveal the cause. I only found the real problem after looking at the node directly with htop.

I took two lessons from this.

First, monitor application latency itself, especially p95 and p99. Infrastructure metrics do not always tell you that users are already feeling pain.

Second, know the sampling resolution of the metrics you rely on. If a problem is short-lived enough, aggregated infrastructure graphs may hide it. In those cases, direct inspection of the machine can still be the fastest way to understand what is happening.

If you need a good starting point for that kind of live investigation, Netflix’s Linux Performance Analysis in 60,000 Milliseconds is still a useful reference.