Jacob Amar

Posted on May 10

Kafka on Kubernetes: Performance Lessons for Any Disk-Heavy Data Service

#devops #kafka #performance #kubernetes

We recently started migrating Kafka clusters from EC2 to EKS using Strimzi. The goal was not to chase new features, but to reduce the operational overhead of running large stateful clusters by hand. Upgrades, configuration changes, instance-family replacements, and failure recovery all required too much manual coordination on EC2.

We wanted a model that gave us:

Declarative configuration.
Simpler upgrades.
Easier infrastructure changes.
Better self-healing.
Less day-to-day operational toil.

That part worked. What did not work was the performance profile after the migration. As soon as we moved the first cluster, we saw persistent disk reads across the brokers and higher latency than we expected on comparable hardware.

Why That Was a Problem

For Kafka, disk reads are not just a storage detail. In normal operation, when consumers stay near the head of the log and the brokers have enough memory, hot data should usually be served from page cache instead of disk.

Once reads start falling through to storage, the symptoms become hard to ignore:

Latency increases.
Throughput becomes less predictable.
Storage does more work than it should.
The cluster behaves differently from the mental model you rely on.

That is why this stood out immediately. We were not looking at a harmless metric difference. We were looking at a performance path that should not have been active so often.

What You’ll Learn

This post explains how we investigated the issue and what we found. Although Kafka exposed the problem, the root cause was broader: an interaction between Kubernetes, cgroup v2, Linux reclaim behavior, and disk-backed workloads.

By the end of this article, you should have a clearer picture of:

Why some data services start reading from disk more than expected on Kubernetes.
Which signals help distinguish a Kafka problem from a kernel or memory-management problem.
What to inspect before changing application-level settings.
Which kernel and reclaim-related knobs are worth understanding.
How to think about tuning other stateful services that behave strangely on Kubernetes.

The problem: unexpected disk reads

After migrating our first Kafka cluster to Kubernetes with Strimzi, we noticed something unusual right away: the brokers were doing consistent, non-trivial disk reads. For our workload, that was a red flag, because these clusters handle very high throughput and small latency regressions show up quickly in broker performance.

The graph mattered because this was not an isolated spike or a recovery event. It was steady read activity under normal operation, on hardware that was supposed to behave similarly to our EC2 deployment.

Why disk reads mattered

Kafka relies heavily on the operating system page cache rather than implementing its own buffering layer. In a healthy cluster, when consumers are reading near the head of the log and the node has enough available memory, most hot reads should come from memory instead of disk.

That is why these reads got our attention immediately:

Memory hits are much faster than disk reads.
Falling through to storage adds latency.
Sustained reads often mean page cache is being reclaimed too aggressively.

The key point was simple: for this workload, disk reads were not just a storage metric. They were a latency signal.

First hypothesis: consumer lag

Our first suspicion was consumer lag. That would have been the simplest explanation: if consumers were reading older offsets, the relevant data might no longer be in page cache, forcing the kernel to fetch it from disk.

We checked consumer lag using our Kafka lag exporters and monitoring dashboards and found no meaningful lag. Consumers were reading close to the head of the log, so lag alone could not explain the persistent reads.

Takeaway: the reads were real, but they were not caused by consumers falling behind.

What actually changed

Once we ruled out lag, the next question was straightforward: what changed between the old and new environments?

We compared the obvious candidates:

Kafka configuration, including topic settings, compression, and broker configs.
Linux sysctl tuning.
Instance sizing, including CPU and memory.

Those were effectively unchanged. The meaningful differences were lower in the stack:

We moved from Ubuntu 20.04 to Amazon Linux 2023.
We moved from cgroup v1 to cgroup v2.

That narrowed the investigation to operating-system memory behavior rather than Kafka itself.

Measuring kernel behavior

To see what the kernel was doing, we used writeback.bt, a bpftrace script that shows why pages are being written back. This was useful because it distinguishes between ordinary background writeback and reclaim-driven writeback caused by memory pressure.

On the new machine, many writeback events were tagged as vmscan:

bpftrace ./writeback.bt
Attaching 4 probes…
Tracing writeback… Hit Ctrl-C to end.
TIME      DEVICE  PAGES    REASON      ms
13:06:59  259:0   2385     vmscan      0.006
13:06:59  259:0   2385     vmscan      0.000
13:06:59  259:0   26476    periodic    0.000
13:06:59  259:0   38518    vmscan      0.002
13:06:59  259:0   2397     vmscan      0.000
13:06:59  259:0   2397     vmscan      0.000

On the old machine, writeback was dominated by background and periodic events instead:

bpftrace ./writeback.bt
Attaching 4 probes…
Tracing writeback… Hit Ctrl-C to end.
TIME      DEVICE  PAGES    REASON      ms
13:07:59  259:0   2945     periodic    0.006
13:07:59  259:0   25613    periodic    0.000
13:07:59  259:0   26476    background  0.000
13:07:59  259:0   38518    background  0.000
13:07:59  259:0   2107     background  0.000
13:07:59  259:0   2645     periodic    0.000

That difference was the first strong kernel-level signal. The new setup was doing much more reclaim-driven writeback, which meant page cache was under pressure.

What vmscan told us

vmscan is part of the kernel reclaim path. When it shows up in writeback traces, it usually means the kernel is actively reclaiming memory rather than performing routine background flushing.

In practice, that meant the system was paying a reclaim penalty we did not expect on equivalent hardware. At that point, the question was no longer “why is Kafka reading from disk?” but “why is the kernel reclaiming page cache so aggressively in this environment?”

Root cause: reclaim pressure under cgroup v2

This led us to cgroup v2 memory behavior, and specifically to memory.high. Under cgroup v2, once a workload crosses the high memory threshold, the kernel can start applying reclaim pressure inside that cgroup even if the node still has free memory available.

That is a poor fit for Kafka and similar disk-backed systems:

They benefit from using available memory for page cache.
High-throughput traffic can push them into reclaim pressure quickly.
Once reclaim starts inside the cgroup, hot pages are evicted sooner.
More reads then fall through to disk, increasing latency.

In other words, the issue was not that the node had too little RAM in absolute terms. The issue was that the memory-reclaim behavior changed under Kubernetes with cgroup v2.

Why dirty ratios were not enough

At first, we tried the obvious Kafka-style tuning knobs such as vm.dirty_ratio and vm.dirty_background_ratio. Those settings influence how much dirty data the kernel allows before forcing writeback.

They helped control writeback behavior, but they did not solve the real problem. Reclaim-driven writeback remained, and disk reads still stayed above the old baseline.

Takeaway: dirty page tuning was not enough, because the main issue was reclaim pressure rather than ordinary flushing.

Fix part 1: remove pod memory limits

The first meaningful fix was to stop setting memory limits for Kafka pods and rely on requests plus dedicated nodes instead. That avoided triggering pod-level reclaim pressure through cgroup memory controls while the host still had memory available.

This worked for our setup because:

Kafka ran on dedicated nodes.
Only minimal system and DaemonSet workloads shared those nodes.
Capacity was managed at the node level rather than through strict pod memory caps.

That change gave Kafka more freedom to benefit from host page cache instead of being boxed into an artificial reclaim boundary.

Warning: Do not treat this as a general Kubernetes best practice. Removing pod memory limits on shared nodes can push the node into memory pressure, which may trigger pod eviction and interrupt the workload. We only used this approach because Kafka was isolated on dedicated nodes and capacity was controlled at the node level.

Fix part 2: tune vm.min_free_kbytes

The second fix was to tune vm.min_free_kbytes. This setting influences the kernel watermarks that determine when kswapd wakes up and starts reclaiming memory.

On our nodes, the default value was:

sysctl -a | grep min_free_kbytes
vm.min_free_kbytes = 67584

We increased it gradually and observed:

Earlier background reclaim by kswapd.
Fewer direct reclaim events.
Less vmscan-driven writeback under the same workload.

For our hardware, the best result came from:

vm.min_free_kbytes = 2548576

That number is workload-specific, so I would not present it as a universal recommendation. The important lesson is the mechanism: raising the watermarks helped the kernel reclaim earlier and more smoothly, instead of falling into harsher reclaim behavior later.

Results

After applying both changes, the difference was obvious:

Remove Kafka pod memory limits.
Tune vm.min_free_kbytes on the node.

Under the same workload, disk read bytes dropped close to zero and normalized load average decreased by about 50%.

Practical checklist

If you are troubleshooting similar behavior in Kafka or another disk-backed service on Kubernetes, I’d start here:

Run the workload on dedicated nodes.
Avoid pod memory limits if the service depends heavily on page cache.
Compare cgroup behavior, not just Kafka configs.
Use bpftrace or similar tools to distinguish normal writeback from reclaim-driven writeback.
Inspect reclaim-related tuning such as vm.min_free_kbytes, not only dirty page ratios.

Continue the discussion

If you’ve run into similar page-cache, reclaim, or cgroup v2 behavior in Kafka or other stateful workloads on Kubernetes, I’d be interested to compare notes.

DEV Community