How 3 Lines of Code Caused a 10x Kafka Throughput Drop

#kafka #distributedsystems #performance #apachekafka

In August 2025, a user reported that Apache Kafka v3.9.0 dropped consumer throughput by 10x. Other users reproduced it. The culprit was a configuration called min.insync.replicas, and the fix was three lines of code.

The report

Sharad Garg opened a ticket titled "Consumer throughput drops by 10 times with Kafka v3.9.0 in ZK mode." Ritvik Gupta ran controlled tests and traced the issue to min.insync.replicas. Setting it from 1 to 2 caused a massive drop:

Test	Message Rate	Configuration
1 Producer 1 Consumer	89.21	min.insync.replicas = 2
1 Producer 1 Consumer	298.99	min.insync.replicas = 1

Another user reported throughput falling from 147 MB/s on Kafka 3.4 to 58 MB/s on Kafka 3.9 with the same setting.

The root cause

Chia-Ping Tsai, a long-time Kafka contributor, identified the issue. It traced back to KAFKA-15583, titled "High watermark can only advance if ISR size is larger than min ISR."

The high watermark (HW) is the offset of the latest message copied to all in-sync replicas. Consumers are only allowed to read up to the HW. This guarantees that consumed data will not disappear if a broker crashes.

The change added this check inside the leader's watermark advancement logic:

if (isUnderMinIsr) {
  trace(s"Not increasing HWM because partition is under min ISR")
  return false
}

Before v3.9.0, min.insync.replicas only affected producers using acks=all. It dictated how many replicas had to acknowledge a write before the producer considered it successful. It had nothing to do with consumers.

After v3.9.0, the same setting also blocks consumer reads. If a follower is slow and drops out of the ISR, the leader stops advancing the high watermark until that follower catches up. Consumers stall until the watermark moves again.

Why this is a feature, not a bug

Kafka prioritizes durability over throughput. Blocking reads until min.insync.replicas are healthy prevents consumers from reading data that has not been sufficiently replicated. If the leader crashes after a consumer reads an under-replicated message, that message is gone, and the consumer has already processed it.

The trade-off is real. The change arguably deserved a major version bump, because a 10x throughput drop in a minor release can break production pipelines.

The fix

If you hit this, your options are straightforward:

Lower min.insync.replicas if your durability requirements allow it.
Ensure followers have enough resources to keep up with the leader.
Monitor ISR size and follower lag as critical metrics.

Three lines of code. A massive performance impact. A reminder that distributed systems are full of sharp edges.

For the full timeline, mailing list discussion, and the exact PR diff: How a Minor Release Caused a 10x Throughput Drop in Kafka.

DEV Community

How 3 Lines of Code Caused a 10x Kafka Throughput Drop

The report

The root cause

Why this is a feature, not a bug

The fix

Top comments (0)