Why Your Kafka Consumers Are Suddenly 10x Slower in v3.9.0

#kafka #java #opensource

TL;DR

In a minor release of Apache Kafka, consumer throughput dropped by 10x.

This change was done to prioritize durability over throughput.

The main star of the story was the new min.insync.replicas in the topic configuration.

In versions before v3.9.0, it controlled when the broker accepted writes from producers using acks=all.
Now, it can also dictate whether a message is able to be consumed by the consumers.
This slight change caused a throughput drop of 10x for some Kafka users.

Keep reading to find out why this change was made, and how to fix it, so that your production systems don't throttle.

It All Begins

In August 2025, a user Sharad Garg raised an issue on the Kafka Issue tracker.
It was titled "Consumer throughput drops by 10 times with Kafka v3.9.0 in ZK mode".
Other people validated his claim, and shared more information on the reproduction steps.

Notably, Ritvik Gupta ran tests that pointed the blame on configuration min.insync.replicas.
In his tests he showed that the consumer throughput dropped significantly on changing min.insync.replicas from 1 to 2.

Other users came to this issue, reporting the same problem.

"We were able to reproduce this issue in our own environment too.
Throughput drops from: 147.5842 MB/sec (Kafka 3.4) to 58.6748 MB/sec (Kafka 3.9) with min.insync.replicas=2"

-Bertalan Kondrat

It Gets Escalated

Eventually the issue gets stale, getting no attention from the maintainers, then Marcus Page escalates it to the Kafka dev mailing list.

This is how I learn about this issue.

This escalation gets the attention of a long-time contributor to the project Chia-Ping Tsai.
A day later, he replies to that email stating that he had identified the root cause, and posted it on the ticket. I obviously rushed to check this root cause, and it left me even more confused about this issue.

The Root Cause

The root cause is related to KAFKA-15583 - "High watermark can only advance if ISR size is larger than min ISR". The title says it all. The consumer can't read more data due to the HW, which can't be advanced due to the slow followers dropping the partition below the min ISR

-Chia-Ping Tsai

What Is High Watermark?

The High Watermark is the offset of the latest message successfully copied to all brokers currently in the In-Sync Replicas (ISR) list. It acts as a strict safety boundary so consumers only read fully committed data that won't disappear if a broker suddenly crashes.

Since I didn't know what High Watermark was I was confused, after learning about it, I was even more confused. Why would min.insync.replicas dictate the HW?

If you don't know what min.insync.replicas does, it allows you to enforce greater durability guarantees on the producer level. The producer raises an exception if a message is not acknowledged by min.insync.replicas replicas after a write.

Notice how this description does not mention consumers.

Next, I looked into the related KAFKA-15583 issue.

It has no description. 🥲

It links to 2 PRs, and it is here that I finally understood the whole picture.

It's Not a Bug, It's a Feature

After looking at one of the mentioned PR, I found the 3 lines of code that caused consumer throughput to drop by a factor of 10.

  private def maybeIncrementLeaderHW(leaderLog: UnifiedLog, currentTimeMs: Long = time.milliseconds): Boolean = {
+    if (isUnderMinIsr) {
+      trace(s"Not increasing HWM because partition is under min ISR(ISR=${partitionState.isr})")
+      return false
+    }

These 3 lines of code effectively block consumer reads until a produced message is replicated in at least min.insync.replicas.
So if one of the followers is latent, and gets kicked out of in-sync replicas, the leader has to wait for it to catch up before allowing the next read to a consumer.
This effectively is a trade-off between reliability and performance.
So, even though it affects consumer throughput, Kafka accepts this performance loss in favor of being highly reliable and ensuring no data loss.

In my opinion, this change maybe needed a major version bump instead of a minor, as this change in throughput could cause a lot of production systems to be impacted.
But bumping versions is already a controversial topic I'll talk about another day

So that's it, a 3 line change that causes a massive throughput drop.

I half expected it to be an obscure JVM bug, or a CPU architecture issue, but with 99.99999999% of other bugs, it was due to new code.

Thanks for reading through the end. I have joined the kafka-dev mailing list, and actively trying to become a contributor.

Follow this blog for more quirks and insider information about Kafka.

DEV Community