DEV Community

Fathma Siddique
Fathma Siddique

Posted on

Dealing with Long-Running Kafka Consumers and Message Backlogs

For a while I genuinely could not figure out what was wrong.
Nothing was throwing errors. The service was running. But messages were piling up, some were being processed twice, and the lag just kept climbing. I kept waiting for it to sort itself out. It did not.
Eventually I had to sit down and actually trace through what was happening.

The Problem
Our consumer was doing too much. Each message triggered external API calls, some heavy business logic, blocking operations. Individually, a message might take a couple of minutes to get through. That feels manageable until you remember that Kafka's default max.poll.records is 500. Pull a batch of even a handful of slow messages, and the cumulative processing time blows past Kafka's default max.poll.interval.ms of 5 minutes without much effort.
When that happens, Kafka assumes the consumer has died. It triggers a rebalance, reassigns the partitions, and those same messages get picked up and processed all over again.
That was our loop. Consumer pulls a batch, gets bogged down processing it, Kafka loses patience, rebalance happens, repeat.

What We Did
The first thing was just to stop the bleeding. We bumped max.poll.interval.ms up to 8 minutes to give the consumer a bit more breathing room. Rebalances stopped almost immediately. That was a relief, but it was a band-aid not a fix.

Next we set max.poll.records = 1. One message at a time. With each message taking a couple of minutes, pulling any larger batch was just asking for trouble. Throughput dropped considerably, but at least the system was stable and we could reason about it.

We also dropped auto-commit and switched to manual offset commits. Honestly we should have done this from the start. Auto-commit quietly marks messages as done on a timer whether processing actually succeeded or not. Manual commits meant we knew exactly what had been handled and what had not.

Kafka consumers are not meant to do heavy, long-running work.
After things stabilised, we redesigned the flow so that the consumer became lightweight. It would validate the message and quickly hand off the heavy work to background workers.Kafka went back to doing what it is good at: moving data fast. And our system stopped fighting it.

We also added retries and a dead letter queue so one broken message could not drag everything else down with it.

What Stuck With Me
I think I was treating Kafka like a job queue because that is what felt familiar. But it is not that. It is a streaming system that expects you to keep up with it. The moment you do slow, heavy work inside the consumer, you are borrowing time you do not have.
Once we aligned with how Kafka actually works, everything got simpler. The lag cleared. The rebalances stopped. The system finally felt like it was running the way it was supposed to.
Sometimes the fix is technical. But sometimes you just have to admit the design was wrong.

Top comments (0)