turboline-ai

Posted on Jun 29

Kafka is not a queue — and treating it like one will wreck your system

#streaming #datapipeline #backend

Stop Treating Kafka like a queue!

A story about silent lag, vanishing messages, and the mental model shift nobody warns you about.

You integrated Kafka. Messages are flowing. Everything looks fine in staging. Then you ship to production, traffic spikes, and suddenly messages are "disappearing," consumers are falling hours behind, and your on-call engineer is staring at a dashboard that makes no sense.

This is not a Kafka bug. This is a mental model problem.

The core misunderstanding

Most teams come to Kafka from RabbitMQ, SQS, or some other traditional message queue. In those systems, the broker owns delivery. You enqueue a message, a consumer picks it up, acknowledges it, and it's gone. The queue shrinks. That's the contract.

Kafka does not work this way.

Kafka is a distributed, append-only log. Messages are written to partitions and retained for a configured period, regardless of whether anyone consumed them. Consumers track their own position in that log using offsets. The broker does not care if you've read something. It does not remove messages after consumption. The log moves forward. Consumers choose where they sit in it.

This distinction sounds academic until it isn't.

What "vanishing messages" actually means

When engineers say messages are vanishing in Kafka, they almost always mean one of two things:

The consumer offset moved past the message. If a consumer crashes and restarts without committing its offset correctly, it may resume from a later position. The message was never lost. The consumer skipped it.

The message hit the retention window. Kafka has a default log retention of 7 days (or a size-based limit depending on config). If your consumer falls behind far enough, the log segment gets deleted before the consumer gets there. The message is gone, but only because your consumption logic couldn't keep up.

Neither of these is Kafka misbehaving. Both are symptoms of teams applying queue semantics to a log-based system.

Silent lag and why it's so dangerous

In a traditional queue, lag is visible almost immediately. The queue depth grows and alarms go off. With Kafka, consumer lag can build slowly and quietly. The producer keeps writing. The consumer keeps reading. But if the consumer is processing slower than messages arrive, the offset gap widens.

kafka-consumer-groups.sh --bootstrap-server localhost:9092 \
  --describe --group my-consumer-group

Run that command and look at the LAG column. A number in the hundreds of thousands means your consumer is in trouble. A number that's growing means it's not catching up. Most teams don't check this until something breaks.

The dangerous part is that everything looks fine from the outside until the retention window catches your lag. Then messages start disappearing and it looks like a Kafka failure.

Backpressure doesn't work the way you think

In a queue-based system, backpressure is often implicit. The queue fills up, the producer slows down, or you get an error. Kafka does not apply that kind of backpressure by default. Producers keep writing. The log keeps growing. Your slow consumer has no mechanism to tell the producer to wait.

This means you need to build backpressure into your consumption logic explicitly. If your consumer is doing heavy processing per message, you need to think carefully about:

Batch sizing and how many records you pull per poll cycle
Processing time relative to max.poll.interval.ms (exceed this and Kafka thinks your consumer is dead and triggers a rebalance)
Whether you're committing offsets before or after processing completes

That last one matters more than most teams realize. Auto-commit enabled? Your offsets may advance before your processing finishes. Consumer crashes mid-batch? Those messages are marked consumed even if your logic never ran.

Replay is a feature, not an accident

Here is where the log model actually gives you something queues cannot. Because Kafka retains messages independently of consumption, you can replay. You can reset a consumer group offset to an earlier point and reprocess. You can run multiple independent consumers reading the same topic at their own pace without interfering with each other.

consumer.seek(partition, offset=0)  # Reset to beginning of partition

This is genuinely powerful for audit trails, event sourcing, and rebuilding derived state. But only if you design for it. Teams that treat Kafka like a queue often overwrite consumer offsets carelessly or rely on auto-commit in ways that make replay unpredictable.

The mental model shift

Getting Kafka right means rebuilding your intuitions around a few core ideas:

The broker does not manage delivery. Your consumer owns its position in the log.
Lag is your responsibility to monitor. Nothing will warn you automatically unless you instrument it.
Message retention is a clock, not a safety net. A slow consumer and a short retention window is a data loss scenario.
Backpressure must be designed in. Kafka will not slow the producer down for you.

Teams that internalize this stop fighting Kafka and start building systems that use its actual strengths. At Turboline, our real-time streaming infrastructure is designed around these semantics from the ground up, so teams inherit patterns that handle offset management, lag monitoring, and replay correctly without having to learn the hard way in production.

The concrete takeaway

If you are using Kafka and you have not explicitly thought through offset commit strategy, consumer lag monitoring, and what happens when a consumer falls behind your retention window, your system has silent risk. None of it will be obvious until traffic is high enough and timing is bad enough for everything to surface at once.

Check your consumer group lag today. Understand where your offsets are committed relative to where your processing happens. Know your retention window and whether your slowest consumer can realistically stay inside it under load.

Kafka is not trying to be difficult. It is just not a queue, and it was never designed to be.

DEV Community