Maksym

Posted on Jun 29

Kafka is not a queue — and treating it like one will wreck your system

#kafka #distributedsystems

A story about silent lag, vanishing messages, and the mental model shift nobody warns you about.

We had a working system. Messages went in, messages came out. Consumers processed them, the UI updated, everyone was happy.

Then we scaled.

Within a week of going from ~200 to ~40,000 events per hour, things started going wrong in ways that were hard to explain. Some events were processed twice. Others seemed to disappear. Our "queue" was growing, but consumers looked idle. On-call became a guessing game.

The problem wasn't Kafka. The problem was that we were using Kafka like it was RabbitMQ — and it absolutely is not.

The mental model that kills you

When most developers first hear "Kafka is a message broker," they picture something like this:

Producer puts a message in.
Consumer picks it up.
Message is gone.

That's a queue. That's RabbitMQ. That's Celery. That mental model is burned into our brains because it's how we've handled async work for years.

Kafka is something different: it's a distributed, append-only log.

Messages are not consumed — they are read. They don't disappear after processing. They sit in the log for as long as your retention policy says they should, whether that's 7 days, 30 days, or indefinitely. Every consumer group has its own pointer — called an offset — that tracks where in the log it's currently reading.

This one distinction breaks almost every assumption you carry in from the queue world.

Lesson 1: "Deleting" a message doesn't exist

In RabbitMQ, ACKing a message removes it. In Celery, a task is popped off the queue when a worker picks it up. Gone. Done.

In Kafka, "processing" a message just advances your offset. The message is still there. Every other consumer group can read it independently. New consumer groups you add tomorrow can replay the entire history from the start.

This is incredibly powerful — and also the first thing that bites you.

We had a bug in a consumer that silently crashed mid-processing. The offset never advanced. Kafka happily kept serving the same batch of events on every poll cycle. No error. No dead letter queue. Just an infinitely retrying consumer and a pile of duplicate side effects in our database.

The lesson: failed processing ≠ message removed. You need to deliberately handle failure — commit the offset after successful processing, or build dead-letter logic manually. Kafka will not do this for you.

Lesson 2: Consumer lag is a silent killer

In a queue, "how backed up are we?" is obvious — it's the queue depth. In Kafka, the equivalent metric is consumer lag: the difference between the latest offset in a partition and the offset your consumer group is currently at.

We had lag. A lot of it. But because our consumers were running (just slowly), nothing was crashing, no alerts fired. The UI was just... stale. Events from 20 minutes ago were being processed as if they were live.

The problem was our consumer group configuration. We had left session.timeout.ms at the default, but our processing was occasionally taking longer than that. Kafka's coordinator assumed the consumer was dead, kicked it out of the group, and triggered a rebalance — redistributing partitions among the remaining consumers. During a rebalance, consumption stops entirely.

Slow consumer → rebalance → consumption pause → more lag → slower consumer → rebalance. A death spiral.

The lesson: monitor consumer lag as a first-class metric (Kafka's own tooling, or something like Prometheus + kafka-exporter). Tune session.timeout.ms, max.poll.interval.ms, and max.poll.records to match your actual processing time — not the defaults.

Lesson 3: Retention is not infinite, and compaction is not deletion

Two settings that look like cleanup but behave very differently:

Time-based retention (retention.ms) — messages older than this are deleted from the log. Set it too low and a consumer group that falls behind loses events forever. We had a topic with 24-hour retention and a consumer that got stuck over a weekend. Monday morning: 2 days of events, gone.

Log compaction (cleanup.policy=compact) — Kafka keeps only the latest message per key. This is great for state snapshots (think: user profile updates where you only care about the current state). But if you have multiple event types sharing a key, compaction will silently throw away intermediate states you might have needed.

Neither of these is "deleting messages" in the queue sense. They're policies that shape what the log looks like over time, and they interact with consumer lag in ways that can permanently destroy data if you're not paying attention.

The lesson: set retention based on your worst-case recovery window, not convenience. If a consumer could be down for 3 days, your retention should cover at least that.

Lesson 4: Partitions are your unit of parallelism — and ordering

Want to scale consumption? Add partitions. More partitions = more consumers in a group can run in parallel (one consumer per partition, max).

But here's the catch: ordering is only guaranteed within a partition.

We had an event stream where the order of events for a given user mattered — "user updated profile" before "user deleted account." By default, events from the same user could land on different partitions. The delete might process before the update. The result was a ghost account that kept reappearing.

The fix is using a message key — Kafka routes messages with the same key to the same partition, preserving order for that entity. User ID as the key, in our case.

The lesson: if ordering matters for a logical entity, always set a meaningful message key. Without it, partition assignment is round-robin and ordering guarantees vanish.

Lesson 5: Replay is a feature, not an accident

The flip side of "messages don't disappear" is that you can intentionally re-read history. Reset a consumer group's offset to the beginning and replay everything. This is something you simply can't do with a traditional queue.

We used this when we shipped a bug in our event processor. After the fix, we reset the offset, replayed 3 days of events through the corrected logic, and rebuilt our derived data from scratch. No manual patches. No support tickets apologizing for lost data.

This is Kafka's killer feature for event-driven architectures — but only if you've designed for it. Idempotent consumers are non-negotiable: replaying the same event twice must produce the same result. If your consumer does a blind INSERT instead of an INSERT ... ON CONFLICT DO UPDATE, replay turns into a data corruption exercise.

The lesson: build idempotency in from day one. Treat every message as something that might be delivered more than once.

The mental shift in one sentence

A queue is a pipe that empties itself. Kafka is a log that you move a bookmark through.

Once that clicks, all the rest follows: why lag matters more than depth, why you need to think about retention before you need it, why ordering requires intentional key design, and why replay is a superpower rather than a footgun.

We didn't throw Kafka out after our rough patch. We got better at using it correctly. The system that broke us at 40k events/hour now handles 10x that without incident.

It just took unlearning everything we knew about queues first.

Have a Kafka war story of your own? Drop it in the comments — the best lessons always come from production.

Top comments (2)

Yunetzi • Jun 29

Kafka isn't a queue, so what mental model actually guarantees delivery?

Maksym • Jun 29

The mental model: stop asking "does Kafka guarantee delivery?" and start asking "where in my pipeline can a message be lost, and what's my safeguard at each point?".