A survival guide for when everything goes wrong in production.
There's a moment every engineer who works with Kafka experiences. You check the producer. Messages are sending. You check the consumer. Nothing. The consumer group shows zero lag because there's nothing to lag behind — as far as the consumer knows, the topic is empty.
But it's not empty. The messages are there. Somewhere. In some partition, at some offset, behind some configuration you set six months ago and forgot about.
Kafka doesn't lose messages. But it's very good at hiding them from you.
Consumer Lag: The Number Everyone Watches Wrong
Consumer lag is the difference between the latest offset in a partition and the offset your consumer group has committed. Simple concept. Dangerous in practice.
The mistake: treating lag as a single number. Lag is per-partition. If you have 30 partitions and one consumer is stuck on partition 17 while the others are healthy, the total lag looks manageable. But partition 17's data is hours behind, and whatever downstream system depends on that data is serving stale results.
Monitor lag per partition. Tools like Burrow, Kafka Exporter for Prometheus, or even kafka-consumer-groups.sh --describe break it down. If one partition's lag is growing while others are stable, you have a stuck consumer, a hot partition, or a poison message.
A poison message is a record your consumer can't process — malformed data, unexpected schema, null where it shouldn't be null. The consumer throws an exception, the offset doesn't commit, and it retries the same message forever. Lag grows. The consumer looks "alive" because it's processing — just not making progress.
The fix: dead letter queues. After N retries, move the message to a separate topic, commit the offset, and move on. Alert on the dead letter topic. Investigate later. Don't let one bad record block millions of good ones.
Rebalance Storms: The Silent Killer
Consumer rebalancing is Kafka's mechanism for redistributing partitions across consumers in a group. When a consumer joins or leaves, Kafka reassigns partitions. During rebalance, all consumers in the group stop processing. For a few seconds, nobody's doing anything.
This is fine. Unless it happens every 30 seconds.
Rebalance storms happen when Kafka thinks a consumer is dead, removes it from the group, triggers a rebalance, then the consumer comes back, joins the group, triggers another rebalance, and the cycle repeats.
Three timeout settings control this:
-
session.timeout.ms: how long Kafka waits for a heartbeat before declaring the consumer dead. Default: 45 seconds. -
heartbeat.interval.ms: how often the consumer sends heartbeats. Default: 3 seconds. -
max.poll.interval.ms: how long between two poll() calls before Kafka kicks the consumer out. Default: 5 minutes.
The most common cause of rebalance storms: max.poll.interval.ms is too short for your processing time. Your consumer polls 500 records, spends 6 minutes processing them, and by the time it polls again, Kafka has already declared it dead and rebalanced.
Fixes:
- Increase
max.poll.interval.msto match your worst-case processing time. - Decrease
max.poll.recordsso each batch processes faster. - Use
static.group.instance.id— this enables static membership, which means Kafka won't immediately rebalance when a consumer temporarily disconnects. It waits forsession.timeout.msto expire first. - Use cooperative rebalancing (
partition.assignment.strategy = CooperativeStickyAssignor) — instead of stopping all consumers during rebalance, it only reassigns the affected partitions.
One team I worked with had a 12-consumer group processing payment events. Every few minutes, all processing stopped for 10-15 seconds during rebalance. Twelve times an hour. That's 2 minutes of downtime every hour in a payment pipeline. The fix was adding static group instance IDs and switching to cooperative rebalancing. Total rebalance disruption dropped from 2 minutes per hour to near zero.
Exactly-Once: The Myth and the Reality
Kafka advertises exactly-once semantics. Here's what that actually means.
Idempotent producer (enable.idempotence = true): Kafka deduplicates messages from the same producer session. If a network retry causes the producer to send the same message twice, the broker detects the duplicate and discards it. This prevents duplicates within a single producer session. If the producer restarts, it gets a new session, and deduplication doesn't cross sessions.
Transactional producer + consumer: For true exactly-once across produce-and-consume workflows, you need transactions.
producer.beginTransaction();
producer.send(outputTopic, processedRecord);
producer.sendOffsetsToTransaction(consumerOffsets, consumerGroupId);
producer.commitTransaction();
This atomically writes the output record AND commits the consumer offset. Either both happen or neither does. If the transaction fails, the consumer re-reads the input, reprocesses it, and tries again.
The reality check:
- Exactly-once works within Kafka. The moment your consumer writes to an external database, you're back to at-least-once unless you implement idempotency on the database side.
- Transactions add latency. Each transaction involves coordination between the producer, the transaction coordinator, and the brokers hosting the output partitions.
- Most systems don't need exactly-once. If your consumer can handle duplicates (idempotent writes, upserts, deduplication at the application layer), at-least-once is simpler and faster.
Don't reach for exactly-once because it sounds correct. Reach for it when duplicate processing would cause real damage — financial transactions, inventory counts, billing events. For analytics, logging, and notifications, at-least-once with deduplication is the pragmatic choice.
Partition Strategies: The Decision That Haunts You
Once you choose a partition key, changing it later means reprocessing everything. Choose carefully.
Key-based partitioning (default when you set a key): all messages with the same key go to the same partition. This guarantees ordering per key. If you're processing events per user, partition by user ID and every event for a given user arrives in order.
The trap: hot partitions. If one user generates 1,000x more events than average, their partition becomes the bottleneck. The consumer assigned to that partition falls behind while others are idle.
Round-robin (no key): messages distribute evenly across partitions. Maximum throughput, zero ordering guarantees. Use this for stateless processing where order doesn't matter — log aggregation, metrics collection, fan-out work queues.
Custom partitioner: when you need ordering within a logical group but want to control distribution. For example, partition by tenant_id % num_partitions to ensure per-tenant ordering while distributing large tenants across multiple partitions.
The question to ask: does your consumer need to see related messages in order? If yes, use a key that groups related messages. If no, use round-robin for maximum throughput.
Compaction: Powerful and Dangerous
Log compaction keeps only the latest value for each key. Instead of retaining messages by time or size, Kafka retains the last message per key indefinitely.
Use case: a topic that represents current state. User profile updates: you only care about the latest profile, not the history. Config changes: you want the current config, not every version.
The danger: if your producer accidentally sends a message with a null value (a tombstone), compaction deletes the key permanently. One bug in a producer can wipe state for thousands of keys, and because compaction runs in the background, you might not notice until downstream consumers can't find the data they expect.
Guard rails:
- Monitor tombstone rate. A sudden spike in null-value messages is a red flag.
- Separate compacted topics from retention-based topics. Don't compact your event stream.
- Test your producer's null-handling thoroughly. A missing field serialized as null can become a tombstone.
The 4 Problems Disguised as "My Messages Are Disappearing"
When someone says their Kafka messages are disappearing, it's almost never data loss. Here's what's actually happening:
1. Retention expired. The topic's retention.ms is set to 7 days. Your consumer was down for 8 days. The messages were deleted before the consumer came back. This is not a bug. It's configuration.
2. Consumer offset reset. Your consumer group's committed offsets expired (controlled by offsets.retention.minutes, default 7 days in older Kafka versions). When the consumer restarts, it doesn't know where it left off and uses auto.offset.reset — which defaults to latest, meaning it skips everything produced while it was offline. Set it to earliest if you want to reprocess, or better yet, don't let your consumer stay down longer than your offset retention.
3. Wrong topic or partition. The producer is writing to orders-v2 and the consumer is reading from orders. Or the producer changed its partition key, so messages that used to go to partition 5 now go to partition 12, and the consumer assigned to partition 5 sees nothing.
4. Serialization mismatch. The producer is writing Avro, the consumer expects JSON. The consumer "reads" the message but can't deserialize it, throws an exception, and depending on error handling, either crashes or silently skips it. The message is there — the consumer just can't understand it.
The 2M Messages Per Second Story
Payment processing platform. 6 brokers, 3 racks, 120 partitions across 4 topics. Target: sustain 2 million messages per second with P99 produce latency under 10ms.
Initial state: 800K messages per second, P99 at 45ms. Rebalances every few minutes. Consumer lag growing during peak hours.
What we changed:
Broker side:
- Increased
num.io.threadsandnum.network.threadsto match the core count. Default values are conservative. - Set
log.flush.interval.messagesto a higher value. Letting the OS page cache handle flushing is almost always faster than forcing Kafka to fsync. - Moved log directories to separate NVMe drives per mount point. Kafka is I/O bound. Spreading partitions across drives parallelizes writes.
Producer side:
- Batch size from 16KB to 256KB. Larger batches mean fewer network round trips.
-
linger.msfrom 0 to 5. Instead of sending immediately, the producer waits 5ms to fill the batch. Throughput jumps significantly for a tiny latency increase. - Compression:
lz4. Reduces network bandwidth and disk usage. lz4 is fast enough that compression time is negligible compared to network savings.
Consumer side:
-
fetch.min.bytesfrom 1 to 64KB. Don't make a network round trip for a single message. -
max.poll.recordstuned to match processing capacity. Too high means long processing between polls and rebalance risk. Too low means excessive poll overhead. - Static group instance IDs + cooperative rebalancing. Eliminated rebalance storms.
Partition count: Increased from 120 to 360. More partitions = more parallelism, up to the point where metadata overhead becomes a problem (usually in the thousands).
Result: sustained 2.1M messages per second, P99 produce latency at 7ms, zero rebalances during the 4-hour peak window. The infrastructure didn't change. Six brokers, same hardware. Just configuration.
Key Takeaways
Kafka is a log. Everything flows from that mental model. Producers append to a log. Consumers read from a log at their own pace. Offsets are just positions in that log.
When "messages disappear," trace the offset. Where did the consumer last commit? What's the earliest available offset in the partition? The gap between those two numbers tells you everything.
Rebalances are the biggest operational pain in Kafka. Static membership and cooperative rebalancing aren't just nice-to-haves — they're the difference between a stable pipeline and one that hiccups every few minutes.
And if someone tells you they need exactly-once semantics, ask them what happens if their consumer processes a message twice. If the answer is "nothing, we have upserts," they don't need exactly-once. They need at-least-once with idempotency. Which is simpler, faster, and what most production systems actually run.
Over to You
Have you ever lost messages in Kafka — or thought you did? What was the actual root cause? I'd love to hear your rebalance storm stories.
If you enjoyed this, I write about production engineering, AI systems, and the messy reality of building software at scale.
Follow me:
This is part of the **Great Stack to Doesn't Work* series — a survival guide for when everything goes wrong in production. Follow the series to catch every episode.*
Top comments (0)