Rachit Misra

Posted on Mar 31

Your Shiny New Kafka Cluster is a Ticking Time Bomb

#kafka #softwaredevelopment #career #systemdesign

What nobody tells you about event-driven architecture until it’s 3 AM and your database is corrupted
A war story about distributed systems, offset commits, and the most expensive lesson of my engineering career.

Everyone said Kafka would fix our feed latency.
They were right.
For exactly 31 days.
Then PagerDuty fired at 3:17 AM, consumer lag hit 2.4 million messages, and we discovered something that no architecture diagram had ever shown us — Kafka doesn’t care about your database state. It just delivers. Faithfully. Mercilessly.
This is that story.

The Setup — How We Got Here
Q3. Product team screaming for faster feeds. Our engineering lead had just come back from a conference. Someone in the architecture meeting said the word Kafka.
The room lit up. Senior engineers nodded. Someone opened a laptop and pulled up the Confluent documentation.
We had a simple Redis-based event queue that was “too slow.” Our P99 latency was 800ms on feed generation. Product wanted 200ms. Kafka promised sub-100ms. The decision felt obvious.
Three weeks later, we shipped it. Latency dropped to 80ms. The product team celebrated. Engineering celebrated. We had graphs going the right direction.
We had no idea what we’d just invited into our system.

What Kafka Actually Is
Before we get to the 3 AM incident, let’s talk about what Kafka actually is — because most teams get this wrong before they write a single line of code.
Kafka is not a message queue.
This is the most dangerous misconception in distributed systems today. When engineers hear “message queue,” they think: send a message, someone receives it, it’s gone. Like an email inbox.
Kafka is a distributed commit log.
Every event is written to a log. The log is partitioned across brokers. Each partition is an ordered, immutable sequence of records. Consumers read from the log at their own pace, tracked by an offset — a pointer to their position in the log.

Partition 0: [event1] [event2] [event3] [event4] [event5]
↑
Consumer offset
(Consumer has read up to here)

The crucial difference from a traditional queue: Kafka doesn’t delete messages after consumption. Messages are retained for a configured retention period (default 7 days). Consumers can seek to any offset and replay any portion of the log.
This is both Kafka’s superpower and its most dangerous property.
Superpower: Replay events for new consumers, rebuild state, debug issues by replaying history.
Danger: If your consumers are not idempotent and your system state has mutated, replay doesn’t recover your system. It corrupts it further.

The Architecture We Built
Our feed generation system looked elegant on the whiteboard:

User Action
↓
[Producer] → [Kafka Topic: user-events]
↓
[Consumer Group A] → Updates user feed cache
[Consumer Group B] → Updates recommendation engine
[Consumer Group C] → Updates activity timeline

Three consumer groups. Each consuming the same events for different purposes. Decoupled. Independent. Horizontally scalable.
The architecture review went smoothly. “What about consumer failures?” someone asked.
“Kafka handles that,” we said. “It retains messages. We can replay.”
We were right. And catastrophically wrong.

Month 2 — The Nightmare Begins
3:17 AM
PagerDuty fires.

ALERT: Consumer lag critical
Topic: order-events
Consumer Group: order-processor
Lag: 2,400,000 messages

2.4 million messages behind. Not a small spike — Consumer Group B had been silently falling behind for 6 hours. A memory leak in one of the consumer instances had caused it to slow down. Kubernetes hadn’t restarted it because it was still running — just slowly.
By the time the alert fired, we had 2.4 million unprocessed order events.
The Decision That Made Everything Worse
“No problem,” the on-call engineer said. “We’ll just restart the consumers and let them catch up.”
This is where the real disaster began.
The consumers restarted. They picked up from their last committed offset. They began processing 2.4 million events.
Thirty minutes later, our database was in a state of chaos:
∙ Orders were being marked as “processing” that had already been delivered
∙ Inventory counters were going negative
∙ Users were being charged twice for orders they’d already received
∙ Recommendation scores were being recalculated with stale data, overwriting fresh data
We stopped the consumers. But the damage was done.
What Actually Happened
Let me reconstruct the failure precisely, because understanding the exact mechanism is everything.
Step 1 — The initial processing:

Event: OrderPlaced {orderId: 12345, amount: 599, userId: user456}
Consumer processes event:
→ Creates order record in DB
→ Charges payment gateway
→ Updates inventory: item_count -= 1
→ Commits offset

This worked correctly for months.
Step 2 — The slow consumer:

Event: OrderPlaced {orderId: 67890, amount: 1299, userId: user789}
Consumer receives event
Consumer starts processing...
→ Creates order record in DB ✅
→ Charges payment gateway ✅
→ Updates inventory: item_count -= 1 ✅
Consumer crashes before committing offset ❌

The offset for this batch was never committed. From Kafka’s perspective, these events were never successfully consumed.
Step 3 — The restart:
When the consumer restarted, it read from the last committed offset — before the crash. It received the same events again.

Event: OrderPlaced {orderId: 67890, amount: 1299, userId: user789}
Consumer processes event AGAIN:
→ Creates order record in DB ← DUPLICATE
→ Charges payment gateway ← DOUBLE CHARGE
→ Updates inventory: item_count -= 1 ← WRONG AGAIN
→ Commits offset ✅

Step 4 — The 2.4 million event replay:
Multiply this across 2.4 million events, many of which had been partially processed or fully processed but whose offsets hadn’t been committed. Some events were processed once, some twice, some partially.
The database was now in an indeterminate state. We couldn’t tell which operations had run once and which had run twice.
Kafka didn’t fail. It worked exactly as designed.
We had built a perfectly reliable pipeline for delivering chaos at scale.

The Root Cause Analysis
Our postmortem identified three fundamental mistakes.
Mistake 1: No Idempotency
Idempotency means that running the same operation multiple times produces the same result as running it once.
Our consumers were not idempotent. Processing the same OrderPlaced event twice created two orders, two charges, two inventory decrements.
The fix:
Every event handler must check: “Have I already processed this event?”

@KafkaListener(topics = "order-events")
public void handleOrderEvent(OrderEvent event) {
// Idempotency check FIRST
if (eventProcessingRepository.exists(event.getEventId())) {
log.info("Event {} already processed, skipping", event.getEventId());
return;
}

try {
    // Process the event
    orderService.processOrder(event);

    // Mark as processed ATOMICALLY with the business operation
    // Use a DB transaction to ensure both happen or neither happens
    eventProcessingRepository.markProcessed(event.getEventId());

} catch (Exception e) {
    // Don't mark as processed — allow retry
    throw e;
}

}

CREATE TABLE processed_events (
event_id VARCHAR(255) PRIMARY KEY,
processed_at TIMESTAMP DEFAULT NOW(),
consumer_group VARCHAR(100)
);

-- With TTL index to avoid unbounded growth
CREATE INDEX idx_processed_events_time ON processed_events(processed_at);

Every unique event ID is stored. Before processing, check if it exists. If yes — skip. If no — process and insert atomically.
The idempotency key design matters:
Your event IDs must be stable and unique across retries. Use a combination of business identifiers:

// Good: stable, business-meaningful
String eventId = "order-placed-" + orderId + "-" + userId;

// Bad: changes on retry
String eventId = UUID.randomUUID().toString();

Mistake 2: Auto-Commit Was a Lie
Kafka consumers have two offset commit strategies:
Auto-commit (the default):

props.put("enable.auto.commit", "true");
props.put("auto.commit.interval.ms", "5000");

Every 5 seconds, Kafka automatically commits the current offset regardless of whether your application has finished processing. If your consumer crashes between the auto-commit and finishing processing — the message is considered consumed but your application never finished handling it. Silent data loss.
Alternatively, if your consumer crashes after processing but before the next auto-commit — the message replays on restart. Duplicate processing.
Auto-commit gives you the worst of both worlds: potential data loss AND potential duplicates, with no control over which failure mode you experience.
Manual commit (the correct approach):

props.put("enable.auto.commit", "false");

@KafkaListener(topics = "order-events")
public void handleOrderEvent(
OrderEvent event,
Acknowledgment acknowledgment) {

try {
    // Process the event fully
    orderService.processOrder(event);

    // Only commit offset AFTER successful processing
    acknowledgment.acknowledge();

} catch (Exception e) {
    // Don't acknowledge — message will be redelivered
    // Make sure your handler is idempotent!
    log.error("Failed to process event {}", event.getEventId(), e);
    throw e;
}

}

With manual commit, you control exactly when an offset is committed. The offset only advances when you’ve confirmed successful processing.
At-least-once vs exactly-once:
Manual commit gives you at-least-once delivery — messages are never lost, but may be delivered more than once. Combined with idempotency, this is safe and practical.
Exactly-once delivery is possible with Kafka transactions but comes with significant complexity and performance overhead. For most systems, at-least-once + idempotency is the right trade-off.

// Exactly-once with Kafka transactions (complex, high overhead)
@Transactional
public void processWithExactlyOnce(OrderEvent event) {
// Kafka transaction spans both the consumer offset commit
// and the producer write — atomic
kafkaTemplate.executeInTransaction(operations -> {
orderService.processOrder(event);
operations.send("order-processed", event.getOrderId());
return null;
});
}

Mistake 3: No Replay Strategy
When we decided to replay 2.4 million events, we hadn’t asked a fundamental question:
Is the current state of our database compatible with replaying these events?
The answer was no. The database had moved forward. Events that assumed “inventory = 100” were being replayed against a database where “inventory = 47.”
The correct replay strategy:
Before replaying events, you must answer:
1. What is the current state of the system? Take a snapshot before replay begins.
2. Are the events you’re replaying compatible with the current state? If an event says “deduct 1 from inventory” and inventory is already at the post-event value — replaying it will corrupt state.
3. Can you replay to a shadow system first? Replay events against a read replica or a staging environment to validate the outcome before applying to production.
4. Do you have a compensation mechanism? If replay causes inconsistency, can you detect and correct it?

public class SafeReplayService {

public void replayEvents(String topic, long fromOffset, long toOffset) {
    // Step 1: Take DB snapshot for rollback capability
    String snapshotId = snapshotService.createSnapshot();

    // Step 2: Enable replay mode (idempotency is critical here)
    replayModeFlag.set(true);

    // Step 3: Replay in small batches with validation
    for (long offset = fromOffset; offset < toOffset; offset += BATCH_SIZE) {
        List<ConsumerRecord> batch = fetchBatch(topic, offset, BATCH_SIZE);

        // Validate state compatibility before processing
        if (!stateCompatibilityChecker.isCompatible(batch)) {
            log.error("State incompatibility detected at offset {}", offset);
            rollbackService.rollback(snapshotId);
            throw new ReplayException("Cannot safely replay at offset " + offset);
        }

        processBatch(batch);
        validateBatchOutcome(batch);
    }

    replayModeFlag.set(false);
}

}

The Kafka Failure Modes Nobody Talks About
Consumer Group Rebalancing
When a consumer joins or leaves a consumer group, Kafka triggers a rebalance — reassigning partitions across consumers.
During rebalance, all consumers in the group stop processing. For a group of 10 consumers handling a high-throughput topic, a rebalance can pause processing for 30-60 seconds.
Causes of unexpected rebalances:
∙ Consumer takes longer than max.poll.interval.ms to process a batch
∙ Consumer fails to send heartbeat within session.timeout.ms
∙ Deployment rolling update adds/removes consumer instances
Mitigation:

// Increase poll interval for slow processors
props.put("max.poll.interval.ms", "600000"); // 10 minutes

// Reduce batch size to ensure processing within interval
props.put("max.poll.records", "100"); // Process 100 records at a time

// Use static membership to reduce rebalances during restarts
props.put("group.instance.id", "consumer-instance-1");

Log Compaction Surprises
Kafka supports log compaction on certain topics — retaining only the latest message for each key. This is useful for event sourcing and change data capture.
The surprise: if you’re consuming a compacted topic and your consumer falls behind, some of the events you missed may have been compacted away. You’ll never see intermediate states.

Before compaction:
[key:user1, value:name=Alice] [key:user1, value:name=AliceB] [key:user1, value:name=Carol]

After compaction:
[key:user1, value:name=Carol]

Your consumer that missed the first two events will only see “Carol.” If your system expected to process every name change — you’ve silently lost data.
Partition Hot Spots
Kafka distributes messages across partitions using a partitioning key. If your partitioning key has low cardinality — for example, partitioning order events by country in a primarily US-based app — one partition receives 90% of the traffic.
The consumers reading that partition are overloaded. Others are idle. Horizontal scaling doesn’t help because you can’t have more consumers than partitions for a given topic.

// Bad: low cardinality key
producer.send(new ProducerRecord<>("orders", order.getCountry(), order));

// Good: high cardinality key
producer.send(new ProducerRecord<>("orders", order.getOrderId(), order));

The Poison Pill
A single malformed message that causes your consumer to crash on every processing attempt. Auto-commit means the offset never advances. Your consumer restarts, fetches the same message, crashes again. Infinite loop.

@KafkaListener(topics = "order-events")
public void handleOrderEvent(OrderEvent event) {
try {
orderService.processOrder(event);
acknowledgment.acknowledge();
} catch (PoisonPillException e) {
// Dead letter queue for messages that can't be processed
deadLetterProducer.send("order-events-dlq", event);
acknowledgment.acknowledge(); // Move past the poison pill
log.error("Moved poison pill to DLQ: {}", event.getEventId());
} catch (RetryableException e) {
// Don't acknowledge — allow retry
throw e;
}
}

Always implement a dead letter queue (DLQ) for messages that consistently fail processing. Without it, a single bad message can halt your entire consumer group.

The Three Questions You Must Answer Before Adding Kafka
After our incident, we created a pre-Kafka checklist. Every team considering Kafka must answer these three questions before writing a single line of producer code.
Question 1: Are Your Consumers Idempotent?
Can the same event be processed twice without corrupting state?
If you can’t answer yes with confidence — don’t add Kafka yet. Build idempotency first.
Test this explicitly:

@test
public void processingSameEventTwiceProducesSameResult() {
OrderEvent event = createTestOrderEvent();

orderConsumer.handle(event);
orderConsumer.handle(event); // Process twice

// State should be identical to processing once
Order order = orderRepository.findById(event.getOrderId());
assertEquals(1, orderRepository.countByUserId(event.getUserId()));
assertEquals(OrderStatus.PLACED, order.getStatus());
// Payment should only be captured once
assertEquals(1, paymentRepository.countByOrderId(event.getOrderId()));

}

Question 2: Is Your Offset Commit Strategy Deliberate?
Have you explicitly chosen between at-least-once, at-most-once, and exactly-once?
Have you disabled auto-commit and implemented manual acknowledgment?
Have you tested what happens when your consumer crashes mid-batch?
Question 3: Do You Have a Replay Strategy That Accounts for Current DB State?
If you need to replay 3 days of events tomorrow, can you do it safely?
Do you have the tooling to:
∙ Check state compatibility before replay?
∙ Replay to a shadow environment for validation?
∙ Roll back if replay causes inconsistency?
If you can’t answer yes to all three — you have a time bomb, not a pipeline.

When Kafka IS the Right Answer
After all of this, I want to be clear: Kafka is extraordinary for the right problems.
Use Kafka when:
∙ You need event replay. Building a new analytics service that needs to process 6 months of historical events? Kafka’s retention makes this trivial. A traditional queue can’t do this.
∙ Multiple consumers need the same events. Order placed → update inventory, send email, update recommendations, charge payment. Each consumer group processes independently at their own pace.
∙ You need high throughput with durability. Kafka handles millions of messages per second with persistence guarantees. Traditional queues struggle here.
∙ You’re building event sourcing. Kafka’s log is a natural fit for storing the complete history of state changes.
∙ You need decoupling between services. Producers don’t know about consumers. Services can be added without modifying existing code.
Don’t use Kafka when:
∙ You just need a simple task queue. Redis queues, RabbitMQ, or AWS SQS are simpler and sufficient.
∙ Your team doesn’t understand distributed systems fundamentals. Kafka amplifies your architecture’s weaknesses.
∙ You need simple request-response patterns. Kafka’s async nature adds latency and complexity for synchronous workflows.
∙ You’re a startup with 100 users. Your feed latency problem is probably a missing database index, not a missing message broker.

The Checklist We Wish We Had

Pre-Kafka Production Checklist:

Consumer Design
□ Idempotency implemented and tested
□ Auto-commit disabled
□ Manual offset commit with acknowledgment pattern
□ Dead letter queue configured
□ Poison pill handling implemented
□ Consumer lag alerting configured

Operations
□ Replay strategy documented
□ State compatibility validation tooling built
□ Shadow replay environment available
□ Kafka cluster monitoring configured
□ Consumer lag per group/topic
□ Broker disk usage
□ Under-replicated partitions
□ Controller election rate

Architecture
□ Partition count matches consumer scaling requirements
□ Partitioning key has high cardinality
□ Retention period matches replay requirements
□ Log compaction behavior understood for topic
□ Rebalance frequency monitored

Testing
□ Consumer crash mid-batch tested
□ Double-processing tested (idempotency verification)
□ Consumer lag recovery tested
□ Replay tested against production-like data volume

What Distributed Systems Actually Teach You
The 3 AM incident taught us something that no architecture talk had ever communicated clearly:
Distributed systems don’t punish bad architecture immediately.
They let you deploy. They let you celebrate the latency improvements. They let you present the graphs to product. They let you write the blog post about how you scaled.
Then, weeks or months later, under exactly the right (wrong) conditions — a slow consumer, an unexpected traffic spike, a network partition — they collect.
The bill is always paid at 3 AM.
The engineers who truly understand distributed systems aren’t the ones who avoided incidents. They’re the ones who’ve been humbled by them, understood exactly why they happened, and built systems that fail gracefully instead of catastrophically.
Kafka isn’t a silver bullet. It’s a distributed consistency nightmare dressed in a hoodie that says “low latency.”
Respect it, or it will humble you.

What’s Next
Next post: Why SQL quietly beats NoSQL for 90% of startups — despite everything the hype machine told you. We’ll look at the actual data, the benchmark lies, and the cases where PostgreSQL outperforms MongoDB at scale.
This one is going to make people uncomfortable.
Follow to stay updated. 🔔

If this saved you from a 3 AM incident, share it with your team. The best time to learn this lesson is before it’s your production database.

Tags: Kafka, Distributed Systems, Backend Engineering, System Design, Software Architecture, Java, Event-Driven Architecture, Microservices, Interview Prep

Top comments (2)

Andre Cytryn • Apr 1

the "Kafka is not a message queue" framing is so important and so rarely communicated clearly upfront. the commit log mental model changes everything -- once you think of consumers as readers with a bookmark rather than recipients who consume-and-delete, the idempotency requirement becomes obvious.

your point about the architecture review going smoothly because "Kafka handles that" is painfully relatable. i've been working on a tool to help teams diagram these kinds of event flows more explicitly -- and it's striking how often the failure modes only become visible when you actually draw out the producer/consumer/offset relationships instead of just the happy path. whiteboard diagrams that skip the offset commit lifecycle miss exactly the failure modes you describe.

one addition to your checklist: test your DLQ consumer too. teams implement the DLQ but never test what happens when the DLQ consumer itself falls behind or fails. that's a second time bomb.

Rachit Misra • Apr 2

that is a solid point, andre. testing the dlq consumer's failure modes is exactly the kind of edge case that separates a working system from a resilient one. thanks for the addition.