At 2 a.m., our Kafka consumers looked healthy, but the dashboards told a different story.
Consumer lag was climbing rapidly. Throughput had dropped. Messages that had already been processed started showing up again. A few pods had restarted, and suddenly the entire consumer group seemed unstable.
The confusing part was that nothing looked obviously wrong. We had six consumer pods running in Kubernetes, six partitions, and enough resources allocated to handle the workload.
The breakthrough came when we stopped looking at individual consumers and started looking at the consumer group itself.
Like many engineers, I understood the basics of Kafka: producers write to partitions, consumers read from them, and rebalancing happens when a consumer joins or leaves a group.
What I didn't understand was what actually happens during a rebalance, why consumers can suddenly stop processing, how offset commits interact with partition ownership changes, and why a single pod restart can create a visible lag spike across an entire system.
This article is the deep dive I wish I had while debugging those incidents.
We'll unpack what happens inside a Kafka consumer group during a rebalance, explore why rebalances are expensive, compare eager and cooperative protocols, and walk through the production patterns that significantly reduce lag spikes, duplicate processing, and deployment-related churn.
The Consumer Group Mental Model Most Engineers Never Build
Most explanations of Kafka consumer groups stop at a simple statement: a consumer group is a set of consumers that share the work of reading partitions from a topic.
That's true, but it doesn't explain why an entire consumer group can stop processing because a single pod restarted.
To understand rebalancing, you need to think of a consumer group as a distributed coordination protocol rather than a collection of consumers.
Every consumer group has three key actors: a Group Coordinator, a Group Leader, and the group members themselves.
The Group Coordinator is a Kafka broker responsible for managing the lifecycle of a consumer group. When a consumer starts, it first asks the cluster a simple question: "Which broker manages my group?" Once it gets the answer, every group-related operation flows through that coordinator.
The consumers themselves are the group members. Each consumer joins the group by sending a JoinGroup request and receives a temporary member.id. Unless you explicitly configure static membership, this identifier changes every time the consumer restarts.
The third actor is the one most engineers never hear about: the Group Leader.
Despite the name, the coordinator does not decide which consumer gets which partitions. Instead, the coordinator elects one consumer as the Group Leader during every rebalance. The leader receives the list of active members and their topic subscriptions, runs the configured partition assignment strategy, and sends the final assignment back to the coordinator.
The coordinator acts more like a traffic controller than a decision-maker. The assignment logic lives inside the clients.
This distinction matters more than it might seem. It explains why Kafka can support multiple partition assignment strategies such as Range, RoundRobin, Sticky, and CooperativeSticky without requiring broker changes. It also means that the behavior of a rebalance is heavily influenced by client-side configuration. A single change to the assignor can dramatically alter partition movement, consumer downtime, and lag characteristics across the entire group.
This design allows Kafka to evolve assignment strategies independently of the broker, giving teams the flexibility to optimize consumer behavior without touching the cluster itself.
Consumer groups also move through a well-defined state machine:
Stable → PreparingRebalance → CompletingRebalance → Stable
During the Stable state, consumers process records normally. When a rebalance is triggered, the group enters PreparingRebalance, partitions are revoked, and consumers temporarily stop fetching new records. Once the leader computes the new assignment and the coordinator distributes it, the group returns to Stable.
That transition window is where lag spikes are born.
One final detail matters more than most teams realize: Kafka tracks consumer health using heartbeats, but it tracks consumer progress using poll().
These are separate mechanisms.
A consumer can continue sending heartbeats successfully while spending too much time processing records between poll() calls. If processing exceeds max.poll.interval.ms, Kafka assumes the consumer is stuck, removes it from the group, and triggers a rebalance.
The consumer wasn't dead. It was just slow.
Many unexpected rebalances in production start with that distinction.
What Actually Triggers a Rebalance?
Most Kafka articles explain rebalancing with a single sentence: a rebalance happens when a consumer joins or leaves the group.
That's only partially true.
In production, rebalances often happen when nobody intentionally adds or removes consumers. A pod restart, a slow downstream dependency, a topic configuration change, or even a broker failure can trigger the same sequence of events.
Understanding these triggers is the difference between reacting to lag spikes and preventing them.
The first trigger is straightforward: a new consumer joins the group. This happens during scale-ups, rolling deployments, or when a failed consumer comes back online. The coordinator moves the group out of the Stable state and starts a rebalance so partitions can be redistributed.
The second trigger is a clean consumer shutdown. When an application calls consumer.close(), Kafka sends a LeaveGroup request to the coordinator, which immediately initiates a rebalance. This is why graceful shutdowns matter. A clean exit starts the rebalance instantly instead of waiting for a timeout.
The third trigger is an unclean consumer failure. If a pod crashes, the JVM exits unexpectedly, or a network partition prevents heartbeats from reaching the coordinator, Kafka waits until session.timeout.ms expires before declaring the consumer dead.
During that entire period, the failed consumer's partitions sit idle while producers continue writing messages.
The fourth trigger surprises most teams because the consumer is still alive.
Kafka uses heartbeats to determine whether a consumer exists, but it uses poll() to determine whether the consumer is making progress.
If record processing takes longer than max.poll.interval.ms, Kafka assumes the consumer is stuck and removes it from the group, even if heartbeats continue successfully in the background.
This commonly happens when consumers perform expensive work inside the polling loop, such as synchronous HTTP calls, large database transactions, or heavy batch processing.
The remaining triggers are less common but still important.
Kafka also initiates a rebalance when topic metadata changes - for example, when partitions are added or when a new topic matches a subscription pattern. Rebalances can also occur during group coordinator failover, when the broker responsible for managing the consumer group becomes unavailable and a new coordinator takes over.
Regardless of the trigger, the sequence that follows is always the same.
The group transitions from Stable to PreparingRebalance. Consumers stop fetching records, send new JoinGroup requests, a leader computes the next partition assignment, and the coordinator distributes the results.
Only then does the group return to Stable.
That brief transition window is the source of the lag spikes, throughput drops, and duplicate processing patterns many teams see in production.
And depending on which rebalance protocol you're using, that window can range from barely noticeable to a full stop-the-world event.
The Original Rebalance Protocol: Why One Consumer Restart Can Pause an Entire Group
For years, Kafka used a rebalance protocol that prioritized correctness over availability.
The rule was simple:
Before Kafka assigns partitions again, every consumer must give up every partition it currently owns.
This approach is known as eager rebalancing.
It guarantees that no partition is ever processed by two consumers simultaneously, but that safety comes at a cost.
Whenever a rebalance begins, Kafka effectively presses the pause button on the entire consumer group.
Imagine a consumer group with six consumers processing six partitions.
Everything is running normally until one pod restarts during a deployment.
Intuitively, you might expect Kafka to move only the affected partition to another consumer.
That's not what happens.
The coordinator notifies every consumer that a rebalance is in progress. Each consumer stops fetching new records, commits its current offsets, revokes all assigned partitions, and sends a fresh JoinGroup request.
At this point, no consumer owns any partition.
Meanwhile, producers continue writing messages exactly as before.
Once every consumer rejoins, the coordinator elects a Group Leader. The leader calculates a new partition assignment and sends the results back through a SyncGroup request. Only after every consumer receives its new assignment does processing resume.
The sequence looks like this:
Stop consuming → Revoke all partitions → JoinGroup → Compute assignments → SyncGroup → Resume consuming
That entire window is effectively a processing blackout.
Even consumers that ultimately keep the same partitions must still release and reacquire them.
If Consumer 1 owns Partition 0 before the rebalance and owns Partition 0 after the rebalance, it still stops processing during the transition.
This behavior explains the characteristic lag pattern many teams observe in production.
At the moment the rebalance starts, consumption drops to zero across the entire group while producers continue publishing new messages. Lag climbs rapidly until the rebalance completes. Once consumers resume, the group enters a catch-up phase where lag gradually returns to normal.
The larger the consumer group, the more expensive this process becomes.
Adding a single consumer to a group of fifty consumers can temporarily pause all fifty consumers.
A rebalance that lasts only ten seconds in a system processing 20,000 messages per second creates a backlog of 200,000 messages before consumers even begin catching up.
Offset management introduces another challenge.
Imagine a consumer fetches records from offsets 1,000 to 1,500 and processes only the first 1,200 before a rebalance begins.
The consumer now faces a difficult trade-off.
It can commit offset 1,200 immediately and revoke the partition, which ensures a faster rebalance but guarantees that offsets 1,201 through 1,500 will be processed again.
Alternatively, it can finish processing the entire batch before committing, which reduces duplicate processing but delays the rebalance for every consumer in the group.
This is why duplicate processing during rebalances is not an edge case. It's an expected behavior that consumer applications must be designed to handle safely.
If your monitoring dashboards show a sudden drop to zero consumption across all partitions followed by a sharp lag spike and gradual recovery, you've likely experienced an eager rebalance.
The surprising part isn't that lag increased.
It's that the entire consumer group stopped to move a small number of partitions.
How Kafka Fixed the Problem: Cooperative Rebalancing
The biggest problem with eager rebalancing isn't that it pauses consumers.
It's that it pauses consumers unnecessarily.
If one consumer leaves a group of six consumers, only a handful of partitions actually need to move. Yet eager rebalancing forces every consumer to revoke every partition, even when most assignments remain unchanged.
Kafka addressed this limitation in version 2.4 by introducing cooperative rebalancing.
The core idea is deceptively simple:
Only move the partitions that need to move.
Instead of revoking all partitions at once, consumers keep processing their existing partitions while Kafka incrementally transfers ownership of only the affected partitions.
Let's revisit the earlier example.
Imagine three consumers processing six partitions:
Consumer 1 owns Partitions 0 and 1
Consumer 2 owns Partitions 2 and 3
Consumer 3 owns Partitions 4 and 5
Now Consumer 2 crashes.
With eager rebalancing, Consumers 1 and 3 must revoke all their partitions before Kafka can compute a new assignment. Processing stops completely across the group.
With cooperative rebalancing, Consumers 1 and 3 continue processing their existing partitions while Kafka redistributes only Partitions 2 and 3.
The unaffected partitions never stop consuming.
This dramatically reduces the blast radius of a rebalance.
The trade-off is that cooperative rebalancing happens in multiple rounds.
During the first round, Kafka identifies which partitions must move and asks current owners to revoke only those specific partitions.
During the second round, Kafka assigns those newly available partitions to their new owners.
The extra coordination step preserves Kafka's most important guarantee: a partition is never owned by two consumers at the same time.
Cooperative rebalancing doesn't eliminate pauses entirely.
If a partition changes ownership, that partition still experiences a brief interruption.
What changes is the scope of the interruption.
A rebalance triggered by one consumer no longer pauses the entire group. It affects only the partitions involved in the change.
This difference becomes significant as consumer groups grow.
In a group with fifty consumers and two hundred partitions, adding a new consumer with eager rebalancing can temporarily pause processing for all two hundred partitions.
With cooperative rebalancing, only the partitions that need redistribution are affected.
Everything else continues processing normally.
The result is lower lag spikes, shorter recovery times, and fewer downstream incidents during deployments and consumer failures.
To enable cooperative rebalancing, configure the consumer to use the CooperativeStickyAssignor.
In Java:
props.put(
ConsumerConfig.PARTITION_ASSIGNMENT_STRATEGY_CONFIG,
CooperativeStickyAssignor.class.getName()
);
In Python clients that support cooperative assignment:
{
"partition.assignment.strategy": "cooperative-sticky"
}
One important caveat: every consumer in the same group must use a compatible assignor.
If some consumers use eager assignors and others use cooperative assignors, Kafka falls back to eager behavior during the migration period.
A rolling upgrade works, but the end state should be consistent across the entire consumer group.
Partition Assignors: How Kafka Decides What Moves
Once a rebalance begins, Kafka faces a deceptively difficult problem:
How should partitions be distributed across consumers?
At first glance, the answer seems simple: divide partitions evenly.
In practice, the assignor must optimize for three competing goals simultaneously: balance load fairly across consumers, minimize partition movement during rebalances, and avoid unnecessary disruption to consumers that are already processing efficiently.
These goals often conflict.
Moving partitions aggressively improves load distribution but forces consumers to rebuild caches, reinitialize local state, and replay uncommitted messages. Preserving existing assignments reduces disruption but can leave the group slightly imbalanced.
Kafka addresses this trade-off through pluggable partition assignors.
RangeAssignor
The Range assignor works independently for each topic. It sorts consumers and partitions, then assigns contiguous ranges of partitions to each consumer.
For single-topic consumer groups with evenly distributed partitions, this approach works well.
Problems appear when a group subscribes to multiple topics. Because Range operates on each topic independently, the same consumers often receive the extra partitions across multiple topics, creating systematic imbalance over time.
Range is simple and predictable, but it can produce uneven workloads in multi-topic consumer groups.
RoundRobinAssignor
The RoundRobin assignor combines partitions from all subscribed topics into a single list and distributes them evenly across consumers.
This generally creates better balance than Range.
The downside is instability.
During a rebalance, RoundRobin tends to reshuffle partitions aggressively. Consumers frequently lose partitions they previously owned, even when no movement is strictly necessary.
That additional movement increases lag, invalidates local caches, and amplifies the cost of rebalancing.
StickyAssignor
Sticky assignor introduced a different optimization strategy.
Instead of focusing exclusively on balance, it also tries to preserve existing assignments.
The assignor first calculates an ideal distribution, then minimizes partition movement by keeping as many existing assignments intact as possible.
For stateful consumers, this distinction matters.
If a consumer maintains in-memory aggregations, recently accessed data, or downstream connections tied to specific partitions, unnecessary movement creates avoidable work.
Sticky assignor minimizes that disruption.
CooperativeStickyAssignor
CooperativeSticky combines two independent improvements.
The Sticky assignor minimizes partition movement.
The Cooperative rebalance protocol minimizes consumer disruption.
Together, they ensure that only the partitions that actually need to move are reassigned while unaffected consumers continue processing.
For most production workloads running Kafka 2.4 or later, this should be the default choice.
It delivers balanced workloads, stable partition ownership, and significantly smaller lag spikes during deployments and consumer failures.
Choosing an assignor isn't just a configuration decision.
It's a decision about how much disruption your system experiences every time the consumer group changes.
If your consumers maintain local state, perform expensive initialization, or operate under strict latency requirements, minimizing partition movement is often more valuable than achieving perfectly even distribution.
Unless you have a specific reason not to, CooperativeStickyAssignor should be your default choice.
Why Rebalances Cause Duplicate Processing
Consumer lag is usually the first symptom teams notice during a rebalance.
Duplicate processing is the second.
And unlike lag spikes, duplicates don't always announce themselves with a dashboard alert. They often show up later as duplicate database records, repeated API calls, incorrect aggregations, or customers receiving the same notification twice.
The root cause lies in the gap between processing a message and committing its offset.
Kafka does not track whether your application successfully processed a record.
It only tracks the last offset your consumer committed.
This distinction is critical.
Imagine a consumer fetches records from offsets 1,000 to 1,500.
By the time a rebalance starts, it has successfully processed records up to offset 1,200.
The remaining records are still in memory, waiting to be processed.
At this point, the consumer has two options.
It can immediately commit offset 1,200 and give up ownership of the partition. This speeds up the rebalance but guarantees that records 1,201 through 1,500 will be processed again by whichever consumer receives the partition next.
Alternatively, it can finish processing the entire batch before committing offset 1,500.
This reduces duplicate processing but delays the rebalance for every consumer in the group.
There is no perfect answer because Kafka prioritizes availability and fault tolerance over exactly-once consumption.
Duplicate delivery during rebalances is expected behavior.
This is also why enable.auto.commit=true often creates problems in production.
With auto-commit enabled, Kafka periodically commits offsets in the background, typically every five seconds.
Your application loses control over when offsets are persisted.
A rebalance can occur immediately after a record is processed but before the next automatic commit happens.
When another consumer takes ownership of that partition, it resumes from the last committed offset, not the last processed record.
The result is duplicate processing.
Disabling auto-commit gives applications explicit control over this boundary.
Instead of committing offsets on a timer, consumers commit offsets only after records have been processed successfully.
More importantly, they commit one final time when partitions are about to be revoked.
Kafka provides a dedicated hook for this purpose: ConsumerRebalanceListener.
The onPartitionsRevoked() callback executes before ownership transfers to another consumer.
This is the last guaranteed opportunity to commit offsets and clean up any partition-specific state.
The onPartitionsAssigned() callback executes after new partitions arrive, allowing consumers to rebuild caches, initialize local state, or restore processing context.
These callbacks turn rebalancing from an unpredictable event into a manageable lifecycle.
Even with careful offset management, duplicate delivery remains possible.
A consumer can crash after processing a record but before committing its offset. Network failures can interrupt commits. Coordinator failovers can introduce retries.
The safest approach is to assume duplicates will happen.
Design consumers to be idempotent.
If processing the same message twice changes the outcome, rebalancing will eventually expose that weakness.
Idempotency keys, deduplication tables, transactional writes, and upsert operations transform duplicate processing from a production incident into a harmless retry.
The question isn't whether your consumers will receive duplicate messages.
The question is whether your system is designed to tolerate them.
Production Patterns That Minimize Rebalances and Their Impact
Rebalancing is not a failure.
It's a fundamental part of how Kafka maintains fault tolerance and distributes work across consumers.
The goal isn't to eliminate rebalances completely. The goal is to make them infrequent, predictable, and inexpensive.
Over time, we found that most consumer group instability came from a small set of recurring problems: consumers restarting during deployments, slow processing causing poll timeouts, unnecessary partition movement, and imprecise offset management.
Each problem has a corresponding mitigation.
Use CooperativeStickyAssignor by Default
If you're still using eager rebalancing, every consumer change becomes a group-wide event.
Switching to CooperativeStickyAssignor dramatically reduces disruption by limiting partition movement and allowing unaffected consumers to continue processing during a rebalance.
For Kafka 2.4 and later, this should be the default choice for most workloads.
partition.assignment.strategy=org.apache.kafka.clients.consumer.CooperativeStickyAssignor
Disable Auto-Commit
Automatic offset commits optimize for convenience, not correctness.
When offsets are committed on a timer, the consumer loses control over the relationship between processing completion and offset persistence.
Disable auto-commit and commit offsets explicitly after successful processing.
enable.auto.commit=false
Pair this with ConsumerRebalanceListener and commit offsets in onPartitionsRevoked() before ownership transfers to another consumer.
Introduce Static Membership
Every time a consumer restarts, Kafka treats it as a new member unless configured otherwise.
This creates unnecessary rebalances during rolling deployments.
Static membership gives each consumer a stable identity through group.instance.id.
As long as the consumer rejoins before session.timeout.ms expires, Kafka preserves its existing partition assignments.
group.instance.id=payment-processor-pod-1
For Kubernetes workloads, use a StatefulSet with ordinal-based naming (payment-processor-0, payment-processor-1) or inject a stable identifier through environment variables.
Avoid relying on default pod names, which change during restarts and defeat the purpose of static membership.
Tune Session Timeouts for Reality, Not Averages
A common mistake is setting session.timeout.ms based on average restart time.
Tune for worst-case restart latency instead.
If image pulls, JVM startup, readiness probes, and dependency initialization can occasionally take 60 seconds, a 45-second session timeout guarantees unnecessary rebalances.
A practical rule of thumb is:
session.timeout.ms = worst_case_restart_time × 1.5
Longer timeouts reduce deployment churn but delay failure detection.
Shorter timeouts improve responsiveness but increase the risk of false positives during transient pauses.
Choose intentionally.
Keep the Poll Loop Fast
One of the most common causes of unexpected rebalances is exceeding max.poll.interval.ms.
This usually happens because consumers perform expensive work directly inside the polling loop.
Large database transactions, synchronous API calls, and oversized batches can all prevent the consumer from calling poll() frequently enough.
Reducing max.poll.records lowers the amount of work performed per polling cycle and helps maintain steady progress.
max.poll.records=100
The correct solution is almost never increasing max.poll.interval.ms.
That only hides the problem while slowing failure detection.
Separate Consumption from Processing
Fetching records and processing records are different concerns.
The consumer thread should focus on polling Kafka consistently and handing work to a dedicated processing layer.
This architecture prevents slow downstream systems from destabilizing the consumer group.
When backpressure occurs, pause consumption instead of allowing the poll loop to stall.
Consumers can temporarily stop fetching new records while continuing to send heartbeats and maintain partition ownership.
Design for Idempotency
No configuration eliminates duplicate delivery completely.
Consumers can still fail after processing records but before committing offsets.
Network partitions, coordinator failovers, and application crashes will eventually happen.
The final line of defense is idempotency.
Deduplication keys, upserts, transactional writes, and idempotent downstream APIs ensure that reprocessing the same message produces the same result.
A resilient Kafka consumer assumes duplicates are inevitable and makes them harmless.
The most effective Kafka systems don't avoid rebalances.
They assume rebalances will happen and are designed to absorb them gracefully.
Conclusion: Rebalances Aren't the Problem. Unpredictable Rebalances Are.
When our consumer group started showing lag spikes at 2 a.m., we blamed the usual suspects.
Kubernetes. Resource limits. Batch sizes. Application code.
What we didn't realize was that the real problem wasn't inside our consumers. It was in the coordination layer we rarely thought about.
A single consumer restart wasn't just restarting one pod. It was triggering a distributed protocol involving heartbeats, group coordinators, partition ownership changes, offset commits, and assignment strategies.
Once we understood that, the symptoms suddenly made sense.
The lag spikes weren't random. They were the result of stop-the-world rebalances.
Duplicate processing wasn't a bug. It was the natural consequence of at-least-once delivery and uncommitted offsets.
Deployment-related instability wasn't caused by Kubernetes. It was caused by consumers repeatedly leaving and rejoining the group without stable identities.
The most important lesson was this:
Kafka rebalances are inevitable.
Consumer crashes happen. Deployments happen. Brokers fail. Topics evolve.
The teams that build resilient Kafka systems don't try to avoid rebalances altogether. They design their consumers to absorb them gracefully.
That means using CooperativeStickyAssignor to reduce unnecessary partition movement. It means disabling auto-commit and taking explicit control of offset management. It means introducing static membership to minimize deployment churn and keeping the poll loop fast enough to avoid accidental rebalances.
Most importantly, it means assuming that duplicate processing will happen eventually and making your consumers idempotent by design.
The next time you see a sudden lag spike, don't start by increasing CPU limits or scaling your deployment.
Ask a different question:
What triggered the rebalance?
Because once you understand how Kafka consumer groups coordinate, the dashboards stop looking random.
They start telling a story.
And if you've made it this far, you'll know exactly how to read it.
🔗 Connect with Me
📖 Blog by Naresh B. A.
👨💻 Backend & AI Systems Engineer | Distributed Systems · Production ML
🌐 Portfolio: Naresh B A
📫 Let's connect on LinkedIn | GitHub: Naresh B A
Thanks for spending your precious time reading this. It's my personal take on a tech topic, and I really appreciate you being here. ❤️






Top comments (0)