DEV Community: Rajeev

Kafka Safe Producer Defaults and Version Compatibility Explained

Rajeev — Wed, 01 Apr 2026 04:37:14 +0000

In the previous article Kafka Retries and Idempotent Producers Explained, we discussed how idempotent producers prevent duplicate messages in Kafka even with retries.

In this article, we will explore Kafka safe producer defaults, what they mean, and how version compatibility between brokers and clients affects them.

What Does “Safe Producer” Mean in Kafka?

A safe producer ensures that messages are:

Written without duplicates (idempotence)
Preserved in correct order per partition
Retried safely if transient failures occur

Kafka achieves this with the following producer settings:

enable.idempotence=true
acks=all
retries=Integer.MAX_VALUE
max.in.flight.requests.per.connection=5
delivery.timeout.ms=120000

Note: min.insync.replicas is a broker/topic-level setting and must be configured for full durability.

Kafka Version ≥ 3.0 — What It Covers

When we say:

“Kafka ≥ 3.0 has safe producer enabled by default”

we are talking about the producer client behavior, supported by the broker version ≥ 3.0.

What is enabled automatically?

enable.idempotence=true
acks=all
retries=Integer.MAX_VALUE
max.in.flight.requests.per.connection=5

Important: delivery.timeout.ms=120000 is also default but not tied to idempotence.

What this does NOT cover

Consumer behavior → consumers still need to handle duplicates if necessary
Other Kafka components → Streams, Connect, etc., are unaffected
Broker settings → durability depends on replication and min.insync.replicas

Version Compatibility: Broker vs Producer

Kafka broker version and producer client version are separate:

Component	Role	Version dependency
Broker	Kafka server/cluster	Defines feature support (≥3.0 enables safe defaults)
Producer	Kafka client	Implements safe producer defaults; must match features with broker
Consumer	Kafka client	Reads messages; independent of producer defaults

What Happens with Mixed Versions?

Broker Version	Producer Version	Safe Producer Defaults Applied?
≥ 3.0	≥ 3.0	Automatic
≥ 3.0	< 3.0 (e.g., 2.1)	Must manually enable `enable.idempotence`, `acks=all`, etc.
< 3.0	any	Must manually enable safe producer configs

Key insight:

Even if your broker is ≥3.0, using an older producer client will not automatically enable safe producer defaults.

When Should You Explicitly Configure Safe Producer Settings?

Legacy systems (Kafka ≤ 2.8)
- Always configure enable.idempotence=true, acks=all, etc. manually.
Mixed-version clusters
- Explicit config ensures consistent behavior across old and new clients.
Critical systems
- For payments, order processing, or inventory management, explicit configs prevent duplicates and maintain ordering.
Upgrades
- When migrating brokers or clients, explicit settings help maintain predictable behavior.

Recommended Safe Producer Configuration

Even with Kafka ≥ 3.0, explicitly setting configs can improve clarity:

acks=all
enable.idempotence=true
retries=Integer.MAX_VALUE
max.in.flight.requests.per.connection=5
delivery.timeout.ms=120000

This ensures high reliability, correct ordering, and duplicate-free message delivery.

Bottom Line

Safe producer = producer client behavior, not broker or consumer
Broker ≥3.0 supports safe defaults, but older clients must be configured manually
Explicit configuration is still recommended for critical systems or mixed-version clusters
Understanding producer vs broker vs consumer roles avoids common pitfalls in Kafka message delivery

Summary

Kafka safe producer guarantees idempotent writes and correct ordering per partition
Defaults are automatic in broker ≥3.0 with modern clients
For older clients or mixed clusters, safe producer configs must be explicitly set
Proper broker settings (min.insync.replicas) are still required for full durability

Ensuring safe producer behavior is essential for reliable Kafka pipelines, especially in distributed, event-driven systems.

If you found this useful and want to share your thoughts leave a comment if you’d like. I always appreciate feedback and different perspectives.

Originally published on my personal blog:

🔗 https://rajeevranjan.dev/blog/kafka/kafka-safe-producer-defaults-compatibility/

Kafka Retries and Idempotent Producers Explained: Avoid Duplicates and Ensure Reliable Delivery

Rajeev — Sun, 22 Mar 2026 13:01:56 +0000

In the previous article Kafka Producer Acks Explained: Replicas, ISR, and Write Guarantees, we discussed when a producer considers a write successful and how acknowledgments impact durability and availability.

But even with correct acknowledgment settings, one important problem still remains:

What happens when a write fails in Kafka?

Or even more interesting:

What happens when Kafka thinks a write failed, but it actually succeeded?

This is where Kafka retries and idempotent producers become critical.

The Real Problem: Uncertain Failures in Distributed Systems

In distributed systems, failures are not always clear.

Consider this scenario:

Producer sends a message to the leader.
Leader writes the message successfully.
Leader sends acknowledgment.
Network issue occurs → acknowledgment is lost.

From Kafka’s perspective:

Broker: Write succeeded
Producer: Write failed

Now the producer retries.

👉 The same message gets written again.

This leads to duplicate messages in Kafka, even though the system behaved correctly.

Kafka Retries Explained

Kafka producers support automatic retries to handle transient failures.

Kafka Retry Configuration

retries=3
retry.backoff.ms=100

How Kafka Retries Work

Producer sends a record.
If it receives an error (or timeout), it retries.
This continues until:
- Retry count is exhausted, or
- The send succeeds

When Do Kafka Retries Trigger?

Retries typically happen in scenarios like:

Temporary network failures
Leader broker not available
NOT_ENOUGH_REPLICAS
REQUEST_TIMED_OUT

These are recoverable errors, making retries useful.

Problem with Kafka Retries: Duplicate Messages

Retries improve reliability but introduce a major issue:

Duplicate message production

Why?

Because the producer cannot always distinguish between:

A failed write
A successful write with lost acknowledgment

So retrying can result in:

Message A → written

Retry Message A → written again

Message Ordering Issues with Retries

Retries can also impact ordering.

Example:

Message A is sent
Message B is sent
A fails and is retried later

Now B might be written before A retry.

👉 This can break ordering guarantees.

Kafka controls this using:

max.in.flight.requests.per.connection

But retries alone cannot guarantee correctness.

Kafka Idempotent Producer

To solve duplicate messages in Kafka, we use:

Idempotent Producer

What is Idempotence in Kafka?

Idempotence means:

Sending the same message multiple times results in it being written only once.

In Kafka:

👉 Even if retries happen, duplicate messages are not stored.

How Kafka Idempotent Producer Works

Kafka ensures idempotency using:

1. Producer ID (PID)

Each producer gets a unique identifier from the broker.

2. Sequence Numbers

Each message has a sequence number per partition
Broker tracks the latest sequence number

Duplicate Detection

On retry:

Same sequence number is sent
Broker detects duplicate
Duplicate message is discarded

Enable Idempotent Producer in Kafka

enable.idempotence=true

This is enough to enable duplicate protection.

Important Kafka Config Changes with Idempotence

When idempotence is enabled, Kafka automatically enforces:

acks=all
retries=Integer.MAX_VALUE
max.in.flight.requests.per.connection=5
# Note: For idempotent producers, this number should be ≤5 to preserve ordering and ensure no duplicates

Why These Settings Matter

acks=all → ensures durability
retries=∞ → safe retry mechanism
limited in-flight requests → preserves ordering

Scope of Idempotent Producer

What It Guarantees

No duplicate messages per partition
Safe retries
Ordering guarantees (with correct config)

What It Does Not Guarantee

No duplicates across producers
No duplicates across restarts
End-to-end exactly-once processing

For that, Kafka provides transactions.

Kafka Retries vs Idempotent Producer

Feature	Without Idempotence	With Idempotence
Retries	Can create duplicates	Safe
Reliability	Moderate	High
Ordering	Can break	Preserved

Recommended Kafka Producer Configuration

acks=all
enable.idempotence=true
retries=Integer.MAX_VALUE
retry.backoff.ms=100

This setup ensures:

High reliability
No duplicate messages
Strong durability guarantees

When to Use Idempotent Producer

Use idempotent producers in:

Payment systems
Order processing
Inventory management
Critical event-driven systems

In modern Kafka setups:

👉 It should almost always be enabled.

Closing Thoughts

Kafka retries are essential for handling transient failures, but they introduce the risk of duplicate messages.

Idempotent producers eliminate this risk by making retries safe.

Together, they ensure:

Reliable message delivery
No duplication
Strong consistency at the producer level

Summary

Kafka retries help recover from failures but can cause duplicate messages.

Idempotent producers solve this by ensuring messages are written exactly once per partition.

Retries improve fault tolerance
Idempotence ensures correctness
Together they enable reliable Kafka pipelines

If you found this useful and want to share your thoughts leave a comment if you’d like. I always appreciate feedback and different perspectives.

Kafka Producer Acks Explained: Replicas, ISR, and Write Guarantees

Rajeev — Thu, 05 Mar 2026 16:55:13 +0000

In the previous article Kafka Consumer Graceful Shutdown Explained, we discussed how Kafka consumers should shut down gracefully to avoid duplicate processing and offset inconsistencies.

Now let’s move to the producer side of Kafka.

When a producer sends a message, an important question arises:

When should the producer consider the write successful?

Kafka answers this through a configuration called producer acknowledgments, commonly referred to as acks.

But before we understand acks, we need to clarify two important Kafka concepts that directly influence reliability:

Replicas
ISR (In-Sync Replicas)

These concepts define how Kafka maintains durability and availability in a distributed cluster.

Replicas vs ISR in Kafka

When a Kafka topic is created, each partition is replicated across multiple brokers. This replication is controlled by the replication factor.

For example:

replication.factor = 3

This means a partition has:

1 Leader replica
2 Follower replicas

Together they form the replica set.

Replicas

Replicas are all copies of a partition across brokers. They are fixed when the topic is created.

Example:

Partition-0

Leader: Broker 1
Follower: Broker 2
Follower: Broker 3

These three together form the replica set.

ISR (In-Sync Replicas)

The ISR is a subset of replicas that are:

Alive
Fully caught up with the leader

Unlike replicas, ISR is dynamic.

If a follower falls behind or becomes unavailable, Kafka removes it from the ISR.

Example:

Replicas: Broker1, Broker2, Broker3
ISR: Broker1, Broker2

Kafka uses ISR — not all replicas — when determining if a write is successful.

This becomes especially important when acks=all.

Kafka Producer Acknowledgments

The acks configuration controls how many brokers must confirm a write before the producer considers it successful.

There are three main settings:

acks=0
acks=1
acks=all (or acks=-1)

Each option provides a different balance between throughput, durability, and latency.

acks = 0

With this configuration, the producer does not wait for any acknowledgment from the broker.

The producer simply sends the record and moves on.

Characteristics

No confirmation from the broker
Highest throughput
Minimal network overhead
Possible data loss

What Happens Internally

Producer sends the message.
Producer immediately considers the write successful.
Broker response is not waited for.

If the leader broker is unavailable or crashes, the producer will not know.

Important Note

Retries do not help much here because the producer never receives failure feedback.

Typical Use Cases

This configuration is generally used when data loss is acceptable, for example:

Logs
Metrics
Non-critical telemetry data

acks = 1

Here the producer waits for acknowledgment from the leader broker only.

The leader writes the message to its local log and then sends the acknowledgment back to the producer.

What Happens Internally

Producer sends record to leader.
Leader writes record to its log.
Leader sends acknowledgment.
Replication to followers happens asynchronously.

Risk Scenario

If the leader crashes before followers replicate the message, the data may be lost.

Characteristics

Good throughput
Moderate durability
Lower latency compared to acks=all

Historically, acks=1 was the default configuration in Kafka up to version 2.x.

acks = all (or acks = -1)

This is the strongest durability guarantee Kafka offers for producers.

Here the leader waits for all replicas in the ISR to acknowledge the write.

However, this works together with another configuration:

min.insync.replicas

min.insync.replicas defines the minimum number of ISR replicas that must acknowledge a write.

Default value:

min.insync.replicas = 1

If the ISR size drops below this value, the broker rejects the write.

Write Flow with acks=all

When a producer sends a message:

Leader appends the record to its log.
Leader replicates the record to all ISR members.
Leader waits for acknowledgments.
Write succeeds only if:

ISR size >= min.insync.replicas

If this condition fails, the broker returns errors such as:

NOT_ENOUGH_REPLICAS
NOT_ENOUGH_REPLICAS_AFTER_APPEND

The producer then receives an exception.

Kafka 3.x Default Behavior

This is a detail that often confuses people.

Even in Kafka 3.x:

acks=1

still appears as the default producer configuration.

However:

enable.idempotence=true

is enabled by default starting from Kafka 3.0.

Idempotent producers automatically require:

acks=all

So although the configuration may show acks=1, producers effectively behave like acks=all unless idempotence is disabled.

Write Availability vs Durability

Now let's connect these settings to cluster availability.

Assume:

replication.factor = 3

With acks=0 or acks=1

As long as the leader broker is available, writes can succeed.

Even if follower replicas are down, the producer can still write.

With acks=all

Write availability now depends on min.insync.replicas.

Case 1 min.insync.replicas = 1

Writes succeed as long as the leader is alive.

This means two brokers can fail and writes will still be accepted.

Case 2 min.insync.replicas = 2

Now at least:

Leader
One follower

must be present in ISR.

This means one broker failure can be tolerated.

Case 3 min.insync.replicas = 3

All replicas must acknowledge the write.

This means no broker failure can be tolerated.

While technically possible, this setup is rarely used because Kafka systems are designed to tolerate node failures.

However, it may be used in scenarios where:

Extremely high durability is required
Write unavailability is preferred over potential data loss

General Rule for Failure Tolerance

If:

replication.factor = N
min.insync.replicas = M

Then the system can tolerate:

N - M brokers failing

While still accepting writes when using acks=all.

The ISR size can be at most equal to the replication factor, but may shrink dynamically if followers fall behind.

Choosing the Right Acknowledgment Strategy

A quick rule of thumb used in many production systems:

Setting	Throughput	Durability	Typical Usage
acks=0	Highest	Lowest	Logs, metrics
acks=1	High	Moderate	General workloads
acks=all	Lower	Highest	Financial or critical data

Most modern Kafka deployments prefer:

acks=all
enable.idempotence=true
min.insync.replicas >= 2

This combination provides strong guarantees without sacrificing too much availability.

Closing Thoughts

Producer acknowledgments directly impact data durability and system availability.

Understanding the relationship between:

Replication factor
ISR
min.insync.replicas
Producer acks

is essential when designing reliable Kafka pipelines.

Misconfiguring these settings can either lead to:

Data loss
Unnecessary write failures
Reduced throughput

A balanced configuration ensures both reliability and performance in production systems.

Summary

Kafka producer acknowledgments determine when a write is considered successful.

acks=0 prioritizes throughput but risks data loss
acks=1 waits for leader acknowledgment
acks=all ensures replication across ISR members

Combined with replication.factor and min.insync.replicas,
these settings control the balance between durability and availability.

What’s Next?

Now that we understand acknowledgments and replication, the next step is exploring the following:

Retries
Idempotent producers

These features build on top of the acknowledgment mechanism and are critical for building robust event-driven systems.

Originally published on my personal blog:

🔗 https://rajeevranjan.dev/blog/kafka-producer-acks-explained/

Kafka Consumer Graceful Shutdown: Handle WakeupException and Commit Offsets Safely

Rajeev — Thu, 26 Feb 2026 13:07:29 +0000

In the previous article, we looked at how rebalancing works and why consumers pause during deployments.

Now let’s look at something that causes even more real production issues:

How do you gracefully shut down a Kafka consumer without losing work?

In real systems:

Pods restart
Deployments happen
Servers receive SIGTERM
Auto-scaling removes instances

If shutdown is not handled properly, you can:

Reprocess messages
Lose uncommitted offsets
Cause inconsistent state
Trigger unnecessary rebalances

Kafka does not handle this automatically for you.

You have to shut down the consumer correctly.

Why Graceful Shutdown Matters

A typical Kafka consumer runs inside a loop:

while (true) {
    ConsumerRecords<String, String> records =
        consumer.poll(Duration.ofMillis(100));

    for (ConsumerRecord<String, String> record : records) {
        process(record);
    }
}

Now imagine the application is terminated while:

Records are already processed
Offsets are not committed yet

When the consumer restarts, Kafka will deliver those records again.

That may be acceptable.

But in many systems, it creates duplicate writes, duplicate payments, or inconsistent updates.

A proper Kafka consumer graceful shutdown should:

Stop polling new records
Finish processing current records
Commit offsets safely
Close the consumer cleanly

The Main Problem: `poll()` Blocks

The poll() method blocks while waiting for records.

If your application receives a shutdown signal while the thread is inside poll(), how do you stop it?

You should not kill the thread.

Kafka provides a safe mechanism for this.

WakeupException in Kafka

Kafka provides this method:

consumer.wakeup();

When called from another thread, it interrupts the blocking poll() call.

Inside the consumer thread, poll() throws:

WakeupException

This is the official and safe way to interrupt a Kafka consumer.

Nothing else should be used.

Correct Kafka Consumer Graceful Shutdown Pattern

Let’s build this properly.

Step 1: Add a running flag

private volatile boolean running = true;

Step 2: Consumer implementation

public class GracefulKafkaConsumer implements Runnable {

    private final KafkaConsumer<String, String> consumer;
    private volatile boolean running = true;

    public GracefulKafkaConsumer(KafkaConsumer<String, String> consumer) {
        this.consumer = consumer;
    }

    @Override
    public void run() {
        try {
            consumer.subscribe(List.of("my-topic"));

            while (running) {
                ConsumerRecords<String, String> records =
                    consumer.poll(Duration.ofMillis(100));

                for (ConsumerRecord<String, String> record : records) {
                    process(record);
                }

                // Commit after processing
                consumer.commitSync();
            }

        } catch (WakeupException e) {
            // Expected during shutdown
            if (running) {
                throw e;
            }
        } finally {
            try {
                consumer.commitSync(); // final safety commit
            } finally {
                consumer.close(); // leave group cleanly
            }
        }
    }

    public void shutdown() {
        running = false;
        consumer.wakeup();
    }

    private void process(ConsumerRecord<String, String> record) {
        System.out.println("Processing: " + record.value());
    }
}

What Is Happening Here?

running controls the loop.
consumer.wakeup() interrupts poll().
WakeupException is expected during shutdown.
Offsets are committed after processing.
consumer.close() triggers proper group leave.

This ensures offsets are committed safely before exit.

Why Not Just Kill the Thread?

If you forcefully stop the thread:

Offsets may not be committed
The consumer may not leave the group cleanly
Rebalances may be delayed
Duplicate processing may increase

Always use consumer.wakeup() for graceful shutdown.

Manual Commit vs Auto Commit

If you are using:

enable.auto.commit=true

Offsets are committed periodically.

But auto commit does not guarantee that processing is finished before commit.

For production systems, it is usually safer to use:

enable.auto.commit=false

And commit offsets only after successful processing.

If you want a deeper explanation of offset commits, I covered it here:

👉 Kafka Auto Commit Explained (At-Least-Once Processing)

Offset management and graceful shutdown are closely connected.

A Quick Note on Long Processing

If your message processing takes a long time, Kafka may trigger a rebalance even if the consumer is still running.

This usually relates to how frequently poll() is called and certain consumer configurations.

This topic deserves a separate discussion because it directly impacts stability in high-throughput systems.

We’ll explore it properly in a dedicated article.

Connecting Shutdown to JVM Exit

You can connect shutdown logic like this:

GracefulKafkaConsumer consumerRunnable =
    new GracefulKafkaConsumer(consumer);

Thread consumerThread = new Thread(consumerRunnable);
consumerThread.start();
Thread mainThread = Thread.currentThread();

Runtime.getRuntime().addShutdownHook(new Thread(() -> {
    consumerRunnable.shutdown();

    try {
            mainThread.join();
        } catch (InterruptedException e) {
            Thread.currentThread().interrupt();
        }
}));

When the application receives a termination signal, the consumer exits safely.

How This Relates to Rebalancing

In the previous article(👉 Read: Kafka Eager vs Cooperative Rebalancing Explained), we discussed eager vs cooperative rebalancing.

Improper shutdown can trigger unnecessary rebalances.

A clean shutdown:

Reduces duplicate processing
Leaves the consumer group smoothly
Minimizes disruption during deployments

Graceful shutdown is not optional in production systems.

It is part of building reliable Kafka consumers.

Closing Thoughts

Kafka consumer graceful shutdown is simple in concept, but easy to get wrong.

The correct approach is:

Stop polling
Finish processing
Commit offsets safely
Close the consumer properly

If you skip any of these steps, you increase the chance of duplicate processing or inconsistent state.

What’s Next?

Next, we’ll explore Kafka producer internals — specifically how acks, retries, and idempotent producers impact reliability and delivery guarantees in production systems.

Originally published on my personal blog:

🔗 https://rajeevranjan.dev/blog/kafka-consumer-graceful-shutdown/

Kafka Eager vs Cooperative Rebalancing Explained (Why Consumers Pause During Deployments)

Rajeev — Thu, 19 Feb 2026 17:30:18 +0000

If you search about Kafka rebalancing, you’ll often find something like this:

“Kafka automatically redistributes partitions when consumers join or leave.”

That statement is correct.

But it hides what really happens during that redistribution.

In real systems, rebalancing can:

Pause message processing
Increase latency
Trigger duplicate processing
Disrupt rolling deployments

Most developers only notice it when their consumers suddenly stop for a few seconds during a deployment.

In this article, we’ll look at what actually happens during a rebalance, how eager rebalancing works, how cooperative rebalancing improves it, and why the difference matters in production.

When Does Rebalancing Happen?

Rebalancing is triggered whenever consumer group membership changes.

Common triggers:

A new consumer instance starts
A consumer crashes
A deployment restarts pods
Topic partitions are increased

Kafka must redistribute partitions among active consumers.

Example:

3 consumers
6 partitions

Each gets 2 partitions.

If one consumer goes down, Kafka must reassign its partitions to the remaining consumers.

That reassignment process is the rebalance.

What matters is how that reassignment happens.

Eager Rebalancing: Stop Everything

In the traditional model (used for years), Kafka performs what is called eager rebalancing.

When a rebalance is triggered:

All consumers stop fetching records
All partitions are revoked from all consumers
Kafka computes a new assignment
Partitions are reassigned
Consumers resume processing

Every consumer pauses.

Even partitions that did not need to move are temporarily revoked.

This is why Kafka consumers pause during a rebalance.

It is effectively a full stop across the entire consumer group.

Why This Becomes Visible in Production

In static environments, this may not feel significant.

But in dynamic systems:

Auto-scaling
Kubernetes rolling updates
Frequent deployments

Rebalances can happen often.

Suppose you have:

4 consumers
12 partitions

During a rolling deployment, one new consumer joins.

With eager rebalancing, all 4 consumers pause — even though only a few partitions actually need reassignment.

With cooperative rebalancing, only the affected partitions move. The rest continue processing.

With eager rebalancing, each event creates a visible pause in processing.

If offset handling is not carefully managed, this can also increase duplicate processing.

The system is correct — but not smooth.

Cooperative Rebalancing: Move Only What’s Necessary

To reduce disruption, Kafka introduced cooperative (incremental) rebalancing in Apache Kafka 2.4.

The key difference:

Instead of revoking all partitions from all consumers, Kafka only revokes the partitions that actually need to move.

Other partitions continue processing.

The rebalance happens incrementally rather than all at once.

The practical result:

No full stop
Reduced latency spikes
Smoother scaling
Less disruption during deployments

The system still rebalances — but without unnecessary interruption.

What Enables Cooperative Rebalancing?

Rebalancing behavior depends on the partition assignment strategy.

Older strategies:

RangeAssignor
RoundRobinAssignor

For cooperative rebalancing, you use:

partition.assignment.strategy=org.apache.kafka.clients.consumer.CooperativeStickyAssignor

That configuration enables incremental rebalancing.

The change is small in configuration, but significant in behavior.

Why This Connects to Offset Management

Rebalancing and offset commits are closely related.

When partitions are revoked:

You may still be processing records
Offsets may or may not be committed
Improper handling can lead to duplicates

If you haven't explored how offset commits work, I explained it in detail here:

👉 Kafka Auto Commit Explained (At-Least-Once Processing)

Even cooperative rebalancing does not eliminate this responsibility.

It only reduces disruption.

Correct offset handling is still critical.

When Should You Prefer Cooperative Rebalancing?

You should strongly consider it if:

You deploy frequently
You auto-scale consumers
You run multiple instances
You process high-throughput topics
You care about latency stability

In modern distributed systems, these are common conditions.

Closing Thoughts

Rebalancing is often treated as an internal Kafka detail.

In practice, it directly affects:

Throughput
Latency
Duplicate processing
Deployment stability

Understanding how rebalancing works — and which strategy you are using — is part of building production-ready Kafka consumers.

What’s Next?

If you're following along, next we’ll cover graceful shutdown of a Kafka consumer — including how to handle WakeupException, commit offsets safely, and close the consumer without losing work.

That’s where many real-world production issues actually happen.

Originally published on my personal blog:

🔗 https://rajeevranjan.dev/blog/kafka-eager-vs-cooperative-rebalancing/

Kafka Consumer Auto-Commit: Why 'At-Least-Once' Is Often Misunderstood

Rajeev — Wed, 11 Feb 2026 17:23:39 +0000

If you search online, you’ll often see statements like:

“Kafka consumers provide at-least-once delivery when auto-commit is enabled.”

While this statement is not entirely wrong, it is dangerously incomplete. Many production issues happen because developers take this at face value without understanding when offsets are committed and what exactly “at-least-once” means in practice.

In this article, we’ll look at what actually happens inside a Kafka consumer when auto-commit is enabled, why failures can still cause data loss, and when auto-commit is truly safe to use.

Why Auto-Commit Feels Safe

By default, the Kafka Java consumer has:

enable.auto.commit = true
auto.commit.interval.ms = 5000

This gives a comforting impression:

Kafka periodically commits offsets
If a consumer crashes, it restarts
Messages should be reprocessed

So it feels like at-least-once delivery is guaranteed.

But the real question is:

At least once relative to what? Polling? Processing? Business logic?

Kafka only tracks polling. It does not track what your application does after that.

What Kafka Actually Commits

Kafka commits offsets, not messages.

An offset simply means:

“The next record the consumer should read.”

When auto-commit is enabled, the consumer periodically commits the latest offsets returned by poll(), regardless of whether your application has finished processing those records.

Kafka does not know:

Whether you processed the record
Whether processing succeeded or failed
Whether your database write completed

From Kafka’s perspective, once poll() returns records, those offsets are eligible for commit.

The Actual Timeline (Critical to Understand)

Let’s walk through a realistic scenario:

Consumer calls poll()
Kafka returns records with offsets 100–120
Auto-commit timer fires
Offset 120 is committed
Your application is still processing records
The consumer crashes (OOM, JVM kill, container restart)

What happens next?

Kafka sees offset 120 as committed
On restart, the consumer resumes from 121
Records 100–120 are never re-read

From your application’s point of view, those messages are effectively lost.

Why Is This Still Called “At-Least-Once”?

Because Kafka’s guarantee is scoped narrowly.

Kafka guarantees:

Records returned by poll() will be delivered at least once to the consumer
As long as offsets are not committed before polling

Kafka does not guarantee:

At-least-once processing
At-least-once database writes
At-least-once business side effects

This distinction is often overlooked.

The Real-World Problem

In real systems:

Processing may be asynchronous
There are database writes, API calls, retries
Processing can take seconds or even minutes

Auto-commit is time-based, not processing-based.

So the larger the gap between:

Polling the record
Finishing business processing

the higher the risk of data loss if the consumer crashes.

This is why teams sometimes observe:

Missing records
Inconsistent aggregates
Silent data loss after restarts

Kafka usually behaves correctly — the misunderstanding is in how the guarantees are interpreted.

When Auto-Commit Is Actually Safe

Auto-commit can be acceptable when:

Processing is very fast
Processing is idempotent
Losing a small number of records is acceptable
The consumer does not maintain critical state

Typical examples:

Metrics collection
Log aggregation
Monitoring events
Best-effort analytics

In these cases, simplicity may outweigh strict correctness.

When You Should Avoid Auto-Commit

Avoid auto-commit when:

You write to a database
You update business state
You perform non-idempotent operations
You require strong delivery guarantees

In these situations, manual offset management provides better control:

Process the record
Ensure processing succeeds
Commit the offset explicitly

It adds complexity, but it aligns offset commits with business success.

Manual Commit Is Not Magic Either

Even with manual commits:

Duplicates can still happen
Rebalances can interrupt processing
Commits can fail or be delayed

Kafka gives delivery guarantees, not business correctness guarantees.

Production systems should still be designed with:

Idempotent processing
Clear retry strategies
Proper failure handling

Key Takeaways

Kafka commits offsets, not processing results
Auto-commit is tied to poll(), not your business logic
“At-least-once” does not mean “processed at least once”
Auto-commit is fine for best-effort use cases
For critical systems, explicit offset control is safer

Understanding this early can prevent subtle production bugs later.

What’s Next

In the next article, we’ll explore:

The difference between eager and cooperative rebalancing

Originally published on my personal blog:

🔗 https://rajeevranjan.dev/blog/kafka-consumer-auto-commit-misunderstood

DEV Community: Rajeev

Kafka Safe Producer Defaults and Version Compatibility Explained

What Does “Safe Producer” Mean in Kafka?

Kafka Version ≥ 3.0 — What It Covers

What is enabled automatically?

What this does NOT cover

Version Compatibility: Broker vs Producer

What Happens with Mixed Versions?

When Should You Explicitly Configure Safe Producer Settings?

Recommended Safe Producer Configuration

Bottom Line

Summary

Kafka Retries and Idempotent Producers Explained: Avoid Duplicates and Ensure Reliable Delivery

The Real Problem: Uncertain Failures in Distributed Systems

Kafka Retries Explained

Kafka Retry Configuration

How Kafka Retries Work

When Do Kafka Retries Trigger?

Problem with Kafka Retries: Duplicate Messages

Message Ordering Issues with Retries

Kafka Idempotent Producer

What is Idempotence in Kafka?

How Kafka Idempotent Producer Works

1. Producer ID (PID)

2. Sequence Numbers

Duplicate Detection

Enable Idempotent Producer in Kafka

Important Kafka Config Changes with Idempotence

Why These Settings Matter

Scope of Idempotent Producer

What It Guarantees

What It Does Not Guarantee

Kafka Retries vs Idempotent Producer

Recommended Kafka Producer Configuration

When to Use Idempotent Producer

Closing Thoughts

Summary

Kafka Producer Acks Explained: Replicas, ISR, and Write Guarantees

Replicas vs ISR in Kafka

Replicas

ISR (In-Sync Replicas)

Kafka Producer Acknowledgments

acks = 0

Characteristics

What Happens Internally

Important Note

Typical Use Cases

acks = 1

What Happens Internally

Risk Scenario

Characteristics

acks = all (or acks = -1)

min.insync.replicas

Write Flow with acks=all

Kafka 3.x Default Behavior

Write Availability vs Durability

With acks=0 or acks=1

With acks=all

Case 1 min.insync.replicas = 1

Case 2 min.insync.replicas = 2

Case 3 min.insync.replicas = 3

General Rule for Failure Tolerance

Choosing the Right Acknowledgment Strategy

Closing Thoughts

Summary

What’s Next?

Kafka Consumer Graceful Shutdown: Handle WakeupException and Commit Offsets Safely

Why Graceful Shutdown Matters

The Main Problem: poll() Blocks

WakeupException in Kafka

Correct Kafka Consumer Graceful Shutdown Pattern

Step 1: Add a running flag

Step 2: Consumer implementation

What Is Happening Here?

Why Not Just Kill the Thread?

Manual Commit vs Auto Commit

A Quick Note on Long Processing

Connecting Shutdown to JVM Exit

How This Relates to Rebalancing

Closing Thoughts

The Main Problem: `poll()` Blocks