NARESH

Posted on Jan 28

Kafka Guarantees Delivery, Not Uniqueness: How to Build Idempotent Systems

#kafka #webdev #distributedsystems #idempotency

TL;DR

Kafka guarantees delivery, not uniqueness. Retries are expected.
Acknowledgements can fail even when a write succeeds, leading to duplicates.
Producer idempotency prevents duplicate writes to Kafka, not duplicate business effects.
Application-level idempotency gives each event a stable identity.
Consumers must assume every message can be a duplicate.
Databases (and sometimes caches) enforce idempotency at the point of side effects.
"Exactly-once" is a practical goal, not a perfect guarantee.
Idempotency doesn't eliminate retries; it makes retries safe.

If you've worked with Kafka long enough, you've probably seen this happen or you will.
A producer sends a message.
The consumer processes it.
The database write succeeds.
And then… something goes wrong.
The acknowledgement doesn't come back.
The network hiccups.
The consumer restarts.
Kafka does what it's designed to do: it retries.
Suddenly, the same message shows up again.
Now you're left staring at duplicated rows, repeated updates, or inconsistent state, wondering:
"But didn't Kafka already process this?"

Here's the uncomfortable truth:
Kafka guarantees delivery, not uniqueness.

Kafka is excellent at making sure messages are not lost. But when failures occur and they always do in distributed systems Kafka will retry. And retries mean duplicates, unless your system is designed to handle them.

This is where many systems quietly break.
Not because Kafka failed.
But because the system assumed acknowledgements were reliable.

Understanding the Duplicate Message Scenario

Let's walk through what's happening in the diagram above, step by step.

The producer sends a message (Y) to the Kafka broker.
The broker successfully appends this message to the topic partition. So far, everything is working as expected.
However, when the broker sends the acknowledgement back to the producer, that acknowledgement fails to reach the producer maybe due to a temporary network issue or a timeout. From the producer's point of view, it has no way of knowing whether the message was actually written or not.
So the producer does the only safe thing it can do: it retries and sends the same message (Y) again.
The broker receives this retry and, without additional safeguards, appends the message again to the same partition. Now the topic contains two identical messages even though the producer intended to send it only once.

This is an important realization:
The duplication happened not because Kafka is broken, but because the producer could not trust the acknowledgement.

Kafka chose reliability over guessing. It preferred possibly duplicating a message rather than risking data loss. And that trade-off is intentional.

This is exactly why retries are a fundamental part of Kafka and why idempotency becomes essential when building real-world systems on top of it.

A Simple Analogy

Think of Kafka like a courier service.
You send a package and wait for a confirmation.
If the confirmation doesn't arrive, you send the package again just to be safe.
From the courier's point of view, that's the correct behavior.
From the receiver's point of view, they may now have two identical packages.

Kafka behaves the same way.
Retries are not a bug. They are a feature.

The question is: can your system safely handle receiving the same message more than once?

Enter Idempotency

This is where idempotency comes in.

At a high level, an operation is idempotent if doing it multiple times produces the same final result as doing it once.

In practical terms:

Processing the same event twice should not corrupt your data.
Writing the same record again should not create duplicates.
Retrying should be safe, not dangerous.

Kafka provides some idempotency guarantees at the producer level, which help prevent duplicate messages from being written to Kafka itself during retries. That's important but it's only part of the story.

Because even with an idempotent producer:

consumers can retry
acknowledgements can fail
databases can be written to more than once

Which means true idempotency is not a single setting.
It's a system-wide design choice that spans:

producers
Kafka
consumers
and the database itself

In this article, we'll walk through how idempotency actually works in real Kafka systems what Kafka protects you from, what it doesn't, and how to design your pipeline so that retries don't turn into production incidents.

No framework-specific code.
No marketing promises.
Just practical, production-oriented thinking.

Producer-Side Idempotency: Preventing Duplicates at the Source

Let's start at the very beginning of the pipeline the producer.

When a producer sends a message to Kafka, it expects an acknowledgement in return. If that acknowledgement doesn't arrive maybe due to a network glitch or a temporary broker issue the producer assumes the message was not delivered and sends it again.

From the producer's perspective, this is the safest possible behavior.
But without protection, this retry can result in duplicate messages being written to Kafka, even though the original message may have already been stored successfully.

To handle this, Kafka provides producer-side idempotency.

What Kafka's Idempotent Producer Actually Does

When producer idempotency is enabled, Kafka ensures that retries from the same producer do not result in duplicate records being written to a partition.

Internally, Kafka does this by tracking:

a unique identity for the producer session, and
a sequence number for each message sent to a given partition

If the producer retries a message because it didn't receive an acknowledgement, Kafka can recognize that this is a retry of a previously sent message, not a new one and it avoids writing it again.

The result is simple and powerful:
Even if the producer retries, Kafka will store the message only once.

This gives us a strong guarantee at the Kafka log level.

Why Acknowledgements Matter (`acks=all`)

Producer idempotency works correctly only when Kafka is allowed to fully confirm writes.
That's why it's typically paired with waiting for acknowledgements from all in-sync replicas.

Why does this matter?
Because a partial acknowledgement can lie.
If the producer receives an acknowledgement before the message is safely replicated, and a failure happens immediately after, Kafka might accept the retry and now you're back to duplicates or lost data.

Waiting for full acknowledgements ensures that:

Kafka has durably stored the message.
retries are handled safely.
producer idempotency can actually do its job.

In short:

Fast acknowledgements optimize latency.
Strong acknowledgements protect correctness.

The Critical Limitation (This Is Where Many Teams Stop Too Early)

At this point, it's tempting to think:
"Great producer idempotency is enabled. We're safe."

Not quite.

Producer idempotency only guarantees that Kafka won't store duplicate records due to producer retries.
It does not guarantee:

uniqueness across different producers
uniqueness across restarts
uniqueness at the consumer or database level
business-level correctness

If multiple producers send logically identical events or if a consumer processes the same message twice Kafka will not stop that.

This is an important distinction:
Kafka-level idempotency protects delivery. It does not protect business state.

And that's why real-world systems need more than just producer idempotency.

Application-Level Idempotency: Making Duplicates Detectable

Once you accept that Kafka alone cannot guarantee uniqueness, the next question becomes:
How does the rest of the system recognize a duplicate when it sees one?

The answer is application-level idempotency.

At this layer, we stop relying on Kafka to "do the right thing" and instead give our system the ability to identify whether an event has already been processed, regardless of how many times it shows up.

The Core Idea: Stable Event Identity

Application-level idempotency starts with a simple but powerful concept:
Every logical event must have a stable, unique identity.

This identity is not generated by Kafka.
It's generated by the application and travels with the event, end to end.

Think of it like a receipt number.
If you see the same receipt number twice, you immediately know:

this isn't a new action
it's a retry or a duplicate
processing it again would be incorrect

In Kafka systems, this typically means attaching an event ID to every message something that uniquely represents what happened, not when it was sent.

Why This Matters Even with Idempotent Producers

Producer idempotency prevents Kafka from writing the same send attempt twice.
But it cannot answer questions like:

Did another producer emit the same logical event?
Did this consumer restart and reprocess the message?
Did a downstream write succeed even though the ack failed?

Only the application can answer those questions and it can only do so if events are identifiable.
That's why application-level idempotency is about business correctness, not messaging mechanics.

What Happens Without Stable Event IDs

Without a stable identifier, the system has no memory.
When a duplicate message arrives, the consumer has no way to know:

whether this event is new
whether it was already applied
whether processing it again would cause harm

So the system does the only thing it can do: process it again.
This is how duplicates silently turn into:

double inserts
incorrect counters
repeated state transitions
corrupted aggregates

And by the time you notice, the damage is already done.

With Application-Level Idempotency

When every event carries a stable ID, the system can make an informed decision.
At the consumer side, the flow becomes:

Receive event.
Check whether this event ID was already seen.
If yes → skip or safely ignore.
If no → process and record the ID.

Now retries stop being dangerous.
They become harmless repetitions.

A Key Mindset Shift

This is the mental shift many teams miss:
Retries are inevitable. Duplicates are optional if your system can recognize them.

Kafka will retry.
Networks will fail.
Consumers will restart.

Application-level idempotency is how you design a system that remains correct anyway.

A Reality Check: "Exactly Once" Is a Goal, Not a Guarantee

Before we talk about consumer-side idempotency, it's important to set expectations.

In distributed systems, achieving 100% idempotency across all components is theoretically impossible.
This isn't a limitation of Kafka.
It's a property of distributed systems themselves.

When you have:

independent processes
network partitions
retries
crashes
and multiple sources of truth

There will always be edge cases where the system cannot know, with absolute certainty, whether an operation already happened or not.

So when we talk about "exactly-once" behavior in Kafka-based systems, what we really mean is:
Practically exactly-once under well-defined failure scenarios.

The goal is not perfection.
The goal is controlled correctness.

Why This Matters

Many teams approach idempotency expecting a magic switch a configuration that eliminates duplicates forever.
That switch does not exist.

Instead, what Kafka and good system design give you is:

deterministic behavior
bounded failure modes
safe retries
and recoverable state

Idempotency is about minimizing harm, not eliminating retries.

Kafka's Philosophy Aligns with This Reality

Kafka intentionally chooses:

at-least-once delivery
explicit retries
clear failure semantics

Because losing data is usually worse than processing it twice.
This means Kafka pushes the final responsibility for correctness up to the application.

That's not a weakness.
It's a design decision.

Consumer-Side Idempotency: The Final Line of Defense

With that reality in mind, we now arrive at the most critical part of the system: the consumer.

Even with:

idempotent producers
stable event IDs
careful message design

Consumers will still:

restart
reprocess messages
see the same event more than once

Which means the consumer must assume:
"Every message I receive could be a duplicate."

Consumer-side idempotency is where this assumption is enforced.

What the Consumer Must Do

At a high level, the consumer's job is simple:

Receive an event.
Check whether this event ID has already been processed.
Decide whether to:
- apply the change
- skip it
- or safely update existing state

This check typically happens before any irreversible side effects especially database writes.
If the consumer does not perform this check, all previous idempotency efforts can still collapse at the last step.

Why the Consumer Is So Important

The consumer is the only component that:

sees the final event
performs the side effect
mutates durable state

That makes it the last opportunity to prevent duplicates from becoming permanent.
If duplicates reach the database unchecked, the system has already lost.

How Consumers Enforce Idempotency in Practice

At the consumer layer, idempotency stops being a theory and becomes a decision-making process.
The consumer receives a message and must answer one question before doing anything else:
Have I already processed this event?

Everything else flows from that.

The Two Common Deduplication Strategies

In practice, consumers enforce idempotency using one of two mechanisms sometimes both.

1. Database-Based Deduplication (Most Reliable)

In this approach, the database itself becomes the source of truth for idempotency.
The idea is simple:

every event has a stable event ID
the database enforces uniqueness for that ID
duplicate writes are either ignored or treated as no-ops

This works well because:

databases are durable
uniqueness constraints are enforced atomically
retries become safe by design

From the consumer's point of view:

if the write succeeds → the event was new
if the write fails due to duplication → the event was already processed

The key benefit here is correctness under crashes.
Even if:

the consumer restarts
the same message is processed again
the acknowledgement failed previously

…the database prevents corruption.
That's why database-level idempotency is often the strongest safety net in Kafka systems.

2. Cache-Based Deduplication (Fast but Weaker)

Some systems use an in-memory cache or distributed cache to track processed event IDs.
This approach is typically chosen for:

very high throughput
extremely low latency
short-lived deduplication windows

The flow looks like this:

consumer checks cache for event ID
if present → skip
if not → process and store ID in cache

This can work well, but it comes with trade-offs:

cache entries expire
cache can be evicted
cache can be lost on failure

Which means:
Cache-based deduplication improves performance, but cannot be the only line of defense if correctness is critical.

Many production systems use cache as an optimization, with the database still acting as the final authority.

Choosing Between Them (or Combining Them)

There is no universal right answer.
The choice depends on:

how harmful duplicates are
how long duplicates can appear
how much latency you can tolerate
how much complexity you're willing to manage

A common pattern is:

cache for fast, short-term duplicate filtering
database for long-term correctness

This balances performance and safety.

The Important Ordering Rule

One subtle but critical rule applies regardless of strategy:
Deduplication must happen before side effects.

Once the consumer performs an irreversible action such as writing to a database or triggering an external call it's already too late to ask whether the event was a duplicate.
This is why idempotency checks are placed at the very start of message processing.

Why This Still Doesn't Mean "Perfect Idempotency"

Even with all of this in place, edge cases still exist.
There are moments where:

a write succeeds
the consumer crashes
the acknowledgement never happens
the message is retried

At that point, the system relies entirely on idempotency to remain correct.
And this brings us back to the earlier reality check:

Idempotency doesn't eliminate retries. It makes retries safe.

That's the real objective.

Conclusion: Idempotency Is a System Property, Not a Feature

Kafka is very good at one thing: making sure data is not lost.
It is intentionally not responsible for ensuring that data is processed only once everywhere. That responsibility belongs to the system built on top of Kafka, not Kafka itself.

The approaches discussed in this article represent one practical way to achieve idempotency in real-world systems but they are not the only way.

Depending on the ecosystem and tooling you use, there may be other mechanisms available:

language-level abstractions
framework-provided annotations
transactional helpers
or platform-specific guarantees

These can simplify implementation, but they do not change the underlying requirement:
the system must still be designed to tolerate retries and detect duplicates.

Frameworks can help. They cannot replace sound system design.

And that's the key takeaway.
Idempotency is not:

a Kafka configuration
a producer setting
a consumer option
or a database trick

It is a system-wide design decision.

What We've Learned

Let's zoom out and connect the dots.

Kafka retries are expected not exceptional.
Acknowledgements can fail even when writes succeed.
Producer idempotency prevents duplicate writes to Kafka, not duplicate business effects.
Application-level event identity makes duplicates detectable.
Consumer-side idempotency is the final line of defense.
Databases and caches enforce correctness when retries happen.
"Exactly-once" is a practical goal, not a mathematical guarantee.

Or put simply:
Kafka guarantees delivery. Your system must guarantee correctness.

Why This Matters in Production

In small systems, duplicates might look harmless.
In large systems with high throughput, retries, restarts, and partial failures duplicates silently accumulate and eventually surface as:

incorrect data
broken invariants
painful backfills
and long debugging sessions

Idempotency is cheaper than recovery.

Systems that are designed to tolerate retries age far better than systems that assume they won't happen.

The Right Mental Model Going Forward

When designing Kafka-based systems, ask these questions early:

What uniquely identifies a business event?
What happens if this message is processed twice?
Where is duplication detected?
Where is correctness enforced?
What happens when acknowledgements lie?

If you can answer those clearly, your system is already ahead of most.

Kafka will retry. Failures will happen. Duplicates will appear.

Idempotency is how you make all of that safe.

🔗 Connect with Me

📖 Blog by Naresh B. A.
👨‍💻 Building AI & ML Systems | Backend-Focused Full Stack

🌐 Portfolio: Naresh B A

📫 Let's connect on LinkedIn | GitHub: Naresh B A

Thanks for spending your precious time reading this it's a personal, non-techy little corner of my thoughts, and I really appreciate you being here. ❤️

Top comments (1)

Calvin ILOKI • Feb 28

I am new in building Kafka based system. I am really impressed by your article. This gives me a deep understanding of one of the main topics to achieve consistency.

I have read in a book that “The CAP theorem states that a distributed system can only provide two out of the three properties at the same time. For example, a distributed system can choose to provide consistency and partition tolerance, but at the cost of availability. Alternatively, a system can choose to provide availability and partition tolerance, but at the cost of consistency” Excerpt From Java 8 to 21 Shai Almog

CAP stands for Consistency, Availability and Partition Tolerance. This tradeoffs is mandatory in a distributed system like microservices for eventual consistency.

Thank you Naresh