- What exactly-once actually guarantees — and the practical caveats
- Mastering the Kafka primitives: idempotent producers and transactions
- Stateful stream processing patterns that deliver EOS in practice
- Sinks and external systems: how to make writes idempotent or transactional
- Practical checklist: implement exactly-once with Kafka (steps and config)
Exactly-once in Kafka is not a single toggle — it’s an architectural contract between producers, brokers, and consumers that makes a read → process → write sequence appear atomic from the business perspective. When implemented correctly, duplicates from producer retries are removed and a group of writes and offset commits can be made atomic, but those guarantees are bounded by what participates in the transaction.
You see the problem in production as two recurring symptoms: invisible duplicates slipping into downstream stores and occasional partial commits that leave aggregates or external databases inconsistent. Teams treat Kafka as a silver bullet and then find that retries, rebalances, or non-transactional sinks still produce inconsistent business state — the result is long outage postmortems, labor-intensive reconciliations, and brittle compensating logic.
What exactly-once actually guarantees — and the practical caveats
Exactly-once in the Kafka ecosystem means: from the viewpoint of a read → process → write flow that is implemented using Kafka’s transaction APIs, each input record’s observable side-effects on Kafka topics (and other log-backed state) are visible exactly once. This is achieved by combining idempotent producers (broker-side de‑dup) and transactions (atomic commit of produced records + consumer offsets).
Important practical caveats you must accept up front:
- Cluster-local: Kafka transactions only span Kafka topics and the cluster’s internal transactional state; they do not extend to arbitrary external systems (databases, HTTP APIs) by default. Achieving exactly-once to external systems requires additional design (outbox, idempotent writes, or two-phase commit patterns).
-
Session bounds for idempotency: an idempotent producer guarantees de-duplication within a single producer session (a PID/epoch pair). To preserve stronger semantics across restarts you must use
transactional.idand the transaction recovery fencing that comes with it. - Observable behavior vs. hidden work: processing may happen multiple times internally (retries, task failover); the guarantee is that the final observable effects (topic writes, state-store updates backed by changelogs) reflect each input once. That distinction matters when you reason about side-effects outside Kafka.
Mastering the Kafka primitives: idempotent producers and transactions
Two primitives form the mechanical foundation.
-
Idempotent producers: when you enable
enable.idempotence=true, the client acquires a Producer ID (PID) and appends a per-partition sequence number to batches; the broker uses PID+sequence to deduplicate retries so the log receives each record once for that PID/session. The client enforcesacks=all,retriesdefaults, and appropriate inflight limits for correctness. -
Transactional producers: set a unique
transactional.id, callinitTransactions(), then usebeginTransaction() / send(...) / sendOffsetsToTransaction(...) / commitTransaction()to atomically tie produced records and consumer offsets together. This is the standard pattern when you implement consume-transform-produce without using Kafka Streams.
Practical config and Java snippet (illustrative):
Properties props = new Properties();
props.put("bootstrap.servers", "kafka:9092");
props.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer");
props.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer");
props.put("enable.idempotence", "true"); // idempotent producer
props.put("transactional.id", "orders-validator-1"); // stable per logical producer
KafkaProducer<String,String> producer = new KafkaProducer<>(props);
producer.initTransactions();
try {
producer.beginTransaction();
producer.send(new ProducerRecord<>("validated-orders", key, value));
// sendOffsetsToTransaction requires ConsumerGroupMetadata gathered from your consumer
producer.sendOffsetsToTransaction(offsetsMap, consumerGroupMetadata);
producer.commitTransaction();
} catch (Exception e) {
producer.abortTransaction();
}
Notes you must operationalize:
- Use
isolation.level=read_committedon consumers that must not see uncommitted transactional writes. That prevents consumers from reading in-flight transactional messages and protects downstream state. - The transaction coordinator uses an internal transaction log topic; that topic should be durable (replication factor ≥ 3 in production) and its availability matters for transaction recovery.
Stateful stream processing patterns that deliver EOS in practice
If you use Kafka Streams (or libraries built on top of it), a lot of the plumbing comes for free — but you still must choose the right mode and structure.
-
EOS modes in Streams: Kafka Streams historically provided
exactly_once(v1) and, since 2.5, an improvedexactly_once_v2(a.k.a. EOS v2) that reduces resource usage and scales better via a thread-producer model. Useprocessing.guarantee=exactly_once_v2once your brokers meet the minimum version requirements. -
State stores are first-class:
RocksDB-backed local state stores are backed by changelog topics; Streams ties state-store updates, changelog writes, and output topic writes to transactions so the materialized view is consistent with output. Rely on changelogs for recovery and size your RocksDB/configs accordingly. -
Dedup / idempotency pattern (stateful): a common pattern is to keep a
KeyValueStore<eventId, timestamp>or windowed store to detect duplicates. On process:- Lookup
eventIdin store. - If absent, process and store
eventIdwith TTL. - If present and within TTL, skip processing. Because the store is changelog-backed, this deduplication survives failover and works with EOS transaction commits.
- Lookup
Example sketch (Streams Processor API):
public class DedupProcessor implements Processor<String, Event, String, Event> {
private KeyValueStore<String, Long> dedupStore;
public void init(ProcessorContext ctx) {
dedupStore = ctx.getStateStore("dedup-store");
}
public void process(Record<String, Event> r) {
if (dedupStore.get(r.value().id) == null) {
// do work & forward
dedupStore.put(r.value().id, ctx.timestamp());
context.forward(r);
} // otherwise, drop duplicate
}
}
- Transactional state stores: the Streams roadmap includes/introduced transactional state store behavior so state updates can be treated transactionally with outputs; check your Streams version and enable transactional state store options where supported. This reduces edge-cases where state and outputs diverge during crashes.
Sinks and external systems: how to make writes idempotent or transactional
This is where projects fail most often: Kafka’s transactions don’t magically make arbitrary sinks transactional.
Important: Kafka transactions cover only Kafka; to guarantee exactly-once into external systems you must either make the external writes idempotent or employ an architectural pattern that provides atomicity (for example, the outbox pattern or connector-level transactional writes).
Patterns you can use:
- Outbox pattern: write business state and an outbox row in the same DB transaction; a CDC or Connect source reads the outbox and writes to Kafka. This makes the DB the single source of truth for the DB write and the emitted event. Many organizations use Debezium + a small consumer to publish outbox rows to Kafka.
-
Idempotent sinks / upserts: where possible, write sinks that can
UPSERTby primary key or accept an idempotency token. For example, many JDBC sinks offer upsert modes; Flink exposesexactlyOnceJDBC sink builder options that rely on transactional/durable sinks or XA-like semantics. If the sink supports idempotent upserts, you can achieve effective end-to-end exactly-once. - Kafka Connect exactly-once mode: Connect has KIP work to enable exactly-once semantics for source connectors and to coordinate offsets in transactions; use connectors that explicitly support EOS and read KIP-618 guidance when enabling exactly-once in Connect clusters.
-
Two-phase commit / XA (rare): some stream engines and connectors implement 2PC for external stores (e.g., via
XADataSource) but these are expensive and operationally complex. Prefer idempotent upserts or outbox when possible.
Practical example choices:
- If your DB can do idempotent upserts, use connector upsert mode and include the primary key in the Kafka key.
- If your external system cannot be idempotent, implement the outbox in the source DB and publish via a transactional source connector.
Operational trade-offs, observability, and key metrics
Exactly-once is powerful but not free — expect measurable trade-offs and new operational surface area.
- Latency vs. throughput: short transaction/commit intervals reduce failover window but increase synchronous work during commits; Streams’ commit interval tuning directly impacts throughput and end-to-end latency. Confluent’s measurements show modest producer overhead for transactions but Streams commit intervals can produce a noticeable throughput delta at short commit intervals. Plan benchmarks on your message sizes and workload.
-
Broker resources and transaction state: transactions use a transaction log topic and transaction coordinator; those internal topics require adequate replication factor, partitions, and healthy ISRs. Long-running or stalled transactions can withhold the Last Stable Offset (LSO) and affect consumers set to
read_committed. -
Failure modes you must monitor for:
ProducerFencedExceptionor unrecoverable transactional errors on producers, inflight transaction timeouts, aborted transactions, and long-running transactions that blockread_committedconsumers. Monitor broker request metrics for transaction requests (InitProducerId, AddPartitionsToTxn, EndTxn) and producer transaction timing metrics (txn-commit-time, txn-begin-time). -
Key metrics / signals to export:
- Broker: request rates and latencies for transaction RPCs,
transaction.state.log.*health. - Producer:
txn-init-time-ns-total,txn-commit-time-ns-total,record-error-rate. - Connect: transaction size and commit rates per task (if you’re using exactly-once support).
- Streams: task-level commit rate, state-store restoration times, and changelog lag.
- Broker: request rates and latencies for transaction RPCs,
Short table comparing common processing guarantees
| Guarantee | Mechanism | What it gives you | Operational cost |
|---|---|---|---|
| At-least-once | default produce + consumer offset commit | No lost messages, duplicates possible | Lowest |
| Idempotent producer |
enable.idempotence=true (PID + seq) |
Dedup for retries within session | Minimal |
| Kafka transactions |
transactional.id + API |
Atomic writes across partitions + atomic offsets | Broker transaction state; commit coordination |
| End-to-end EOS | Streams/transactions + read_committed
|
Observed effect of each input exactly once for Kafka-backed state | Highest (config, monitoring, potential latency) |
Practical checklist: implement exactly-once with Kafka (steps and config)
This checklist is a pragmatic rollout plan you can follow.
- Inventory and constraints
- Identify all inputs, outputs, and external side-effects. Mark sinks that can support idempotent upsert or transactional writes. Mark external systems that cannot. (This drives whether you use outbox or idempotent sinks.)
- Broker and client compatibility
- Ensure brokers support the EOS mode you want (
exactly_once_v2needs brokers ≥ 2.5+ / Streams 2.5+). Plan rolling upgrades for brokers and clients as needed.
- Ensure brokers support the EOS mode you want (
- Producer & consumer configuration
- For transactional producers:
enable.idempotence=true,transactional.id=<unique-per-logical-producer>. CallinitTransactions()once at startup. - Consumers that must not see in-flight transactions: set
isolation.level=read_committed.
- For transactional producers:
- Stream vs. manual transactions
- If your processing is purely stream-in/stream-out and uses state stores, prefer Kafka Streams with
processing.guarantee=exactly_once_v2(or the appropriate config for your Streams version) to reduce complexity. - If you’re implementing
consume-transform-produceby hand, implementbeginTransaction()/sendOffsetsToTransaction()/commitTransaction()carefully and handleProducerFencedException/TimeoutExceptionand abort logic.
- If your processing is purely stream-in/stream-out and uses state stores, prefer Kafka Streams with
- Sinks & external systems
- Prefer outbox + CDC or idempotent upserts. If using Connect, validate the connector’s EOS support and follow KIP-618 migration steps for source connectors.
- Testing and failure injection
- Automate fault injection: broker restarts, producer/client hard kill, network partitions, rebalance storms. Verify output topics and downstream stores show no duplicates or partial commits. Use end-to-end verification tests with deterministic input and assertions.
- Observability & runbook
- Export the transaction-related producer metrics (
txn-*), broker request metrics forInitProducerId/EndTxn, Connect transaction metrics, Streams commit and restore times. Establish alerts for high aborted-transaction ratios, long commit times, or persistentProducerFencedException.
- Export the transaction-related producer metrics (
- Migration and rollbacks
- When switching EOS modes (e.g., v1 → v2), follow Streams upgrade guidance and do rolling restarts; keep state store cleanup/restore procedures documented because offset/state mismatches require careful remediation.
- Document invariants and TTLs
- For stateful dedup stores use TTLs to bound storage. Document expected commit intervals and tail latencies so on-call teams can reason about transactional fences or blocked consumers.
Operational tip: before flipping EOS in production, run a realistic load test with the same message size distribution and commit interval you plan to use in production; measure end-to-end latency and throughput, then tune
commit.interval.msand transaction timeout settings until you find an acceptable balance.
You have the primitives — enable.idempotence, transactional.id, sendOffsetsToTransaction, isolation.level=read_committed, and the Streams processing.guarantee. Use them deliberately: keep transactions short, prefer idempotent sinks or outbox when external systems are involved, and instrument the transaction metrics and changelog lag so you detect EOS breakage quickly. The implementation details matter: name transactional.ids deterministically, size RocksDB/changelog properly, and practice failover scenarios in staging to verify your assumptions.
Sources:
KIP-98 - Exactly Once Delivery and Transactional Messaging - Design and guarantees for idempotent producers, PIDs, sequence numbers, and the transactional producer API.
KafkaProducer Javadoc (Apache Kafka) - Producer configuration defaults, enable.idempotence, transactional.id behavior and API notes.
Exactly-once Semantics is Possible: Here's How Apache Kafka Does it (Confluent Blog) - Implementation notes, performance observations, and trade-offs for EOS.
Kafka Streams Upgrade Guide — exactly_once_v2 / KIP-447 (Confluent Docs) - EOS v2 background, migration guidance, and KIP references.
Consumer Configuration: isolation.level (Apache Kafka Documentation) - read_committed semantics and impact on consumers.
KIP-618: Exactly-Once Support for Source Connectors (Apache Kafka) - How Connect handles exactly-once for source connectors and worker-level considerations.
Building Systems Using Transactions in Apache Kafka (Confluent Developer) - Practical examples of beginTransaction() / sendOffsetsToTransaction() / commitTransaction() and limitations regarding external systems.
How to tune RocksDB / Kafka Streams state stores (Confluent Blog) - State store/changelog behavior and tuning for Streams.
Apache Kafka — Common monitoring metrics (Documentation) - Producer, consumer, Streams, and broker metrics relevant to monitoring transactions.
Exactly-once semantics with Kafka transactions (Strimzi Blog) - Practical considerations, monitoring pointers, and transactional behavior notes.
Flink JdbcSink (exactlyOnceSink) — API reference (Apache Flink) - Example of exactly-once-capable JDBC sinks and XA-like options for sinks.
Top comments (0)