Exactly-once is a lie: why your Spark stream is actually at-least-once

#databricks #streaming #architecture #spark

If you think your Spark Structured Streaming pipeline is actually achieving end-to-end exactly-once processing, you are likely just lucky that your infrastructure hasn't had a truly bad day yet.

Why I chose this topic: I’ve spent the last six years cleaning up "perfect" pipelines that bloated their databases with duplicate records the moment a Kafka partition rebalanced during a checkpoint commit. We treat the word "exactly-once" as a religious tenet, but in the trenches of financial ledger reconciliation, it’s a leaky abstraction that hides the brutal reality of distributed systems.

Why the common approach falls short

The industry loves the marketing slide that says Spark Structured Streaming is "exactly-once." In reality, what Spark provides is exactly-once processing within the Spark engine itself, not end-to-end.

The mechanism relies on checkpointing—writing state to HDFS or S3—and the deterministic replay of inputs. If your task fails, Spark rolls back to the last successful offset and re-processes. That sounds clean, right? But the moment you write that data to an external sink, you are at the mercy of the sink's idempotency. If your sink is a generic JDBC connector or a legacy database that doesn't support transactional writes keyed to the Spark batch ID, you are not doing exactly-once. You are doing at-least-once, and you are quietly praying that your deduplication logic catches the debris.

I once debugged a PII-scrubbing pipeline where a node failure during a foreachBatch sink operation caused a partial write. Because the sink wasn't atomic and the checkpointLocation hadn't updated yet, the next retry wrote the entire batch again. We ended up with duplicate sensitive records in our downstream warehouse. The Spark logs looked "successful," but the data integrity was trash.

Photo by Mario Gogh on Unsplash

The state of the sink

To achieve true exactly-once, your sink must be able to handle the same batch ID twice without side effects. If you are using spark-sql-kafka, you have the advantage of the Kafka offset tracking being baked into the checkpoint. But the second you leave the Kafka ecosystem, you are in the wild west.

Consider the delta sink. When you use Delta Lake, the transaction log acts as the source of truth for the batch ID. Spark writes the data, then commits the transaction. If a crash happens midway, the data files are written, but the commit fails. Upon restart, Spark sees the failed transaction, ignores the orphaned files, and tries again. This works because Delta supports atomic commits.

Compare that to a standard mode("append") write to a legacy SQL database. There is no atomic commit here. There is no "batch ID" metadata stored in the target table. If your executor dies after writing 50% of the rows but before finishing the commit, those 50% remain. The retry writes them again. You are now leaking duplicates. Unless you are manually implementing a MERGE INTO using a unique constraint or a primary key—which carries a heavy performance tax—you aren't doing exactly-once. You are doing "at-least-once with a post-hoc cleanup script."

The reality of checkpointing failure

We treat checkpointLocation as a holy object. We assume that if we store it on S3, we are safe. We aren't.

In production environments using spark 3.x, I’ve seen consistent-hashing issues and S3 eventual consistency bugs (though largely mitigated by newer S3A committers) lead to corrupted checkpoints. When the offsets or commits directory in your checkpoint path gets corrupted, your streaming job enters a death spiral.

You can’t just "fix" a corrupted checkpoint. You are forced to choose: lose the state and reset the source offset (creating a gap in data), or force-start from a previous checkpoint and deal with the inevitable re-processing of data you’ve already sunk. Neither of these options is "exactly-once." They are "emergency recovery procedures." If you aren't logging your source offsets in a separate, immutable metadata store, you are flying blind during these failures.

Photo by MARIOLA GROBELSKA on Unsplash

The objections (and my answers)

"But the documentation says it’s exactly-once!"

The documentation is correct about the internal state management. If you are doing aggregations in memory using mapGroupsWithState and only writing to a Delta table, the engine guarantees that the state store and the transaction log stay in sync. My objection isn't to the engine's internal math; it’s to the delusion that the engine lives in a vacuum. Your system includes your sink, your network, and your storage provider. If any of those don't support atomic, idempotent writes, the guarantee breaks.

"Just use Kafka as a sink and it's fine."

Kafka is a great buffer, but it’s not an analytical store. If your pipeline is feeding a BI tool, you eventually have to land that data somewhere else. The moment you use a custom sink or a non-transactional database, the "exactly-once" promise evaporates. You are then responsible for the write-ahead-log pattern yourself.

"We use a primary key to deduplicate, so it's effectively exactly-once."

That is an operational workaround, not a semantic guarantee. If your database performance degrades because you’re running UPSERT logic on every incoming stream to handle the duplicates that Spark created, you haven't solved the problem; you've just shifted the cost from the storage layer to the compute layer.

Conclusion

Exactly-once is a goal, but in Spark Structured Streaming, it is never a default. It is a configuration of your entire stack.

If you want to get closer to the truth, stop trusting the framework to handle everything. Use transactional sinks like Delta Lake or Apache Hudi. If you are forced to use a legacy sink, build idempotency into your data model using unique business keys. Monitor your checkpointLocation as if it were your production database, because that’s exactly what it is.

Stop telling stakeholders you have "exactly-once" semantics. Tell them you have "idempotent processing pipelines with a defined recovery point." It sounds less like a marketing brochure and more like the actual engineering work you're doing.

Tags: #spark #streaming #data #architecture

Cover photo by Brian Cockley on Unsplash.