Stop waking up at 3 AM: Why your data pipelines must be idempotent

#dataengineering #pipelines #architecture #data

Why I chose this topic: In my first year as a junior engineer, I pushed a non-idempotent job that double-counted $2M in revenue because I didn't wrap a delete statement in a transaction. I haven't slept through a production alert since, and I’m tired of seeing junior engineers make the same career-limiting mistake.

You ship the job. It passes CI. It runs perfectly in staging. Then, at 3:14 AM, the Airflow scheduler hiccups, a network partition hits your Redshift cluster, or your upstream API returns a 500. The job fails halfway through. You wake up to a PagerDuty alert, a half-loaded partition in S3, and a duplicate record set in your main reporting table. You have two choices: manually clean up the mess for three hours, or click "Clear" on the task and pray it doesn't make things worse.

If you have to "pray" when you hit re-run, your pipeline is garbage.

Idempotency—the property where an operation produces the same result regardless of how many times it’s executed—is the only thing standing between you and a mid-life crisis.

The path you’re choosing

You are currently deciding between two fundamental ways to build data movement: "Append-Only" versus "Replace-by-Key."

Most engineers start with "Append-Only." It’s intuitive. You pull data from a source and write it to the sink. It’s fast. It’s cheap. It’s also a ticking time bomb. The "Replace-by-Key" approach is the boring, professional way to handle data. It requires more compute cycles and a bit of SQL gymnastics, but it guarantees that if you run the job ten times, the state of your database is identical to running it once.

Photo by ACatInABox on Unsplash

The contenders

In the Append-Only corner, we have the "Insert-Append" strategy. You fetch SELECT * from your source and INSERT INTO destination without checking for existing IDs.

In the Replace-by-Key corner, we have "Delete-then-Insert." You identify the window of data you are processing, wipe it from the destination table, and insert the new payload within a single atomic transaction.

The cost of doing business

Append-only pipelines are cheap on paper. You aren't burning CPU cycles checking for pre-existing keys or performing deletes. However, the "hidden cost" is the engineer’s salary. When your append-only job fails at 3 AM, you aren't just paying for the compute; you are paying for the two hours of your life you spend writing ad-hoc DELETE scripts to deduplicate the table.

With a Replace-by-Key strategy, your compute bill might tick up by 5-10%. You are running a DELETE FROM table WHERE processing_date = '2023-10-27' before your INSERT. In a system like Snowflake or BigQuery, this is negligible. If you are running on an old-school Postgres instance with a massive table, yes, you’ll need to partition the table by date to keep the DELETE operation performant. But that’s a design choice, not a failure.

Managing the ops burden

The operational burden of append-only is a nightmare of state management. You have to maintain an "audit" table or a complex WHERE NOT EXISTS clause that slows down as the table grows. I’ve seen developers use ON CONFLICT DO NOTHING in Postgres, which is a start, but it fails to account for updates. What if the source data changed? Now your database record is stale, and your pipeline is "idempotent" but factually wrong.

Replace-by-Key shifts the burden to the schema. You need to ensure your tables are partitioned. If you are using dbt (data build tool), this is the default behavior. A dbt run --select my_model with the --full-refresh flag or the default incremental strategy using delete+insert handles the atomicity for you. If you aren't using a tool that abstracts this, you are effectively building a custom database engine on top of your pipeline. Don't do that.

Photo by Growtika on Unsplash

Failure modes and recovery

Consider the "Partial Write" failure mode. Your job hits a timeout after inserting 50% of the rows.

If you used Append-Only, you now have a partial state. If you re-run it, you have 150% of the data. You now have to manually delete the rows that snuck in during the first failed attempt. It’s a guessing game of WHERE created_at > ....

With Replace-by-Key, the failure mode is clean. Because the operation is wrapped in a BEGIN; DELETE ...; INSERT ...; COMMIT; block, the failure means the database rolls back to the state before the job started. You can hit "Retry" in Airflow or Dagster a thousand times, and the end result remains exactly the same. The "atomicity" of the transaction is your insurance policy.

What I'd pick, and why

I choose the Replace-by-Key strategy every single time.

The caveat is that you must have a deterministic way to define the "key." If your source data doesn't have a natural key or a reliable updated_at timestamp, you’re in trouble. In those cases, I enforce a processing_batch_id or a partition date as the key. If I can't guarantee a clean wipe and replace, I don't build the pipeline.

Don't listen to the people who say "this is over-engineering." They are the same people who are currently fixing a production data quality issue on a Saturday morning.

My advice: Use dbt to manage your incremental models. If you’re writing raw Python or Spark, force yourself to use the DELETE ... WHERE partition = X pattern. If your database doesn't support transactions (looking at you, some early S3-based data lakes), use a "staging-to-production" swap. Write to a temporary table, then use an atomic rename to swap it with the production table.

Sleep is worth more than the few extra dollars in compute costs. Make your jobs boring, make them transactional, and stop waking up in the middle of the night.

Cover photo by Tyler on Unsplash.