137Foundry

Posted on Jun 21

How to Add Idempotency Keys to an Existing Integration Without Breaking Live Traffic

#api #productivity #programming

Most idempotency retrofit work is on integrations that have been running in production for months or years, processing live traffic, with downstream consumers that depend on the existing message format. The challenge is not the cryptography or the design pattern. It is the deployment sequencing that lets you add keys, change behavior, and clean up edge cases without breaking anything in flight.

This guide walks through a six-step deployment sequence that works for most production retrofits. The whole sequence takes one to three weeks of calendar time per integration, depending on how many consumers you have and how aggressive your retry windows are.

Photo by Giant Asparagus on Pexels

Step 1: Add the Key Column Without Using It

The first deployment is a no-op from a behavioral standpoint. You add a UUID column to the outgoing event table, populate it for new rows, and leave existing rows with NULL.

The sender does not change yet. The receiver does not change yet. The only difference is that new rows now have a UUID that future code can use.

This step takes a few hours to ship and prove. You verify by inspecting the database that new rows are getting UUIDs and old rows still have NULL. If anything is wrong, you roll back the column add (or just leave it; an unused column is harmless).

This step is important because it lets you backfill the new column for existing rows during the next quiet window without time pressure. The Wikipedia overview of idempotence covers the property you are ultimately enforcing, but the deployment work is what makes the enforcement reliable in production.

Step 2: Backfill UUIDs for Existing Rows

For rows that already exist with NULL UUIDs, generate UUIDs and populate the column. This is a one-time backfill, usually run as a batch job.

The UUIDs for backfilled rows do not need to be deterministic, since these rows correspond to events that have already been processed. The only requirement is that future code can rely on every row having a non-NULL UUID.

The backfill is usually fast (a single UPDATE statement for small tables, a chunked job for large ones). Verify that no rows remain with NULL UUIDs before moving to the next step. Wikipedia's overview of extract, transform, load processes covers the broader pattern of backfilling state without disrupting live processing.

Step 3: Modify the Sender to Include the UUID

The sender now reads the UUID from the row and includes it in the outgoing message (header or body, depending on the message format).

The receiver still does not use the UUID. It simply receives it and ignores it. This step is also behaviorally a no-op, but it gets the data flowing through the pipeline so that downstream changes can rely on it being present.

You verify by tailing the receiver's request log and confirming that every incoming request includes a UUID.

Step 4: Add Receiver Dedup Without Enforcing It

The receiver now records every incoming UUID in a dedup table but does not yet refuse duplicates. If a duplicate UUID arrives, the receiver logs the event but processes the request as normal.

This step is critical. It lets you observe what the actual duplicate rate looks like in production before flipping the enforcement switch. The expected hit rate is between 0.1 and 1 percent. Wikipedia's entry on the saga pattern and writing from practitioners like Martin Fowler at martinfowler.com both reinforce why observation before enforcement is the safer sequencing.

Two outcomes are possible:

The hit rate is in the expected range. Move to step 5.
The hit rate is 0 percent or 10+ percent. There is a problem worth understanding before enforcement: either the keys are not stable across retries (move to investigating sender behavior) or the keys are colliding across distinct operations (move to investigating key generation logic).

Spend at least one full week in this step before proceeding. Production traffic patterns vary day to day, and a week of data is the minimum to be confident the hit rate is representative.

Step 5: Flip the Enforcement Switch

Once you are confident that the dedup mechanism is recognizing duplicates correctly, change the receiver to actually return the original response (without re-applying side effects) on duplicate UUIDs.

This is the moment when the integration becomes truly idempotent. From this deployment forward, retries are safe to perform.

The change is small and behaviorally observable: duplicate-side-effect incidents should drop to zero from this point on. The dedup hit rate stays roughly stable; the difference is that the hits now actually prevent the duplicate work.

Step 6: Tighten Up the Sender Retry Logic

Once enforcement is on, the sender can be more aggressive about retries because the receiver is now safe under duplicates. Increase the retry count, shorten the backoff interval, or both, depending on the operational characteristics you want.

This is also the step where you switch the sender's retry logic from "create new row" to "update existing row" if it was not already. With enforcement on at the receiver, a new row would just produce a different UUID and a fresh side effect. The retry has to reuse the existing row's UUID.

After this step, the integration is fully retrofitted. You should observe:

A non-zero, roughly stable dedup hit rate on the receiver (proof that retries are working).
Zero new duplicate-side-effect incidents in the receiver's data.
Faster recovery from broker outages or network instability, because the sender can retry more aggressively without risking corruption.

What to Verify Before Each Step

A few specific checks save a lot of grief:

Before step 1: Confirm you have permission to alter the outgoing events table schema. Some legacy integrations have this table managed by a separate team.

Before step 3: Confirm that the message format allows adding a new field without breaking existing consumers. For some legacy formats (fixed-length records, strictly schemaed protocol buffers), this requires a separate compat-layer change first.

Before step 4: Confirm you have a place to put the receiver's dedup table and have provisioned enough storage for at least the retention window times the expected message rate.

Before step 5: Confirm the receiver's response cache returns identical results for duplicate UUIDs. If the cached response includes a timestamp or other "now" field, duplicate responses might look different from each other even though the side effect was skipped, which can confuse downstream consumers that check response equality.

Common Retrofit Failure Modes

A few patterns that reliably cause problems:

Skipping the observation phase. Going from "no UUIDs" to "enforce on receiver" in two deploys means you find out about key generation bugs only after enforcement turns them into visible errors. The observation phase exists to find these bugs in a safe window.

Generating UUIDs at send time instead of event time. The retrofit hooks the UUID into the wrong place, and the dedup hit rate stays at zero even after step 4. The fix is to move generation upstream to the source-of-truth event. The 137Foundry data integration practice treats event-time generation as the default for any new integration we ship, because the retrofit cost when this is wrong is meaningfully larger than the design cost to get it right from the start.

Backfilling UUIDs for already-processed rows in a way that collides with future generation. Make sure the backfill uses a UUID scheme that cannot accidentally produce the same value as future event-time generation. Standard random UUIDs handle this; sequential or time-based schemes can collide if not handled carefully.

Forgetting to clean up the dedup table. Without TTL or partitioning, the dedup table grows unbounded and eventually becomes the bottleneck. Add cleanup as part of step 4, not as a follow-up.

Why the Sequencing Matters

The six-step sequence works because each step is independently shippable and observably safe. Any single step can be rolled back without affecting the others. The pipeline can sit indefinitely in step 4 (recording but not enforcing) if the team needs more time to validate.

The alternative (a big-bang retrofit that adds UUIDs, enforces dedup, and changes retry behavior in one deploy) is much more dangerous because the failure modes interact in ways that are hard to debug in production.

For more depth on the underlying patterns, the longer reference on how to handle idempotency in data integration pipelines covers the design patterns the retrofit is aiming to deliver, and the 137Foundry services overview covers how this work fits into broader integration engineering.

The Honest Framing

A retrofit done right is invisible to upstream and downstream consumers. The pipeline just becomes more resilient over the course of a few weeks, without any breaking changes or visible behavior shifts.

A retrofit done in a rush usually breaks something in the middle and produces a postmortem that lasts longer than the deployment itself. The slow path is the fast path.

The six-step sequence is not the only valid approach, but it has worked across enough production retrofits to recommend as a default. Variations on timing or step ordering are fine. Skipping the observation phase is not.

DEV Community