How to Design Idempotency Keys That Survive Upstream Event Format Changes

#dataengineering #integration #reliability #api

A retry-safe data integration is built on idempotency keys. The first retry is easy: the integration sees an event it already processed, looks up the existing downstream record, and skips the write. The hundredth retry, six months after the upstream system silently changed its event payload format, is where the design either holds up or quietly produces duplicates.

This piece is about the second case: what makes an idempotency key strategy resilient to the changes that happen to upstream event payloads over time, and what to avoid in the initial design so that the future format change does not turn into a duplicate-record incident.

Photo by Winston Chen on Unsplash

What an idempotency key actually has to do

The job of an idempotency key is to give the integration a stable way to ask "have I already processed this event?" The answer has to be correct on every retry, every replay, and across every code deploy between the first attempt and the last.

There are two common shapes for the key:

A field provided by the upstream system. The most common version is an event ID that the upstream system stamps on every event before delivering it. Stripe events, Shopify webhooks, GitHub webhooks, and most modern SaaS event systems all provide one of these. If the upstream system commits to never reusing the ID, the key is the ID, and the integration's job is to remember which IDs it has seen.
A hash computed by the integration over the event payload. Used when the upstream system does not provide a stable ID. The integration takes the relevant fields of the payload, normalizes them, and hashes them. The hash becomes the key.

The first shape is strictly better when it is available, because the key is a contract the upstream system has agreed to honor. The second shape is necessary when the upstream system does not provide one, and the entire reliability story rests on whether the hash inputs and the normalization are stable across format changes.

Why payload-hash keys silently break

A payload-hash key works perfectly until the upstream system adds a new field to the event payload. The hash inputs change. The hash changes. The key is different. The "have I seen this before" check returns false, and the integration processes what is actually the same event a second time, producing a duplicate downstream record.

This failure mode is invisible. There is no error, no log entry, no alert. The integration processes the second copy as if it were a new event, the downstream record is created, the upstream system gets the success response it expected. Three months later someone notices the duplicates in the report.

The root cause is that the hash was computed over fields that the upstream system was free to change. The fix is to be deliberate about which fields go into the hash and to treat that set as a contract with the upstream system, even when the upstream system has not formally agreed to it.

The general background on idempotent operations and what makes them mathematically stable is in the Wikipedia article on idempotence, and the engineering practices around idempotency keys for HTTP APIs are well covered in the Stripe API documentation, which is the closest thing the industry has to a canonical reference implementation. The underlying choice of hash function (SHA-256, blake2, etc.) is well-documented across the Wikipedia article on cryptographic hash functions, and the data-storage primitive most of these key-tracking tables sit on top of is PostgreSQL.

The four rules for a stable payload-hash key

If you have to compute the key yourself, these four rules will save you the duplicate-record incident.

Rule 1: Include only fields that the upstream system has committed not to add to or remove from. Usually this is the entity identifiers (customer ID, order ID, line item ID) and the event type. It is not "the whole payload."

Rule 2: Normalize aggressively. Trim whitespace, lowercase strings that should be case-insensitive, sort lists into a canonical order, omit null fields entirely instead of including them as nulls. Two events that should produce the same key should produce the same key regardless of cosmetic differences in serialization.

Rule 3: Pin the hash algorithm and the input format in writing. The hash algorithm (SHA-256, blake2b, whatever) and the exact wire format the fields are serialized into before hashing (sorted JSON with no whitespace, or a specific concatenation order) are part of the contract. They cannot change without coordinated downstream cleanup, because changing them invalidates every existing key.

Rule 4: Stamp the schema version into the key itself. The key is not hash(fields); it is v1:hash(fields). When the upstream system makes a backwards-incompatible change and you have to change the input set, the new key is v2:hash(new_fields). Both keys can coexist in the log without colliding, and the migration can happen gradually.

These four rules are the difference between an idempotency key strategy that holds up for years and one that produces a duplicate incident at the next upstream system change.

Photo by John Adams on Unsplash

How to handle the inevitable upstream change

Eventually the upstream system will add a field, change a field's type, or rename a field. The integration's idempotency key cannot quietly absorb the change; it has to handle it explicitly.

The recommended procedure:

Detect the change. A schema drift detector that watches the incoming event payloads and alerts when a new field appears or an existing field's type changes is the first defense. Most teams find out about upstream changes from downstream report anomalies. The schema drift detector finds out at the moment the first event with the new format arrives.
Decide whether the new field is hash-relevant. A new optional field that does not affect the downstream record is not hash-relevant; the existing key continues to work, and the new field is ignored at the hash step. A new required field that changes the meaning of the event is hash-relevant; the key needs to incorporate it.
If hash-relevant, mint a new key version. The integration starts emitting v2:hash(...) keys for new events. Existing v1:... records in the log are unaffected.
Backfill the new key onto existing records, gradually. A background job walks the log, recomputes the new-version key for each existing event, and writes it into a secondary index. The check-then-act lookup becomes "look up by either v1 or v2 key; if either matches, the event has been processed."
Eventually retire v1. Once the backfill is complete and the v2 key is the authoritative one, the v1 key can be deleted. This step is usually months away from the format change, and that is fine. There is no rush.

This is more work than "rebuild the integration from scratch with the new format," but it is the only path that preserves the existing downstream records and the existing idempotency guarantees.

When a fresh integration is the right answer

Sometimes the upstream system has changed so substantially that the existing idempotency keys are not salvageable. The event format is unrecognizable, the entity IDs have changed, the semantics have shifted. In those cases, the right move is to build a new integration alongside the old one, run them in parallel for a window, and cut over when the new one has caught up.

The old integration keeps its keys and continues to handle events from before the cutover. The new integration has fresh keys and handles everything after. The two never need to talk to each other. The risk of a duplicate is bounded by the cutover window, not by the entire history of the integration.

This pattern is more common than people expect. The "we will migrate the keys in place" approach is usually possible, but the engineering cost is higher than just running both integrations in parallel for a quarter. If the upstream change is large enough to require thinking, it is large enough to consider the parallel-integration option.

The defensive checks that catch leakage

Even with the four rules and the migration procedure, the integration needs a few defensive checks that fire when the idempotency story breaks down:

Daily duplicate-detection report. A SQL query that finds downstream records created within the same minute with the same business identifier, run nightly. If duplicates appear, the alert fires.
Key-collision metric. Count the number of events whose key matched an existing log row in the last hour. The number should be small (retries, replays). A sudden spike is a sign that the key is too lax and unrelated events are colliding.
New-format detection. A check that compares the field set of today's events against the field set of last week's events. Any new field triggers a review of whether it is hash-relevant.

Together, these three checks catch most of the failure modes that a careful key design fails to prevent in advance. None of them is expensive to build; all of them are cheap insurance.

For the broader design of the replay system that depends on stable idempotency keys (the event log schema, the dry-run mode, the rate limits, the time-window scoping), the 137Foundry guide on data integration replay covers each piece. The 137Foundry data integration service page has the surrounding context for the integration design as a whole, and the rest of the 137Foundry homepage lists the other engineering services we offer alongside.