How to Write a Replay Tool for Failed Integration Records

#api #productivity #programming

Every team running a data integration eventually needs a replay tool. A record landed in the error queue. The underlying cause was fixed (the schema was updated, the credentials were rotated, the downstream came back up). Now you need to push the record through the integration again.

If you don't have a tool for this, someone will write ad-hoc shell scripts to copy payloads from the queue and POST them by hand. This works until it doesn't, and "until it doesn't" usually involves accidentally double-sending records, replaying with stale credentials, or processing 4,000 items in a tight loop that takes down the downstream service.

This walkthrough shows how to build a replay tool that handles the common cases safely, scales to thousands of items, and doesn't require the operator to remember any operational lore.

Photo by César Gaviria on Pexels

What "replay" actually means

A replay tool reads items from the error queue (or from any persisted record of failed events) and re-sends them through the integration. The tool's job is to make that operation safe, observable, and auditable.

The hard parts are:

Idempotency. If the original record actually did succeed on the downstream but the response was lost, replaying creates a duplicate. The downstream system or the replay tool needs to handle this.
Rate limiting. Replaying 5,000 items in a tight loop will rate-limit you or the downstream service. The tool needs to throttle.
Selective replay. You usually don't want to replay everything; you want to replay items matching a filter (older than X, error class Y, integration Z).
Dry run mode. Before replaying anything, you want to see what would happen without actually doing it.
Audit trail. Every replay needs a record of what was replayed, by whom, with what result.

A tool that nails these five is one most teams will trust. A tool that ignores any of them is one operators will be afraid to run during incidents.

The interface

The tool should be a single command-line entry point with a small set of flags. A reasonable shape:

replay [integration_name] [--filter <expr>] [--dry-run] [--limit N] [--rate R] [--reason "..."]

integration_name is the queue or namespace to replay from.
--filter is an expression for which items to include (e.g., error_class=schema_validation AND created_at>2026-05-01).
--dry-run shows what would happen without doing it. Defaults to true; the operator must explicitly pass --no-dry-run (or equivalent) to actually replay.
--limit caps the number of items processed in this invocation. Defaults to something small like 50 so operators don't accidentally drain the queue.
--rate controls items per second. Defaults to a conservative number like 2/sec.
--reason is required for non-dry-run invocations. Captured in the audit log.

Making --dry-run the default and requiring an explicit opt-out is one of the most underrated design choices for any destructive tool. It costs operators one extra flag during real work and saves them from many accidental productions.

Idempotency: the critical piece

If the downstream system supports idempotency keys (Stripe, for example, has one of the cleanest implementations), use them. Generate a deterministic key per record (a hash of the payload and a stable timestamp works) and pass it on every replay attempt. The downstream deduplicates.

If the downstream doesn't support idempotency keys, the tool needs to handle it locally. The pattern:

Before replaying, mark the record as "replay in progress" in your replay-state store.
Call the downstream with the payload.
If the call succeeds, mark the record as "replayed successfully" with a timestamp.
If the call fails, mark the record as "replay failed" with the error.
If the process crashes between steps, the next invocation sees the "replay in progress" mark and either retries (if you trust the downstream to deduplicate by some other means) or skips with a warning.

The state store is just a small table or key-value collection. A few thousand rows; a Postgres table is fine. The PostgreSQL documentation on upserts is the cleanest reference for the pattern.

For some downstream systems, you can derive idempotency from the payload's natural unique key. A Salesforce contact has an external ID; replaying the same external ID twice updates the same record. This is the easiest version when it's available.

Rate limiting

The replay tool should never overwhelm the downstream. Two layers of protection:

Token bucket per integration. The tool maintains a token bucket sized to the integration's known safe rate (e.g., 5 requests per second). Each replay attempt waits for a token. The bucket refills at the configured rate.

Adaptive backoff on 429s. If the downstream returns rate-limit errors, the tool increases its delay multiplicatively (start at the current rate, double on each 429, cap at one request per 10 seconds) until the downstream stops returning 429s. Then it slowly ramps back up.

The combination handles both the static rate limit (you know the downstream's limit) and the dynamic case (the downstream is under load from other sources and is more aggressive than usual).

The Stripe API documentation has a good public writeup of how their clients handle rate limiting that's worth borrowing patterns from.

Selective replay

The filter expression is the part that operators use most. Common queries:

"All items in the schema-validation error class from the last 24 hours"
"All items for integration X older than 7 days"
"All items for customer Y" (when a customer-specific upstream fix is rolled out)
"All items where attempt_count < 3" (only items that haven't been deeply retried)

The filter expression doesn't need to be a fancy DSL. A small set of comparison operators (=, !=, >, <, IN) over a fixed set of fields is enough. If you find yourself needing JOINs, the tool has outgrown its scope; that's a different system.

The audit trail

Every replay invocation should produce a record:

Who ran it (the operator's identity, captured from the auth context)
When it ran
What filter and limit were used
How many items were processed, succeeded, failed
The reason string

This record goes to a log table or a structured logging sink, queryable later. The point isn't just compliance; it's incident review. When something goes wrong with the integration two weeks after a replay, "did anyone run a replay recently?" is a question you want answered quickly.

The OWASP Logging Cheat Sheet covers what a good audit log looks like in detail; the same principles apply here.

Dry run output

The dry run should be readable. Not just a count of items; a sample of the actual payloads, the destinations, and what the tool would do.

A reasonable dry-run output:

Would replay 247 items matching filter: error_class=schema_validation AND created_at>2026-05-01

Sample of 3 items:
  Item abc123 (created 2026-05-04 14:22): payload {id: 4421, name: "Acme Corp"}
    -> POST https://api.upstream.example/contacts
    -> idempotency key: sha256:b8f3...

  Item abc124 (created 2026-05-04 14:23): payload {id: 4422, name: "Beta Inc"}
    -> POST https://api.upstream.example/contacts
    -> idempotency key: sha256:c1d2...

  ... 244 more items not shown ...

Estimated time at 2 req/sec rate: 2m 3s
Estimated cost: $0.49 (based on integration cost per call)

The cost line is optional but appreciated. The estimated time helps operators decide whether to walk away or wait.

Putting it all together

A complete replay tool, with the design above, is around 500-800 lines of code in any reasonable language. It is one of the highest-leverage tools an integration team can build: it converts incidents from "two-engineer cleanup sprints" into "one-engineer ten-minute commands."

The longer walkthrough of the broader integration error handling design (queue shape, ownership, alerting) is at How to Design Integration Error Queues Your Team Will Actually Drain.

137Foundry's data integration team routinely builds replay tools as part of integration projects. For the broader engineering work we cover, see 137Foundry.

The biggest risk with a replay tool is building one that's so cautious (or so dangerous) that operators avoid using it. The right tool is one that operators reach for during incidents without thinking, because it's safer to run than to do the replay manually. That's the bar to design for.