How to Retrofit a Dry-Run Mode Into an Existing Data Integration

#dataengineering #integration #testing #devops

Most production data integrations have a "process the event" code path and nothing else. There is no separate "what would happen if I processed the event" path. When the operations team needs to plan a replay, a backfill, or any other bulk operation, the only way to answer "how many records will this change" is to actually run the code and see.

This is fine until the bulk operation is large, irreversible, or expensive. Then "let's just run it and see" is the wrong default, and the integration needs a dry-run mode. Adding one to an existing integration is more work than building it in from the start, but it is well-bounded work, and the payoff is the difference between confident operations and incident-driven operations.

Photo by Jandira Sonnendeck on Unsplash

What a dry-run mode actually does

A dry-run mode walks the same code path as a real run, with one exception: the side-effecting calls are replaced with logging instead of execution. The integration loads the event, looks up the existing downstream record, decides whether to create or update, and reports what it would do. It does not actually create or update.

The output is usually a structured summary: how many events would be processed, how many would be skipped because they were already processed, what the first few of each category look like, and what the aggregate downstream impact would be (number of records created, number updated, total cost if cost is relevant).

The key point is that the dry-run goes through the entire decision tree of the real run. A dry-run that only checks "would this be processed" without going through the field-by-field logic is half-useful. A dry-run that walks the full code path and only stops at the side-effecting call is what makes the operator confident enough to run the real version.

The architecture: a single boolean is not enough

The naive implementation of dry-run is to thread a dry_run=True boolean through every function that does I/O, and to have each function check the boolean before its side-effecting call. This works for small integrations and is the right place to start.

For larger integrations, the boolean approach scales poorly. Every new function has to remember to honor the flag. Every refactor has to thread the flag correctly. A single missed check creates a "dry-run mode produced real writes" incident, which is the worst category of bug because it betrays the user's trust in the safety mechanism.

The architectural fix is to extract the side-effecting calls into a separate layer (a "writer" or "client" object) and to have the dry-run mode swap in a no-op implementation of that layer. The processing code path does not know whether it is in dry-run mode; it calls writer.create_record(...) and writer.update_record(...) the same way regardless. The dry-run writer logs what it would do; the real writer does it.

This is the same pattern as a test double or a mock, applied to production. The processing logic gets exercised exactly as it would in real run; only the boundary changes. The discipline becomes: every new side-effecting call goes through the writer layer. Code review catches any new call that bypasses the layer. The flag-threading bug becomes impossible by construction.

A clean treatment of the broader pattern (separating decisions from I/O, also called "functional core, imperative shell") is in widely shared software architecture writing; for the data-integration-specific application, the Apache Kafka documentation around consumer group semantics is a fair reference for how the production case is structured, and the design pattern itself is well-discussed across the PostgreSQL ecosystem for transaction-safe migrations. The underlying mathematical property that makes the no-op writer safe is idempotence, which is what allows the processing code to run the same logic multiple times during testing or replay without producing different downstream state.

Photo by An Nguyen on Pexels

Retrofitting an existing integration: the step-by-step

If your integration was not built with a writer layer, retrofitting one is a multi-week project for a small integration and a multi-month project for a large one. The work is well-defined, and it does not require rewriting any business logic.

Step 1: Inventory the side-effecting calls. Walk the codebase and list every call that writes to the downstream system. This includes database writes, HTTP POST/PUT/DELETE calls, message queue publishes, file writes, and any other operation that changes state. The list is usually shorter than people expect, often fewer than twenty distinct call sites in a non-trivial integration.

Step 2: Define the writer interface. For each distinct kind of write, define a method on the writer object. The interface should be at the right level of abstraction: not "make an HTTP call to URL X with payload Y" (too low) but "create a customer record with this data" (right). The interface is what the processing code will call, and it should read like a description of the business operation, not a description of the wire protocol.

Step 3: Build the real writer. Implement the interface against the downstream system. The real writer does what the existing code does, just behind the interface. This step is mostly mechanical, and the most common source of bugs is missing edge cases that the existing code handled inline.

Step 4: Build the dry-run writer. Implement the same interface but with no-op side effects. The dry-run writer records what it would do (the method name, the arguments, a structured summary) and returns a fake result that the processing code can use to continue. The fake result has to be plausible (a real-looking ID, a real-looking timestamp) so that the processing code does not crash on it.

Step 5: Migrate the processing code to use the writer. Replace every direct side-effecting call in the processing code with a call to the writer. This is the longest step; do it gradually, one call site at a time, with tests that confirm the behavior did not change.

Step 6: Add the dry-run entry point. A command-line flag or a configuration setting that controls which writer the integration uses. The default is the real writer. The dry-run mode is opt-in.

Step 7: Verify on production data. Run the dry-run mode in production against a small batch of events and confirm the output matches what the real run would have produced. This is the validation step that catches the missed call site that bypassed the writer layer.

The output format matters more than the mechanism

A dry-run that produces a wall of debug log lines is technically a dry-run, and it is operationally useless. The operator cannot read it and cannot summarize it.

A useful dry-run produces a structured summary at the top, a sample of the first few events in each category in the middle, and a path to detailed logs at the bottom. The summary is what the operator reads first; the sample is what they read to confirm the summary makes sense; the detailed logs are what they reach for only if something looks wrong.

A concrete example of what a summary should look like:

Dry run summary
Window: 2026-06-30 09:00 UTC to 2026-06-30 17:00 UTC
Events evaluated: 4,283
Would create new records: 17
Would update existing records: 1,124
Would skip (already processed, no change): 3,142
Estimated downstream API calls: 1,141
Estimated cost: $0.46

Six lines that tell the operator everything they need to decide whether to run the real version. The sample below those six lines shows the first three records of each category. The detailed log is one click away if needed.

When the dry-run lies

The most common way a dry-run mode misleads the operator is when the side-effect of the write would have been input to a downstream decision that the dry-run cannot replay. The integration creates customer A; downstream system assigns customer A an internal ID; subsequent events for customer A include that ID; dry-run does not actually create customer A, so the subsequent events cannot be evaluated correctly.

For most integrations this is a corner case and the dry-run summary is right within a small error margin. For some integrations (especially ones with auto-assigned IDs that affect downstream routing), the dry-run mode has to be more sophisticated: it has to maintain a "would have created" registry and consult it during evaluation of subsequent events in the same window.

The honest framing is to document what the dry-run captures and what it does not, and to surface the limitations in the output. A dry-run that says "this is accurate within ±10 records of the real run" is more useful than one that pretends to be exactly accurate and silently is not.

For more on the broader replay design that the dry-run mode plugs into (the event log schema, the idempotency keys, the rate limits, the time-window scoping), the full guide on data integration replay at https://137foundry.com covers the surrounding architecture. The 137Foundry data integration service page has the architectural context, and the rest of the 137Foundry articles cover related reliability patterns like schema drift, error queues, and idempotency for retry-prone webhooks.