Daniel Glover

Posted on Feb 25 • Originally published at danieljamesglover.com

Dry-Run Engineering: The Simple Practice That Prevents Production Disasters

#devops #programming #security #bestpractices

There is a post trending on Hacker News today about the --dry-run flag. Henrik Warne writes about adding it to a reporting application early in development and being surprised by how useful it became. I have been nodding along because this matches my experience exactly.

The --dry-run pattern is one of those deceptively simple engineering practices that punches well above its weight. If you have ever run rsync --dry-run before committing to a massive file sync, or used terraform plan before terraform apply, you already know the value.

What dry-run actually means

A dry-run flag tells your script to show what it would do without actually doing it. Print the files that would be deleted. Log the API calls that would be made. Display the database rows that would be updated. Then exit without changing anything.

The key principle: make it safe to run without thinking.

When a colleague asks "what will this script do?", you should be able to run it with --dry-run and show them. No risk. No cleanup needed afterwards.

Where this matters most

Database migrations

Before running a migration that modifies production data, a dry-run should output:

How many rows will be affected
Sample of the changes (first 10 rows, perhaps)
Any constraints that might fail
Estimated execution time

File operations

Scripts that move, rename, or delete files should preview the operations. I once watched a junior engineer accidentally delete a week of customer uploads because a cleanup script had no preview mode. That script has a --dry-run flag now.

API integrations

When your script calls external services - sending emails, posting to Slack, updating CRM records - a dry-run should log what would be sent without actually sending it. This is invaluable for testing integrations without spamming real systems.

Infrastructure changes

Terraform popularised plan before apply. Ansible has --check mode. Kubernetes has --dry-run=client. These tools understood that showing the diff before making changes reduces incidents significantly.

Implementation patterns

The simplest approach is a global flag that gates all side effects:

def delete_old_files(directory, dry_run=False):
    files = find_files_older_than(directory, days=30)

    for file in files:
        if dry_run:
            print(f"Would delete: {file}")
        else:
            os.remove(file)
            print(f"Deleted: {file}")

    print(f"Total: {len(files)} files {'would be' if dry_run else ''} deleted")

For more complex scripts, consider a transaction-style approach where you collect all intended actions, display them, then execute only if not in dry-run mode:

class ActionPlan:
    def __init__(self):
        self.actions = []

    def add(self, description, execute_fn):
        self.actions.append((description, execute_fn))

    def preview(self):
        for desc, _ in self.actions:
            print(f"  - {desc}")

    def execute(self, dry_run=False):
        if dry_run:
            print("Dry run - the following actions would be taken:")
            self.preview()
            return

        for desc, fn in self.actions:
            print(f"Executing: {desc}")
            fn()

The hidden benefit: better logging

Adding dry-run support forces you to think about what your script actually does. You cannot preview an action without first describing it clearly. This naturally improves your logging, error messages, and overall observability.

Scripts with good dry-run output tend to have good production logging too. The same descriptions you write for preview mode become your audit trail.

Common objections

"It adds complexity"

Yes, but minimal complexity. A single boolean flag and some conditional prints. The alternative - running scripts blind and hoping for the best - creates far more complexity when things go wrong.

"Our scripts are simple enough"

Until they are not. Adding dry-run early is trivial. Retrofitting it after an incident is embarrassing and often incomplete.

"We have staging environments"

Staging helps, but it is not the same as previewing against production data. A dry-run against your actual database shows you what will really happen, not what would happen to synthetic test data.

Making it the default

I have started making --dry-run the default for destructive scripts. You have to explicitly pass --execute or --no-dry-run to make changes. This inverts the safety model - accidents require extra effort.

# Shows what would happen (safe default)
./cleanup-old-data.py

# Actually does it (requires explicit flag)
./cleanup-old-data.py --execute

This is particularly valuable for scripts that run via cron or automation. A misconfigured job that runs in dry-run mode by default produces logs instead of damage.

The small investment that pays dividends

Henrik Warne added --dry-run on a whim and found himself using it daily. That matches my experience. Once you have it, you use it constantly - before deployments, while debugging, when demonstrating to stakeholders, during incident response.

The pattern is old. Subversion had it. rsync has had it for decades. But it remains underused in custom scripts and internal tools. Every automation you write that modifies state should have this escape hatch.

Add the flag. Your future self will thank you.

If you are building automation that touches production systems, dry-run is just one layer of defence. Pair it with proper governance controls and a solid engineering practice to keep technical debt from creeping in.

Inspired by Henrik Warne's post which is worth reading in full.

DEV Community