The Dry-Run Pattern: How to Test AI Agents Without Breaking Production

#ai #agents #testing #devops

Most AI agent builders test in production.

Not because they want to — but because there's no obvious staging environment for agents. Unlike a web app, an agent's "production" is a mix of API calls, file writes, external messages, and scheduled loops. Hard to sandbox cleanly.

So agents ship to prod. And prod is where the config drift gets discovered.

Here's a pattern that fixes it.

The Dry-Run Flag

Before any consequential action — a write, a send, a delete, an API call with side effects — check a flag:

DRY_RUN = os.getenv("DRY_RUN", "false").lower() == "true"

def send_tweet(text):
    if DRY_RUN:
        print(f"[DRY RUN] Would post: {text[:50]}...")
        return {"status": "dry_run"}
    return twitter_client.post(text)

Simple. But the discipline matters: every destructive or external action checks this flag.

The Morning Audit Loop

Once you have the flag, run a dry-run loop every morning before agents go live:

# cron: 0 6 * * * DRY_RUN=true python3 agent_loop.py > /logs/morning-audit.log 2>&1

Your agents run their full logic, generate their full output — but nothing actually fires. Review the log. Look for:

Actions that shouldn't be happening (config drift)
Unexpected tool calls (new code paths)
Repeated attempts at the same task (loop bugs)
Outputs that look wrong (prompt regression)

What You'll Catch

In practice, a morning dry-run catches three categories of problems:

1. Config drift — A SOUL.md change you made last night changed how the agent reasons about what to escalate. The dry-run shows it escalating things it never used to.

2. Loop bugs — An agent retries a failed task indefinitely because error state wasn't cleared. In production this costs $40 in API calls. In dry-run it's a five-line log entry.

3. Prompt regression — A model update changed output format. Your downstream parser breaks silently. The dry-run output looks wrong immediately.

Building It Into Your Agent

A minimal implementation for an OpenClaw-style agent:

# In your agent's tool wrapper:
class AgentTools:
    def __init__(self, dry_run=False):
        self.dry_run = dry_run
        self.action_log = []

    def execute(self, action, fn, *args, **kwargs):
        if self.dry_run:
            self.action_log.append({"action": action, "args": str(args)[:100]})
            return {"status": "dry_run", "would_have": action}
        return fn(*args, **kwargs)

Now tools.execute("post_tweet", twitter.post, text) does the right thing in both modes.

The Broader Principle

AI agents need the same operational hygiene as servers. You wouldn't deploy a new server config without a test environment. You shouldn't deploy an updated agent without a dry-run.

Config drift in agents is subtle — a one-line SOUL.md change can alter behavior across a dozen downstream actions. The dry-run pattern makes that visible before it hits users.

Building a multi-agent system and want the full config patterns? The Library at askpatrick.co has 76+ battle-tested agent configs, updated nightly.