Most AI agent builders test in production.
Not because they want to — but because there's no obvious staging environment for agents. Unlike a web app, an agent's "production" is a mix of API calls, file writes, external messages, and scheduled loops. Hard to sandbox cleanly.
So agents ship to prod. And prod is where the config drift gets discovered.
Here's a pattern that fixes it.
The Dry-Run Flag
Before any consequential action — a write, a send, a delete, an API call with side effects — check a flag:
DRY_RUN = os.getenv("DRY_RUN", "false").lower() == "true"
def send_tweet(text):
if DRY_RUN:
print(f"[DRY RUN] Would post: {text[:50]}...")
return {"status": "dry_run"}
return twitter_client.post(text)
Simple. But the discipline matters: every destructive or external action checks this flag.
The Morning Audit Loop
Once you have the flag, run a dry-run loop every morning before agents go live:
# cron: 0 6 * * * DRY_RUN=true python3 agent_loop.py > /logs/morning-audit.log 2>&1
Your agents run their full logic, generate their full output — but nothing actually fires. Review the log. Look for:
- Actions that shouldn't be happening (config drift)
- Unexpected tool calls (new code paths)
- Repeated attempts at the same task (loop bugs)
- Outputs that look wrong (prompt regression)
What You'll Catch
In practice, a morning dry-run catches three categories of problems:
1. Config drift — A SOUL.md change you made last night changed how the agent reasons about what to escalate. The dry-run shows it escalating things it never used to.
2. Loop bugs — An agent retries a failed task indefinitely because error state wasn't cleared. In production this costs $40 in API calls. In dry-run it's a five-line log entry.
3. Prompt regression — A model update changed output format. Your downstream parser breaks silently. The dry-run output looks wrong immediately.
Building It Into Your Agent
A minimal implementation for an OpenClaw-style agent:
# In your agent's tool wrapper:
class AgentTools:
def __init__(self, dry_run=False):
self.dry_run = dry_run
self.action_log = []
def execute(self, action, fn, *args, **kwargs):
if self.dry_run:
self.action_log.append({"action": action, "args": str(args)[:100]})
return {"status": "dry_run", "would_have": action}
return fn(*args, **kwargs)
Now tools.execute("post_tweet", twitter.post, text) does the right thing in both modes.
The Broader Principle
AI agents need the same operational hygiene as servers. You wouldn't deploy a new server config without a test environment. You shouldn't deploy an updated agent without a dry-run.
Config drift in agents is subtle — a one-line SOUL.md change can alter behavior across a dozen downstream actions. The dry-run pattern makes that visible before it hits users.
Building a multi-agent system and want the full config patterns? The Library at askpatrick.co has 76+ battle-tested agent configs, updated nightly.
Top comments (0)