Run Your Agent in Shadow Mode Before You Trust It With Production

#hermeschallenge #ai #python #agents

The new agent version looked good in staging. The test suite passed. The team was ready to ship.

Then someone asked: "What exactly would it do on the last 30 days of real traffic?"

Nobody knew. Staging uses synthetic data. Real traffic is messier. The new version might handle it fine. Or it might try to delete records, send emails, or call APIs in ways nobody anticipated.

Shadow mode gives you the answer without the consequences.

The Shape of the Fix

from agent_shadow_mode import ShadowAgent

# Your real tools
tools = {
    "create_ticket": create_ticket,
    "update_record": update_record,
    "send_email": send_email,
}

# Wrap in shadow mode
agent = ShadowAgent(tools=tools, audit_path="./shadow-audit.jsonl")
agent.enable()  # shadow mode ON

# Run against real traffic
response = run_agent_with_tools(agent.tools, user_message)

# Check what it would have done
for entry in agent.read_audit():
    print(f"{entry['tool']}: {entry['args']}")

# Output: 
# create_ticket: {'title': 'Login issue', 'priority': 'high'}
# send_email: {'to': 'support@co.com', 'subject': 'Ticket created'}

The agent thinks the tools ran. The tools did not. The audit log captures every intended action. You review, confirm the behavior looks correct, then flip shadow mode off.

What It Does NOT Do

agent-shadow-mode does not simulate tool responses. Tools in shadow mode return a default value (configurable, default None). If your agent makes decisions based on tool return values, behavior in shadow mode may differ from behavior with real tools.

It does not integrate with your existing logging stack automatically. The audit log is a local JSONL file. If you want audit entries in your centralized log, you read the JSONL and forward entries yourself.

It does not validate that the agent's intended actions are correct. It records them. Correctness review is a human step.

Inside the Library

ShadowAgent wraps each tool in a closure that either calls the real function or records to the audit log, depending on the current mode:

class ShadowAgent:
    def __init__(self, tools: dict, audit_path: str, fake_return=None):
        self._real_tools = tools
        self._audit_path = audit_path
        self._fake_return = fake_return
        self._shadow = False
        self._lock = threading.Lock()
        self.tools = self._wrap_tools()

    def _wrap_tool(self, name: str, fn: Callable) -> Callable:
        def wrapped(*args, **kwargs):
            if self._shadow:
                entry = {
                    "ts": time.time(),
                    "tool": name,
                    "args": kwargs or args,
                    "shadow_mode": True,
                }
                with self._lock:
                    with open(self._audit_path, "a") as f:
                        f.write(json.dumps(entry) + "\n")
                return self._fake_return
            return fn(*args, **kwargs)
        return wrapped

enable() / disable() flip self._shadow. Thread-safe: the flag is read inside the wrapped closure under no lock (reads of a bool are atomic in CPython), and writes to the audit file are locked.

The fake return is configurable: ShadowAgent(tools=tools, fake_return={"status": "ok"}) returns that dict for every shadowed tool call. If your agent branches on tool returns, set a fake return that makes the agent continue executing rather than stopping.

The 16 tests cover: shadow mode on/off toggle, audit log format, correct passthrough when shadow mode is off, configurable fake return, thread safety (10 concurrent tool calls in shadow mode), and read_audit() helper.

When to Use It

Use it before deploying a new agent version to production. Run it against real traffic replay (or live traffic on a percentage of requests) and inspect the audit log. If the intended actions look right, enable real execution.

Use it for agents that perform irreversible actions. Write operations, email sending, payment processing, record deletion. Shadow mode lets you validate the agent's decision-making without the consequences.

Use it for A/B testing. Run the new agent in shadow mode alongside the old agent. Compare the audit log from the new version against the actual actions taken by the old version. If they match on 95% of cases, the new version is ready.

Skip it for read-only agents. If your agent only calls get_data() and search_documents(), there is nothing to shadow.

Install

pip install git+https://github.com/MukundaKatta/agent-shadow-mode

from agent_shadow_mode import ShadowAgent
import os

# Production setup: shadow mode controlled by environment variable
tools = {
    "create_order": create_order,
    "send_confirmation": send_confirmation,
    "update_inventory": update_inventory,
}

agent = ShadowAgent(
    tools=tools,
    audit_path="/var/log/agent/shadow-audit.jsonl",
    fake_return={"status": "ok", "id": "shadow-00000"},
)

if os.getenv("AGENT_SHADOW_MODE") == "1":
    agent.enable()
    print("Running in shadow mode — no real actions will be taken")

result = run_agent(agent.tools, user_request)

if agent.is_shadow():
    # Log shadow decisions for review
    for entry in agent.read_audit():
        logger.info("shadow_action", **entry)

Sibling Libraries

Library	What it solves
`prompt-replay`	Replay recorded prompts against a new model for comparison
`agenttap`	Wire-level capture of all LLM requests and responses
`agent-decision-log`	Structured WHY-layer decision log
`agentsnap`	Snapshot agent behavior for regression testing
`llm-fixture-replay`	VCR-style record/replay for unit tests

The full pre-deployment review workflow: prompt-replay to test the new model on historical prompts, agent-shadow-mode to audit intended actions on live traffic, agentsnap to catch response regressions in CI.

What's Next

Percentage rollout would be a natural feature. Instead of all-or-nothing shadow mode, roll out real execution to 10% of requests and shadow the other 90%. This gives you real-world feedback while limiting exposure.

Comparison mode: run real execution and shadow execution side-by-side for the same input and log divergences. This tells you not just "what would the new agent do" but "where does it differ from the current agent."

Integration with the agent decision log: when a tool is intercepted in shadow mode, automatically log the decision context (what the agent was trying to accomplish, what tool it chose, what arguments it passed) so the audit log is self-explanatory without needing to cross-reference the conversation log.

Built as part of the agent-stack family: composable Python primitives for production LLM agents.