agent-shadow-mode: Run Your Agent in Record-Not-Execute Mode Before You Trust It

#hermeschallenge #ai #python #agents

1. The first time I trusted an agent too fast

I had an agent that could send emails, write to a database, and call an external API that billed per request. I tested it in a sandbox. It worked great. I ran it in production.

It sent 140 emails in 12 minutes.

The logic was correct. The loop termination was off by one. I had tested the loop in isolation but not in the full pipeline with real data. By the time I caught it, the emails were gone.

After that, every new agentic workflow I built got a shadow mode pass first. Run against real data. Record every tool call. Review the audit log. Then flip the flag and go live.

agent-shadow-mode packages that pattern as a library. Wrap your tools. Set shadow=True. Run the agent. Read the JSONL audit log. Nothing executes. You see exactly what it would have done.

2. Shape of the fix

from agent_shadow_mode import ShadowAgent

agent = ShadowAgent(shadow=True, audit_path="/logs/shadow.jsonl")

@agent.tool
def send_email(to: str, subject: str, body: str) -> bool:
    # Real SMTP call goes here
    smtp.send(to, subject, body)
    return True

@agent.tool
def write_to_db(table: str, data: dict) -> dict:
    return db.insert(table, data)

# In shadow mode: records the call, returns a shadow response, does not execute
result = agent.call("send_email",
    to="alice@example.com",
    subject="Q2 report",
    body="Please review..."
)
print(result)  # {"shadow": True, "tool": "send_email", "args": {...}}

After the dry run, open the audit log:

import json

with open("/logs/shadow.jsonl") as f:
    for line in f:
        event = json.loads(line)
        print(event["tool"], event["args"], event["timestamp"])

When you are satisfied:

# Go live: flip the flag
agent = ShadowAgent(shadow=False, audit_path="/logs/live.jsonl")
# Same tool registrations, same calls
# Now the functions actually execute

The audit log is written in both shadow and live mode. You get a JSONL record of every call regardless.

3. What it does NOT do

It does not simulate return values from your real tools. In shadow mode, agent.call() returns a stub response. The shape is {"shadow": True, "tool": name, "args": args}. It does not call the function, so it cannot know what the function would have returned. If your agent uses the tool's return value to make subsequent decisions, shadow mode will not replicate that branching accurately.

It does not intercept calls made outside the registry. If your agent calls send_email() directly without going through agent.call(), shadow mode will not catch it. All tool calls must go through the registry.

It does not isolate database reads from writes. If you have a tool that reads and writes, shadow mode will not split those. The whole tool is either shadowed or not. Tag your tools with side_effects metadata and filter before registering if you want read-through in shadow mode.

It does not give you a full execution trace. It logs tool calls. It does not log LLM inputs, LLM outputs, intermediate reasoning steps, or anything your orchestration layer does between tool calls. For a full trace, use agent-step-log or agenttrace alongside.

4. Inside the library

The repo is at MukundaKatta/agent-shadow-mode. There are 16 tests.

Core types:

ShadowAgent: main class. Constructor takes shadow: bool, audit_path: str or None, mock_response: callable or None.
@agent.tool: decorator that registers a function. Can also call agent.register(fn) directly.
agent.call(name, **kwargs): dispatches the call. In shadow mode, records to audit log and returns stub. In live mode, records to audit log and calls the real function.
AuditEvent: dataclass written to the JSONL log. Fields: tool, args, timestamp, shadow, result (None in shadow mode), error (None if no exception).
ShadowResponse: what agent.call() returns in shadow mode. Has tool, args, and shadow=True. Falsy checks on it should not be used as a success gate.

The mock_response parameter lets you override the stub. Pass a callable that takes (tool_name, kwargs) and returns whatever you want. Useful when your agent branches on tool return values and you want to test specific paths in shadow mode.

Audit log format: one JSON object per line. ISO 8601 timestamp. All kwargs serialized with json.dumps. If an arg is not JSON-serializable, it is replaced with "<unserializable>" and a warning is logged. The file is opened in append mode; running the agent multiple times adds to the same log.

In live mode, if the tool raises an exception, the exception is caught, logged to the audit file with the error field set, and re-raised. You get the audit record AND the exception propagates normally.

5. When this is useful, when it is not

Useful when:

You are deploying a new agentic workflow and want a dry run against real data before going live. Especially when the tools have side effects: email, billing, database writes.
You are debugging an agent that has already done something wrong in production. Run it in shadow mode on the same input to reproduce what it would have done.
You are testing in CI and want to assert that the agent would call specific tools with specific args, without actually executing them.
You want an audit trail of all tool calls in production, not just in shadow mode. The audit_path writes in both modes.

Not useful when:

Your agent's behavior depends heavily on tool return values and branches differently based on them. Shadow mode returns stubs. The agent's downstream decisions will not match what would happen in live mode. You would need a more sophisticated simulation layer.
Your tools are already idempotent and safe to call in test environments. If you can hit a real test API, do that instead. Shadow mode is for tools that have real costs or irreversible effects.
You need full execution replay. Shadow mode captures what the agent would call. It does not capture why. Use agent-decision-log or agent-step-log for the reasoning layer.

6. Install

The package is pending PyPI publication.

# PyPI (pending):
pip install agent-shadow-mode

# From source:
git clone https://github.com/MukundaKatta/agent-shadow-mode
cd agent-shadow-mode
pip install -e .

No runtime dependencies. Python 3.9+.

# Run the tests:
pytest tests/ -v
# 16 tests, all passing

7. Siblings in the stack

Library	What it does
`agentsnap`	Snapshot agent state at a point in time
`agent-replay-trace`	Load and step through JSONL agent traces
`agent-decision-log`	Structured WHY-layer log alongside tool calls
`tool-side-effects-tag`	Tag tools READ/WRITE/IDEMPOTENT/DESTRUCTIVE
`agenttrace`	Cost and latency per agent run

The workflow that makes sense: register your tools in agent-fn-registry with side effect tags. Feed those registrations to ShadowAgent. Run in shadow mode first. Review the audit log with agent-replay-trace. Go live.

8. What comes next

The two things I most want to add:

First, read-through mode. You mark some tools as READ in the side effects tags. In shadow mode, READ tools execute for real (so the agent gets real data to reason with) and WRITE tools are shadowed. Right now it is all-or-nothing.

Second, assertion helpers for tests. agent.assert_called(tool_name, times=1) and agent.assert_called_with(tool_name, **kwargs). These let you use shadow mode in pytest without manually parsing the JSONL file.

Third, a replay() method. You pass a JSONL audit log from a previous shadow run. The agent replays the same sequence of tool calls in live mode, using the recorded args. Bridges the gap between dry-run review and live execution without re-running the LLM.

The flag flip from shadow to live is the core value. Everything else builds around making that flip safer and more inspectable.

Source: github.com/MukundaKatta/agent-shadow-mode