Mukunda Rao Katta

Posted on May 25

I thought shadow mode was on. It wasn't. 400 emails later I built agent-shadow-mode.

#hermeschallenge #ai #python #agents

It was a Tuesday. I had a new email-dispatch agent I wanted to stress-test against real production traffic before giving it the keys.

The plan was simple: point it at the live queue, run it in shadow mode, check the JSONL log to see what it would have sent. No emails go out. No risk. Just a dry run.

I forgot to set the env var.

Four hundred emails went out in eleven minutes. Some were duplicates. Some were to the wrong segment. One was a re-engagement email to a user who had unsubscribed two months earlier.

The agent worked exactly as designed. I just hadn't shadowed it.

That afternoon I built agent-shadow-mode. The point is to make "is this in shadow mode" an explicit, auditable, per-tool decision, not an afterthought you remember after the damage is done.

The shape of the fix

The library gives you a shadow decorator. Wrap a tool, give it a stub response, and toggle shadow mode with an env var or a programmatic flag.

from agent_shadow_mode import shadow, ShadowConfig, is_shadow_mode

config = ShadowConfig(audit_file="shadow_audit.jsonl", shadow_mode=True)

@shadow(config=config, stub_response="Email queued (shadow mode)")
def send_email(to: str, subject: str, body: str) -> str:
    # real implementation
    return smtp_client.send(to, subject, body)

In shadow mode, send_email never calls smtp_client.send. It logs the call to shadow_audit.jsonl and returns the stub. In live mode, it runs normally.

You can also toggle via environment variable:

SHADOW_MODE=1 python run_agent.py

The wrapper checks os.environ.get("SHADOW_MODE") at call time, so you can flip the flag between runs without changing code.

Here is what a shadow log entry looks like:

import json
from pathlib import Path

for line in Path("shadow_audit.jsonl").read_text().splitlines():
    entry = json.loads(line)
    print(entry["tool"], entry["args"], entry["stub_response"], entry["timestamp"])

send_email {'to': 'alice@example.com', 'subject': 'Re-engagement', 'body': '...'} Email queued (shadow mode) 2026-05-24T14:02:11Z
send_email {'to': 'bob@example.com', 'subject': 'Re-engagement', 'body': '...'} Email queued (shadow mode) 2026-05-24T14:02:12Z

The agent loop receives "Email queued (shadow mode)" and keeps reasoning from there. It does not know it is in shadow mode. It just sees a return value and continues.

What it does NOT do

It does not verify tool behavior. It records calls, not correctness.
It does not replay the captured calls later. For replay, see prompt-replay in the siblings table.
It does not block network access at the OS level. If your tool has side effects outside the Python call (spawning subprocesses, writing to shared state), shadow mode will not catch those.
It does not diff shadow output against live output automatically. You read the JSONL and do that yourself.

Inside the lib: per-tool stub response design

This is the part I thought hardest about.

The naive approach is to have one global stub: return None or "" for every shadowed tool. That is fast to implement and wrong in practice.

Here is why it breaks. Suppose your agent loop looks like this:

result = send_email(to=addr, subject=subj, body=body)
if "queued" in result.lower():
    mark_as_sent(addr)
else:
    retry_queue.append(addr)

If send_email returns "" in shadow mode, the agent branches into retry_queue. If it returns "Email queued (shadow mode)", the agent branches into mark_as_sent. Only the second path reflects what the agent would actually do in production.

Per-tool stubs let you mirror the real control flow without executing the real side effect. You are testing the agent's reasoning, not just its tendency to call tools in the right order.

The stub is set at decoration time:

@shadow(config=config, stub_response="Payment processed")
def charge_card(amount: float, token: str) -> str:
    return payment_gateway.charge(amount, token)

@shadow(config=config, stub_response='{"status": "created", "id": "shadow-123"}')
def create_crm_record(data: dict) -> str:
    return crm_api.post("/records", data)

Each tool gets a stub that matches the shape of its real response. The agent loop reasons through the full scenario. You get a log of every would-be action with no side effects.

You can also set stub_response at runtime if the response needs to vary:

@shadow(config=config)
def lookup_user(user_id: str) -> dict:
    return db.get_user(user_id)

# override stub per call
result = lookup_user(user_id="u_42", _shadow_stub={"id": "u_42", "plan": "pro"})

The _shadow_stub kwarg is stripped before the real function sees it, so the signature stays clean.

When this is useful

Staging against live traffic. You point at production data with writes shadowed. The agent processes real input and logs what it would have done. You inspect the log before flipping to live.

Canary deployments. You deploy a new agent version with shadow mode on, run it in parallel with the old version, and compare the JSONL logs. Same inputs, different decisions, no side effects from either.

Debugging production issues. Something went wrong yesterday. You replay the traffic with shadow mode on, reproduce the call sequence, and see what the agent would call given the same inputs today.

Onboarding new tools. You add a tool to an existing agent but are not confident about its call patterns. Shadow it for a few hundred runs. Read the log. Then enable it.

Compliance audits. Some environments require a record of every potential action an agent could take, not just the ones it actually took. The JSONL file satisfies that.

When NOT to use it

Do not use shadow mode as a permanent production feature. It is a testing and staging tool. Leaving it on in production means your agent is not actually doing anything, which defeats the purpose of having an agent.

Do not use it to test tools that are purely read-only. If lookup_user has no side effects, shadowing it adds log noise without reducing risk. Use tool-side-effects-tag (in the siblings table) to mark which tools are DESTRUCTIVE or WRITE, and shadow only those.

Do not treat the shadow log as a correctness proof. The log tells you what the agent called, not whether the real execution would have succeeded. A network error, a validation failure, or a rate limit in the real call will not appear in the shadow log.

Install

pip install agent-shadow-mode

Zero dependencies. Python 3.9+.

from agent_shadow_mode import shadow, ShadowConfig, is_shadow_mode, ShadowAuditEntry

Full API: shadow decorator, ShadowConfig dataclass, is_shadow_mode() check, ShadowAuditEntry typed log record.

The config accepts:

config = ShadowConfig(
    audit_file="shadow_audit.jsonl",   # path to JSONL log
    shadow_mode=True,                  # or set SHADOW_MODE env var
    append=True,                       # append to existing log vs overwrite
)

16 tests, no deps, MIT license.

GitHub: MukundaKatta/agent-shadow-mode

Siblings

These four libraries compose with agent-shadow-mode in the same agent stack.

Lib	Boundary	Repo
agentsnap	Snapshot tests for what tools were called and in what order	MukundaKatta/agentsnap
agent-decision-log	WHY layer, logs the options considered and the rationale behind each tool choice	MukundaKatta/agent-decision-log
tool-side-effects-tag	Tags tools as READ, WRITE, IDEMPOTENT, or DESTRUCTIVE so you know which ones to shadow	MukundaKatta/tool-side-effects-tag
prompt-replay	Replay recorded prompts against a new config or model, pairs with shadow logs as input	MukundaKatta/prompt-replay

A common pattern: use tool-side-effects-tag to identify DESTRUCTIVE tools, wrap them with agent-shadow-mode, record the run with agentsnap, log the decisions with agent-decision-log, then use prompt-replay to re-run the same traffic after you make changes.

What's next

A few things I want to add:

Shadow diff mode. Run the tool in shadow mode and also call a lightweight read-only probe to capture what the real result would have been, without the side effect. Compare the stub to the probe output. Flag divergence. Useful for slow-rolling config changes where you want to know if the stub is still realistic.

Per-call shadow toggle. Right now shadow is all-or-nothing per config instance. A shadow_if predicate would let you shadow only certain inputs, for example shadow all calls where amount > 1000 and let small amounts through live.

Structured stub schemas. Today the stub is a raw string. If your tool returns a typed dict, you have to manually keep the stub shape in sync with the return type. A stub_factory callable that receives the args and returns a typed stub would be cleaner.

The core design is stable. The add-ons are nice-to-haves once the testing pattern proves out in your stack.

The main lesson from the 400-email incident: shadow mode only works if it is the default, not an opt-in you remember to set. Build it into the decorator. Make live mode the thing you explicitly enable. Audit files are cheap. Emails to unsubscribed users are not.