Snapshot tests caught a regression in my agent that the unit tests missed

#hermesagent #ai #llm #testing

I shipped a small agent that books meetings. It calls three tools in order: search_calendar, find_slot, create_event.

The unit tests all passed for a year. Then I bumped the model from claude-3-5-sonnet to claude-3-7-sonnet and the booking flow silently broke. Bookings still happened. They just went to the wrong day half the time.

The root cause: the new model started calling find_slot(duration_minutes=30, attendees=[...]) instead of find_slot(attendees=[...], duration_minutes=30). My tool was positional under the hood and the first arg was the date. Whoops.

Unit tests passed because I mocked find_slot and asserted on the final reply text. The reply still said "Booked for Tuesday at 3pm". Just for the wrong Tuesday.

I wanted a test that pinned the actual sequence of tool calls and their arg shape. That is AgentSnap.

The shape of a snapshot

A snapshot is a JSON file with one entry per turn: the model decision, the tools it called, the args, and the tool result shape. Not the full free-text response. Just the trace.

import { snapshot } from "@mukundakatta/agentsnap";

test("books a 30 minute coffee", async () => {
  const trace = await runAgent({
    user: "book a 30 min coffee with Sam next Tuesday",
  });
  snapshot(trace, "book_coffee.snap");
});

The first run writes the file. Future runs compare against it. A diff is a hard fail unless you re-record with AGENTSNAP_UPDATE=1.

The snap file looks like this:

[
  { "tool": "search_calendar", "args": { "user": "Sam" } },
  { "tool": "find_slot", "args": { "attendees": ["Sam"], "duration_minutes": 30 } },
  { "tool": "create_event", "args": { "title": "Coffee", "duration_minutes": 30 } }
]

When the model started calling find_slot with reordered args, the diff was loud:

- "args": { "attendees": ["Sam"], "duration_minutes": 30 }
+ "args": { "duration_minutes": 30, "attendees": ["Sam"] }

OK, key order is not actually a problem there. But when the model started passing {"duration": 30} instead of {"duration_minutes": 30} (yes, that happened too), the diff showed me in one second what would have been an hour of bisecting.

Handling non-determinism

LLMs are noisy. If you snapshot the whole reply, the test will fail every run. Two things help:

One: snapshot the trace, not the reply. Tool names and arg keys are stable across runs in a way that prose is not.

Two: redact volatile fields. AgentSnap lets you pass a redactor.

from agentsnap import snapshot

snapshot(
    trace,
    "book_coffee.snap",
    redact=["args.timestamp", "args.request_id"],
)

Redacted fields become <redacted> in the snapshot so the diff still tells you they were there, but their value does not break the test.

Where it pays off

I run snapshot tests on every PR that touches a tool definition or a prompt. The CI matrix runs each snapshot against two models: the current production model and the next candidate. If the candidate produces a different trace, I see it before I roll out.

In the four months since I started using this, it has caught:

The arg-reorder bug above.
A prompt edit that made the model skip search_calendar entirely on shorter requests.
A tool description change that made one model start calling create_event twice.

Each of those was a behaviour change that unit tests would never have caught because the final reply text still looked fine.

Limitations

Snapshots are great at saying "something changed". They are bad at saying "what changed is good or bad". You still have to read the diff and make a call. For a small agent with maybe ten golden traces, this is fine. For a large agent with hundreds, you will need a way to bulk-approve obviously-fine diffs.

Also: snapshots assume the agent is deterministic enough that the same input gives the same trace. I run with temperature=0 for tests. With temperature above zero you will get false positives.