Mukunda Rao Katta

Posted on May 25

agentsnap: Jest-Style Snapshot Tests for AI Agent Tool Calls

#hermeschallenge #ai #javascript #agents

The bug that took three hours to spot

I had an agent that summarized search results and stored them in a database. It worked in testing. After a deploy, users started seeing stale data. Logs showed the agent was running fine. No errors in the console. The monitoring dashboard was green.

The actual problem: a refactor had flipped the argument order on one tool call. The agent was passing the query to the wrong parameter. The result still typed correctly, so no TypeScript error. The agent still returned a response, so no exception was thrown. The data was just quietly wrong in a way that looked plausible enough to ship.

I found it by hand, comparing tool call logs from before and after the deploy. That meant pulling two runs worth of JSON logs, diffing them manually in a text editor, and eventually noticing the swapped arguments buried in the middle of a long trace. That took three hours. Meanwhile users were looking at bad data. There had to be a better way.

That is what agentsnap is for. It captures the exact sequence of tool calls your agent makes, including names, arguments, and results, and saves that as a snapshot file. On the next run, it diffs the current trace against the saved file. Any change in how the agent uses its tools shows up immediately as a test failure, before it reaches production. The diff output tells you exactly what changed, which field, and which call in the sequence.

Shape of the fix

import { AgentSnap } from "@mukundakatta/agentsnap";

const snap = new AgentSnap("my-agent-test");

// Record each tool call as it happens:
snap.record({ tool: "search_web", args: { query: "AI news" }, result: [...] });
snap.record({ tool: "summarize", args: { text: "..." }, result: "..." });

// Assert against the saved snapshot (creates one if new):
await snap.assert();

On first run with a new snapshot name, assert() writes the snapshot file and passes. On subsequent runs, it diffs the recorded calls against the saved file. If anything changed, it throws with a readable diff that shows you exactly which tool call changed, which field changed, and what the old and new values were. No parsing logs by hand, no manual comparison.

Set AGENTSNAP_UPDATE=1 in your environment and rerun to accept the new behavior as the baseline. That is the same pattern as Jest's --updateSnapshot flag. The intent is the same: you are making an explicit decision to adopt the new behavior, not silently accepting a drift you did not notice.

The diff output shows the full path to the changed value: calls[1].args.text changed from "..." to "...". When the snapshot grows to dozens of calls, you still land on the exact change immediately.

What it does NOT do

agentsnap does not run your agent for you. It does not mock your LLM client or any of your tool implementations. It does not know whether a changed trace is correct or wrong. That judgment is yours. The library just makes the change visible and forces you to acknowledge it before anything ships.

If you update a snapshot without reviewing what changed, the library has no way to protect you from that. The AGENTSNAP_UPDATE=1 flag is a manual step for a reason. It works the same way Jest's snapshot update flag works: it trusts that you looked at the diff before accepting it.

agentsnap also does not verify that the tool calls produced the correct business outcome. It verifies that the tool calls looked the same as they did when you last reviewed them. Correctness testing is a separate concern.

Inside the lib

The snapshot format is plain JSON. Each entry is an object with tool, args, and result. The file lives next to your test file by default, named <snapshot-name>.agentsnap.json. You can change the directory via a constructor option. The files are meant to be committed to source control, the same way Jest snapshots are. That way reviewers can see the diff in a PR and understand how agent behavior changed without running the agent themselves.

The diff algorithm compares sequences structurally. It does not care about object key ordering inside args or result. It does care about call ordering in the sequence. If your agent calls search_web before summarize, that order is part of the snapshot. Reordering calls is reported as a diff, because call order often matters for correctness in agent pipelines. A tool that reads before it writes versus one that writes before it reads are meaningfully different, even if the individual call signatures are identical.

One deliberate design choice: there is no auto-ignore for timestamps or IDs. If your tool results include a timestamp, that timestamp is part of the snapshot. This was a tradeoff considered carefully. Auto-ignoring fields requires configuration. Configuration requires decisions about which fields to ignore and why. Those decisions are easy to get wrong, and a field you thought was safe to ignore might turn out to matter. The library takes the position that you should normalize your results before recording if you want to exclude volatile fields. Do that in your test setup, explicitly, with intent. The library records what you pass it, exactly.

When useful

You refactored agent logic and want to confirm tool call behavior did not change
You are reviewing a PR that touches agent code and want a diff of how tool usage changed between the old and new version
You want a regression baseline before shipping a prompt change, so you can see what the prompt change affected in terms of tool behavior
You are debugging an agent that was working and now is not, and you want to compare traces across deploys or code versions

When not useful

Your tool results are inherently non-deterministic, such as calls that return random IDs or live timestamps that change on every run
You want to test the correctness of results, not the structure of calls
You are building an agent that intentionally varies its tool strategy based on randomness, where snapshot stability would require freezing the random seed
You want a full E2E integration test harness that spins up services and runs the agent end to end

Install

npm install @mukundakatta/agentsnap
# or
yarn add @mukundakatta/agentsnap

Requires Node 18+. Zero runtime dependencies.

Siblings

Library	What it does	Registry
@mukundakatta/agentvet	Validates tool call signatures against a schema before execution	npm
@mukundakatta/agentguard	Allowlist-based egress control for agent HTTP calls	npm
@mukundakatta/agentcast	Repair-validate-retry loop for LLM structured output	npm
@mukundakatta/agenttrace	Per-run cost and latency tracking across LLM calls	GitHub
agentsnap-rs	Rust port of the same snapshot concept	crates.io

What is next

The main thing missing is a VS Code extension that opens the snapshot file in a side-by-side diff view when a test fails. Right now you get a terminal diff, which is readable but not the fastest to review when the snapshot is large. A visual diff in the editor would make reviewing snapshot updates much faster.

A GitHub Actions annotation mode is also on the list. When a snapshot test fails in CI, the action would post the diff as an inline review comment on the PR, so the reviewer can see exactly what changed in tool call behavior without checking out the branch and running tests locally.

If you build agents and you have ever spent time hunting down a silent behavior change in tool calls, agentsnap is worth a few minutes to try. The snapshot files are small, human-readable, and straightforward to review in a PR. The workflow is the same one Jest users already know.

DEV Community