Mukunda Rao Katta

Posted on May 25

agentsnap-rs: Snapshot Testing for Rust Agent Tool-Call Traces

#hermeschallenge #ai #rust #agents

The bug that hid behind a passing test

A few months ago I changed a single line in a system prompt. One word, really. The output quality test still passed. The unit tests still passed. So the change shipped.

Three days later someone on the team noticed the agent was calling tools in a different order. The lookup happened before the validation step instead of after. The final answer was still correct in most cases, so no hard assertion broke. But the behavior had changed in a meaningful way. The intermediate work was different. The cost was different. In edge cases, the ordering mattered.

The problem was that the tests only checked the final output. Nobody was watching the trace.

That is the gap agentsnap-rs fills.

The shape of the fix

The idea is simple: record the tool-call trace on the first run, store it as deterministic JSON, and diff it on every run after that. If the calls change, the test fails.

# Cargo.toml
[dev-dependencies]
agentsnap-rs = "0.1"

use agentsnap_rs::{Snapshot, ToolCall};

#[test]
fn lookup_then_validate_order() {
    let calls = vec![
        ToolCall::new("lookup_user", serde_json::json!({ "id": "u42" })),
        ToolCall::new("validate_account", serde_json::json!({ "user_id": "u42" })),
        ToolCall::new("send_reply", serde_json::json!({ "text": "Account confirmed." })),
    ];

    Snapshot::assert("lookup_then_validate_order", &calls);
}

First run: no snapshot file exists yet, so the library writes one to .snapshots/lookup_then_validate_order.json and the test passes.

Second run: the library reads the stored snapshot, serializes the current trace to the same deterministic format, and compares them byte by byte. If anything changed, the test fails with a diff.

To update a snapshot intentionally:

AGENTSNAP_UPDATE=1 cargo test

That is the whole API surface for the basic case.

A slightly more realistic example

Real agents do not hand you a clean Vec<ToolCall>. You collect calls during a run.

use agentsnap_rs::{Recorder, Snapshot};

fn run_agent(query: &str) -> Vec<serde_json::Value> {
    let mut recorder = Recorder::new();

    // your agent loop here
    recorder.record("search", serde_json::json!({ "q": query }));
    recorder.record("fetch_page", serde_json::json!({ "url": "https://example.com/1" }));
    recorder.record("summarize", serde_json::json!({ "text": "..." }));

    recorder.calls()
}

#[test]
fn agent_trace_matches_snapshot() {
    let calls = run_agent("what is the weather in Austin");
    Snapshot::assert("weather_query_trace", &calls);
}

Recorder is just a thin wrapper that collects calls and hands them back as a slice. No async, no trait objects, nothing clever.

What it does NOT do

Before you reach for this library, know what it is not:

It does not run your agent. You bring the trace. The library only records and compares it.
It does not validate argument schemas. That is agentvet-rs. Snapshots catch drift; agentvet catches malformed args.
It does not track cost or latency. That is agenttrace-rs. Snapshots are structural, not metric.
It does not tell you why a call was made. That is agent-decision-log. Snapshots record what happened, not the reasoning behind it.

Inside the lib: deterministic JSON snapshots in git

The design decision I care most about is the snapshot format.

Every ToolCall is serialized with sorted keys. The library uses a custom serializer that walks the JSON value tree and sorts object keys at every level before writing. That means the same data always produces the same bytes regardless of insertion order in the source map.

A snapshot file looks like this:

[
  {
    "args": { "id": "u42" },
    "name": "lookup_user"
  },
  {
    "args": { "user_id": "u42" },
    "name": "validate_account"
  },
  {
    "args": { "text": "Account confirmed." },
    "name": "send_reply"
  }
]

That is it. Plain JSON, one object per call, keys sorted, stored in .snapshots/ next to your tests.

This matters for two reasons.

First, git diffs are readable. When someone changes the agent and the snapshot needs updating, the PR diff shows exactly which tool calls changed and how the arguments shifted. Reviewers can catch unintended behavioral changes during code review, not after deploy.

Second, there are no binary blobs. Some snapshot frameworks store compressed or binary artifacts. Those are opaque. You cannot review them. You cannot audit them. You just hope the tool is doing the right thing. Sorted JSON eliminates that problem.

The .snapshots/ directory is meant to be committed. The files are small, stable, and human-readable. They are part of the test contract.

When this is useful

Use agentsnap-rs when:

Your agent calls external tools and you want to catch unexpected changes to call order or arguments.
You are refactoring prompt logic and want confidence that the trace is structurally the same after the refactor.
You are doing prompt A/B testing and want to compare the behavioral trace between versions, not just the final answer.
You want reviewers to see exactly what changed in agent behavior when a PR touches the system prompt or tool routing logic.
You are debugging a regression and want to pin the last known good trace.

When NOT to use it

Skip agentsnap-rs when:

Your agent produces non-deterministic traces by design (random sampling, live data, time-based branching). Snapshots require stable inputs to be meaningful. Consider seeding randomness or using a deterministic replay harness first.
You only care about the final answer. If the intermediate tool calls do not matter for correctness, snapshot testing adds noise without value.
Your tool arguments contain volatile data like timestamps or request IDs. You would be updating the snapshot constantly. Either strip those fields before recording, or use a different validation strategy for those fields.

Install

# Cargo.toml
[dev-dependencies]
agentsnap-rs = "0.1"

GitHub: MukundaKatta/agentsnap-rs

Siblings

Lib	Boundary	Repo
agentsnap	Python original, same semantics	MukundaKatta/agentsnap
agenttrace-rs	Cost and latency per run, complements snapshots	MukundaKatta/agenttrace-rs
agentvet-rs	Validates tool args before they appear in the trace	MukundaKatta/agentvet-rs
agent-decision-log	WHY layer, records reasoning alongside the WHAT snapshot	MukundaKatta/agent-decision-log

The four libraries compose well. agentvet-rs catches malformed args before they run. agentsnap-rs records what ran and flags structural drift. agenttrace-rs tells you what each run cost. agent-decision-log records why each branch was taken. Together they give you a reasonable audit trail for an agent run.

What is next

A few things on the list:

Field masking. A way to mark specific argument fields as volatile so they are excluded from the diff. Useful for timestamps and generated IDs.
Partial matching. Assert that certain calls appear without pinning the full trace. Useful when you care about one tool call but not the full sequence around it.
CLI diff tool. A small binary that takes two snapshot directories and prints a human-readable diff. Useful for comparing snapshots across branches without running the tests.

The core loop is stable. The 0.1.0 shipped 2026-05-10. If you build something with it, open an issue. I want to know what the field-masking API should look like before I commit to a shape.

The original bug I described at the top? Snapshot testing would have caught it. The tool order changed. The snapshot diff would have shown exactly that, and the PR adding the prompt change would have failed CI. That is the use case. That is why this exists.

DEV Community