DEV Community

Daniel Castillo
Daniel Castillo

Posted on

TracePact: Catch AI agent tool-call regressions before production

You changed a prompt. The output still looks fine. But your agent stopped reading the config before deploying and switched from running tests to running builds.

Nobody noticed until production broke.

The problem

Most agent failures aren't bad text — they're bad behavior. The agent calls the wrong tools, in the wrong order, with the wrong arguments. Output evals don't catch this because the final response
still looks plausible.

Teams try to catch it manually:

  • reviewing traces in agent UIs
  • parsing raw session logs
  • comparing old vs new runs by hand
  • debugging regressions only after users report them

What TracePact does

TracePact is a behavioral testing framework for AI agents. It works at the tool-call level, not the text level.

1. Write behavior contracts:

import { TraceBuilder } from '@tracepact/vitest';

const trace = new TraceBuilder()
.addCall('read_file', { path: 'src/service.ts' }, '...')
.addCall('write_file', { path: 'src/service.ts', content: '...' })
.addCall('run_tests', {}, 'PASS') 
.build(); 

// Did it read before writing?
expect(trace).toHaveCalledToolsInOrder([
'read_file', 'write_file', 'run_tests'
]); 

// Did it avoid shell?
expect(trace).toNotHaveCalledTool('bash');
Enter fullscreen mode Exit fullscreen mode

No API calls. No tokens. Runs in milliseconds.

2. Record & replay:

# Record a baseline (one-time, live)
npx tracepact run --live --record

# Replay without API calls (instant, deterministic)
npx tracepact run --replay ./cassettes
Enter fullscreen mode Exit fullscreen mode

3. Diff runs to catch drift:

npx tracepact diff baseline.json latest.json --fail-on warn
Enter fullscreen mode Exit fullscreen mode
3 changes detected: 

- read_file (seq 1) (removed)
+ write_file (seq 3) (added)
~ bash.cmd: "npm test" -> "npm run build"

Summary: 1 removed, 1 added, 1 arg changed[BLOCK] 
Enter fullscreen mode Exit fullscreen mode

Filter noisy args and irrelevant tools:

npx tracepact diff baseline.json latest.json \
--ignore-keys timestamp,requestId \
--ignore-tools read_file
Enter fullscreen mode Exit fullscreen mode

Severity levels: none (identical), warn (args changed), block (tools added/removed). Use --fail-on in CI to gate deployments.

Good fit

  • Coding agents — read before write, run tests before finishing, never edit restricted files
  • Ops agents — inspect before restarting, check evidence before acting
  • Workflow agents — validate before mutation, avoid duplicate side effects
  • Internal assistants — use correct system for correct task

Less useful for

Pure chatbots, style evaluation, creative tasks, or systems where only text output matters. TracePact is for behavioral guarantees, not response quality.

MCP server for IDEs

TracePact ships an MCP server that works with Claude Code, Cursor, and Windsurf:

{
"mcpServers": { 
"tracepact": {
"command": "npx", 
"args": ["@tracepact/mcp-server"]
} 
} 
}
Enter fullscreen mode Exit fullscreen mode

Tools: tracepact_audit, tracepact_run, tracepact_capture, tracepact_replay, tracepact_diff, tracepact_list_tests.

Get started

npm install @tracepact/core @tracepact/vitest @tracepact/cli
npx tracepact init
npx tracepact 
Enter fullscreen mode Exit fullscreen mode

GitHub: https://github.com/dcdeve/tracepact


We built this because we kept running into the same problem: prompt or model changes that silently break agent behavior while the output still looks fine. If you're testing AI agents, I'd love to hear
how you're handling tool-call regressions today.

Top comments (1)

Collapse
 
nyrok profile image
Hamza KONTE

This is exactly the right layer to test. Most teams test LLM output quality (does the text look good?) but skip tool-call behavior (did the agent actually do the right things?). Those are fundamentally different failure modes and require different testing infrastructure.

The prompt-change detection angle is particularly sharp — a tiny wording change can silently flip which tool the agent calls, with no surface-level output change. That's the dangerous class of regression.

One thing I've noticed: when agents have well-structured prompts (explicit blocks for role, constraints, tool usage rules), tool-call behavior becomes more stable and therefore more testable. Vague system prompts make tool selection probabilistic in ways that TracePact would struggle to pin down because the model is essentially guessing.

If you're looking for a way to produce more deterministic prompts before feeding them to TracePact's test harness, I built flompt (flompt.dev) — a visual prompt builder that compiles structured XML prompts designed to minimize that kind of drift. Might be an interesting pairing for tightening the signal. Open-source at github.com/Nyrok/flompt.