Raffael

Posted on Mar 29

Agentest: Vitest-style e2e testing for AI Agents

#ai #agents #langchain #typescript

Testing AI agents is hard.

You can’t just assert on text; you need to verify tool calls, retries, and trajectories, ideally in CI, without touching your agent’s code.

That’s why I built Agentest: a Vitest-style test runner for AI agents in Node.js/TypeScript.

The problem

I’ve been building AI-agent products for a few years now, and one pattern keeps coming up:

An agent talks to 3–5 tools (book_appointment, check_availability, send_email, etc.).
The “happy path” is easy to test once.
The “edge cases” (error injection, tool-sequence changes, partial failures) are a mess.
The only “real” tests are:
- manual QA, or
- fragile snapshot-style tests that break on tiny wording changes.
Regression testing missing.

On top of that, most agent-evaluation tools are observability platforms or hosted dashboards, not something that lives in your repo like Jest or Vitest.

What Agentest is

Agentest is an embedded agent-simulation and evaluation framework for Node.js/TypeScript.

It lives in your project like Playwright.
You run it with npx agentest run.
It spins up LLM-powered simulated users, sends them to your agent, mocks your tool calls, and evaluates every turn using LLM-as-judge metrics all without touching your agent’s code.

You can think of it as “Vitest for AI agents”:

scenario-style tests,
deterministic mocks,
trajectory assertions,
LLM-as-judge metrics,
regression testing,
and CI-ready CLI exits.

Quick start (in 100% TypeScript)

Here’s a minimal example with a booking-style agent.

Install it

npm install @agentesting/agentest --save-dev

Create a config

// agentest.config.ts
import { defineConfig } from 'agentest'

export default defineConfig({
  agent: {
    name: 'booking-agent',
    endpoint: 'http://localhost:3000/api/chat',
  },
})

Write a scenario

// tests/booking.sim.ts
import { scenario, sequence } from 'agentest'

scenario('user books a morning slot', {
  profile: 'Busy professional who prefers mornings.',
  goal: 'Book a haircut for next Tuesday morning.',

  knowledge: [
    { content: 'The salon is open Tuesday 08:00–18:00.' },
    { content: 'Standard haircut takes 45 minutes.' },
  ],

  mocks: {
    tools: {
      check_availability: (args) => ({
        available: true,
        slots: ['09:00', '09:45', '10:30'],
        date: args.date,
      }),
      create_booking: sequence([
        { success: true, bookingId: 'BK-001', confirmationSent: true },
      ]),
    },
  },

  assertions: {
    toolCalls: {
      matchMode: 'contains',
      expected: [
        { name: 'check_availability', argMatchMode: 'ignore' },
        { name: 'create_booking', argMatchMode: 'ignore' },
      ],
    },
  },
})

Run it

npx agentest run

You’ll see something like:

Agentest running 1 scenario(s)

  ✓ user books a morning slot → 2/2 conversations passed
    ✓ conv-1-a3f8b2c1: goal met, trajectory matched, helpfulness: 4.5, coherence: 5.0
    ✓ conv-2-d9e1f4a7: goal met, trajectory matched, helpfulness: 5.0, coherence: 5.0

1/1 scenarios passed

How it works under the hood

Agentest owns the tool-call loop:

LLM-powered simulated user sends a message to your agent (via HTTP).
If the agent responds with tool_calls, Agentest:
- resolves each call through your mocks,
- injects results back,
- and POSTs them again (no changes to your agent code).
Repeat until the agent emits a final text response.
The simulated user responds, and the loop continues until the goal is met or maxTurns is hit.
At the end, every turn is evaluated in parallel with LLM-as-judge metrics (helpfulness, coherence, relevance, faithfulness, goal completion, etc.).

You can configure:

number of conversations per scenario,
maximum turns,
which metrics to run,
thresholds (e.g., “helpfulness must be ≥ 3.5 on average”),
mock behavior (sequences, errors, passthrough, etc.),
and reporters (console, JSON, GitHub Actions, etc.).

Works with any AI agent framework

Agentest is framework-agnostic: it doesn’t care how your agent is built, as long as it exposes either:

an OpenAI-compatible HTTP endpoint, or
a custom handler function (e.g., LangChain, Mastra, Vercel AI SDK, OpenLLMetry, AutoGen, CrewAI, or your own in-process agent).

Because Agentest only talks to your agent via requests or a handler, you can test any agent framework without changing its code—just point Agentest at your endpoint or adapter function and run npx agentest run. This makes it easy to plug Agentest into existing LangChain, Mastra, or OpenLLMetry codebases, or keep your own agent runtime while still getting robust, CI-ready tests.

Why this is different

Agentest is not:

Another agent framework (like LangChain, Mastra, OpenAI Agents SDK, CrewAI, etc.).
Another hosted observability dashboard.
Another “prompt-only” evaluation tool.

It is:

A test runner that lives in your repo, just like Vitest or Jest.
A mock-heavy layer that intercepts your agent’s tool calls and lets you assert on trajectories, not just final text.
A CI-native CLI that can fail the build if:
- the agent calls the wrong tools,
- it misses steps,
- or LLM-as-judge metrics fall below your thresholds.

You can mix it with existing tools:

Use LangSmith / Braintrust / DeepEval for observability and long-term evals.
Use Agentest for regression-style, code-first tests that run in every PR.

Scenarios, mocks, and assertions

Agentest’s core model is:

Scenario
- Who the simulated user is (profile)
- What they want (goal)
- What they know (knowledge)
- How tools behave (mocks).
Mocks
- Function mocks,
- sequence mocks,
- error simulation,
- and “unmocked tools” policy (error vs passthrough).
Assertions
- Trajectory-based assertions on tool-call sequences (strict, unordered, contains, within modes),
- argument-matching (ignore, partial, exact),
- quantitative and qualitative metrics (helpfulness, coherence, relevance, faithfulness, goal completion, failure labels, etc.).

LLM-as-judge metrics

At the end of the run, each turn is evaluated in parallel using LLM-as-judge prompts. You get metrics like:

Helpfulness (1–5)
Coherence (1–5)
Relevance (1–5)
Faithfulness (1–5)
Verbosity (1–5)
Goal completion (binary, per-conversation)
Agent behavior failures (e.g., repetition, refusal to clarify, hallucination).

You can set thresholds (e.g., helpfulness: 3.5, goal_completion: 0.8) so the run fails if the average slips below.

Comparison mode (for experimentation)

Agentest also supports comparison mode, where the same scenarios run against multiple models or agent variants side by side.

export default defineConfig({
  agent: {
    name: 'gpt-4o',
    endpoint: 'http://localhost:3000/api/chat',
    body: { model: 'gpt-4o', temperature: 0.7 },
  },
  compare: [
    { name: 'gpt-4o-mini', body: { model: 'gpt-4o-mini' } },
    { name: 'claude-sonnet', body: { model: 'claude-sonnet-4-20250514' } },
  ],
})

After each run, you get a per-metric breakdown:

user books a morning slot
  ✓ gpt-4o: 5/5 scenarios passed
  ✓ gpt-4o-mini: 4/5 scenarios passed
  ── comparison ──
  helpfulness: gpt-4o: 4.5 | gpt-4o-mini: 3.8
  coherence:   gpt-4o: 5.0 | gpt-4o-mini: 4.2

This is great for A/B testing model versions, prompt changes, or different agent implementations.

Local LLMs and CI

Agentest can run its own LLM-driven simulated users and evaluators on:

Anthropic,
OpenAI,
Ollama,
Any openai-compatible-style endpoint (vLLM, LM Studio, etc.).

Meanwhile, your agent lives anywhere:

via HTTP endpoint (OpenAI-compatible, Azure, LiteLLM, etc.),
or via a custom handler function (LangChain, Anthropic SDK, Vercel AI SDK, Mastra, OpenLLMetry, etc.).

The CLI exits:

0 if all scenarios pass,
1 if any scenario fails or no scenarios are found,

so you can drop npx agentest run into your GitHub Actions, CircleCI, or any other CI pipeline.

Why this is needed now

AI agents are moving fast:

Teams are already building agents that:
- handle GitHub issues,
- book appointments,
- route support tickets,
- generate and test code.
They crave real regression tests, not just “prompt-playground” demos.
Existing tools are either:
- observability dashboards,
- agent frameworks,
- or one-off scripts.

Agentest fills the gap of “Vitest-style testing for AI agents”: deterministic, mock-heavy, and built into your CI/CD, just like the rest of your tests.

Get started

GitHub: https://github.com/r-prem/agentest
Install: npm install @agentesting/agentest --save-dev
Run: npx agentest run
Docs: https://r-prem.github.io/agentest/

Agentest is MIT-licensed and Node-/TS-first, with frameworks like LangChain, Mastra, OpenLLMetry, and many others already compatible out of the box.

If you find it useful, I’d love to see what you’re testing with it: share a scenario, a repo, or a GitHub Actions setup, and I’ll happily add it to the docs!

DEV Community