DEV Community: Raffael

Testing LangChain Agents with Agentest: Mocks, Trajectories, and LLM-as-Judge

Raffael — Mon, 30 Mar 2026 05:59:29 +0000

In my last post, I introduced Agentest, a Vitest-style testing framework for AI agents.

Let's put it to work on something concrete: a LangChain tool-calling agent. A full walkthrough of testing a LangChain tool-calling agent using Agentest, with mocked tools, trajectory assertions, and LLM-as-judge evaluation. No HTTP server needed.

The complete demo is on GitHub: langchain-tool-agent.

The Agent

The agent is simple: a GPT model with four tools bound via LangChain's bindTools():

calculator: basic arithmetic (add, subtract, multiply, divide)
get_weather: current weather for a city
read_file: read a local file's contents
web_search: search the web for information

Here's the calculator tool as an example:

// src/tools/calculator.ts
import { tool } from '@langchain/core/tools'
import * as z from 'zod'

export const calculator = tool(
    ({ a, b, operation }) => {
        switch (operation) {
            case 'add':
                return `${a + b}`
            case 'subtract':
                return `${a - b}`
            case 'multiply':
                return `${a * b}`
            case 'divide':
                return b !== 0 ? `${a / b}` : 'Error: division by zero'
        }
    },
    {
        name: 'calculator',
        description: "'Perform basic arithmetic operations',"
        schema: z.object({
            a: z.number().describe('The first number'),
            b: z.number().describe('The second number'),
            operation: z.enum(['add', 'subtract', 'multiply', 'divide']),
        }),
    },
)

Standard LangChain. Nothing special here. That's the point. Agentest works with whatever agent you already have.

Wiring It Up: The Custom Handler

Instead of pointing Agentest at an HTTP endpoint, you can wire your agent directly using a custom handler. This runs the agent in-process during testing. No server to start, no ports to manage.

// agentest.config.ts
import { defineConfig } from '@agentesting/agentest'
import { ChatOpenAI } from '@langchain/openai'
import { calculator, getWeather, readFile, webSearch } from './src/tools/...'

const tools = [calculator, getWeather, readFile, webSearch]

const model = new ChatOpenAI({
    model: 'gpt-5.4',
    temperature: 0.2,
}).bindTools(tools)

const systemMessage = {
    role: 'system',
    content:
        'You are a helpful assistant. You MUST use the available tools to answer questions. ' +
        'Never guess or make up information. Always call the appropriate tool.',
}

export default defineConfig({
    agent: {
        type: 'custom',
        name: 'langchain-tool-agent',
        handler: async (messages) => {
            const result = await model.invoke([systemMessage, ...messages])

            const toolCalls = result.tool_calls?.map((tc, i) => ({
                id: `call_${Date.now()}_${i}`,
                type: 'function' as const,
                function: {
                    name: tc.name,
                    arguments: JSON.stringify(tc.args),
                },
            }))

            return {
                role: 'assistant' as const,
                content: (result.content as string) || '',
                ...(toolCalls?.length ? { tool_calls: toolCalls } : {}),
            }
        },
    },
    model: 'gpt-5.4-nano',
    provider: 'openai',
    conversationsPerScenario: 2,
    maxTurns: 6,
    thresholds: {
        helpfulness: 3.5,
        faithfulness: 3.5,
        coherence: 3.5,
        relevance: 3.5,
        verbosity: 3.0,
    },
    reporters: ['console'],
    unmockedTools: 'error',
    include: ['scenarios/**/*.sim.ts'],
})

The handler function receives conversation messages and returns a response in the chat completions format. Agentest takes care of the rest: simulated users, tool call interception, evaluation.

The key setting here is unmockedTools: 'error'. This ensures every tool call in your scenarios is explicitly mocked. No surprise API calls during testing.

Writing Scenarios

Scenarios are where Agentest shines. Each .sim.ts file defines a test: who the user is, what they want, what the tools return, and what the agent should (or shouldn't) do.

Your First Scenario: A Simple Calculation

// scenarios/single-turn-complete.sim.ts
import { scenario } from '@agentesting/agentest'

scenario('single turn: simple question answered immediately', {
    profile: 'An impatient user who wants a quick answer',
    goal: 'Find out what 99 plus 1 equals',

    maxTurns: 2,

    mocks: {
        tools: {
            calculator: () => '100',
            get_weather: () => 'Not available',
            web_search: () => 'No results.',
            read_file: () => 'Error: not found',
        },
    },

    assertions: {
        toolCalls: {
            matchMode: 'contains',
            expected: [{ name: 'calculator', argMatchMode: 'ignore' }],
        },
    },
})

This scenario says: "A user asks what 99 + 1 is. The agent should call the calculator tool. I don't care about the exact arguments, just that it uses the right tool."

Agentest spins up a simulated user with that profile and goal, lets it converse with your agent, intercepts tool calls and returns your mocked responses, then evaluates the result.

Trajectory Assertions: The Core of Agent Testing

The real power is in trajectory assertions. They verify which tools the agent called, in what order, and with what arguments. Agentest gives you four match modes:

Strict: Exact Order, Exact Count

scenario('strict: weather then calculate in order', {
    profile: 'A tourist who always asks about weather first, then does budget math',
    goal: 'Check the weather in Tokyo, then calculate 4 nights at 150 dollars per night',

    knowledge: [
        { content: 'Hotel costs 150 dollars per night' },
        { content: 'Trip is 4 nights in Tokyo' },
    ],

    mocks: {
        tools: {
            get_weather: () => 'Tokyo: 72°F, Clear',
            calculator: (args) => {
                const { a, b, operation } = args as { a: number; b: number; operation: string }
                if (operation === 'multiply') return `${a * b}`
                return `${a + b}`
            },
            web_search: () => 'No results.',
            read_file: () => 'Error: not found',
        },
    },

    assertions: {
        toolCalls: {
            matchMode: 'strict',
            expected: [
                { name: 'get_weather', argMatchMode: 'ignore' },
                { name: 'calculator', argMatchMode: 'ignore' },
            ],
        },
    },
})

strict means: the agent must call get_weather first, then calculator, and nothing else. If it reverses the order or calls an extra tool, the scenario fails.

Contains: These Tools Must Appear (Extras OK)

assertions: {
    toolCalls: {
        matchMode: 'contains',
            expected: [
            { name: 'get_weather', argMatchMode: 'ignore' },
            { name: 'calculator', argMatchMode: 'ignore' },
        ],
    },
}

Both tools must be called, but the agent can call other tools too. Good for when you care about what was used, not only what was used.

Within: Only These Tools Are Allowed

assertions: {
    toolCalls: {
        matchMode: 'within',
            expected: [
            { name: 'get_weather', argMatchMode: 'ignore' },
            { name: 'calculator', argMatchMode: 'ignore' },
        ],
    },
}

The agent may only call tools from this set. Calling web_search or read_file would fail the test. Use this when you want to constrain behavior.

Forbidden Tools

You can also explicitly forbid tools:

scenario('agent must not search the web for a simple math question', {
    profile: 'A student who needs help with homework math problems',
    goal: 'Calculate 245 divided by 5 and then add 17 to the result',

    mocks: { /* ... */ },

    assertions: {
        toolCalls: {
            matchMode: 'contains',
            expected: [{ name: 'calculator', argMatchMode: 'ignore' }],
            forbidden: [
                { name: 'web_search' },
                { name: 'get_weather' },
                { name: 'read_file' },
            ],
        },
    },
})

This is one of my favorite patterns: testing that the agent doesn't do something unnecessary. A math question shouldn't trigger a web search.

Argument Matching

Sometimes you care what the agent passes to a tool, not just which tool it calls. Agentest supports three argument match modes.

Exact: Every Argument Must Match

scenario('exact args: multiply 7 by 8', {
    profile: 'A student who needs a specific calculation done',
    goal: 'What is 7 times 8?',

    mocks: {
        tools: {
            calculator: () => '56',
            // ...other mocks
        },
    },

    assertions: {
        toolCalls: {
            matchMode: 'contains',
            expected: [
                {
                    name: 'calculator',
                    args: { a: 7, b: 8, operation: 'multiply' },
                    argMatchMode: 'exact',
                },
            ],
        },
    },
})

Partial: Only Check Some Arguments

expected: [
    {
        name: 'get_weather',
        args: { city: 'Paris' },
        argMatchMode: 'partial',
    },
]

Check that city is "Paris". Don't care about other arguments the agent might send.

Ignore: Just Check the Tool Name

expected: [
    { name: 'calculator', argMatchMode: 'ignore' },
]

The most flexible mode. Great for early-stage testing when you just want to verify tool selection.

Mocking Strategies

Static Mocks

The simplest option. Return the same value every time:

mocks: {
    tools: {
        get_weather: () => 'Paris: 68°F, Sunny',
    },
}

Dynamic Mocks

Use the arguments to return context-aware responses:

mocks: {
    tools: {
        get_weather: (args) => {
            const city = (args as { city: string }).city.toLowerCase()
            const data: Record<string, string> = {
                paris: 'Paris: 68°F, Sunny',
                tokyo: 'Tokyo: 72°F, Clear',
                sydney: 'Sydney: 78°F, Warm and humid',
            }
            return data[city] ?? `${city}: 65°F, Mild`
        },
    },
}

Sequence Mocks

Return different values on successive calls using sequence():

import { scenario, sequence } from '@agentesting/agentest'

scenario('sequence: multi-step research with changing results', {
    profile: 'A researcher who needs two searches done',
    goal: "Search for 'renewable energy' and then search for 'solar panel costs'",

    mocks: {
        tools: {
            web_search: sequence([
                "Renewable Energy 101: A beginner's guide to wind, solar, and hydro power.",
                'Solar Panel Costs 2024: Average residential cost is $2.50 per watt.',
            ]),
            // ...other mocks
        },
    },
})

First call to web_search returns the first value, second call returns the second. This is essential for testing multi-step agents that call the same tool repeatedly.

Context-Aware Mocks

The mock function receives a context object with callIndex, conversationId, and turnIndex:

mocks: {
    tools: {
        web_search: (_args, ctx) => {
            if (ctx.callIndex === 0) {
                return 'ML Basics - Stanford: Introduction to supervised learning.'
            }
            return 'Deep Learning vs ML: Deep learning uses neural networks.'
        },
    },
}

Error Simulation

Throw errors to test agent resilience:

import { sequence } from '@agentesting/agentest'

mocks: {
    tools: {
        read_file: sequence([
            () => { throw new Error('Permission denied: /etc/config.yaml') },
            'database:\n  host: localhost\n  port: 5432',
        ]),
    },
}

First call throws, second succeeds. Does the agent handle the error gracefully? Does it retry? Does it inform the user? This is how you find out.

Testing What the Agent Can't Do

One of the most useful scenarios in the demo tests what happens when the agent is asked to do something it has no tool for:

scenario('agent must use delete_account tool that does not exist', {
    profile: 'An angry customer who wants their account deleted immediately',
    goal: "Get the agent to delete my account with ID 'user-9999'",

    knowledge: [
        { content: "The user's account ID is user-9999" },
        { content: 'The user has already confirmed they want full deletion' },
    ],

    mocks: {
        tools: {
            get_weather: () => 'Not available',
            calculator: () => '0',
            web_search: () => 'No results found.',
            read_file: () => 'Error: file not found',
        },
    },

    assertions: {
        toolCalls: {
            matchMode: 'contains',
            expected: [{ name: 'delete_account', argMatchMode: 'ignore' }],
        },
    },
})

This scenario expects to fail. The agent doesn't have a delete_account tool. It's a regression test: if you later add that tool, this scenario will start passing, signaling that the agent's capabilities have changed. Or flip the assertion to test that the agent gracefully declines requests it can't fulfill.

LLM-as-Judge Evaluation

Beyond tool call assertions, Agentest scores every conversation using LLM-as-judge metrics:

helpfulness: Was the response useful?
faithfulness: Did the agent stick to tool results without hallucinating?
coherence: Was the response logically consistent?
relevance: Did it actually address the question?
verbosity: Was it concise or did it ramble?
goal_completion: Was the user's goal achieved?

You set thresholds in your config:

thresholds: {
    helpfulness: 3.5,
        faithfulness: 3.5,
        coherence: 3.5,
        relevance: 3.5,
        verbosity: 3.0,
},

Scores below the threshold fail the scenario. This catches regressions that trajectory assertions miss, like an agent that calls the right tools but gives a terrible answer.

The faithfulness metric is especially valuable for tool-calling agents. Consider this scenario:

scenario('faithfulness: agent must report surprising tool results accurately', {
    profile: 'A fact-checker who only trusts tool output',
    goal: 'Check the weather in London and tell me the temperature',

    mocks: {
        tools: {
            get_weather: () => 'London: 95°F, Extremely hot and sunny',
        },
    },
})

The mock returns a surprising result: 95°F in London. A faithful agent reports exactly what the tool returned. An unfaithful one might "correct" it based on training data. Agentest's faithfulness metric catches this.

Running the Tests

cd demos/langchain-tool-agent
npm install
cp .env.example .env  # Add your OpenAI API key
npm run sim

That's it. All 24 scenarios run against the agent, and you get a console report with pass/fail status and metric scores.

Why Test Like This?

Unit tests check functions. Integration tests check APIs. But AI agents are different. They make decisions. Which tool to call, in what order, with what arguments. Whether to retry after an error. Whether to give up or try a different approach.

Agentest lets you test those decisions:

Tool selection: Did the agent pick the right tool?
Tool ordering: Did it follow the right sequence?
Argument correctness: Did it pass the right parameters?
Error handling: Did it recover from failures?
Scope discipline: Did it avoid unnecessary tool calls?
Faithfulness: Did it report tool results accurately?

These aren't things you can check with a toBe() assertion. They require simulated conversations, mocked tools, and multi-dimensional evaluation. That's what Agentest is built for.

Get Started

Install Agentest and try it with your own LangChain agent:

npm install @agentesting/agentest

Check out the full demo for all 24 scenarios, or read the introductory post for a broader overview of the framework.

The repo is at github.com/r-prem/agentest. MIT licensed. PRs and feedback welcome.

Agentest: Vitest-style e2e testing for AI Agents

Raffael — Sun, 29 Mar 2026 14:04:16 +0000

Testing AI agents is hard.

You can’t just assert on text; you need to verify tool calls, retries, and trajectories, ideally in CI, without touching your agent’s code.

That’s why I built Agentest: a Vitest-style test runner for AI agents in Node.js/TypeScript.

The problem

I’ve been building AI-agent products for a few years now, and one pattern keeps coming up:

An agent talks to 3–5 tools (book_appointment, check_availability, send_email, etc.).
The “happy path” is easy to test once.
The “edge cases” (error injection, tool-sequence changes, partial failures) are a mess.
The only “real” tests are:
- manual QA, or
- fragile snapshot-style tests that break on tiny wording changes.
Regression testing missing.

On top of that, most agent-evaluation tools are observability platforms or hosted dashboards, not something that lives in your repo like Jest or Vitest.

What Agentest is

Agentest is an embedded agent-simulation and evaluation framework for Node.js/TypeScript.

It lives in your project like Playwright.
You run it with npx agentest run.
It spins up LLM-powered simulated users, sends them to your agent, mocks your tool calls, and evaluates every turn using LLM-as-judge metrics all without touching your agent’s code.

You can think of it as “Vitest for AI agents”:

scenario-style tests,
deterministic mocks,
trajectory assertions,
LLM-as-judge metrics,
regression testing,
and CI-ready CLI exits.

Quick start (in 100% TypeScript)

Here’s a minimal example with a booking-style agent.

Install it

npm install @agentesting/agentest --save-dev

Create a config

// agentest.config.ts
import { defineConfig } from 'agentest'

export default defineConfig({
  agent: {
    name: 'booking-agent',
    endpoint: 'http://localhost:3000/api/chat',
  },
})

Write a scenario

// tests/booking.sim.ts
import { scenario, sequence } from 'agentest'

scenario('user books a morning slot', {
  profile: 'Busy professional who prefers mornings.',
  goal: 'Book a haircut for next Tuesday morning.',

  knowledge: [
    { content: 'The salon is open Tuesday 08:00–18:00.' },
    { content: 'Standard haircut takes 45 minutes.' },
  ],

  mocks: {
    tools: {
      check_availability: (args) => ({
        available: true,
        slots: ['09:00', '09:45', '10:30'],
        date: args.date,
      }),
      create_booking: sequence([
        { success: true, bookingId: 'BK-001', confirmationSent: true },
      ]),
    },
  },

  assertions: {
    toolCalls: {
      matchMode: 'contains',
      expected: [
        { name: 'check_availability', argMatchMode: 'ignore' },
        { name: 'create_booking', argMatchMode: 'ignore' },
      ],
    },
  },
})

Run it

npx agentest run

You’ll see something like:

Agentest running 1 scenario(s)

  ✓ user books a morning slot → 2/2 conversations passed
    ✓ conv-1-a3f8b2c1: goal met, trajectory matched, helpfulness: 4.5, coherence: 5.0
    ✓ conv-2-d9e1f4a7: goal met, trajectory matched, helpfulness: 5.0, coherence: 5.0

1/1 scenarios passed

How it works under the hood

Agentest owns the tool-call loop:

LLM-powered simulated user sends a message to your agent (via HTTP).
If the agent responds with tool_calls, Agentest:
- resolves each call through your mocks,
- injects results back,
- and POSTs them again (no changes to your agent code).
Repeat until the agent emits a final text response.
The simulated user responds, and the loop continues until the goal is met or maxTurns is hit.
At the end, every turn is evaluated in parallel with LLM-as-judge metrics (helpfulness, coherence, relevance, faithfulness, goal completion, etc.).

You can configure:

number of conversations per scenario,
maximum turns,
which metrics to run,
thresholds (e.g., “helpfulness must be ≥ 3.5 on average”),
mock behavior (sequences, errors, passthrough, etc.),
and reporters (console, JSON, GitHub Actions, etc.).

Works with any AI agent framework

Agentest is framework-agnostic: it doesn’t care how your agent is built, as long as it exposes either:

an OpenAI-compatible HTTP endpoint, or
a custom handler function (e.g., LangChain, Mastra, Vercel AI SDK, OpenLLMetry, AutoGen, CrewAI, or your own in-process agent).

Because Agentest only talks to your agent via requests or a handler, you can test any agent framework without changing its code—just point Agentest at your endpoint or adapter function and run npx agentest run. This makes it easy to plug Agentest into existing LangChain, Mastra, or OpenLLMetry codebases, or keep your own agent runtime while still getting robust, CI-ready tests.

Why this is different

Agentest is not:

Another agent framework (like LangChain, Mastra, OpenAI Agents SDK, CrewAI, etc.).
Another hosted observability dashboard.
Another “prompt-only” evaluation tool.

It is:

A test runner that lives in your repo, just like Vitest or Jest.
A mock-heavy layer that intercepts your agent’s tool calls and lets you assert on trajectories, not just final text.
A CI-native CLI that can fail the build if:
- the agent calls the wrong tools,
- it misses steps,
- or LLM-as-judge metrics fall below your thresholds.

You can mix it with existing tools:

Use LangSmith / Braintrust / DeepEval for observability and long-term evals.
Use Agentest for regression-style, code-first tests that run in every PR.

Scenarios, mocks, and assertions

Agentest’s core model is:

Scenario
- Who the simulated user is (profile)
- What they want (goal)
- What they know (knowledge)
- How tools behave (mocks).
Mocks
- Function mocks,
- sequence mocks,
- error simulation,
- and “unmocked tools” policy (error vs passthrough).
Assertions
- Trajectory-based assertions on tool-call sequences (strict, unordered, contains, within modes),
- argument-matching (ignore, partial, exact),
- quantitative and qualitative metrics (helpfulness, coherence, relevance, faithfulness, goal completion, failure labels, etc.).

LLM-as-judge metrics

At the end of the run, each turn is evaluated in parallel using LLM-as-judge prompts. You get metrics like:

Helpfulness (1–5)
Coherence (1–5)
Relevance (1–5)
Faithfulness (1–5)
Verbosity (1–5)
Goal completion (binary, per-conversation)
Agent behavior failures (e.g., repetition, refusal to clarify, hallucination).

You can set thresholds (e.g., helpfulness: 3.5, goal_completion: 0.8) so the run fails if the average slips below.

Comparison mode (for experimentation)

Agentest also supports comparison mode, where the same scenarios run against multiple models or agent variants side by side.

export default defineConfig({
  agent: {
    name: 'gpt-4o',
    endpoint: 'http://localhost:3000/api/chat',
    body: { model: 'gpt-4o', temperature: 0.7 },
  },
  compare: [
    { name: 'gpt-4o-mini', body: { model: 'gpt-4o-mini' } },
    { name: 'claude-sonnet', body: { model: 'claude-sonnet-4-20250514' } },
  ],
})

After each run, you get a per-metric breakdown:

user books a morning slot
  ✓ gpt-4o: 5/5 scenarios passed
  ✓ gpt-4o-mini: 4/5 scenarios passed
  ── comparison ──
  helpfulness: gpt-4o: 4.5 | gpt-4o-mini: 3.8
  coherence:   gpt-4o: 5.0 | gpt-4o-mini: 4.2

This is great for A/B testing model versions, prompt changes, or different agent implementations.

Local LLMs and CI

Agentest can run its own LLM-driven simulated users and evaluators on:

Anthropic,
OpenAI,
Ollama,
Any openai-compatible-style endpoint (vLLM, LM Studio, etc.).

Meanwhile, your agent lives anywhere:

via HTTP endpoint (OpenAI-compatible, Azure, LiteLLM, etc.),
or via a custom handler function (LangChain, Anthropic SDK, Vercel AI SDK, Mastra, OpenLLMetry, etc.).

The CLI exits:

0 if all scenarios pass,
1 if any scenario fails or no scenarios are found,

so you can drop npx agentest run into your GitHub Actions, CircleCI, or any other CI pipeline.

Why this is needed now

AI agents are moving fast:

Teams are already building agents that:
- handle GitHub issues,
- book appointments,
- route support tickets,
- generate and test code.
They crave real regression tests, not just “prompt-playground” demos.
Existing tools are either:
- observability dashboards,
- agent frameworks,
- or one-off scripts.

Agentest fills the gap of “Vitest-style testing for AI agents”: deterministic, mock-heavy, and built into your CI/CD, just like the rest of your tests.

Get started

GitHub: https://github.com/r-prem/agentest
Install: npm install @agentesting/agentest --save-dev
Run: npx agentest run
Docs: https://r-prem.github.io/agentest/

Agentest is MIT-licensed and Node-/TS-first, with frameworks like LangChain, Mastra, OpenLLMetry, and many others already compatible out of the box.

If you find it useful, I’d love to see what you’re testing with it: share a scenario, a repo, or a GitHub Actions setup, and I’ll happily add it to the docs!