Raffael

Posted on Mar 30

Testing LangChain Agents with Agentest: Mocks, Trajectories, and LLM-as-Judge

#ai #typescript #langchain #testing

In my last post, I introduced Agentest, a Vitest-style testing framework for AI agents.

Let's put it to work on something concrete: a LangChain tool-calling agent. A full walkthrough of testing a LangChain tool-calling agent using Agentest, with mocked tools, trajectory assertions, and LLM-as-judge evaluation. No HTTP server needed.

The complete demo is on GitHub: langchain-tool-agent.

The Agent

The agent is simple: a GPT model with four tools bound via LangChain's bindTools():

calculator: basic arithmetic (add, subtract, multiply, divide)
get_weather: current weather for a city
read_file: read a local file's contents
web_search: search the web for information

Here's the calculator tool as an example:

// src/tools/calculator.ts
import { tool } from '@langchain/core/tools'
import * as z from 'zod'

export const calculator = tool(
    ({ a, b, operation }) => {
        switch (operation) {
            case 'add':
                return `${a + b}`
            case 'subtract':
                return `${a - b}`
            case 'multiply':
                return `${a * b}`
            case 'divide':
                return b !== 0 ? `${a / b}` : 'Error: division by zero'
        }
    },
    {
        name: 'calculator',
        description: "'Perform basic arithmetic operations',"
        schema: z.object({
            a: z.number().describe('The first number'),
            b: z.number().describe('The second number'),
            operation: z.enum(['add', 'subtract', 'multiply', 'divide']),
        }),
    },
)

Standard LangChain. Nothing special here. That's the point. Agentest works with whatever agent you already have.

Wiring It Up: The Custom Handler

Instead of pointing Agentest at an HTTP endpoint, you can wire your agent directly using a custom handler. This runs the agent in-process during testing. No server to start, no ports to manage.

// agentest.config.ts
import { defineConfig } from '@agentesting/agentest'
import { ChatOpenAI } from '@langchain/openai'
import { calculator, getWeather, readFile, webSearch } from './src/tools/...'

const tools = [calculator, getWeather, readFile, webSearch]

const model = new ChatOpenAI({
    model: 'gpt-5.4',
    temperature: 0.2,
}).bindTools(tools)

const systemMessage = {
    role: 'system',
    content:
        'You are a helpful assistant. You MUST use the available tools to answer questions. ' +
        'Never guess or make up information. Always call the appropriate tool.',
}

export default defineConfig({
    agent: {
        type: 'custom',
        name: 'langchain-tool-agent',
        handler: async (messages) => {
            const result = await model.invoke([systemMessage, ...messages])

            const toolCalls = result.tool_calls?.map((tc, i) => ({
                id: `call_${Date.now()}_${i}`,
                type: 'function' as const,
                function: {
                    name: tc.name,
                    arguments: JSON.stringify(tc.args),
                },
            }))

            return {
                role: 'assistant' as const,
                content: (result.content as string) || '',
                ...(toolCalls?.length ? { tool_calls: toolCalls } : {}),
            }
        },
    },
    model: 'gpt-5.4-nano',
    provider: 'openai',
    conversationsPerScenario: 2,
    maxTurns: 6,
    thresholds: {
        helpfulness: 3.5,
        faithfulness: 3.5,
        coherence: 3.5,
        relevance: 3.5,
        verbosity: 3.0,
    },
    reporters: ['console'],
    unmockedTools: 'error',
    include: ['scenarios/**/*.sim.ts'],
})

The handler function receives conversation messages and returns a response in the chat completions format. Agentest takes care of the rest: simulated users, tool call interception, evaluation.

The key setting here is unmockedTools: 'error'. This ensures every tool call in your scenarios is explicitly mocked. No surprise API calls during testing.

Writing Scenarios

Scenarios are where Agentest shines. Each .sim.ts file defines a test: who the user is, what they want, what the tools return, and what the agent should (or shouldn't) do.

Your First Scenario: A Simple Calculation

// scenarios/single-turn-complete.sim.ts
import { scenario } from '@agentesting/agentest'

scenario('single turn: simple question answered immediately', {
    profile: 'An impatient user who wants a quick answer',
    goal: 'Find out what 99 plus 1 equals',

    maxTurns: 2,

    mocks: {
        tools: {
            calculator: () => '100',
            get_weather: () => 'Not available',
            web_search: () => 'No results.',
            read_file: () => 'Error: not found',
        },
    },

    assertions: {
        toolCalls: {
            matchMode: 'contains',
            expected: [{ name: 'calculator', argMatchMode: 'ignore' }],
        },
    },
})

This scenario says: "A user asks what 99 + 1 is. The agent should call the calculator tool. I don't care about the exact arguments, just that it uses the right tool."

Agentest spins up a simulated user with that profile and goal, lets it converse with your agent, intercepts tool calls and returns your mocked responses, then evaluates the result.

Trajectory Assertions: The Core of Agent Testing

The real power is in trajectory assertions. They verify which tools the agent called, in what order, and with what arguments. Agentest gives you four match modes:

Strict: Exact Order, Exact Count

scenario('strict: weather then calculate in order', {
    profile: 'A tourist who always asks about weather first, then does budget math',
    goal: 'Check the weather in Tokyo, then calculate 4 nights at 150 dollars per night',

    knowledge: [
        { content: 'Hotel costs 150 dollars per night' },
        { content: 'Trip is 4 nights in Tokyo' },
    ],

    mocks: {
        tools: {
            get_weather: () => 'Tokyo: 72°F, Clear',
            calculator: (args) => {
                const { a, b, operation } = args as { a: number; b: number; operation: string }
                if (operation === 'multiply') return `${a * b}`
                return `${a + b}`
            },
            web_search: () => 'No results.',
            read_file: () => 'Error: not found',
        },
    },

    assertions: {
        toolCalls: {
            matchMode: 'strict',
            expected: [
                { name: 'get_weather', argMatchMode: 'ignore' },
                { name: 'calculator', argMatchMode: 'ignore' },
            ],
        },
    },
})

strict means: the agent must call get_weather first, then calculator, and nothing else. If it reverses the order or calls an extra tool, the scenario fails.

Contains: These Tools Must Appear (Extras OK)

assertions: {
    toolCalls: {
        matchMode: 'contains',
            expected: [
            { name: 'get_weather', argMatchMode: 'ignore' },
            { name: 'calculator', argMatchMode: 'ignore' },
        ],
    },
}

Both tools must be called, but the agent can call other tools too. Good for when you care about what was used, not only what was used.

Within: Only These Tools Are Allowed

assertions: {
    toolCalls: {
        matchMode: 'within',
            expected: [
            { name: 'get_weather', argMatchMode: 'ignore' },
            { name: 'calculator', argMatchMode: 'ignore' },
        ],
    },
}

The agent may only call tools from this set. Calling web_search or read_file would fail the test. Use this when you want to constrain behavior.

Forbidden Tools

You can also explicitly forbid tools:

scenario('agent must not search the web for a simple math question', {
    profile: 'A student who needs help with homework math problems',
    goal: 'Calculate 245 divided by 5 and then add 17 to the result',

    mocks: { /* ... */ },

    assertions: {
        toolCalls: {
            matchMode: 'contains',
            expected: [{ name: 'calculator', argMatchMode: 'ignore' }],
            forbidden: [
                { name: 'web_search' },
                { name: 'get_weather' },
                { name: 'read_file' },
            ],
        },
    },
})

This is one of my favorite patterns: testing that the agent doesn't do something unnecessary. A math question shouldn't trigger a web search.

Argument Matching

Sometimes you care what the agent passes to a tool, not just which tool it calls. Agentest supports three argument match modes.

Exact: Every Argument Must Match

scenario('exact args: multiply 7 by 8', {
    profile: 'A student who needs a specific calculation done',
    goal: 'What is 7 times 8?',

    mocks: {
        tools: {
            calculator: () => '56',
            // ...other mocks
        },
    },

    assertions: {
        toolCalls: {
            matchMode: 'contains',
            expected: [
                {
                    name: 'calculator',
                    args: { a: 7, b: 8, operation: 'multiply' },
                    argMatchMode: 'exact',
                },
            ],
        },
    },
})

Partial: Only Check Some Arguments

expected: [
    {
        name: 'get_weather',
        args: { city: 'Paris' },
        argMatchMode: 'partial',
    },
]

Check that city is "Paris". Don't care about other arguments the agent might send.

Ignore: Just Check the Tool Name

expected: [
    { name: 'calculator', argMatchMode: 'ignore' },
]

The most flexible mode. Great for early-stage testing when you just want to verify tool selection.

Mocking Strategies

Static Mocks

The simplest option. Return the same value every time:

mocks: {
    tools: {
        get_weather: () => 'Paris: 68°F, Sunny',
    },
}

Dynamic Mocks

Use the arguments to return context-aware responses:

mocks: {
    tools: {
        get_weather: (args) => {
            const city = (args as { city: string }).city.toLowerCase()
            const data: Record<string, string> = {
                paris: 'Paris: 68°F, Sunny',
                tokyo: 'Tokyo: 72°F, Clear',
                sydney: 'Sydney: 78°F, Warm and humid',
            }
            return data[city] ?? `${city}: 65°F, Mild`
        },
    },
}

Sequence Mocks

Return different values on successive calls using sequence():

import { scenario, sequence } from '@agentesting/agentest'

scenario('sequence: multi-step research with changing results', {
    profile: 'A researcher who needs two searches done',
    goal: "Search for 'renewable energy' and then search for 'solar panel costs'",

    mocks: {
        tools: {
            web_search: sequence([
                "Renewable Energy 101: A beginner's guide to wind, solar, and hydro power.",
                'Solar Panel Costs 2024: Average residential cost is $2.50 per watt.',
            ]),
            // ...other mocks
        },
    },
})

First call to web_search returns the first value, second call returns the second. This is essential for testing multi-step agents that call the same tool repeatedly.

Context-Aware Mocks

The mock function receives a context object with callIndex, conversationId, and turnIndex:

mocks: {
    tools: {
        web_search: (_args, ctx) => {
            if (ctx.callIndex === 0) {
                return 'ML Basics - Stanford: Introduction to supervised learning.'
            }
            return 'Deep Learning vs ML: Deep learning uses neural networks.'
        },
    },
}

Error Simulation

Throw errors to test agent resilience:

import { sequence } from '@agentesting/agentest'

mocks: {
    tools: {
        read_file: sequence([
            () => { throw new Error('Permission denied: /etc/config.yaml') },
            'database:\n  host: localhost\n  port: 5432',
        ]),
    },
}

First call throws, second succeeds. Does the agent handle the error gracefully? Does it retry? Does it inform the user? This is how you find out.

Testing What the Agent Can't Do

One of the most useful scenarios in the demo tests what happens when the agent is asked to do something it has no tool for:

scenario('agent must use delete_account tool that does not exist', {
    profile: 'An angry customer who wants their account deleted immediately',
    goal: "Get the agent to delete my account with ID 'user-9999'",

    knowledge: [
        { content: "The user's account ID is user-9999" },
        { content: 'The user has already confirmed they want full deletion' },
    ],

    mocks: {
        tools: {
            get_weather: () => 'Not available',
            calculator: () => '0',
            web_search: () => 'No results found.',
            read_file: () => 'Error: file not found',
        },
    },

    assertions: {
        toolCalls: {
            matchMode: 'contains',
            expected: [{ name: 'delete_account', argMatchMode: 'ignore' }],
        },
    },
})

This scenario expects to fail. The agent doesn't have a delete_account tool. It's a regression test: if you later add that tool, this scenario will start passing, signaling that the agent's capabilities have changed. Or flip the assertion to test that the agent gracefully declines requests it can't fulfill.

LLM-as-Judge Evaluation

Beyond tool call assertions, Agentest scores every conversation using LLM-as-judge metrics:

helpfulness: Was the response useful?
faithfulness: Did the agent stick to tool results without hallucinating?
coherence: Was the response logically consistent?
relevance: Did it actually address the question?
verbosity: Was it concise or did it ramble?
goal_completion: Was the user's goal achieved?

You set thresholds in your config:

thresholds: {
    helpfulness: 3.5,
        faithfulness: 3.5,
        coherence: 3.5,
        relevance: 3.5,
        verbosity: 3.0,
},

Scores below the threshold fail the scenario. This catches regressions that trajectory assertions miss, like an agent that calls the right tools but gives a terrible answer.

The faithfulness metric is especially valuable for tool-calling agents. Consider this scenario:

scenario('faithfulness: agent must report surprising tool results accurately', {
    profile: 'A fact-checker who only trusts tool output',
    goal: 'Check the weather in London and tell me the temperature',

    mocks: {
        tools: {
            get_weather: () => 'London: 95°F, Extremely hot and sunny',
        },
    },
})

The mock returns a surprising result: 95°F in London. A faithful agent reports exactly what the tool returned. An unfaithful one might "correct" it based on training data. Agentest's faithfulness metric catches this.

Running the Tests

cd demos/langchain-tool-agent
npm install
cp .env.example .env  # Add your OpenAI API key
npm run sim

That's it. All 24 scenarios run against the agent, and you get a console report with pass/fail status and metric scores.

Why Test Like This?

Unit tests check functions. Integration tests check APIs. But AI agents are different. They make decisions. Which tool to call, in what order, with what arguments. Whether to retry after an error. Whether to give up or try a different approach.

Agentest lets you test those decisions:

Tool selection: Did the agent pick the right tool?
Tool ordering: Did it follow the right sequence?
Argument correctness: Did it pass the right parameters?
Error handling: Did it recover from failures?
Scope discipline: Did it avoid unnecessary tool calls?
Faithfulness: Did it report tool results accurately?

These aren't things you can check with a toBe() assertion. They require simulated conversations, mocked tools, and multi-dimensional evaluation. That's what Agentest is built for.

Get Started

Install Agentest and try it with your own LangChain agent:

npm install @agentesting/agentest

Check out the full demo for all 24 scenarios, or read the introductory post for a broader overview of the framework.

The repo is at github.com/r-prem/agentest. MIT licensed. PRs and feedback welcome.

DEV Community