I tested my own AI agent and it worked 25% of the time. So I open-sourced the tool that caught it.

#ai #opensource #agenteval #llmevaluation

If you build AI agents, you have lived this: it works when you test it, then breaks in production. You demo it, it nails the task, everyone is happy. A week later it quietly fails on the same input.

"It ran successfully once" is not the same as "it works." And most of the tools we use to evaluate agents quietly assume the first thing.

So I built AgentEval, an open-source tool that measures whether an agent is actually reliable, not just whether it succeeded once. This post is the why and the how, including the moment I pointed it at my own agent and watched it score 25%.

The problem: single-run evals lie

Most eval setups score a single answer to a single prompt. That is fine for a pure text model. It falls apart for an agent, because an agent that searches, plans, calls tools, and drives a browser is nondeterministic. Run the same task five times and you can get five different paths and outcomes.

A single manual check has a 1-in-N chance of catching the run that happened to work and declaring victory. That is exactly the failure mode that ships broken agents.

What you actually want to know is:

Determinism: given the same input, how often does it succeed? (the flakiness a single check never sees)
Grounding: when it makes a factual or regulatory claim, is that claim backed by a citation that actually resolves?
A record you can keep: a report you can review, diff over time, and attach to your own QA or compliance trail.

What AgentEval does

AgentEval is a TypeScript library + CLI. You wrap any agent, define what "good" looks like, and it runs each task N times to produce a scorecard, a determinism score, and a report.

The only integration point is an adapter: given an input, return a trace.

import { defineAdapter } from 'agenteval-core';

const adapter = defineAdapter({
  async run(input) {
    const result = await myAgent.invoke(input.user_message);
    return {
      input,
      finalText: result.text,
      toolCalls: result.toolCalls ?? [],
      citations: result.citations, // optional, enables grounding checks
    };
  },
});

That AgentTrace shape is the whole contract. It does not assume a particular framework, a tool-calling loop, or a domain, so it works with LangGraph, a raw Anthropic/OpenAI loop, an HTTP endpoint, whatever you have.

Then you describe scenarios, in YAML or in code:

# scenarios/refund.yaml
id: refund-window
input:
  user_message: "Can I get a refund?"
asserts:
  - kind: tool_called
    name: search_kb
  - kind: text_contains_one_of
    options: ["30 days", "30-day"]
  - kind: every_claim_has_citation

And run them N times:

import { runSuite, loadScenarios, renderConsole } from 'agenteval-core';

const scenarios = loadScenarios('./scenarios');
const report = await runSuite(adapter, scenarios, { runs: 5 });

console.log(renderConsole(report));

[PASS] refund-window       (determinism 100%, 5/5 runs)
[FAIL] coverage-question   (determinism 60%, 3/5 runs)   <- flaky: same input, different answer
[FAIL] Summary: 1/2 scenarios passed | overall determinism 80.0%

That coverage-question line is the point. It passed. It also failed, twice, on the identical input. A one-shot check would have called it green.

The moment it earned its keep

I have an autonomous web agent that does real errands: it searches for the right portal, plans a path, drives a browser, and extracts a result. I had four recorded runs of it doing the same task: retrieve a property-tax payment receipt from a municipal portal.

I ingested those four runs as traces and let AgentEval score them:

[FAIL] property-tax-receipt  (determinism 25%, 1/4 runs)
[FAIL] Summary: 0/1 scenarios passed | overall determinism 25.0%

25%. It succeeded once. Twice the portal stopped responding mid-run; once it landed on the wrong page and returned homepage content instead of the receipt. The three failures were not even the same failure.

If I had spot-checked it the day it worked, I would have shipped a one-in-four agent and called it done. That number, sitting there in red, is the entire argument for measuring determinism.

(The full case study, with the redacted traces and the script, is in the repo under case-studies/.)

Grounding: the part most evals skip

Reliability is not only "did it finish." For anything high-stakes, it is also "did it tell the truth, and can I check." AgentEval ships a grounding layer that flags:

Uncited claims: a sentence that asserts a fact or a rule with no citation attached.
Unresolved citations: references that do not point at a real source.
Quote mismatches: a quote that does not actually appear in its cited source.

import { checkGrounding, REGULATED_PRESET } from 'agenteval-core';

const result = checkGrounding(trace, { config: REGULATED_PRESET, knownSources });
// -> { uncitedClaims, unresolvedCitations, quoteMismatches }

It ships a generic preset for any assistant and a regulated preset (CFR/ISO/IEC/MDR/IVDR/USC) for compliance-flavored agents, and the patterns are configurable for your own domain.

It plugs into what you already have

Two things I cared about so adoption is not a chore:

Ingest existing traces. If you already collect OpenTelemetry or LangSmith traces, you can evaluate them without changing your agent: otelToTrace(...), langsmithToTrace(...).
An MCP server. Since AgentEval evaluates agents, it ships an MCP server so a coding agent (Claude, Codex, Cursor) can call it directly: evaluate_agent, check_grounding, get_report.

And for CI, there is a baseline workflow: agenteval baseline snapshots a known-good state, agenteval check fails the build if reliability regressed. You cannot fix what you do not measure, and you cannot keep it fixed without a gate.

Where it came from, and what it is not

AgentEval grew out of the evaluation layer of Deminn, a multi-agent system I built for regulated quality and compliance (CAPA, FDA/ISO) workflows. The reliability and grounding ideas were proven there on a real, messy domain, then generalized so they work on any agent.

Being honest about the boundaries, because I would want to know:

It is v0.1. Useful, tested (160+ tests, CI green), but young.
Grounding is heuristic (regex + similarity), so expect to tune the patterns for your domain. It is a strong signal, not an oracle.
The HTML output is an audit-ready report, not a certified compliance artifact. It is something a reviewer can read and keep, not a stamp.

Try it

npm install agenteval-core
npx agenteval init   # scaffolds a config + an example scenario
npx agenteval run    # prints the scorecard

GitHub: https://github.com/lokesh75-kank/agenteval
MIT licensed, TypeScript, Node 20+.

If you build agents, point it at one of yours and run it a few times. I would bet you find something. And if you do, or if you have a sharper way to measure agent reliability, open an issue or a PR. I would love the feedback.

What is your agent's real success rate? Most of us have never actually measured it.