<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Lokesh Kank</title>
    <description>The latest articles on DEV Community by Lokesh Kank (@lokesh75kank).</description>
    <link>https://dev.to/lokesh75kank</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3994964%2F77ecb3c5-eed9-432b-ac16-e0860a48a4bd.jpg</url>
      <title>DEV Community: Lokesh Kank</title>
      <link>https://dev.to/lokesh75kank</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/lokesh75kank"/>
    <language>en</language>
    <item>
      <title>I tested my own AI agent and it worked 25% of the time. So I open-sourced the tool that caught it.</title>
      <dc:creator>Lokesh Kank</dc:creator>
      <pubDate>Tue, 23 Jun 2026 02:57:56 +0000</pubDate>
      <link>https://dev.to/lokesh75kank/i-tested-my-own-ai-agent-and-it-worked-25-of-the-time-so-i-open-sourced-the-tool-that-caught-it-4c02</link>
      <guid>https://dev.to/lokesh75kank/i-tested-my-own-ai-agent-and-it-worked-25-of-the-time-so-i-open-sourced-the-tool-that-caught-it-4c02</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2F8rixq1mos5ua1mqt44jr.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2F8rixq1mos5ua1mqt44jr.png" alt=" " width="800" height="141"&gt;&lt;/a&gt;&lt;br&gt;
If you build AI agents, you have lived this: it works when you test it, then breaks in production. You demo it, it nails the task, everyone is happy. A week later it quietly fails on the same input.&lt;/p&gt;

&lt;p&gt;"It ran successfully once" is not the same as "it works." And most of the tools we use to evaluate agents quietly assume the first thing.&lt;/p&gt;

&lt;p&gt;So I built &lt;strong&gt;AgentEval&lt;/strong&gt;, an open-source tool that measures whether an agent is actually &lt;em&gt;reliable&lt;/em&gt;, not just whether it succeeded once. This post is the why and the how, including the moment I pointed it at my own agent and watched it score 25%.&lt;/p&gt;
&lt;h2&gt;
  
  
  The problem: single-run evals lie
&lt;/h2&gt;

&lt;p&gt;Most eval setups score a single answer to a single prompt. That is fine for a pure text model. It falls apart for an &lt;strong&gt;agent&lt;/strong&gt;, because an agent that searches, plans, calls tools, and drives a browser is &lt;strong&gt;nondeterministic&lt;/strong&gt;. Run the same task five times and you can get five different paths and outcomes.&lt;/p&gt;

&lt;p&gt;A single manual check has a 1-in-N chance of catching the run that happened to work and declaring victory. That is exactly the failure mode that ships broken agents.&lt;/p&gt;

&lt;p&gt;What you actually want to know is:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Determinism:&lt;/strong&gt; given the same input, how &lt;em&gt;often&lt;/em&gt; does it succeed? (the flakiness a single check never sees)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Grounding:&lt;/strong&gt; when it makes a factual or regulatory claim, is that claim backed by a citation that actually resolves?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;A record you can keep:&lt;/strong&gt; a report you can review, diff over time, and attach to your own QA or compliance trail.&lt;/li&gt;
&lt;/ol&gt;
&lt;h2&gt;
  
  
  What AgentEval does
&lt;/h2&gt;

&lt;p&gt;AgentEval is a TypeScript library + CLI. You wrap any agent, define what "good" looks like, and it runs each task &lt;strong&gt;N times&lt;/strong&gt; to produce a scorecard, a determinism score, and a report.&lt;/p&gt;

&lt;p&gt;The only integration point is an &lt;strong&gt;adapter&lt;/strong&gt;: given an input, return a trace.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;defineAdapter&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;agenteval-core&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;adapter&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;defineAdapter&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;input&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;myAgent&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;invoke&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;input&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;user_message&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="nx"&gt;input&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;finalText&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;toolCalls&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;toolCalls&lt;/span&gt; &lt;span class="o"&gt;??&lt;/span&gt; &lt;span class="p"&gt;[],&lt;/span&gt;
      &lt;span class="na"&gt;citations&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;citations&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="c1"&gt;// optional, enables grounding checks&lt;/span&gt;
    &lt;span class="p"&gt;};&lt;/span&gt;
  &lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That &lt;code&gt;AgentTrace&lt;/code&gt; shape is the whole contract. It does not assume a particular framework, a tool-calling loop, or a domain, so it works with LangGraph, a raw Anthropic/OpenAI loop, an HTTP endpoint, whatever you have.&lt;/p&gt;

&lt;p&gt;Then you describe scenarios, in YAML or in code:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# scenarios/refund.yaml&lt;/span&gt;
&lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;refund-window&lt;/span&gt;
&lt;span class="na"&gt;input&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;user_message&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Can&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;I&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;get&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;a&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;refund?"&lt;/span&gt;
&lt;span class="na"&gt;asserts&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;tool_called&lt;/span&gt;
    &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;search_kb&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;text_contains_one_of&lt;/span&gt;
    &lt;span class="na"&gt;options&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;30&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;days"&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;30-day"&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;every_claim_has_citation&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And run them N times:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;runSuite&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;loadScenarios&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;renderConsole&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;agenteval-core&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;scenarios&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;loadScenarios&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;./scenarios&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;report&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;runSuite&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;adapter&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;scenarios&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;runs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;

&lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;renderConsole&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;report&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[PASS] refund-window       (determinism 100%, 5/5 runs)
[FAIL] coverage-question   (determinism 60%, 3/5 runs)   &amp;lt;- flaky: same input, different answer
[FAIL] Summary: 1/2 scenarios passed | overall determinism 80.0%
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That &lt;code&gt;coverage-question&lt;/code&gt; line is the point. It passed. It also failed, twice, on the identical input. A one-shot check would have called it green.&lt;/p&gt;

&lt;h2&gt;
  
  
  The moment it earned its keep
&lt;/h2&gt;

&lt;p&gt;I have an autonomous web agent that does real errands: it searches for the right portal, plans a path, drives a browser, and extracts a result. I had four recorded runs of it doing the &lt;em&gt;same&lt;/em&gt; task: retrieve a property-tax payment receipt from a municipal portal.&lt;/p&gt;

&lt;p&gt;I ingested those four runs as traces and let AgentEval score them:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[FAIL] property-tax-receipt  (determinism 25%, 1/4 runs)
[FAIL] Summary: 0/1 scenarios passed | overall determinism 25.0%
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;25%.&lt;/strong&gt; It succeeded once. Twice the portal stopped responding mid-run; once it landed on the wrong page and returned homepage content instead of the receipt. The three failures were not even the same failure.&lt;/p&gt;

&lt;p&gt;If I had spot-checked it the day it worked, I would have shipped a one-in-four agent and called it done. That number, sitting there in red, is the entire argument for measuring determinism.&lt;/p&gt;

&lt;p&gt;(The full case study, with the redacted traces and the script, is in the repo under &lt;code&gt;case-studies/&lt;/code&gt;.)&lt;/p&gt;

&lt;h2&gt;
  
  
  Grounding: the part most evals skip
&lt;/h2&gt;

&lt;p&gt;Reliability is not only "did it finish." For anything high-stakes, it is also "did it tell the truth, and can I check." AgentEval ships a grounding layer that flags:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Uncited claims:&lt;/strong&gt; a sentence that asserts a fact or a rule with no citation attached.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Unresolved citations:&lt;/strong&gt; references that do not point at a real source.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Quote mismatches:&lt;/strong&gt; a quote that does not actually appear in its cited source.
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;checkGrounding&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;REGULATED_PRESET&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;agenteval-core&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;checkGrounding&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;trace&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;config&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;REGULATED_PRESET&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;knownSources&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;
&lt;span class="c1"&gt;// -&amp;gt; { uncitedClaims, unresolvedCitations, quoteMismatches }&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It ships a generic preset for any assistant and a regulated preset (CFR/ISO/IEC/MDR/IVDR/USC) for compliance-flavored agents, and the patterns are configurable for your own domain.&lt;/p&gt;

&lt;h2&gt;
  
  
  It plugs into what you already have
&lt;/h2&gt;

&lt;p&gt;Two things I cared about so adoption is not a chore:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Ingest existing traces.&lt;/strong&gt; If you already collect OpenTelemetry or LangSmith traces, you can evaluate them without changing your agent: &lt;code&gt;otelToTrace(...)&lt;/code&gt;, &lt;code&gt;langsmithToTrace(...)&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;An MCP server.&lt;/strong&gt; Since AgentEval evaluates agents, it ships an MCP server so a coding agent (Claude, Codex, Cursor) can call it directly: &lt;code&gt;evaluate_agent&lt;/code&gt;, &lt;code&gt;check_grounding&lt;/code&gt;, &lt;code&gt;get_report&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;And for CI, there is a baseline workflow: &lt;code&gt;agenteval baseline&lt;/code&gt; snapshots a known-good state, &lt;code&gt;agenteval check&lt;/code&gt; fails the build if reliability regressed. You cannot fix what you do not measure, and you cannot keep it fixed without a gate.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where it came from, and what it is not
&lt;/h2&gt;

&lt;p&gt;AgentEval grew out of the evaluation layer of &lt;strong&gt;Deminn&lt;/strong&gt;, a multi-agent system I built for regulated quality and compliance (CAPA, FDA/ISO) workflows. The reliability and grounding ideas were proven there on a real, messy domain, then generalized so they work on any agent.&lt;/p&gt;

&lt;p&gt;Being honest about the boundaries, because I would want to know:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;It is &lt;strong&gt;v0.1&lt;/strong&gt;. Useful, tested (160+ tests, CI green), but young.&lt;/li&gt;
&lt;li&gt;Grounding is &lt;strong&gt;heuristic&lt;/strong&gt; (regex + similarity), so expect to tune the patterns for your domain. It is a strong signal, not an oracle.&lt;/li&gt;
&lt;li&gt;The HTML output is an &lt;strong&gt;audit-ready report&lt;/strong&gt;, not a certified compliance artifact. It is something a reviewer can read and keep, not a stamp.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Try it
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npm &lt;span class="nb"&gt;install &lt;/span&gt;agenteval-core
npx agenteval init   &lt;span class="c"&gt;# scaffolds a config + an example scenario&lt;/span&gt;
npx agenteval run    &lt;span class="c"&gt;# prints the scorecard&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;GitHub: &lt;a href="https://github.com/lokesh75-kank/agenteval" rel="noopener noreferrer"&gt;https://github.com/lokesh75-kank/agenteval&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;MIT licensed, TypeScript, Node 20+.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you build agents, point it at one of yours and run it a few times. I would bet you find something. And if you do, or if you have a sharper way to measure agent reliability, open an issue or a PR. I would love the feedback.&lt;/p&gt;

&lt;p&gt;What is your agent's &lt;em&gt;real&lt;/em&gt; success rate? Most of us have never actually measured it.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>opensource</category>
      <category>agenteval</category>
      <category>llmevaluation</category>
    </item>
  </channel>
</rss>
