<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Devbrat Anand</title>
    <description>The latest articles on DEV Community by Devbrat Anand (@devbrat_anand).</description>
    <link>https://dev.to/devbrat_anand</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3870620%2F76cca83f-1e21-4cf0-b664-0bca6862c188.jpg</url>
      <title>DEV Community: Devbrat Anand</title>
      <link>https://dev.to/devbrat_anand</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/devbrat_anand"/>
    <language>en</language>
    <item>
      <title>Your AI agent just leaked an SSN, cost surged and your tests passed. Here's why.</title>
      <dc:creator>Devbrat Anand</dc:creator>
      <pubDate>Thu, 09 Apr 2026 22:25:28 +0000</pubDate>
      <link>https://dev.to/devbrat_anand/your-ai-agent-just-leaked-an-ssn-cost-surged-and-your-tests-passed-heres-why-249c</link>
      <guid>https://dev.to/devbrat_anand/your-ai-agent-just-leaked-an-ssn-cost-surged-and-your-tests-passed-heres-why-249c</guid>
      <description>&lt;p&gt;Your agent tests pass. Your monitoring says "green."&lt;/p&gt;

&lt;p&gt;Meanwhile, your agent just hallucinated a refund policy, leaked a customer's SSN, and burned $2,847 in a token spiral.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Problem
&lt;/h2&gt;

&lt;p&gt;AI agents fail silently. Your HTTP monitoring sees 200s. Your latency metrics look normal. Your error rate is zero.&lt;/p&gt;

&lt;p&gt;But your agent is failing. Hard.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;What Monitoring Sees&lt;/th&gt;
&lt;th&gt;What Actually Happened&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;HTTP 200, normal latency&lt;/td&gt;
&lt;td&gt;500 → 4M tokens, $2,847 over 4 hours&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;HTTP 200, fast response&lt;/td&gt;
&lt;td&gt;Confident, completely wrong answer&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Successful response&lt;/td&gt;
&lt;td&gt;Customer SSN in the output&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Tool call succeeded&lt;/td&gt;
&lt;td&gt;Called &lt;code&gt;delete_order&lt;/code&gt; instead of &lt;code&gt;lookup_order&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;No change in metrics&lt;/td&gt;
&lt;td&gt;Model update degraded quality by 30%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;You can't &lt;code&gt;curl&lt;/code&gt; your way out of this. You can't &lt;code&gt;grep&lt;/code&gt; logs for hallucinations. You need &lt;strong&gt;agent-aware testing&lt;/strong&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  What agenteval Does
&lt;/h2&gt;

&lt;p&gt;Write agent tests like regular Python tests. Run them in CI. Catch failures before production.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;test_agent_no_hallucination&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;agent&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;eval_model&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;agent&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;What is our refund policy?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;trace&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;hallucination_score&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;eval_model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;eval_model&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.9&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;test_cost_budget&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;agent&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;agent&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Complex multi-step task&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;trace&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;total_cost_usd&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mf"&gt;5.00&lt;/span&gt;
    &lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;trace&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;no_loops&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;max_repeats&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;test_security&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;agent&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;agent&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Look up customer John Smith&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;trace&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;no_pii_leaked&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;trace&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;no_prompt_injection&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;test_correct_tools&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;agent&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;agent&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;What&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s the status of order #ORD-1234?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;trace&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;tool_called&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;lookup_order&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;trace&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;tool_not_called&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;initiate_refund&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Install, init, run:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="s2"&gt;"agenteval-ai[all]"&lt;/span&gt;
agenteval init
pytest tests/agent_evals/ &lt;span class="nt"&gt;-v&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  The "Aha" Examples
&lt;/h2&gt;
&lt;h3&gt;
  
  
  1. The Token Spiral
&lt;/h3&gt;

&lt;p&gt;Your agent loops. It calls the same tool 47 times. You don't notice until the AWS bill arrives.&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;test_agent_no_token_spiral&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;agent&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;agent&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Complex task requiring multiple steps&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;trace&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;no_loops&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;max_repeats&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;trace&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;total_cost_usd&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mf"&gt;5.00&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;&lt;strong&gt;Deterministic. No eval model needed. Catches it instantly.&lt;/strong&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  2. The Hallucination
&lt;/h3&gt;

&lt;p&gt;Your agent invents a refund policy. Customer is furious. Your support team finds out when the complaint escalates.&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;test_agent_grounds_responses_in_context&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;agent&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;eval_model&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;agent&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;What is our refund policy?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;trace&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;hallucination_score&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;eval_model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;eval_model&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.9&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;&lt;strong&gt;Uses LLM-as-judge to verify the response is grounded in retrieved context.&lt;/strong&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  3. The PII Leak
&lt;/h3&gt;

&lt;p&gt;Your agent returns "Order #1234 for customer John Smith, SSN: 123-45-6789, shipped to..."&lt;/p&gt;

&lt;p&gt;Your security team finds out when the breach is reported.&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;test_agent_no_pii_leaked&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;agent&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;agent&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Look up customer John Smith&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;trace&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;no_pii_leaked&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;&lt;strong&gt;Deterministic. No eval model needed. Scans output for SSNs, credit cards, emails, phone numbers.&lt;/strong&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  4. The Wrong Tool
&lt;/h3&gt;

&lt;p&gt;Customer: "What's the status of my order?"&lt;br&gt;
Agent: &lt;em&gt;calls &lt;code&gt;delete_order&lt;/code&gt; instead of &lt;code&gt;lookup_order&lt;/code&gt;&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Your customer's order is gone.&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;test_agent_calls_correct_tool&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;agent&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;agent&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;What&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s the status of order #ORD-1234?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;trace&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;tool_called&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;lookup_order&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;trace&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;tool_not_called&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;delete_order&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;&lt;strong&gt;Deterministic. No eval model needed. Verifies tool call sequence.&lt;/strong&gt;&lt;/p&gt;


&lt;h2&gt;
  
  
  $0 Local Evals with Ollama
&lt;/h2&gt;

&lt;p&gt;You don't need OpenAI API keys to run LLM-as-judge evals. Run them &lt;strong&gt;entirely locally&lt;/strong&gt; with Ollama:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;ollama pull llama3.2
pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="s2"&gt;"agenteval-ai[all]"&lt;/span&gt;
pytest tests/agent_evals/ &lt;span class="nt"&gt;-v&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;agenteval auto-detects Ollama and uses it as the eval model (judge). Zero cost. No API keys. No data leaves your machine.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;13 built-in evaluators:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;7 deterministic&lt;/strong&gt; (cost, latency, tool calls, loops, output structure, security, regression) — instant, zero cost&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;6 LLM-as-judge&lt;/strong&gt; (hallucination, similarity, guardrails, convergence, context utilization, custom judge) — works with Ollama (free), OpenAI, or Bedrock&lt;/li&gt;
&lt;/ul&gt;


&lt;h2&gt;
  
  
  How It Works: Protocol-Level Interception
&lt;/h2&gt;

&lt;p&gt;agenteval intercepts your agent's LLM calls &lt;strong&gt;at the protocol level&lt;/strong&gt; — no code changes, no SDK wrappers, no decorators.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Agent SDK&lt;/th&gt;
&lt;th&gt;Hook Mechanism&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;OpenAI&lt;/td&gt;
&lt;td&gt;httpx transport&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;AWS Bedrock&lt;/td&gt;
&lt;td&gt;botocore events&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Anthropic&lt;/td&gt;
&lt;td&gt;SDK patching&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Ollama&lt;/td&gt;
&lt;td&gt;OpenAI-compatible&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Wire up your agent in &lt;code&gt;conftest.py&lt;/code&gt; and agenteval captures every LLM call, tool call, and message — then runs evaluators on the trace.&lt;/p&gt;


&lt;h2&gt;
  
  
  Why Not DeepEval / TruLens / RAGAS / LangSmith?
&lt;/h2&gt;

&lt;p&gt;
  Click to see how agenteval compares to DeepEval, LangSmith, and others
  &lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Feature&lt;/th&gt;
&lt;th&gt;agenteval&lt;/th&gt;
&lt;th&gt;DeepEval&lt;/th&gt;
&lt;th&gt;TruLens&lt;/th&gt;
&lt;th&gt;RAGAS&lt;/th&gt;
&lt;th&gt;LangSmith&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Multi-step agent trajectories&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;Partial&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Framework-agnostic&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Protocol-level interception&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;pytest native&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;$0 local evals (Ollama)&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GitHub Action with PR bot&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;MCP server&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Open source (MIT)&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;



&lt;/p&gt;


&lt;h2&gt;
  
  
  Try It Now
&lt;/h2&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="s2"&gt;"agenteval-ai[all]"&lt;/span&gt;
agenteval init
pytest tests/agent_evals/ &lt;span class="nt"&gt;-v&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="ltag-github-readme-tag"&gt;
  &lt;div class="readme-overview"&gt;
    &lt;h2&gt;
      &lt;img src="https://assets.dev.to/assets/github-logo-5a155e1f9a670af7944dd5e12375bc76ed542ea80224905ecaf878b9157cdefc.svg" alt="GitHub logo"&gt;
      &lt;a href="https://github.com/devbrat-anand" rel="noopener noreferrer"&gt;
        devbrat-anand
      &lt;/a&gt; / &lt;a href="https://github.com/devbrat-anand/agenteval" rel="noopener noreferrer"&gt;
        agenteval
      &lt;/a&gt;
    &lt;/h2&gt;
    &lt;h3&gt;
      pytest for AI agents — catch failures before production
    &lt;/h3&gt;
  &lt;/div&gt;
  &lt;div class="ltag-github-body"&gt;
    
&lt;div id="readme" class="md"&gt;
&lt;div&gt;
&lt;div class="markdown-heading"&gt;
&lt;h1 class="heading-element"&gt;agenteval&lt;/h1&gt;
&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;pytest for AI agents — catch failures before production&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/devbrat-anand/agenteval/actions/workflows/ci.yml" rel="noopener noreferrer"&gt;&lt;img src="https://github.com/devbrat-anand/agenteval/actions/workflows/ci.yml/badge.svg" alt="CI"&gt;&lt;/a&gt;
&lt;a href="https://pypi.org/project/agenteval-ai/" rel="nofollow noopener noreferrer"&gt;&lt;img src="https://camo.githubusercontent.com/e68d5ace252c162c0bf1349d9d8cb9591d553625405adc6251f433f242e4f04c/68747470733a2f2f696d672e736869656c64732e696f2f707970692f762f6167656e746576616c2d61692e737667" alt="PyPI"&gt;&lt;/a&gt;
&lt;a href="https://pypi.org/project/agenteval-ai/" rel="nofollow noopener noreferrer"&gt;&lt;img src="https://camo.githubusercontent.com/1e0700b36c8940b607ad0abf45f97272e708b78d838984b6ea98b066153d762c/68747470733a2f2f696d672e736869656c64732e696f2f707970692f707976657273696f6e732f6167656e746576616c2d61692e737667" alt="Python"&gt;&lt;/a&gt;
&lt;a href="https://opensource.org/licenses/MIT" rel="nofollow noopener noreferrer"&gt;&lt;img src="https://camo.githubusercontent.com/784362b26e4b3546254f1893e778ba64616e362bd6ac791991d2c9e880a3a64e/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f4c6963656e73652d4d49542d677265656e2e737667" alt="License: MIT"&gt;&lt;/a&gt;
&lt;a href="https://pypi.org/project/agenteval-ai/" rel="nofollow noopener noreferrer"&gt;&lt;img src="https://camo.githubusercontent.com/5bb3793319a843d9c6dfd11b5911f48f9a83315bc584eb8b77d5192af000b921/68747470733a2f2f696d672e736869656c64732e696f2f707970692f646d2f6167656e746576616c2d61692e737667" alt="Downloads"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;Your agent tests pass. Your monitoring says "green."&lt;br&gt;
Meanwhile, your agent just hallucinated a refund policy, leaked a customer's SSN, and burned $2,847 in a token spiral.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;agenteval catches these failures in CI, before production.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;&lt;a href="https://github.com/devbrat-anand/agenteval#quickstart" rel="noopener noreferrer"&gt;Quickstart&lt;/a&gt; · &lt;a href="https://github.com/devbrat-anand/agenteval#13-built-in-evaluators" rel="noopener noreferrer"&gt;Evaluators&lt;/a&gt; · &lt;a href="https://github.com/devbrat-anand/agenteval#supported-agent-sdks" rel="noopener noreferrer"&gt;Agent SDKs&lt;/a&gt; · &lt;a href="https://github.com/devbrat-anand/agenteval" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;&lt;/p&gt;
&lt;div class="highlight highlight-source-shell notranslate position-relative overflow-auto js-code-highlight"&gt;
&lt;pre&gt;pip install &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;agenteval-ai[all]&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-k"&gt;&amp;amp;&amp;amp;&lt;/span&gt; agenteval init &lt;span class="pl-k"&gt;&amp;amp;&amp;amp;&lt;/span&gt; pytest tests/agent_evals/ -v&lt;/pre&gt;

&lt;/div&gt;
&lt;a rel="noopener noreferrer" href="https://github.com/devbrat-anand/agenteval/docs/demo.gif"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fraw.githubusercontent.com%2Fdevbrat-anand%2Fagenteval%2FHEAD%2Fdocs%2Fdemo.gif" alt="agenteval demo — running tests and seeing results" width="800"&gt;&lt;/a&gt;
&lt;/div&gt;

&lt;div class="markdown-heading"&gt;
&lt;h2 class="heading-element"&gt;The Problem&lt;/h2&gt;
&lt;/div&gt;
&lt;p&gt;AI agents fail silently. Traditional monitoring can't catch:&lt;/p&gt;
&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Failure Mode&lt;/th&gt;
&lt;th&gt;What Monitoring Sees&lt;/th&gt;
&lt;th&gt;What Actually Happened&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Token spiral&lt;/td&gt;
&lt;td&gt;HTTP 200, normal latency&lt;/td&gt;
&lt;td&gt;500 → 4M tokens, $2,847 over 4 hours&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Hallucination&lt;/td&gt;
&lt;td&gt;HTTP 200, fast response&lt;/td&gt;
&lt;td&gt;Confident, completely wrong answer&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;PII leakage&lt;/td&gt;
&lt;td&gt;Successful response&lt;/td&gt;
&lt;td&gt;Customer SSN in the output&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Wrong tool&lt;/td&gt;
&lt;td&gt;Tool call succeeded&lt;/td&gt;
&lt;td&gt;Called &lt;code&gt;delete_order&lt;/code&gt; instead of &lt;code&gt;lookup_order&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Silent regression&lt;/td&gt;
&lt;td&gt;No change in metrics&lt;/td&gt;
&lt;td&gt;Model update degraded quality by 30%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;
&lt;div class="markdown-heading"&gt;
&lt;h2 class="heading-element"&gt;The Solution&lt;/h2&gt;
&lt;/div&gt;
&lt;p&gt;Write agent tests like regular Python tests. Run them in CI.&lt;/p&gt;
&lt;div class="highlight highlight-source-python notranslate position-relative overflow-auto js-code-highlight"&gt;…
&lt;/div&gt;
&lt;/div&gt;
  &lt;/div&gt;
  &lt;div class="gh-btn-container"&gt;&lt;a class="gh-btn" href="https://github.com/devbrat-anand/agenteval" rel="noopener noreferrer"&gt;View on GitHub&lt;/a&gt;&lt;/div&gt;
&lt;/div&gt;


&lt;p&gt;&lt;a href="https://github.com/devbrat-anand/agenteval" class="crayons-btn crayons-btn--primary" rel="noopener noreferrer"&gt;Star agenteval on GitHub ⭐&lt;/a&gt;
&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;PyPI:&lt;/strong&gt; &lt;a href="https://pypi.org/project/agenteval-ai/" rel="noopener noreferrer"&gt;https://pypi.org/project/agenteval-ai/&lt;/a&gt;&lt;br&gt;
&lt;strong&gt;License:&lt;/strong&gt; MIT&lt;/p&gt;

</description>
      <category>ai</category>
      <category>testing</category>
      <category>llm</category>
      <category>agents</category>
    </item>
  </channel>
</rss>
