<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Bhavin Kotak</title>
    <description>The latest articles on DEV Community by Bhavin Kotak (@bhavinkotak).</description>
    <link>https://dev.to/bhavinkotak</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3957606%2F6890993c-a56a-4ec9-b265-68a6cb5ce1ea.jpeg</url>
      <title>DEV Community: Bhavin Kotak</title>
      <link>https://dev.to/bhavinkotak</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/bhavinkotak"/>
    <language>en</language>
    <item>
      <title>Building a Self-Improving AI Agent Evaluation Platform in Rust</title>
      <dc:creator>Bhavin Kotak</dc:creator>
      <pubDate>Fri, 29 May 2026 03:20:19 +0000</pubDate>
      <link>https://dev.to/bhavinkotak/building-a-self-improving-ai-agent-evaluation-platform-in-rust-1im4</link>
      <guid>https://dev.to/bhavinkotak/building-a-self-improving-ai-agent-evaluation-platform-in-rust-1im4</guid>
      <description>&lt;p&gt;When you're building AI agents, evaluation is only half the problem. The harder half is closing the improvement loop: taking what you learned from failing evals and automatically making the agent better.&lt;/p&gt;

&lt;p&gt;That's what AgentForge is - an open-source platform that does the full cycle: evaluate → cluster failures → patch prompts → re-evaluate → gate promotion.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Architecture&lt;/strong&gt;&lt;br&gt;
AgentForge ships as 16 focused Rust crates:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Crate&lt;/th&gt;
&lt;th&gt;Role&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;agentforge-runner&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Parallel scenario execution&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;agentforge-scorer&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Multi-dimensional LLM-as-judge&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;agentforge-optimizer&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Automatic prompt patching&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;agentforge-redteam&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Adversarial probing&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;agentforge-gatekeeper&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Promotion gate&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;agentforge-observability&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;OTLP tracing + cost&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;agentforge-multiagent&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Multi-agent orchestration&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;agentforge-finetune&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Fine-tune dataset exporter&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;The Optimizer Loop&lt;/strong&gt;&lt;br&gt;
The optimizer reads failure clusters from the scorer, calls an LLM to generate a prompt patch, reruns just the failing scenarios against the patched agent, and writes the result back to the eval database. If the patched agent clears the threshold, the gatekeeper promotes it.&lt;/p&gt;

&lt;p&gt;This runs entirely in-process with no external orchestration needed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Shadow Runs&lt;/strong&gt;&lt;br&gt;
Before promoting, you can run a shadow comparison: the current champion and the challenger handle the same traffic, scores are diffed, and only a statistically significant improvement triggers promotion.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why Rust?&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Zero-overhead async via Tokio for high-concurrency scenarios runs&lt;/li&gt;
&lt;li&gt;sqlx compile-time checked queries against Postgres&lt;/li&gt;
&lt;li&gt;Single static binary → ships as a GitHub Action with no runtime deps&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Get started&lt;/strong&gt;&lt;br&gt;
&lt;code&gt;cargo install agentforge-cli&lt;br&gt;
agentforge run --agent my-agent.yaml --scenarios 50 --threshold 0.85&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Or drop it into CI:&lt;br&gt;
&lt;code&gt;- uses: bhavinkotak/agentforge@v0.1.10&lt;br&gt;
   with:&lt;br&gt;
    agent_file: fixtures/my-agent.yaml&lt;br&gt;
    threshold: '0.85'&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Repo:&lt;/strong&gt; &lt;a href="https://github.com/bhavinkotak/agentforge" rel="noopener noreferrer"&gt;https://github.com/bhavinkotak/agentforge&lt;/a&gt;&lt;/p&gt;

</description>
      <category>rust</category>
      <category>ai</category>
      <category>agents</category>
      <category>machinelearning</category>
    </item>
  </channel>
</rss>
