<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Aayush kumarsingh</title>
    <description>The latest articles on DEV Community by Aayush kumarsingh (@aayush_kumarsingh_6ee1ffe).</description>
    <link>https://dev.to/aayush_kumarsingh_6ee1ffe</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3869731%2F3626c00e-9846-420a-aa24-7ef35e7ed749.png</url>
      <title>DEV Community: Aayush kumarsingh</title>
      <link>https://dev.to/aayush_kumarsingh_6ee1ffe</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/aayush_kumarsingh_6ee1ffe"/>
    <language>en</language>
    <item>
      <title>I built an open-source LLM eval platform with a ReAct agent that diagnoses quality regressions</title>
      <dc:creator>Aayush kumarsingh</dc:creator>
      <pubDate>Thu, 09 Apr 2026 11:45:30 +0000</pubDate>
      <link>https://dev.to/aayush_kumarsingh_6ee1ffe/i-built-an-open-source-llm-eval-platform-with-a-react-agent-that-diagnoses-quality-regressions-3a26</link>
      <guid>https://dev.to/aayush_kumarsingh_6ee1ffe/i-built-an-open-source-llm-eval-platform-with-a-react-agent-that-diagnoses-quality-regressions-3a26</guid>
      <description>&lt;h2&gt;
  
  
  The problem that made me build this
&lt;/h2&gt;

&lt;p&gt;I was building a multi-agent orchestration system. It worked great &lt;br&gt;
in testing. I deployed it. Three days later I changed a system prompt. &lt;br&gt;
Quality dropped from 84% to 52%. I found out 11 days later when a &lt;br&gt;
user complained.&lt;/p&gt;

&lt;p&gt;This is the most common failure mode in LLM applications. Unlike &lt;br&gt;
traditional software where a bug throws an exception, bad LLM outputs &lt;br&gt;
look like valid responses. They just happen to be wrong, unhelpful, &lt;br&gt;
or unsafe. You need systematic measurement to catch this.&lt;/p&gt;

&lt;p&gt;I looked for existing tools. Langfuse is good but expensive at scale for self-hosted teams. &lt;br&gt;
Braintrust doesn't have a free self-hosted option. Helicone doesn't do &lt;br&gt;
evals. I built TraceMind.&lt;/p&gt;
&lt;h2&gt;
  
  
  What TraceMind does
&lt;/h2&gt;

&lt;p&gt;Three things:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Automatic quality scoring&lt;/strong&gt;&lt;br&gt;
Every LLM response is scored 1-10 by another LLM acting as judge &lt;br&gt;
(LLM-as-judge pattern). I use Groq's free tier — llama-3.1-8b-instant &lt;br&gt;
for fast scoring, llama-3.3-70b for deep analysis. The score runs in &lt;br&gt;
the background, never blocking your application.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Golden dataset evals&lt;/strong&gt;&lt;br&gt;
You define expected behaviors once:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;ds&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dataset&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;support-v1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;ds&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;I want a refund&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;expected&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;acknowledge and ask for order number&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;ds&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;push&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run_eval&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;support-v1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;function&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;your_agent&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;run&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;wait&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Pass rate: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;pass_rate&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# Pass rate: 87%
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;3. AI agent that diagnoses regressions&lt;/strong&gt;&lt;br&gt;
This is the part I'm most proud of. You can ask:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;"Why did quality drop yesterday?"
"What are the most common failure patterns?"
"Generate test cases for billing question failures"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The agent implements the ReAct pattern with 6 tools and 4 memory types.&lt;/p&gt;

&lt;h2&gt;
  
  
  The architecture decisions that matter
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Parallel eval execution with asyncio.Semaphore
&lt;/h3&gt;

&lt;p&gt;The naive approach runs LLM judge calls sequentially. &lt;br&gt;
For 100 test cases at 500ms each = 50 seconds.&lt;/p&gt;

&lt;p&gt;I use asyncio.Semaphore(3) to run 3 evaluations concurrently:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;semaphore&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;asyncio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Semaphore&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;max_concurrent&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;tasks&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nf"&gt;run_case&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ex&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;system_fn&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;criteria&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;semaphore&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;ex&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;examples&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;coro&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;asyncio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;as_completed&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tasks&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;coro&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;100 cases now takes ~17 seconds. The semaphore limit exists because &lt;br&gt;
Groq's free tier has rate limits — I tuned it to stay under the threshold.&lt;/p&gt;
&lt;h3&gt;
  
  
  The ReAct agent with semantic memory
&lt;/h3&gt;

&lt;p&gt;The agent has 4 memory types:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;In-context&lt;/strong&gt;: conversation history within the session&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;External KV&lt;/strong&gt;: project config from database&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Semantic&lt;/strong&gt;: past failures in ChromaDB with sentence-transformers embeddings&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Episodic&lt;/strong&gt;: past agent run results in SQLite&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;When you ask "why did quality drop?", the agent:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Searches ChromaDB semantically for similar past failures&lt;/li&gt;
&lt;li&gt;Fetches recent low-scoring traces from the database&lt;/li&gt;
&lt;li&gt;Runs a targeted eval on the failure category&lt;/li&gt;
&lt;li&gt;Uses Opus-equivalent model to analyze root cause&lt;/li&gt;
&lt;li&gt;Generates new test cases to prevent future recurrence&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;I intentionally avoided LangChain. The ReAct loop is 80 lines of &lt;br&gt;
readable Python. When something breaks at 3am, you want to read &lt;br&gt;
your own code.&lt;/p&gt;
&lt;h3&gt;
  
  
  Background worker for async scoring
&lt;/h3&gt;

&lt;p&gt;The HTTP ingestion endpoint returns in &amp;lt;10ms regardless of batch size. &lt;br&gt;
Scoring runs in a background worker that polls every 10 seconds:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_score_unscored_spans&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;spans&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;fetch_unscored&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;limit&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;span&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;spans&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;score&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;_score_span&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;input&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;output&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="nf"&gt;save_score&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;score&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The worst thing an observability tool can do is slow down the system &lt;br&gt;
it's monitoring. Scoring is completely decoupled from ingestion.&lt;/p&gt;
&lt;h3&gt;
  
  
  Local embeddings — no OpenAI dependency
&lt;/h3&gt;

&lt;p&gt;I use sentence-transformers &lt;code&gt;all-MiniLM-L6-v2&lt;/code&gt; for ChromaDB embeddings. &lt;br&gt;
It runs locally, downloads once (~90MB), works offline, zero API cost. &lt;br&gt;
This was a deliberate choice — I wanted the tool to work completely &lt;br&gt;
free with no external dependencies beyond Groq for LLM calls.&lt;/p&gt;
&lt;h2&gt;
  
  
  What I'd do differently in production
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Multi-tenancy&lt;/strong&gt;: Row-level security instead of project-level isolation&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Celery + Redis&lt;/strong&gt; instead of asyncio background worker for horizontal scaling&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Streaming eval results&lt;/strong&gt; via WebSocket — see case-by-case progress in real time&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Alembic migrations&lt;/strong&gt; from day one (I added these later)&lt;/li&gt;
&lt;/ol&gt;
&lt;h2&gt;
  
  
  Try it
&lt;/h2&gt;

&lt;p&gt;Live demo: &lt;a href="https://tracemind.vercel.app" rel="noopener noreferrer"&gt;https://tracemind.vercel.app&lt;/a&gt;&lt;br&gt;
GitHub: &lt;a href="https://github.com/Aayush-engineer/tracemind" rel="noopener noreferrer"&gt;https://github.com/Aayush-engineer/tracemind&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;3-line setup:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;pip&lt;/span&gt; &lt;span class="n"&gt;install&lt;/span&gt; &lt;span class="n"&gt;tracemind&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;tracemind&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;TraceMind&lt;/span&gt;
&lt;span class="n"&gt;tm&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;TraceMind&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;project&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;my-app&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
               &lt;span class="n"&gt;base_url&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://tracemind.onrender.com&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nd"&gt;@tm.trace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;llm_call&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;your_function&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="p"&gt;...&lt;/span&gt;  &lt;span class="c1"&gt;# your code unchanged
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdxmjhby73l7wsn7laek2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdxmjhby73l7wsn7laek2.png" alt=" "&gt;&lt;/a&gt;&lt;br&gt;
If you're building with LLMs and want to know if they're actually &lt;br&gt;
working — I'd love feedback.&lt;/p&gt;

</description>
      <category>python</category>
      <category>rag</category>
      <category>opensource</category>
      <category>llm</category>
    </item>
  </channel>
</rss>
