DEV Community

Cover image for I built an open-source LLM eval platform with a ReAct agent that diagnoses quality regressions
Aayush kumarsingh
Aayush kumarsingh

Posted on

I built an open-source LLM eval platform with a ReAct agent that diagnoses quality regressions

The problem that made me build this

I was building a multi-agent orchestration system. It worked great
in testing. I deployed it. Three days later I changed a system prompt.
Quality dropped from 84% to 52%. I found out 11 days later when a
user complained.

This is the most common failure mode in LLM applications. Unlike
traditional software where a bug throws an exception, bad LLM outputs
look like valid responses. They just happen to be wrong, unhelpful,
or unsafe. You need systematic measurement to catch this.

I looked for existing tools. Langfuse is good but expensive at scale for self-hosted teams.
Braintrust doesn't have a free self-hosted option. Helicone doesn't do
evals. I built TraceMind.

What TraceMind does

Three things:

1. Automatic quality scoring
Every LLM response is scored 1-10 by another LLM acting as judge
(LLM-as-judge pattern). I use Groq's free tier — llama-3.1-8b-instant
for fast scoring, llama-3.3-70b for deep analysis. The score runs in
the background, never blocking your application.

2. Golden dataset evals
You define expected behaviors once:

ds = tm.dataset("support-v1")
ds.add("I want a refund", expected="acknowledge and ask for order number")
ds.push()

result = tm.run_eval("support-v1", function=your_agent.run)
result.wait()
print(f"Pass rate: {result.pass_rate:.0%}")  # Pass rate: 87%
Enter fullscreen mode Exit fullscreen mode

3. AI agent that diagnoses regressions
This is the part I'm most proud of. You can ask:

"Why did quality drop yesterday?"
"What are the most common failure patterns?"
"Generate test cases for billing question failures"
Enter fullscreen mode Exit fullscreen mode

The agent implements the ReAct pattern with 6 tools and 4 memory types.

The architecture decisions that matter

Parallel eval execution with asyncio.Semaphore

The naive approach runs LLM judge calls sequentially.
For 100 test cases at 500ms each = 50 seconds.

I use asyncio.Semaphore(3) to run 3 evaluations concurrently:

semaphore = asyncio.Semaphore(max_concurrent)
tasks = [run_case(ex, system_fn, criteria, semaphore) for ex in examples]
for coro in asyncio.as_completed(tasks):
    result = await coro
Enter fullscreen mode Exit fullscreen mode

100 cases now takes ~17 seconds. The semaphore limit exists because
Groq's free tier has rate limits — I tuned it to stay under the threshold.

The ReAct agent with semantic memory

The agent has 4 memory types:

  • In-context: conversation history within the session
  • External KV: project config from database
  • Semantic: past failures in ChromaDB with sentence-transformers embeddings
  • Episodic: past agent run results in SQLite

When you ask "why did quality drop?", the agent:

  1. Searches ChromaDB semantically for similar past failures
  2. Fetches recent low-scoring traces from the database
  3. Runs a targeted eval on the failure category
  4. Uses Opus-equivalent model to analyze root cause
  5. Generates new test cases to prevent future recurrence

I intentionally avoided LangChain. The ReAct loop is 80 lines of
readable Python. When something breaks at 3am, you want to read
your own code.

Background worker for async scoring

The HTTP ingestion endpoint returns in <10ms regardless of batch size.
Scoring runs in a background worker that polls every 10 seconds:

async def _score_unscored_spans(self):
    spans = fetch_unscored(limit=20)
    for span in spans:
        score = await self._score_span(span.input, span.output)
        save_score(span.id, score)
Enter fullscreen mode Exit fullscreen mode

The worst thing an observability tool can do is slow down the system
it's monitoring. Scoring is completely decoupled from ingestion.

Local embeddings — no OpenAI dependency

I use sentence-transformers all-MiniLM-L6-v2 for ChromaDB embeddings.
It runs locally, downloads once (~90MB), works offline, zero API cost.
This was a deliberate choice — I wanted the tool to work completely
free with no external dependencies beyond Groq for LLM calls.

What I'd do differently in production

  1. Multi-tenancy: Row-level security instead of project-level isolation
  2. Celery + Redis instead of asyncio background worker for horizontal scaling
  3. Streaming eval results via WebSocket — see case-by-case progress in real time
  4. Alembic migrations from day one (I added these later)

Try it

Live demo: https://tracemind.vercel.app
GitHub: https://github.com/Aayush-engineer/tracemind

3-line setup:

pip install tracemind
from tracemind import TraceMind
tm = TraceMind(api_key="...", project="my-app", 
               base_url="https://tracemind.onrender.com")

@tm.trace("llm_call")
def your_function(msg): ...  # your code unchanged
Enter fullscreen mode Exit fullscreen mode


If you're building with LLMs and want to know if they're actually
working — I'd love feedback.

Top comments (0)