I built an open-source LLM eval platform with a ReAct agent that diagnoses quality regressions

#python #opensource #llm #rag

The problem that made me build this

I was building a multi-agent orchestration system. It worked great
in testing. I deployed it. Three days later I changed a system prompt.
Quality dropped from 84% to 52%. I found out 11 days later when a
user complained.

This is the most common failure mode in LLM applications. Unlike
traditional software where a bug throws an exception, bad LLM outputs
look like valid responses. They just happen to be wrong, unhelpful,
or unsafe. You need systematic measurement to catch this.

I looked for existing tools. Langfuse is good but expensive at scale for self-hosted teams.
Braintrust doesn't have a free self-hosted option. Helicone doesn't do
evals. I built TraceMind.

What TraceMind does

Three things:

1. Automatic quality scoring
Every LLM response is scored 1-10 by another LLM acting as judge
(LLM-as-judge pattern). I use Groq's free tier — llama-3.1-8b-instant
for fast scoring, llama-3.3-70b for deep analysis. The score runs in
the background, never blocking your application.

2. Golden dataset evals
You define expected behaviors once:

ds = tm.dataset("support-v1")
ds.add("I want a refund", expected="acknowledge and ask for order number")
ds.push()

result = tm.run_eval("support-v1", function=your_agent.run)
result.wait()
print(f"Pass rate: {result.pass_rate:.0%}")  # Pass rate: 87%

3. AI agent that diagnoses regressions
This is the part I'm most proud of. You can ask:

"Why did quality drop yesterday?"
"What are the most common failure patterns?"
"Generate test cases for billing question failures"

The agent implements the ReAct pattern with 6 tools and 4 memory types.

The architecture decisions that matter

Parallel eval execution with asyncio.Semaphore

The naive approach runs LLM judge calls sequentially.
For 100 test cases at 500ms each = 50 seconds.

I use asyncio.Semaphore(3) to run 3 evaluations concurrently:

semaphore = asyncio.Semaphore(max_concurrent)
tasks = [run_case(ex, system_fn, criteria, semaphore) for ex in examples]
for coro in asyncio.as_completed(tasks):
    result = await coro

100 cases now takes ~17 seconds. The semaphore limit exists because
Groq's free tier has rate limits — I tuned it to stay under the threshold.

The ReAct agent with semantic memory

The agent has 4 memory types:

In-context: conversation history within the session
External KV: project config from database
Semantic: past failures in ChromaDB with sentence-transformers embeddings
Episodic: past agent run results in SQLite

When you ask "why did quality drop?", the agent:

Searches ChromaDB semantically for similar past failures
Fetches recent low-scoring traces from the database
Runs a targeted eval on the failure category
Uses Opus-equivalent model to analyze root cause
Generates new test cases to prevent future recurrence

I intentionally avoided LangChain. The ReAct loop is 80 lines of
readable Python. When something breaks at 3am, you want to read
your own code.

Background worker for async scoring

The HTTP ingestion endpoint returns in <10ms regardless of batch size.
Scoring runs in a background worker that polls every 10 seconds:

async def _score_unscored_spans(self):
    spans = fetch_unscored(limit=20)
    for span in spans:
        score = await self._score_span(span.input, span.output)
        save_score(span.id, score)

The worst thing an observability tool can do is slow down the system
it's monitoring. Scoring is completely decoupled from ingestion.

Local embeddings — no OpenAI dependency

I use sentence-transformers all-MiniLM-L6-v2 for ChromaDB embeddings.
It runs locally, downloads once (~90MB), works offline, zero API cost.
This was a deliberate choice — I wanted the tool to work completely
free with no external dependencies beyond Groq for LLM calls.

What I'd do differently in production

Multi-tenancy: Row-level security instead of project-level isolation
Celery + Redis instead of asyncio background worker for horizontal scaling
Streaming eval results via WebSocket — see case-by-case progress in real time
Alembic migrations from day one (I added these later)

Try it

Live demo: https://tracemind.vercel.app
GitHub: https://github.com/Aayush-engineer/tracemind

3-line setup:

pip install tracemind
from tracemind import TraceMind
tm = TraceMind(api_key="...", project="my-app", 
               base_url="https://tracemind.onrender.com")

@tm.trace("llm_call")
def your_function(msg): ...  # your code unchanged