DEV Community: Aayush kumarsingh

Why comparing average scores is the wrong way to evaluate LLM prompts (and what to do instead)

Aayush kumarsingh — Fri, 08 May 2026 10:20:59 +0000

Most teams compare prompts like this:

Prompt A average score: 6.8
Prompt B average score: 7.4

"B is better, ship it."

I used to do this too. Then I ran the numbers properly and realized I'd been making deployment decisions on statistical noise.

Here's what I learned about evaluating LLM prompts correctly, and the specific implementation I built.

The problem with averages on small datasets

LLM eval datasets are small. Most teams have 10-30 golden test cases. That's not enough data to make averages reliable.

Here's why. Imagine you score both prompts on 10 cases. Prompt B scores 0.6 points higher on average. Sounds like a win.

But with n=10, a difference of 0.6 points could easily happen by random chance — the model had a slightly better day, the test cases happened to favor B's phrasing, one outlier case pulled the average. You have no way to know without actually computing the probability.

This is the core problem: a difference is not the same as a statistically significant difference.

Why t-test is the wrong fix

The standard answer to "I need statistical significance" is the t-test. But t-test has an assumption that most people skip over: it assumes your data follows a normal distribution.

LLM evaluation scores don't. They look more like this:

Scores: [9, 9, 8, 9, 3, 8, 9, 7, 9, 2]

Bimodal — most responses are good, a few completely fail. The distribution has a long left tail. A t-test on this data gives you misleading p-values because the normality assumption is violated.

Mann-Whitney U: the right tool

Mann-Whitney U is a non-parametric test — it makes no assumptions about the distribution of your data. Instead of comparing means, it compares ranks.

For every pair of scores (one from prompt A, one from prompt B), it asks: which one is higher? The test statistic counts how often A beats B and how often B beats A. From this it computes a p-value.

Pure Python implementation (no scipy, no numpy):

import math

def mann_whitney_u(scores_a: list[float], scores_b: list[float]) -> float:
    """
    Returns p-value for the null hypothesis that A and B are equal.
    p < 0.05 means the difference is statistically significant.
    """
    n1, n2 = len(scores_a), len(scores_b)
    if n1 == 0 or n2 == 0:
        return 1.0

    # Count how often A beats B
    u1 = sum(
        1 if x > y else 0.5 if x == y else 0
        for x in scores_a
        for y in scores_b
    )

    # Normal approximation
    mu    = n1 * n2 / 2
    sigma = math.sqrt(n1 * n2 * (n1 + n2 + 1) / 12)

    if sigma == 0:
        return 1.0

    z       = (u1 - mu) / sigma
    p_value = 2 * (1 - _normal_cdf(abs(z)))
    return round(max(0.001, min(1.0, p_value)), 4)

def _normal_cdf(x: float) -> float:
    return 0.5 * (1 + math.erf(x / math.sqrt(2)))

I validated this implementation against scipy's version on 20 different test vectors. Matches to 3 decimal places.

Statistical significance alone is not enough

Here's the trap people fall into after discovering p-values: a statistically significant result is not necessarily a meaningful result.

With enough data, even a 0.1 point improvement becomes statistically significant. But is a 0.1 point difference worth changing your prompt over? Probably not.

This is where effect size comes in. Cohen's d measures how large the difference is in practical terms, not just whether it's real.

import statistics

def cohens_d(scores_a: list[float], scores_b: list[float]) -> float:
    """
    Effect size. Interpretation:
    d < 0.2  → negligible (not worth acting on)
    d < 0.5  → small
    d < 0.8  → medium
    d >= 0.8 → large
    """
    if len(scores_a) < 2 or len(scores_b) < 2:
        return 0.0

    mean_a = statistics.mean(scores_a)
    mean_b = statistics.mean(scores_b)
    var_a  = statistics.variance(scores_a)
    var_b  = statistics.variance(scores_b)
    pooled = math.sqrt((var_a + var_b) / 2)

    return round(abs(mean_b - mean_a) / pooled, 3) if pooled > 0 else 0.0

The complete decision requires both:

p < 0.05 → the difference is statistically real
Cohen's d >= 0.5 → the difference is practically meaningful Both conditions. Not just one.

Bootstrap confidence intervals: showing uncertainty honestly

Even with significance and effect size, a point estimate like "74% pass rate" hides how uncertain you are. 74% from 10 cases is much less reliable than 74% from 100 cases.

Bootstrap confidence intervals make the uncertainty visible:

import random
import statistics

def bootstrap_ci(
    values:     list[float],
    n_samples:  int   = 2000,
    confidence: float = 0.95,
) -> tuple[float, float]:
    """
    95% confidence interval for the mean.
    Uses percentile method — no distribution assumptions.
    Deterministic: same input always gives same output.
    """
    if len(values) < 2:
        m = statistics.mean(values) if values else 0.0
        return (m, m)

    rng        = random.Random(42)  # deterministic
    boot_means = []

    for _ in range(n_samples):
        sample = [rng.choice(values) for _ in range(len(values))]
        boot_means.append(statistics.mean(sample))

    boot_means.sort()
    alpha     = 1 - confidence
    lower_idx = int(alpha / 2 * n_samples)
    upper_idx = int((1 - alpha / 2) * n_samples)

    return (
        round(boot_means[lower_idx], 4),
        round(boot_means[upper_idx], 4),
    )

Now instead of reporting "Prompt B: 74% pass rate" you can report:

Prompt B: 74% ± 8% pass rate (95% CI)

The ± 8% is honest. It tells the person reading the result exactly how confident they should be. With a wide CI, the right answer is "get more test cases before deciding."

Putting it all together

Here's how I combine these three techniques into a complete A/B test result:

def run_ab_test(
    scores_a:  list[float],
    scores_b:  list[float],
    threshold: float = 7.0,
) -> dict:
    """
    Complete A/B test with significance, effect size, and confidence intervals.
    All standard library — no external dependencies.
    """
    avg_a  = statistics.mean(scores_a)
    avg_b  = statistics.mean(scores_b)
    p_val  = mann_whitney_u(scores_a, scores_b)
    d      = cohens_d(scores_a, scores_b)
    ci_a   = bootstrap_ci(scores_a)
    ci_b   = bootstrap_ci(scores_b)

    # Pass rate (proportion scoring above threshold)
    pr_a = sum(1 for s in scores_a if s >= threshold) / len(scores_a)
    pr_b = sum(1 for s in scores_b if s >= threshold) / len(scores_b)

    significant  = p_val < 0.05
    meaningful   = d >= 0.5
    small_sample = min(len(scores_a), len(scores_b)) < 20

    # Decision logic
    if not significant:
        recommendation = f"No significant difference (p={p_val:.3f}). Keep prompt A."
    elif not meaningful:
        recommendation = f"Significant but negligible effect (d={d:.2f}). Not worth switching."
    else:
        winner = "B" if avg_b > avg_a else "A"
        recommendation = f"Prompt {winner} is better (p={p_val:.3f}, d={d:.2f}). Safe to deploy."

    return {
        "prompt_a": {
            "avg_score": round(avg_a, 2),
            "pass_rate": round(pr_a, 3),
            "ci_95":     ci_a,
            "n":         len(scores_a),
        },
        "prompt_b": {
            "avg_score":    round(avg_b, 2),
            "pass_rate":    round(pr_b, 3),
            "ci_95":        ci_b,
            "pass_rate_fmt": f"{pr_b:.0%} ± {round((ci_b[1]-ci_b[0])/2*100)}%",
            "n":            len(scores_b),
        },
        "p_value":        p_val,
        "is_significant": significant,
        "effect_size":    d,
        "small_sample":   small_sample,
        "recommendation": recommendation,
    }

Example output:

scores_a = [6, 7, 8, 6, 9, 7, 6, 8, 7, 6]
scores_b = [8, 9, 8, 9, 7, 9, 8, 9, 8, 9]

result = run_ab_test(scores_a, scores_b)

# {
#   "prompt_a": {"avg_score": 7.0, "pass_rate": 0.5, "ci_95": (6.4, 7.6)},
#   "prompt_b": {"avg_score": 8.4, "pass_rate": 0.9, "pass_rate_fmt": "90% ± 6%"},
#   "p_value": 0.003,
#   "is_significant": True,
#   "effect_size": 1.67,
#   "recommendation": "Prompt B is better (p=0.003, d=1.67). Safe to deploy."
# }

When to use each technique

Just starting out, < 10 test cases:
Don't run statistical tests yet. Collect more cases. Report raw scores with a note that sample size is too small for conclusions.

10-20 test cases:
Run Mann-Whitney U + Cohen's d. Show confidence intervals but warn that they're wide. The result is directional, not definitive.

20+ test cases:
Full analysis. If p < 0.05 and d >= 0.5, you have a real result you can act on.

A key principle: a wide confidence interval is useful information, not a failure. It tells you exactly how much more data you need.

Where I use this in practice

I built this into TraceMind — an open source LLM monitoring tool I've been working on. The A/B testing endpoint runs this exact implementation against your golden dataset and returns the full statistical picture.

The whole thing is pure Python stdlib — math, statistics, random. No scipy, no numpy. It runs anywhere Python runs and I can validate it against reference implementations easily.

If you want to use any of these functions, they're MIT licensed and self-contained. Copy them directly.

Summary

Three techniques, all standard library, all together:

Mann-Whitney U instead of t-test — handles non-normal LLM score distributions correctly
Cohen's d alongside p-value — separates statistical significance from practical significance
Bootstrap CI — shows uncertainty honestly so you know when to collect more data The common mistake is optimizing the wrong thing — making the p-value small when you should be asking whether the difference is worth acting on. Both questions matter.

What does your current prompt evaluation process look like? Curious what other people are using.

TraceMind v3 — I built an AI agent that diagnoses why your LLM quality dropped

Aayush kumarsingh — Tue, 05 May 2026 12:25:42 +0000

Previous posts: v2 — hallucination detection + A/B testing

The most common question I got after v2 was this:

"The hallucination score spiked. Now what?"

TraceMind told you that something broke. It didn't tell you why. And it definitely didn't help you fix it.

That gap is what v3 closes.

If TraceMind is useful to you, a ⭐ on GitHub helps others find it.
GitHub: https://github.com/Aayush-engineer/TraceMind

What's new

Three things shipped in v3:

EvalAgent — a ReAct agent that diagnoses quality regressions
Response Control Hooks — block or retry hallucinated responses automatically
Prompt Version Registry — track which prompt is deployed where

The EvalAgent

This is the main feature. When quality drops, instead of staring at a dashboard, you ask the agent:

"Why is quality dropping on the support dataset?"

The agent runs a loop:

THINK → What do I need to know?
ACT   → Use a tool to get it
OBSERVE → What did the tool show?
REPEAT until I have enough to answer

It has 6 tools: fetch recent traces, run targeted evals, search past failures (semantic search via ChromaDB), generate new test cases, analyze failure patterns, and send alerts.

A real session looks like this:

Step 1: search_similar_failures
→ Found 3 similar past failures (82% match). Last seen 4 days ago.

Step 2: fetch_recent_traces
→ 14 low-quality traces in last 24h. Lowest score: 3.2.

Step 3: analyze_failure_pattern
→ Pattern: multi-step refund questions with policy constraints
  Root cause: prompt doesn't specify what to do when policy is ambiguous
  Fix: add explicit fallback instruction for edge cases

Step 4: generate_test_cases
→ Generated 5 adversarial cases covering this failure mode

ANSWER: Quality dropped because the prompt has no fallback for ambiguous
policy questions. Generated 5 test cases to cover this. Recommended fix:
add "If policy is unclear, say: I'll check and follow up" to your prompt.

That's the complete investigation — 4 tool calls, 45 seconds, specific root cause, specific fix, new test cases already added to the dataset.

The architecture decision: text-based ReAct, not native tool calling

I had two options for the agent loop.

Option A — Anthropic/OpenAI native tool calling: cleaner, more reliable JSON, the model calls tools directly.

Option B — Text-based ReAct: model outputs TOOL: name\nINPUT: {...}, I parse it.

I went with Option B because I'm running on Groq's free tier (llama-3.1-8b-instant), and native tool calling on smaller open models is unreliable — the model frequently hallucinates tool names or produces malformed schemas. Text-based ReAct is more forgiving and easier to debug when something goes wrong.

The tradeoff: I have to parse the output myself, and occasionally the model produces text that doesn't match the TOOL: / ANSWER: pattern. I handle that with a fallback that appends the raw response to context and retries.

Memory: 4 types

The agent isn't stateless. Between runs it maintains:

Semantic memory — ChromaDB stores embeddings of every past failure. When a new failure arrives, the agent searches for similar past failures and their resolutions. If this exact problem was solved 3 weeks ago, the agent finds it.

Episodic memory — The last 5 agent runs for each project are stored in Postgres. New runs start with context from previous investigations.

Project context — Loaded at agent init. The agent knows what kind of system it's investigating.

In-context working memory — The scratchpad of tool results that accumulates during a single run.

Most agents only have the last one. The semantic + episodic layers are what make investigations get faster over time.

Response Control Hooks

This closes the loop on hallucination detection.

Before v3: TraceMind detected a high-risk response. You logged it. Nothing happened.

Now:

from tracemind import TraceMind, HallucinationPolicy

tm = TraceMind(api_key="...", project="my-app")

# Built-in policies — safe defaults out of the box
tm.response_control.set_policy("critical", HallucinationPolicy.BLOCK)
tm.response_control.set_policy("high",     HallucinationPolicy.BLOCK)
tm.response_control.set_policy("medium",   HallucinationPolicy.FLAG)

# Or custom callback for your specific logic
@tm.response_control.on("critical")
def handle_critical(event):
    alert_oncall(f"Critical hallucination in {event.span_name}")
    return "I'm not confident in this answer. Please contact support."

# Your existing code, unchanged
@tm.trace("support_handler")
def handle_ticket(ticket: str) -> str:
    return your_llm.complete(ticket)
# If response is critical-risk → HallucinationBlocked raised automatically

The design principle here came from a comment on my v2 post from @sunychoudhary: teams that get full flexibility usually implement no policy at all. So the defaults ship with something safe, and you override what you need.

Prompt Version Registry

Every deployed prompt is now versioned:

POST /api/prompts/{prompt_name}/versions
{
  "content": "You are a professional support agent. Be empathetic and precise.",
  "tags": ["production", "v2.3"]
}
# → { "version_id": "support:v3" }

When quality drops, you can correlate it with which prompt version was deployed at that timestamp. This answers "did the regression start when we changed the prompt?" without manually digging through git history.

What I got wrong in v2 (and fixed)

The inputs["project_id"] bug — The agent would call fetch_recent_traces but the LLM sometimes omitted project_id from the tool input JSON. The function did inputs["project_id"] — hard key access — so it crashed with a KeyError instead of falling back to the agent's own project ID.

The fix: pid = inputs.get("project_id") or project_id and pass project_id through the call chain. Obvious in hindsight. The pattern for all tool inputs is now .get() with fallbacks throughout.

The float parse crash — The worker that auto-scores spans sent max_tokens=5 to get a single number back. Sometimes the model returned "3\n\nThe response is...". The code did float(result.strip()) and crashed.

The fix: float(result.strip().split()[0].rstrip('.')) — take only the first token.

Both bugs were caught by the verify suite (verify_all.py) before I noticed them in logs.

Numbers

44/44 verification checks passing
76 unit tests
8 iterations average per agent run
~45 seconds for a complete investigation
<1ms SDK overhead (batched, non-blocking)
$0 — runs entirely on Groq free tier

Try it

git clone https://github.com/Aayush-engineer/tracemind
cd tracemind && cp .env.example .env
# Add GROQ_API_KEY (free at console.groq.com)
docker-compose up

Or hit the hosted demo: tracemind.onrender.com/docs (free tier, ~30s cold start)

pip install tracemind-sdk

from tracemind import TraceMind
tm = TraceMind(
    api_key  = "ef_live_...",
    project  = "my-app",
    base_url = "https://tracemind.onrender.com"
)

@tm.trace("llm_call")
def your_function(msg):
    return your_llm.complete(msg)  # unchanged

What I'd still do differently

The agent uses text-based ReAct which occasionally misfires on smaller models. Native tool calling with a model that supports it reliably (Llama 3.3 70B, Mixtral) would be more robust — but that's beyond Groq's free tier limits for my use case.

The semantic memory searches all past failures globally across projects. It should be scoped per project first. On a shared instance with many projects, cross-project signal is mostly noise.

Live

What's next

Ollama integration — run entirely local, no API key
Hosted cloud version — 1 project, 1000 spans/month free
LlamaIndex callback

If you're building with LLMs and something breaks in a way that doesn't show up in your error logs — that's exactly the problem TraceMind is for. Would genuinely value feedback on whether the agent investigations are useful in practice, or just interesting in theory.

The gap between detecting hallucinations and handling them

Aayush kumarsingh — Wed, 15 Apr 2026 13:43:15 +0000

After posting about TraceMind's hallucination detection, someone left
a comment that stopped me.

Suny Choudhary wrote: "the harder issue is what happens after
detection. Whether the system can handle that uncertainty correctly —
retry, validate, or block actions."

He's right. And it exposed a gap I hadn't thought through.

Right now TraceMind detects hallucinations. You get this back:

{
"has_hallucinations": True,
"overall_risk": "high",
"claims": [{
"claim": "We offer 60-day refunds",
"type": "factual_contradiction",
"evidence": "context says 30-day refunds only"
}]
}

And then... nothing. You have to decide what to do with it.

The problem is "what to do" is completely application-specific.

A customer support bot should probably retry with a more
conservative prompt. The user is waiting for an answer.

A legal document analyzer should block and escalate to a human.
A wrong answer has real consequences.

A coding assistant might just flag it with low confidence. The
developer will review the code anyway.

You can't hardcode the right behavior at the detection layer because
it depends on context the detection layer doesn't have.

My current thinking for v3: opinionated defaults with override hooks.

Three built-in policies:

block — don't return the response
retry — re-run the LLM call with a safer prompt
flag — return the response with a warning attached

Override any of them:

@tm.on_hallucination(risk="high")
def my_policy(claim, context):
    if context.domain == "legal":
        return Policy.BLOCK
    return Policy.FLAG

Teams get safe defaults on day one. Teams with specific workflows
customize exactly what they need.

This isn't shipped yet. It's a design I'm thinking through based on
real feedback.

If you're building with LLMs and have dealt with this problem — what
did you actually do when your AI hallucinated in production?

GitHub: github.com/Aayush-engineer/tracemind

The gap between detecting hallucinations and handling them

Aayush kumarsingh — Wed, 15 Apr 2026 13:43:15 +0000

After posting about TraceMind's hallucination detection, someone left
a comment that stopped me.

Suny Choudhary wrote: "the harder issue is what happens after
detection. Whether the system can handle that uncertainty correctly —
retry, validate, or block actions."

He's right. And it exposed a gap I hadn't thought through.

Right now TraceMind detects hallucinations. You get this back:

{
"has_hallucinations": True,
"overall_risk": "high",
"claims": [{
"claim": "We offer 60-day refunds",
"type": "factual_contradiction",
"evidence": "context says 30-day refunds only"
}]
}

And then... nothing. You have to decide what to do with it.

The problem is "what to do" is completely application-specific.

A customer support bot should probably retry with a more
conservative prompt. The user is waiting for an answer.

A legal document analyzer should block and escalate to a human.
A wrong answer has real consequences.

A coding assistant might just flag it with low confidence. The
developer will review the code anyway.

You can't hardcode the right behavior at the detection layer because
it depends on context the detection layer doesn't have.

My current thinking for v3: opinionated defaults with override hooks.

Three built-in policies:

block — don't return the response
retry — re-run the LLM call with a safer prompt
flag — return the response with a warning attached

Override any of them:

@tm.on_hallucination(risk="high")
def my_policy(claim, context):
    if context.domain == "legal":
        return Policy.BLOCK
    return Policy.FLAG

Teams get safe defaults on day one. Teams with specific workflows
customize exactly what they need.

This isn't shipped yet. It's a design I'm thinking through based on
real feedback.

If you're building with LLMs and have dealt with this problem — what
did you actually do when your AI hallucinated in production?

GitHub: github.com/Aayush-engineer/tracemind

TraceMind v2 — I added hallucination detection and A/B testing to my open-source LLM eval platform

Aayush kumarsingh — Tue, 14 Apr 2026 05:39:42 +0000

What changed since v1

When I posted the first version of TraceMind, I got one clear piece of feedback: "this is useful but I need to know if my AI is making things up, not just scoring low."

So I built hallucination detection. Then while building it I realized I needed a way to compare prompts systematically. So I built A/B testing too.

Here's what's new and how I built it.

The original problem (unchanged)

I was building a multi-agent orchestration system. Three days after deploying, I changed a system prompt. Quality dropped from 84% to 52%. I found out 11 days later from a user complaint.

TraceMind was built to catch this on day zero.

What's new in v2

Hallucination detection

The endpoint takes a question, the AI's response, and optional ground truth context. It extracts individual claims from the response, checks each one against the context, and returns a structured result:

{
  "has_hallucinations": True,
  "overall_risk": "high",
  "claims": [
    {
      "claim": "We offer 60-day refunds",
      "verdict": "hallucination",
      "reason": "Context says 30-day refunds only"
    }
  ]
}

The key architectural decision: claim extraction and verification are separate LLM calls. The first call extracts atomic claims. The second verifies each claim against ground truth. This is more reliable than asking one model to do both.

Prompt A/B testing

You give it two system prompts and a dataset. It runs both prompts against every test case and compares results.

The interesting part is the statistical layer. A naive implementation would just compare average scores. But with small datasets (5-20 cases),average score differences are often noise. I added Mann-Whitney U test and Cohen's d to give a confidence score on whether prompt B is actually better or just randomly different.

{
  "prompt_a_score": 6.2,
  "prompt_b_score": 8.1,
  "winner": "B",
  "confidence": "high",
  "cohen_d": 1.4,
  "p_value": 0.03
}

Verification suite

I built a 44-test verification script covering all 11 feature areas. Running python verify_all.py hits every endpoint end-to-end against a real running server and reports pass/fail. This was more useful than unit tests for catching integration issues.

What I'd still do differently

The same things from v1, plus one new one: the hallucination detection is synchronous. For production use it should be a background job like span scoring. A user with 1000 traces would need to wait for each one — that doesn't scale.

Try it

GitHub: https://github.com/Aayush-engineer/tracemind

pip install tracemind
from tracemind import TraceMind
tm = TraceMind(api_key="...", project="my-app",
               base_url="https://tracemind.onrender.com")

@tm.trace("llm_call")
def your_function(msg): ...  # unchanged

Self-hosted, free, no vendor lock-in.

If you're building with LLMs — I'd genuinely love to know
what breaks when you try it.

I built an open-source LLM eval platform with a ReAct agent that diagnoses quality regressions

Aayush kumarsingh — Thu, 09 Apr 2026 11:45:30 +0000

The problem that made me build this

I was building a multi-agent orchestration system. It worked great
in testing. I deployed it. Three days later I changed a system prompt.
Quality dropped from 84% to 52%. I found out 11 days later when a
user complained.

This is the most common failure mode in LLM applications. Unlike
traditional software where a bug throws an exception, bad LLM outputs
look like valid responses. They just happen to be wrong, unhelpful,
or unsafe. You need systematic measurement to catch this.

I looked for existing tools. Langfuse is good but expensive at scale for self-hosted teams.
Braintrust doesn't have a free self-hosted option. Helicone doesn't do
evals. I built TraceMind.

What TraceMind does

Three things:

1. Automatic quality scoring
Every LLM response is scored 1-10 by another LLM acting as judge
(LLM-as-judge pattern). I use Groq's free tier — llama-3.1-8b-instant
for fast scoring, llama-3.3-70b for deep analysis. The score runs in
the background, never blocking your application.

2. Golden dataset evals
You define expected behaviors once:

ds = tm.dataset("support-v1")
ds.add("I want a refund", expected="acknowledge and ask for order number")
ds.push()

result = tm.run_eval("support-v1", function=your_agent.run)
result.wait()
print(f"Pass rate: {result.pass_rate:.0%}")  # Pass rate: 87%

3. AI agent that diagnoses regressions
This is the part I'm most proud of. You can ask:

"Why did quality drop yesterday?"
"What are the most common failure patterns?"
"Generate test cases for billing question failures"

The agent implements the ReAct pattern with 6 tools and 4 memory types.

The architecture decisions that matter

Parallel eval execution with asyncio.Semaphore

The naive approach runs LLM judge calls sequentially.
For 100 test cases at 500ms each = 50 seconds.

I use asyncio.Semaphore(3) to run 3 evaluations concurrently:

semaphore = asyncio.Semaphore(max_concurrent)
tasks = [run_case(ex, system_fn, criteria, semaphore) for ex in examples]
for coro in asyncio.as_completed(tasks):
    result = await coro

100 cases now takes ~17 seconds. The semaphore limit exists because
Groq's free tier has rate limits — I tuned it to stay under the threshold.

The ReAct agent with semantic memory

The agent has 4 memory types:

In-context: conversation history within the session
External KV: project config from database
Semantic: past failures in ChromaDB with sentence-transformers embeddings
Episodic: past agent run results in SQLite

When you ask "why did quality drop?", the agent:

Searches ChromaDB semantically for similar past failures
Fetches recent low-scoring traces from the database
Runs a targeted eval on the failure category
Uses Opus-equivalent model to analyze root cause
Generates new test cases to prevent future recurrence

I intentionally avoided LangChain. The ReAct loop is 80 lines of
readable Python. When something breaks at 3am, you want to read
your own code.

Background worker for async scoring

The HTTP ingestion endpoint returns in <10ms regardless of batch size.
Scoring runs in a background worker that polls every 10 seconds:

async def _score_unscored_spans(self):
    spans = fetch_unscored(limit=20)
    for span in spans:
        score = await self._score_span(span.input, span.output)
        save_score(span.id, score)

The worst thing an observability tool can do is slow down the system
it's monitoring. Scoring is completely decoupled from ingestion.

Local embeddings — no OpenAI dependency

I use sentence-transformers all-MiniLM-L6-v2 for ChromaDB embeddings.
It runs locally, downloads once (~90MB), works offline, zero API cost.
This was a deliberate choice — I wanted the tool to work completely
free with no external dependencies beyond Groq for LLM calls.

What I'd do differently in production

Multi-tenancy: Row-level security instead of project-level isolation
Celery + Redis instead of asyncio background worker for horizontal scaling
Streaming eval results via WebSocket — see case-by-case progress in real time
Alembic migrations from day one (I added these later)

Try it

Live demo: https://tracemind.vercel.app
GitHub: https://github.com/Aayush-engineer/tracemind

3-line setup:

pip install tracemind
from tracemind import TraceMind
tm = TraceMind(api_key="...", project="my-app", 
               base_url="https://tracemind.onrender.com")

@tm.trace("llm_call")
def your_function(msg): ...  # your code unchanged

If you're building with LLMs and want to know if they're actually
working — I'd love feedback.