DEV Community

Cover image for TraceMind v2 — I added hallucination detection and A/B testing to my open-source LLM eval platform
Aayush kumarsingh
Aayush kumarsingh

Posted on

TraceMind v2 — I added hallucination detection and A/B testing to my open-source LLM eval platform

What changed since v1

When I posted the first version of TraceMind, I got one clear piece of feedback: "this is useful but I need to know if my AI is making things up, not just scoring low."

So I built hallucination detection. Then while building it I realized I needed a way to compare prompts systematically. So I built A/B testing too.

Here's what's new and how I built it.


The original problem (unchanged)

I was building a multi-agent orchestration system. Three days after deploying, I changed a system prompt. Quality dropped from 84% to 52%. I found out 11 days later from a user complaint.

TraceMind was built to catch this on day zero.


What's new in v2

Hallucination detection

The endpoint takes a question, the AI's response, and optional ground truth context. It extracts individual claims from the response, checks each one against the context, and returns a structured result:

{
  "has_hallucinations": True,
  "overall_risk": "high",
  "claims": [
    {
      "claim": "We offer 60-day refunds",
      "verdict": "hallucination",
      "reason": "Context says 30-day refunds only"
    }
  ]
}
Enter fullscreen mode Exit fullscreen mode

The key architectural decision: claim extraction and verification are separate LLM calls. The first call extracts atomic claims. The second verifies each claim against ground truth. This is more reliable than asking one model to do both.

Prompt A/B testing

You give it two system prompts and a dataset. It runs both prompts against every test case and compares results.

The interesting part is the statistical layer. A naive implementation would just compare average scores. But with small datasets (5-20 cases),average score differences are often noise. I added Mann-Whitney U test and Cohen's d to give a confidence score on whether prompt B is actually better or just randomly different.

{
  "prompt_a_score": 6.2,
  "prompt_b_score": 8.1,
  "winner": "B",
  "confidence": "high",
  "cohen_d": 1.4,
  "p_value": 0.03
}
Enter fullscreen mode Exit fullscreen mode

Verification suite

I built a 44-test verification script covering all 11 feature areas. Running python verify_all.py hits every endpoint end-to-end against a real running server and reports pass/fail. This was more useful than unit tests for catching integration issues.


What I'd still do differently

The same things from v1, plus one new one: the hallucination detection is synchronous. For production use it should be a background job like span scoring. A user with 1000 traces would need to wait for each one — that doesn't scale.


Try it



GitHub: https://github.com/Aayush-engineer/tracemind

pip install tracemind
from tracemind import TraceMind
tm = TraceMind(api_key="...", project="my-app",
               base_url="https://tracemind.onrender.com")

@tm.trace("llm_call")
def your_function(msg): ...  # unchanged
Enter fullscreen mode Exit fullscreen mode

Self-hosted, free, no vendor lock-in.

If you're building with LLMs — I'd genuinely love to know
what breaks when you try it.

Top comments (1)

Collapse
 
sunychoudhary profile image
Suny Choudhary

Nice update.

Hallucination detection is useful, but I’ve found the harder issue is what happens after detection. In real workflows, the question becomes whether the system can handle that uncertainty correctly, retry, validate, or block actions.

Curious if you’re planning anything around response control or just focusing on evaluation for now.