What changed since v1
When I posted the first version of TraceMind, I got one clear piece of feedback: "this is useful but I need to know if my AI is making things up, not just scoring low."
So I built hallucination detection. Then while building it I realized I needed a way to compare prompts systematically. So I built A/B testing too.
Here's what's new and how I built it.
The original problem (unchanged)
I was building a multi-agent orchestration system. Three days after deploying, I changed a system prompt. Quality dropped from 84% to 52%. I found out 11 days later from a user complaint.
TraceMind was built to catch this on day zero.
What's new in v2
Hallucination detection
The endpoint takes a question, the AI's response, and optional ground truth context. It extracts individual claims from the response, checks each one against the context, and returns a structured result:
{
"has_hallucinations": True,
"overall_risk": "high",
"claims": [
{
"claim": "We offer 60-day refunds",
"verdict": "hallucination",
"reason": "Context says 30-day refunds only"
}
]
}
The key architectural decision: claim extraction and verification are separate LLM calls. The first call extracts atomic claims. The second verifies each claim against ground truth. This is more reliable than asking one model to do both.
Prompt A/B testing
You give it two system prompts and a dataset. It runs both prompts against every test case and compares results.
The interesting part is the statistical layer. A naive implementation would just compare average scores. But with small datasets (5-20 cases),average score differences are often noise. I added Mann-Whitney U test and Cohen's d to give a confidence score on whether prompt B is actually better or just randomly different.
{
"prompt_a_score": 6.2,
"prompt_b_score": 8.1,
"winner": "B",
"confidence": "high",
"cohen_d": 1.4,
"p_value": 0.03
}
Verification suite
I built a 44-test verification script covering all 11 feature areas. Running python verify_all.py hits every endpoint end-to-end against a real running server and reports pass/fail. This was more useful than unit tests for catching integration issues.
What I'd still do differently
The same things from v1, plus one new one: the hallucination detection is synchronous. For production use it should be a background job like span scoring. A user with 1000 traces would need to wait for each one — that doesn't scale.
Try it


GitHub: https://github.com/Aayush-engineer/tracemind
pip install tracemind
from tracemind import TraceMind
tm = TraceMind(api_key="...", project="my-app",
base_url="https://tracemind.onrender.com")
@tm.trace("llm_call")
def your_function(msg): ... # unchanged
Self-hosted, free, no vendor lock-in.
If you're building with LLMs — I'd genuinely love to know
what breaks when you try it.
Top comments (1)
Nice update.
Hallucination detection is useful, but I’ve found the harder issue is what happens after detection. In real workflows, the question becomes whether the system can handle that uncertainty correctly, retry, validate, or block actions.
Curious if you’re planning anything around response control or just focusing on evaluation for now.