Previous posts: v2 — hallucination detection + A/B testing
The most common question I got after v2 was this:
"The hallucination score spiked. Now what?"
TraceMind told you that something broke. It didn't tell you why. And it definitely didn't help you fix it.
That gap is what v3 closes.
If TraceMind is useful to you, a ⭐ on GitHub helps others find it.
GitHub: https://github.com/Aayush-engineer/TraceMind
What's new
Three things shipped in v3:
- EvalAgent — a ReAct agent that diagnoses quality regressions
- Response Control Hooks — block or retry hallucinated responses automatically
- Prompt Version Registry — track which prompt is deployed where
The EvalAgent
This is the main feature. When quality drops, instead of staring at a dashboard, you ask the agent:
"Why is quality dropping on the support dataset?"
The agent runs a loop:
THINK → What do I need to know?
ACT → Use a tool to get it
OBSERVE → What did the tool show?
REPEAT until I have enough to answer
It has 6 tools: fetch recent traces, run targeted evals, search past failures (semantic search via ChromaDB), generate new test cases, analyze failure patterns, and send alerts.
A real session looks like this:
Step 1: search_similar_failures
→ Found 3 similar past failures (82% match). Last seen 4 days ago.
Step 2: fetch_recent_traces
→ 14 low-quality traces in last 24h. Lowest score: 3.2.
Step 3: analyze_failure_pattern
→ Pattern: multi-step refund questions with policy constraints
Root cause: prompt doesn't specify what to do when policy is ambiguous
Fix: add explicit fallback instruction for edge cases
Step 4: generate_test_cases
→ Generated 5 adversarial cases covering this failure mode
ANSWER: Quality dropped because the prompt has no fallback for ambiguous
policy questions. Generated 5 test cases to cover this. Recommended fix:
add "If policy is unclear, say: I'll check and follow up" to your prompt.
That's the complete investigation — 4 tool calls, 45 seconds, specific root cause, specific fix, new test cases already added to the dataset.
The architecture decision: text-based ReAct, not native tool calling
I had two options for the agent loop.
Option A — Anthropic/OpenAI native tool calling: cleaner, more reliable JSON, the model calls tools directly.
Option B — Text-based ReAct: model outputs TOOL: name\nINPUT: {...}, I parse it.
I went with Option B because I'm running on Groq's free tier (llama-3.1-8b-instant), and native tool calling on smaller open models is unreliable — the model frequently hallucinates tool names or produces malformed schemas. Text-based ReAct is more forgiving and easier to debug when something goes wrong.
The tradeoff: I have to parse the output myself, and occasionally the model produces text that doesn't match the TOOL: / ANSWER: pattern. I handle that with a fallback that appends the raw response to context and retries.
Memory: 4 types
The agent isn't stateless. Between runs it maintains:
Semantic memory — ChromaDB stores embeddings of every past failure. When a new failure arrives, the agent searches for similar past failures and their resolutions. If this exact problem was solved 3 weeks ago, the agent finds it.
Episodic memory — The last 5 agent runs for each project are stored in Postgres. New runs start with context from previous investigations.
Project context — Loaded at agent init. The agent knows what kind of system it's investigating.
In-context working memory — The scratchpad of tool results that accumulates during a single run.
Most agents only have the last one. The semantic + episodic layers are what make investigations get faster over time.
Response Control Hooks
This closes the loop on hallucination detection.
Before v3: TraceMind detected a high-risk response. You logged it. Nothing happened.
Now:
from tracemind import TraceMind, HallucinationPolicy
tm = TraceMind(api_key="...", project="my-app")
# Built-in policies — safe defaults out of the box
tm.response_control.set_policy("critical", HallucinationPolicy.BLOCK)
tm.response_control.set_policy("high", HallucinationPolicy.BLOCK)
tm.response_control.set_policy("medium", HallucinationPolicy.FLAG)
# Or custom callback for your specific logic
@tm.response_control.on("critical")
def handle_critical(event):
alert_oncall(f"Critical hallucination in {event.span_name}")
return "I'm not confident in this answer. Please contact support."
# Your existing code, unchanged
@tm.trace("support_handler")
def handle_ticket(ticket: str) -> str:
return your_llm.complete(ticket)
# If response is critical-risk → HallucinationBlocked raised automatically
The design principle here came from a comment on my v2 post from @sunychoudhary: teams that get full flexibility usually implement no policy at all. So the defaults ship with something safe, and you override what you need.
Prompt Version Registry
Every deployed prompt is now versioned:
POST /api/prompts/{prompt_name}/versions
{
"content": "You are a professional support agent. Be empathetic and precise.",
"tags": ["production", "v2.3"]
}
# → { "version_id": "support:v3" }
When quality drops, you can correlate it with which prompt version was deployed at that timestamp. This answers "did the regression start when we changed the prompt?" without manually digging through git history.
What I got wrong in v2 (and fixed)
The inputs["project_id"] bug — The agent would call fetch_recent_traces but the LLM sometimes omitted project_id from the tool input JSON. The function did inputs["project_id"] — hard key access — so it crashed with a KeyError instead of falling back to the agent's own project ID.
The fix: pid = inputs.get("project_id") or project_id and pass project_id through the call chain. Obvious in hindsight. The pattern for all tool inputs is now .get() with fallbacks throughout.
The float parse crash — The worker that auto-scores spans sent max_tokens=5 to get a single number back. Sometimes the model returned "3\n\nThe response is...". The code did float(result.strip()) and crashed.
The fix: float(result.strip().split()[0].rstrip('.')) — take only the first token.
Both bugs were caught by the verify suite (verify_all.py) before I noticed them in logs.
Numbers
44/44 verification checks passing
76 unit tests
8 iterations average per agent run
~45 seconds for a complete investigation
<1ms SDK overhead (batched, non-blocking)
$0 — runs entirely on Groq free tier
Try it
git clone https://github.com/Aayush-engineer/tracemind
cd tracemind && cp .env.example .env
# Add GROQ_API_KEY (free at console.groq.com)
docker-compose up
Or hit the hosted demo: tracemind.onrender.com/docs (free tier, ~30s cold start)
pip install tracemind-sdk
from tracemind import TraceMind
tm = TraceMind(
api_key = "ef_live_...",
project = "my-app",
base_url = "https://tracemind.onrender.com"
)
@tm.trace("llm_call")
def your_function(msg):
return your_llm.complete(msg) # unchanged
What I'd still do differently
The agent uses text-based ReAct which occasionally misfires on smaller models. Native tool calling with a model that supports it reliably (Llama 3.3 70B, Mixtral) would be more robust — but that's beyond Groq's free tier limits for my use case.
The semantic memory searches all past failures globally across projects. It should be scoped per project first. On a shared instance with many projects, cross-project signal is mostly noise.
Live
What's next
- Ollama integration — run entirely local, no API key
- Hosted cloud version — 1 project, 1000 spans/month free
- LlamaIndex callback
If you're building with LLMs and something breaks in a way that doesn't show up in your error logs — that's exactly the problem TraceMind is for. Would genuinely value feedback on whether the agent investigations are useful in practice, or just interesting in theory.


Top comments (0)