rishabh jain

Posted on Jun 15

How I debug RAG failures with deterministic signals

#ai #machinelearning #python #devtools

When building LLM apps, one frustrating problem is that a response can be wrong for many different reasons.

The model may have hallucinated. The retriever may have pulled the wrong chunks. The answer may not be grounded in the provided context. A tool call may be missing. A JSON schema may have failed. Or the prompt may simply be brittle.

I started experimenting with a more structured way to debug these failures.

Instead of only asking another LLM "is this answer good?", I wanted to compute signals from the request and response itself.

The inputs I care about are usually:

prompt
model output
retrieved chunks
similarity scores
tool calls
expected tools
response schema
latency
model settings

From these, I try to classify the failure into categories like:

retrieval failure
RAG hallucination
schema violation
missing tool call
citation failure
prompt brittleness
ambiguous prompt
instruction-following issue

For example, if the retriever returns chunks with low similarity and the output answers confidently anyway, that points toward a retrieval or grounding problem.

If the chunks are relevant but the output introduces unsupported claims, that looks more like hallucination.

If a tool was expected but no tool call happened, that is a different failure mode entirely.

A simple example:

from debugai import analyze

result = analyze(
prompt="What is the refund policy?",
output="You can get a full refund after 90 days.",
chunks=["Electronics can be returned within 30 days with receipt."],
similarity_scores=[0.82],
)

print(result["primary"])

I’m building this into a small Python SDK called DebugAI.

Install:

pip install debugerai

The SDK can run locally without logging into a website. I also have a hosted dashboard for saving traces and diagnoses, but local usage is the main thing I’m testing right now.

The goal is not to magically solve LLM reliability. It is to make failures easier to name and inspect.

I’m still early, so I’m mainly looking for feedback from people building LLM/RAG apps:

Are these the right failure categories?
What signals would you want to see?
Do you debug this another way?
Should tools like this be local-first by default?

Link: https://debugai-5lb2.onrender.com

Top comments (4)

Alex Shev • Jun 15

Deterministic signals are the right framing. "The answer was bad" is too vague to debug; you need to know whether retrieval, grounding, schema, tool execution, or prompt routing failed.

I like treating RAG debugging as a pipeline trace problem. Once each step has a simple pass/fail reason, the fixes become much less mystical.

rishabh jain • Jun 17

Exactly. “Bad answer” is almost never the useful debugging unit.

The useful unit is: which part of the LLM pipeline failed, and what signal proves it? Retrieval may have returned irrelevant chunks, grounding may have failed despite decent retrieval, schema validation may have broken, or the model may have skipped a required tool call.

That’s the direction I’m taking DebugAI: treat each LLM call like a pipeline trace with deterministic signals and a named failure type, so the next action is obvious instead of vibes-based. Curious if you’ve seen this most often in RAG retrieval, tool calls, or output validation?

Alex Shev • Jun 18

I see it most often in output validation and tool-call preconditions.

Retrieval failures are easier to notice because the answer cites weak context. The subtler failures are when the model had enough context, but skipped a required tool call, returned the wrong shape, or made a claim without the artifact that should prove it.

rishabh jain • Jun 18

Yes, this matches what I have seen too. Retrieval failures are noisy and easier to inspect because you can compare the answer against chunks.

The harder failures are contract failures: the model had enough information, but did not satisfy the surrounding system requirements. It skipped a required tool, returned JSON that was almost right but not actually valid for the next service, or made a claim without producing the evidence artifact the workflow expected.

That is why I think LLM debugging needs to look beyond prompt/response traces. The interesting question is often not "was the answer plausible?" but "which contract did this step violate?"

For agents, I have found it useful to treat tool calls, schemas, citations, approvals, and validation outputs as first-class debug surfaces, not just metadata attached to the final answer.