DEV Community

rishabh jain
rishabh jain

Posted on

How I debug RAG failures with deterministic signals

When building LLM apps, one frustrating problem is that a response can be wrong for many different reasons.

The model may have hallucinated. The retriever may have pulled the wrong chunks. The answer may not be grounded in the provided context. A tool call may be missing. A JSON schema may have failed. Or the prompt may simply be brittle.

I started experimenting with a more structured way to debug these failures.

Instead of only asking another LLM "is this answer good?", I wanted to compute signals from the request and response itself.

The inputs I care about are usually:

  • prompt
  • model output
  • retrieved chunks
  • similarity scores
  • tool calls
  • expected tools
  • response schema
  • latency
  • model settings

From these, I try to classify the failure into categories like:

  • retrieval failure
  • RAG hallucination
  • schema violation
  • missing tool call
  • citation failure
  • prompt brittleness
  • ambiguous prompt
  • instruction-following issue

For example, if the retriever returns chunks with low similarity and the output answers confidently anyway, that points toward a retrieval or grounding problem.

If the chunks are relevant but the output introduces unsupported claims, that looks more like hallucination.

If a tool was expected but no tool call happened, that is a different failure mode entirely.

A simple example:

from debugai import analyze

result = analyze(
prompt="What is the refund policy?",
output="You can get a full refund after 90 days.",
chunks=["Electronics can be returned within 30 days with receipt."],
similarity_scores=[0.82],
)

print(result["primary"])

I’m building this into a small Python SDK called DebugAI.

Install:

pip install debugerai

The SDK can run locally without logging into a website. I also have a hosted dashboard for saving traces and diagnoses, but local usage is the main thing I’m testing right now.

The goal is not to magically solve LLM reliability. It is to make failures easier to name and inspect.

I’m still early, so I’m mainly looking for feedback from people building LLM/RAG apps:

  • Are these the right failure categories?
  • What signals would you want to see?
  • Do you debug this another way?
  • Should tools like this be local-first by default?

Link: https://debugai-5lb2.onrender.com

Top comments (1)

Collapse
 
alexshev profile image
Alex Shev

Deterministic signals are the right framing. "The answer was bad" is too vague to debug; you need to know whether retrieval, grounding, schema, tool execution, or prompt routing failed.

I like treating RAG debugging as a pipeline trace problem. Once each step has a simple pass/fail reason, the fixes become much less mystical.