The hardest LLM bugs are contract failures, not hallucinations

#ai #llm #testing #softwareengineering

When people talk about LLM failures, the default word is usually "hallucination."

But after building and testing LLM apps, I think many production bugs are better described as contract failures.

A hallucination is when the model makes something up. That matters, but it is not the only failure mode.

The subtler bugs happen when the model had enough context, but violated the surrounding system contract.

Examples:

It skipped a required tool call.
It returned JSON that looked valid but broke the downstream schema.
It made a claim without the evidence artifact the workflow expected.
It ignored a tool result.
It answered before satisfying a validation step.
It followed one instruction while violating another.

Retrieval failures are usually easier to notice. You can compare the answer against the retrieved chunks and see that the grounding is weak.

Contract failures are harder because the answer may look plausible. The text might even be correct in isolation. But the system still failed because the model did not do the thing the application required.

For example, in a support agent, the problem may not be that the model gave a bad refund answer. The problem may be that it answered without first calling the refund eligibility tool.

In a data extraction app, the problem may not be that the model misunderstood the document. The problem may be that it returned almost-correct JSON that fails validation in the next service.

In a RAG workflow, the problem may not be missing context. The problem may be that the answer made a claim without attaching the citation or artifact that proves it.

This changed how I think about LLM debugging.

Prompt and response logs are useful, but they are not enough. A debugger should also inspect the contracts around the model call:

Was the required tool called?
Were the tool arguments valid?
Was the tool result used?
Did the output match the schema?
Was each claim supported by evidence?
Did the model satisfy the approval or safety preconditions?
Is this failure reproducible as a regression case?

This is the direction I am taking with DebugAI, a Python SDK I have been building.

The goal is to take a bad LLM response and return a structured debug artifact:

failure type
evidence
likely root cause
suggested fix
regression-testable case

For example, instead of only saying "bad answer," the diagnosis should say something like:

{
"failure": "tool_call_failure",
"evidence": [
"Expected refund_order tool before answering",
"No tool call was made"
],
"fix": "Require tool execution before final answer and reject responses without tool evidence."
}

I think this framing is more useful than treating every LLM bug as a hallucination.

Some bugs are grounding failures.
Some are retrieval failures.
Some are instruction failures.
Some are output validation failures.
Some are tool contract failures.

The more specific the failure class, the easier it is to fix and test.

Repo is public here if useful: https://github.com/civicRJ/DebugAI