DEV Community

Cover image for Debugging confidently wrong answers from LLM-powered features
Alan West
Alan West

Posted on

Debugging confidently wrong answers from LLM-powered features

The bug that took two weeks to surface

A few months back I shipped a feature that used a language model to summarize support tickets and suggest responses. Internal QA loved it. The demo went great. Two weeks after launch, our support lead pinged me on Slack: "Are these summaries... making things up?"

They were. Not always. Maybe one in fifty. But the ones that were wrong looked exactly as confident as the correct ones — same tone, same structure, same plausible-looking detail. A ticket about a failed payment got summarized as "user wants to cancel subscription." A complaint about slow load times got rephrased as "user reports outage in EU region."

If you've shipped anything LLM-backed in production, this story is probably familiar. The model isn't broken. The benchmark scores look great. But the tail is full of confidently wrong answers, and your users are the ones finding them.

Here's what I learned debugging this, and the layered approach that finally got our hallucination rate down to something I could live with.

Why this happens (and why it's hard to catch)

The first thing to internalize: a language model produces fluent text whether or not the underlying reasoning is sound. There's no "I'm not sure" signal you can read off the surface output. The model that confidently invents a detail and the model that confidently states a true fact look identical from your application's perspective.

Worse, evaluation suites usually skew toward typical inputs. Your eval probably hits the median case. Production traffic hits the tail — weird formatting, unusual entities, contradictory context, ambiguous pronouns, multi-language messages. Tail behavior is where hallucinations live.

In our case, the model was misreading tickets where the customer mentioned multiple unrelated topics. The summarizer would latch onto whichever topic appeared first or had the strongest sentiment, and confidently summarize that as the whole ticket.

Step 1: Constrain the output structure

Free-form prose gives the model room to confabulate smoothly. Constraining the output forces it to commit to specific claims you can verify.

Instead of asking for a summary, I asked for a structured object:

# Bad: free-form prose, hard to validate
prompt = f"Summarize this ticket:\n{ticket}"

# Better: structured claims we can check one by one
schema = {
    "primary_issue": "str",        # one short phrase, must appear in source
    "customer_intent": "enum[refund, cancel, technical_help, billing, other]",
    "mentioned_order_ids": "list[str]",
    "sentiment": "enum[neutral, frustrated, angry]",
    "requires_human": "bool",
}
Enter fullscreen mode Exit fullscreen mode

JSON Schema or function-calling features from most providers work even better, since they constrain at the decoding layer. The point is: you want discrete claims, not paragraphs. Claims you can check. Prose you cannot.

Step 2: Add a verifier pass

This was the change that actually moved the needle. Run the output through a second model call whose only job is to check whether each claim is supported by the source.

def verify_claim(source_text: str, claim: str) -> str:
    prompt = (
        "You are a strict fact-checker. Given the SOURCE and the CLAIM,\n"
        "answer with exactly one word: YES, NO, or UNCERTAIN.\n"
        "YES means the claim is explicitly supported by the source.\n"
        f"SOURCE:\n{source_text}\n\nCLAIM:\n{claim}\n\nANSWER:"
    )
    return call_llm(prompt, max_tokens=4).strip().upper()

def accept(source, output) -> bool:
    # Treat UNCERTAIN as failure on high-stakes paths.
    for claim in output.claims():
        if verify_claim(source, claim) != "YES":
            return False
    return True
Enter fullscreen mode Exit fullscreen mode

A few things matter for the verifier:

  • Use a different prompt structure than the generator. You don't want correlated failure modes.
  • Force a discrete answer (YES/NO/UNCERTAIN). No prose, no chain-of-thought leaking into the output.
  • Treat UNCERTAIN as failure for high-stakes outputs. Cheap, conservative, surprisingly effective.

Yes, you're paying for an extra call. In our case cost roughly doubled per request, and that was fine — the alternative was customer-visible mistakes.

Step 3: Deterministic guards on the things you can actually check

LLMs don't need to be involved in checking facts that have a definite answer. If your output mentions an order ID, regex-check the format and look it up in your database. Numbers, dates, IDs, enum values, email addresses — all deterministic.

I added a small guard layer that runs after the verifier:

import re

ORDER_ID = re.compile(r"^ORD-\d{8}$")

def guard(output, ticket) -> bool:
    for oid in output.mentioned_order_ids:
        if not ORDER_ID.match(oid):
            return False  # malformed ID
        if oid not in ticket.body:
            return False  # model invented an order ID
        if not db.orders.exists(oid):
            return False  # ID doesn't resolve
    return True
Enter fullscreen mode Exit fullscreen mode

If any deterministic check fails, we don't show the response at all. Fall back to a templated message: "we received your ticket, an agent will respond shortly." Boring, correct, never wrong.

Step 4: Log the disagreements

For every request, log the generator output, the verifier verdict, and the guard outcome. Then build a dashboard of disagreements. Within a week you'll see patterns — specific input shapes that trigger more verifier failures, specific claim types that get confabulated.

This is where you get the data to improve your prompt, swap models, or fine-tune. Without it you're guessing.

Prevention tips for next time

A few things I'd do from day one on the next project:

  • Decide on output structure before you write the prompt. Pick the schema first, then write a prompt that produces it. Don't bolt structure on later.
  • Build evals from production logs, not synthetic examples. Synthetic examples test what you imagined. Logs test what users actually do.
  • Treat the model as one component, not the whole system. Validators, guards, retrieval, deterministic checks — these aren't workarounds, they're the architecture. A good LLM feature is mostly not the LLM.
  • Keep a templated fallback for every code path. When the model is uncertain, users should get a boring correct response — not a creative wrong one.
  • Sample and review. Set up a review queue, look at 50 outputs a week, write down what you find. There's no substitute.

The bigger lesson

The thing I keep coming back to is that fluency is not correctness. A model that produces beautiful, well-structured, confident text saying the wrong thing is in some sense more dangerous than one that produces obvious garbage. Garbage gets caught. Confident wrongness gets shipped.

Build the verifier. Add the guards. Log everything. Then sleep slightly better.

Top comments (0)