Your LLM isn't lying to you. You just trusted it too much.

#python #ai #machinelearning #discuss

Six months ago I shipped a feature that used an LLM to answer customer support questions. It worked great in testing. In production, it told a user that our refund window was 30 days. It's 14.

Nobody caught it for three weeks.

That wasn't a model problem. That was a me problem. I handed the model a job it was never designed to do — recall specific policy facts reliably — and assumed confidence meant correctness. It doesn't. It never did.

Here's what's actually happening under the hood. LLMs don't look things up. They compress patterns from training data into weights, then at inference time they predict what token comes next. That's it. There's no database query, no fact check, no "wait, am I sure about this?" The output that sounds most plausible wins. Always.

import anthropic

client = anthropic.Anthropic()

response = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=256,
    messages=[{
        "role": "user",
        "content": "Who won the 2031 FIFA World Cup?"
    }]
)

print(response.content[0].text)
# Will give you a confident, fluent, completely made-up answer

The model doesn't know 2031 hasn't happened. It just knows how to write sentences that sound like sports journalism.

RAG: give it something real to work with

The fix isn't prompting harder. It's grounding the model in actual source material before it generates anything.

def rag_query(question: str, docs: list[str]) -> str:
    context = "\n".join(docs)

    response = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=512,
        system=f"""Answer using only the context below.
If the answer isn't there, say so — don't guess.

Context:
{context}""",
        messages=[{"role": "user", "content": question}]
    )
    return response.content[0].text


policy_docs = [
    "Refund requests must be submitted within 14 days of purchase.",
    "Digital products are non-refundable after download."
]

print(rag_query("What's your refund policy?", policy_docs))
# Now it's reading from your actual policy, not inventing one

RAG doesn't make the model smarter. It makes it less dependent on memory it doesn't reliably have.

Make the model tell you when it's guessing

The other thing I changed: I stopped letting the model just produce answers. I made it produce answers with confidence scores.

import json

response = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=512,
    system="""Respond only in this JSON format:
{
  "answer": "...",
  "confidence": "high | medium | low",
  "caveat": "..."
}
Be honest. Low confidence is fine. Silence isn't.""",
    messages=[{
        "role": "user",
        "content": "What's the capital of the Moon?"
    }]
)

result = json.loads(response.content[0].text)
print(result["confidence"])   # low
print(result["caveat"])       # The Moon has no capital or inhabited settlements.

Sounds obvious. But most production systems I've seen don't do this. They pipe the output straight to the user and pray.

For anything that matters: a human in the loop

There's a class of applications — medical, legal, financial — where you genuinely cannot let the model be the last word. Not because it's bad. Because the cost of a single wrong answer is too high.

def checked_response(question: str, docs: list[str]) -> str:
    answer = rag_query(question, docs)

    verdict = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=32,
        system="Reply SAFE or NEEDS_REVIEW. Nothing else.",
        messages=[{
            "role": "user",
            "content": f"Should this answer go directly to a patient?\n\n{answer}"
        }]
    ).content[0].text.strip()

    if verdict == "NEEDS_REVIEW":
        queue_for_human_review(answer)
        return "A specialist will follow up shortly."

    return answer

This isn't pessimistic. It's just honest about what the technology is and isn't.

Three things I now treat as non-negotiable before shipping anything LLM-powered:

Does the model have access to the actual source of truth, or is it working from memory?
Can the system express uncertainty, or does it always sound equally confident?
For high-stakes outputs, is there a human somewhere in the loop?

Get those three right and hallucination goes from "catastrophic failure mode" to "manageable edge case."

The model isn't broken. The system around it usually is.

What's bitten you in production? I'm curious whether people are hitting the same class of problems or completely different ones.