The Most Expensive Hallucination of 2026: A Court Filing Goes Sideways

#ai #llm #prompt #observability

Book: Prompt Engineering Pocket Guide
Also by me: LLM Observability Pocket Guide
My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
Me: xgabriel.com | GitHub

On March 25, 2026, the Oregon Court of Appeals fined Salem civil attorney William Ghiorso for a brief he had filed in the underlying appeal. The brief contained 15 case citations the court could not find anywhere in the law, and nine quotations the panel described as "contrived from thin air." The headline number is $10,000, per Salem Reporter. The full bill, once adverse costs and the appellees' fees are added in, lands at roughly $109,700, believed to be the largest aggregate penalty ever tied to a single attorney's AI-related misconduct, per ComplexDiscovery's Q1 round-up.

That filing is one of more than a thousand. As of April 2026, Damien Charlotin's hallucination database catalogs 1,352 cases globally where generative AI produced fake citations that ended up in front of a judge, with over a thousand entries added in roughly the last year. Q1 2026 alone produced over $145,000 in U.S. court sanctions for fabricated citations.

This post is not about lawyers. It is about a defect that any team shipping LLM output into a high-stakes channel (legal, medical, financial, compliance) is one missing check away from. The pattern shows up in software too. The fix is the same.

What actually went wrong

Reconstructing the Ghiorso brief from the court's order is straightforward. He asked an AI tool to produce an opening brief on a civil matter. It produced one. The brief read like a competent first draft. It cited cases by name, with reporter, volume, and page number. It quoted those cases. The tone matched the genre.

None of those citations existed. The AI had assembled plausible-looking case captions (right circuit, right era, right name structure) and attached them to legal propositions the brief needed to support. The quotations were paraphrases of what such an opinion might have said, in the voice such an opinion would have used.

Ghiorso filed it. The opposing counsel ran the citations. The court ran the citations. The citations did not run.

This is not the LLM being broken. This is the LLM doing exactly what next-token prediction does, applied to a domain where "most likely" and "true" diverge. A case citation is a five-token pattern. The model has seen a million of them. Generating one that fits the surrounding sentence is a layup. Generating one that exists requires a database lookup the model is not equipped to perform.

The hallucination is not the defect. The missing verification step is.

The pattern under the failure

Pull back from the courtroom and the pattern is everywhere.

A finance team uses an LLM to draft a regulatory filing. The model invents an SEC release number that almost matches one. A medical chatbot cites a study; the DOI resolves to a different paper on a different topic. A SOC 2 narrative references a control ID that does not exist in your control matrix. A pull request from an AI agent imports requests-async, a package that did not exist on PyPI until a squatter registered it last week.

In every case the structure is identical:

The model produces output that is shaped like verified reference material.
The downstream consumer treats "shaped like" as "is."
No mechanical check sits between (1) and (2).

You cannot prompt your way out of this with "do not hallucinate." Models trained on next-token prediction will continue to produce next tokens. What you can do is refuse to consume any output that has not been independently checked against a source of truth.

A prompt pattern that catches most of it

The first line of defense costs nothing. Make the model's output structurally checkable, and force it to mark every fact it claims with the source it was retrieved from.

You are drafting a legal memo. For every case citation,
quoted passage, statute reference, or factual claim about
an opinion, you MUST output a JSON record with:

  - claim_text: the sentence as it appears in the memo
  - claim_type: one of [case_cite, quote, statute, fact]
  - source_id: the exact identifier you retrieved this from
  - source_quote: the verbatim passage that supports it

If you cannot supply source_id and source_quote from the
retrieval context I gave you, output:

  {"claim_type": "unverified", "claim_text": "..."}

Do not invent source_ids. Do not paraphrase source_quote.

The win here is not that the model stops hallucinating. It is that every claim now lives in a record you can mechanically iterate over. unverified claims are flagged before the document is shown to a human. Claims with a source_id can be cross-checked against the actual source. Claims with a source_quote can be string-matched against the source text.

This is the same idea as a typed API contract. You moved the truthfulness question from "did the LLM lie?" (unanswerable) to "does this string appear in this document?" (a grep).

A 12-line citation existence checker

Once you have a source_id, you need to verify it exists. For case law there is a public API, CourtListener, that takes a citation and tells you whether it resolves to a real opinion.

import os
import requests

API = "https://www.courtlistener.com/api/rest/v4/"
TOK = os.environ["COURTLISTENER_TOKEN"]

def citation_exists(cite: str) -> bool:
    r = requests.get(
        API + "search/",
        params={"type": "o", "citation": cite},
        headers={"Authorization": f"Token {TOK}"},
        timeout=10,
    )
    r.raise_for_status()
    return r.json().get("count", 0) > 0

That is the whole thing. Twelve lines. Run it over every claim_type == "case_cite" record before the brief leaves your pipeline:

def verify_brief(claims: list[dict]) -> list[dict]:
    failures = []
    for c in claims:
        if c["claim_type"] != "case_cite":
            continue
        if not citation_exists(c["source_id"]):
            failures.append(c)
    return failures

If failures is non-empty, the brief does not get filed. The lawyer reads the failure list, deletes the fake citations, and either re-prompts with retrieval or writes those passages by hand. The brief that cost Ghiorso $109,700 would have been caught by this loop in under a second.

The same shape works for any domain with a lookup. PyPI for package imports. PubMed for medical studies. ClinicalTrials.gov for trial IDs. Your internal control registry for compliance IDs. The cost of running the check is rounding error compared to the cost of one bad submission.

Retrieval-grounded citations, when you can swing it

The stronger version is not "check after the fact" but "only let the model cite what you handed it." Wire the legal-research step in front of the drafting step. Pull the candidate cases via a search API. Stuff the relevant passages into the model's context. Tell the model: "you may cite only the cases in <context>. If a proposition needs a case that is not in <context>, write [CITATION NEEDED] and I will run another retrieval round."

def draft_with_grounding(question: str) -> str:
    cases = courtlistener_search(question, k=12)
    context = format_cases_for_prompt(cases)
    prompt = f"""
    {context}

    Draft a memo answering: {question}

    You may cite only cases that appear above. For any
    proposition that needs an authority not above, write
    [CITATION NEEDED]. Output JSON-tagged claims as
    specified in the schema.
    """
    return llm.complete(prompt)

This trades latency for trust. The drafting prompt is now a two-step pipeline with a real retrieval gate. You will still get the occasional misquote, which is what the string-match step catches. The combination of "only retrieve real things" plus "verify what was retrieved survived the draft" is what stops the Ghiorso failure mode.

The observability piece

Even with both gates, things will slip. Your retrieval will miss a case. Your verifier will accept a citation that exists but does not stand for the proposition. A clever paraphrase will pass the string-match check while saying the opposite of the source. You need to see this happen in production before it ends up in a court filing.

The minimum trace per generation:

generation.id              uuid
generation.input.tokens    int
generation.output.tokens   int
retrieval.k                int
retrieval.hit_ids          [str]
verification.claims_total  int
verification.claims_failed int
verification.failures      [{claim_id, reason}]
human.reviewed             bool
human.edits_count          int

Two metrics carry most of the signal. Verification failure rate. What fraction of generations contained at least one failed claim. If this drifts up after a model swap or a prompt change, the new setup is more confident and less correct, which is the worst possible direction. Human edit rate on verified claims. When a reviewer edits a claim that passed verification, your verifier missed something. Those edits are the gold seed for your next round of eval cases.

You can run all of this on whatever tracing tool you already have (OpenTelemetry, Langfuse, Arize, your own Postgres). The shape is what matters: every generation carries the receipts for every claim, and you can answer "did this output get checked, and what did the check find?" in one query.

Where this goes next

The next failure will not look like Ghiorso's. The pipelines that learned the citation lesson are the ones with a verifier wired in front of the file-button; the ones still ahead of the curve are the ones where the verifier is wired but its blind spot has not been audited. A clever paraphrase that survives a string-match. A retrieval that returned the wrong jurisdiction. A control ID that exists but is scoped to a different system. The verifier passes, the human signs off, and the bad output ships through a green pipeline.

That is the failure mode the next published opinion is going to be about. The teams that catch it early are the ones treating their verifier as code under test: a corpus of known-bad generations that should fail the check, a corpus of known-good ones that should pass, and a CI step that runs both on every prompt or model change. Same loop you would build for any other production component that is allowed to say no.

The courtroom version of this story comes with a fine and a published opinion. The production-incident version comes with a postmortem and an outage. The defense is the same in both.

If this was useful

The prompt pattern in this post (schema-tagged claims, source-grounded retrieval, refusal-on-unverified) is the kind of thing the Prompt Engineering Pocket Guide walks through end to end. The verification, tracing, and metric layer is the LLM Observability Pocket Guide. If your team is shipping LLM output into a place where being confidently wrong has a price tag, those two together cover most of the path.