Akshat Soni

Posted on Jun 2

We Had LLMs Hallucinating Legal URLs in Production — Here's What We Tried

#rag #ai #llm #python

The Production Bug

We were building a RAG application over legal documents.
Users ask legal questions, the system retrieves relevant law sections and
court decisions, the LLM synthesizes an answer with citations.

The citations are the whole point. A legal answer without a working
source link is useless — worse than useless, because it looks credible.

Then users started reporting broken links.

When I dug in, I found two distinct failure patterns:

Pattern 1 — Invented URLs. The LLM would cite a section that
simply didn't exist. Not a wrong document — a wrong section within
a real document. Something like referencing #paragraf-99 in a law
that only goes up to #paragraf-67. Confident, plausible, wrong.

Pattern 2 — Half-broken URLs. The source URL was real and present
in the context. The LLM had read it. But what it returned was subtly
mutated:

# What we gave the LLM:
4552013#paragraf-31.odsek-2.pismeno-a

# What the LLM gave back:
4552013#paragraf-3.odsek-2.pismeno-a   ← dropped a digit
4552013#paragraf-31.odsek-2.pismeno-b  ← flipped last letter
4552013#paragraf-31.odsek-2            ← truncated the path

Every mutation was plausible. None of them existed. And the links
rendered fine in the UI — you only found out they were dead when
you clicked.

What We Tried First

Prompt engineering. We tried few-shot prompting — giving the model
explicit examples of what a correct citation looks like and what an
incorrect one looks like. Here's the accepted format, here's what to
avoid. The model understood the examples perfectly. It would even
echo them back correctly during testing. But in real responses with
real legal URLs it still mutated fragments. The examples helped at
the edges but the core problem remained — the model isn't
ignoring your examples, it genuinely cannot tell the difference
between the real URL and its mutated version because both look
equally valid to it.

Reducing context. We cut down the number of documents we injected
per prompt. Fewer URLs competing for attention meant fewer mutations.
This actually helped noticeably — going from 15 documents to 8
cut the broken link rate roughly in half. But the underlying problem
remained, and 8 documents wasn't always enough for complex legal queries.

More batches. Instead of one big LLM call with everything, we split
into smaller focused calls. Again, helped but didn't fix. And it added
latency and cost.

None of these got us below an acceptable error rate. Broken legal
citations were still making it to users.

The "What If" Moment

I was staring at one of these mutations trying to understand the
pattern when something clicked.

These URLs are long, noisy, and nearly identical to each other.
From a token perspective they look like high-frequency boilerplate.
The model isn't reading them as precise addresses — it's reading
them as variations on a pattern and then reconstructing them
from memory.

What if we just... didn't give it the real URLs at all?

The idea: replace every URL with a short numeric code before
sending to the LLM. Keep a lookup table on our side. After the
response, swap the codes back.

# Before sending to LLM:
[§ 31](4552013#paragraf-31.odsek-2.pismeno-a) → [§ 31](=1#1)
[§ 65a](4552013#paragraf-65a)                  → [§ 65a](=1#2)
[Article 5](/SK/ZZ/1964/40#article-5)          → [Article 5](=2#1)

# LLM works with short codes only.

# After LLM responds:
[§ 31](=1#1)      → [§ 31](4552013#paragraf-31.odsek-2.pismeno-a) ✅
[§ 65a](=1#2)     → [§ 65a](4552013#paragraf-65a)                 ✅
[Article 5](=2#1) → [Article 5](/SK/ZZ/1964/40#article-5)         ✅

The = prefix makes codes visually distinct — the model treats
them as opaque tokens rather than values to reason about. It can't
mutate =1#1 because there's nothing meaningful to mutate it into.

Implementation is a bijective map — base URIs get integer IDs,
fragments within each base get sub-indices:

from uri_shortener import UriShortener

shortener = UriShortener()

# Encode before LLM
encoded_context = shortener.encode_text(raw_context)

# Send encoded_context to LLM, get llm_response back

# Check for hallucinated codes before decoding
hallucinated = shortener.find_hallucinated_codes(llm_response)
# e.g. {"=1#99"} — LLM invented fragment 99, which was never encoded

# Decode back to real URLs
final_response = shortener.decode_text(llm_response)

We also got a nice side effect: token savings. Those long legal
URLs are expensive. Replacing them with =1#1 style codes
meaningfully cut context token usage — the exact savings depend
on how many URLs your context has and how long they are, but
the more fragment-heavy your documents are, the more you gain.

What It Fixed

The invented URLs problem: mostly solved. If the LLM invents
a code like =1#99 and we only encoded fragments =1#1 through
=1#3, we can detect it immediately with find_hallucinated_codes()
before it ever reaches the user. This was a significant win.

The Gemini Parallel

While researching this problem, I noticed something interesting
in how Google Gemini handles citation links. In Gemini's responses,
the links in the answer aren't fully resolved during streaming —
they get resolved after the stream finishes. We followed the same
pattern: encode before the LLM call, stream the response with short
codes, decode after the stream completes.

Seeing a production system at that scale doing something
conceptually similar gave us confidence we were on the right track.

We Haven't Fully Solved It

The encoding approach took us from "broken links regularly reaching
users" to "most links work." That's a meaningful improvement in a
legal context where every citation matters. But the half-broken
URL problem — where the LLM generates something close to a valid
code but not quite — is still open.

If you've hit something similar in production — hallucinated URLs,
mutated identifiers, broken citations in any domain — I'd genuinely
like to know what worked for you. Fine-tuning? Validation layers?
Something completely different?

The full demo is here:
👉 github.com/AkshatSoni26/promptlinks

Drop what worked for you in the comments.

Top comments (2)

Abdullah Shahin • Jun 3

The opaque-code remapping is a clever trick because it converts a generation problem (don't mutate this 200-char URL) into a classification problem (emit one of N short tokens), which models are much better at. One extension that's worth trying: make the code a structured short identifier with a built-in checksum digit, so find_hallucinated_codes() can reject mutations before even hitting the lookup table — useful when the model invents a code that happens to collide with a real one. The "halved the broken link rate by going from 15 to 8 documents" observation is also consistent with what I've seen on long-context citation tasks; past roughly 6–8 sources the model starts blending fields between adjacent documents, so even legitimate URLs come back with the wrong anchor. Constrained decoding via a regex/grammar over the code alphabet is the heavier-weight version of the same idea if the provider supports it (OpenAI structured outputs, llama.cpp grammars, Outlines), and it makes the hallucination literally impossible rather than detectable after the fact.

Akshat Soni • Jun 4

Really precise framing — "generation problem → classification problem" is exactly how I should have been thinking about it.

On the checksum — in our setup the lookup table is always available at decode time, so find_hallucinated_codes() already rejects anything that wasn't encoded. The checksum would only matter if validation needed to happen somewhere without access to the lookup table, which isn't our case. Also thought about it more — checksum doesn't fully eliminate hallucination either, the LLM can still invent a code that happens to have a valid checksum. The table catches both mutations and invented codes, so it's actually the stronger guarantee.

On constrained decoding — this is exactly what we thought about too, but we're streaming responses to the client and constrained decoding doesn't really play well with that. The grammar needs to buffer multiple tokens to validate a complete code like =1#1 before it can pass it forward, which breaks the streaming flow. So encode before, decode after the stream finishes felt like the right tradeoff for our architecture. Curious if you've seen teams solve constrained decoding + streaming together cleanly, because that would genuinely be the proper fix.