The Production Bug
We were building a RAG application over legal documents.
Users ask legal questions, the system retrieves relevant law sections and
court decisions, the LLM synthesizes an answer with citations.
The citations are the whole point. A legal answer without a working
source link is useless — worse than useless, because it looks credible.
Then users started reporting broken links.
When I dug in, I found two distinct failure patterns:
Pattern 1 — Invented URLs. The LLM would cite a section that
simply didn't exist. Not a wrong document — a wrong section within
a real document. Something like referencing #paragraf-99 in a law
that only goes up to #paragraf-67. Confident, plausible, wrong.
Pattern 2 — Half-broken URLs. The source URL was real and present
in the context. The LLM had read it. But what it returned was subtly
mutated:
# What we gave the LLM:
4552013#paragraf-31.odsek-2.pismeno-a
# What the LLM gave back:
4552013#paragraf-3.odsek-2.pismeno-a ← dropped a digit
4552013#paragraf-31.odsek-2.pismeno-b ← flipped last letter
4552013#paragraf-31.odsek-2 ← truncated the path
Every mutation was plausible. None of them existed. And the links
rendered fine in the UI — you only found out they were dead when
you clicked.
What We Tried First
Prompt engineering. We tried few-shot prompting — giving the model
explicit examples of what a correct citation looks like and what an
incorrect one looks like. Here's the accepted format, here's what to
avoid. The model understood the examples perfectly. It would even
echo them back correctly during testing. But in real responses with
real legal URLs it still mutated fragments. The examples helped at
the edges but the core problem remained — the model isn't
ignoring your examples, it genuinely cannot tell the difference
between the real URL and its mutated version because both look
equally valid to it.
Reducing context. We cut down the number of documents we injected
per prompt. Fewer URLs competing for attention meant fewer mutations.
This actually helped noticeably — going from 15 documents to 8
cut the broken link rate roughly in half. But the underlying problem
remained, and 8 documents wasn't always enough for complex legal queries.
More batches. Instead of one big LLM call with everything, we split
into smaller focused calls. Again, helped but didn't fix. And it added
latency and cost.
None of these got us below an acceptable error rate. Broken legal
citations were still making it to users.
The "What If" Moment
I was staring at one of these mutations trying to understand the
pattern when something clicked.
These URLs are long, noisy, and nearly identical to each other.
From a token perspective they look like high-frequency boilerplate.
The model isn't reading them as precise addresses — it's reading
them as variations on a pattern and then reconstructing them
from memory.
What if we just... didn't give it the real URLs at all?
The idea: replace every URL with a short numeric code before
sending to the LLM. Keep a lookup table on our side. After the
response, swap the codes back.
# Before sending to LLM:
[§ 31](4552013#paragraf-31.odsek-2.pismeno-a) → [§ 31](=1#1)
[§ 65a](4552013#paragraf-65a) → [§ 65a](=1#2)
[Article 5](/SK/ZZ/1964/40#article-5) → [Article 5](=2#1)
# LLM works with short codes only.
# After LLM responds:
[§ 31](=1#1) → [§ 31](4552013#paragraf-31.odsek-2.pismeno-a) ✅
[§ 65a](=1#2) → [§ 65a](4552013#paragraf-65a) ✅
[Article 5](=2#1) → [Article 5](/SK/ZZ/1964/40#article-5) ✅
The = prefix makes codes visually distinct — the model treats
them as opaque tokens rather than values to reason about. It can't
mutate =1#1 because there's nothing meaningful to mutate it into.
Implementation is a bijective map — base URIs get integer IDs,
fragments within each base get sub-indices:
from uri_shortener import UriShortener
shortener = UriShortener()
# Encode before LLM
encoded_context = shortener.encode_text(raw_context)
# Send encoded_context to LLM, get llm_response back
# Check for hallucinated codes before decoding
hallucinated = shortener.find_hallucinated_codes(llm_response)
# e.g. {"=1#99"} — LLM invented fragment 99, which was never encoded
# Decode back to real URLs
final_response = shortener.decode_text(llm_response)
We also got a nice side effect: token savings. Those long legal
URLs are expensive. Replacing them with =1#1 style codes
meaningfully cut context token usage — the exact savings depend
on how many URLs your context has and how long they are, but
the more fragment-heavy your documents are, the more you gain.
What It Fixed
The invented URLs problem: mostly solved. If the LLM invents
a code like =1#99 and we only encoded fragments =1#1 through
=1#3, we can detect it immediately with find_hallucinated_codes()
before it ever reaches the user. This was a significant win.
The Gemini Parallel
While researching this problem, I noticed something interesting
in how Google Gemini handles citation links. In Gemini's responses,
the links in the answer aren't fully resolved during streaming —
they get resolved after the stream finishes. We followed the same
pattern: encode before the LLM call, stream the response with short
codes, decode after the stream completes.
Seeing a production system at that scale doing something
conceptually similar gave us confidence we were on the right track.
We Haven't Fully Solved It
The encoding approach took us from "broken links regularly reaching
users" to "most links work." That's a meaningful improvement in a
legal context where every citation matters. But the half-broken
URL problem — where the LLM generates something close to a valid
code but not quite — is still open.
If you've hit something similar in production — hallucinated URLs,
mutated identifiers, broken citations in any domain — I'd genuinely
like to know what worked for you. Fine-tuning? Validation layers?
Something completely different?
The full demo is here:
👉 github.com/AkshatSoni26/promptlinks
Drop what worked for you in the comments.
Top comments (0)