- Book: RAG Pocket Guide: Retrieval, Chunking, and Reranking Patterns for Production
- Also by me: Thinking in Go (2-book series) — Complete Guide to Go Programming + Hexagonal Architecture in Go
- My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
- Me: xgabriel.com | GitHub
Your compliance lead asked "where did the model get that?" You opened the chat log. The answer was right. The citation was missing. The deal stalled for two weeks while you bolted on spans.
This is the most common citation failure I see in regulated RAG: the answer is correct, the user is happy, and then someone in legal asks a question the system can't answer about itself. The [1][2] anchors look citation-shaped but they point at a chunk that no longer exists, or a chunk that does exist but doesn't contain the sentence the model used.
The fix isn't "add citations." You already added citations. The fix is realising that citation is three different patterns serving three different consumers, and most teams ship one of the three and call it done.
Three consumers, three patterns
Think about who actually reads a citation.
The user wants to click a number and see the source. They scan, they trust, they move on. They never read the JSON.
The API client (your own frontend, a downstream service, a partner) wants structured fields it can render or forward. It can't parse English prose. It needs { "chunk_id": "...", "score": 0.84 }.
The auditor (compliance, a regulator, a customer's security team) wants to verify the cited sentence exists in the cited document at the cited offset. They don't trust your chunk IDs. They trust character positions and page coordinates inside the original PDF.
Each consumer needs a different shape. Inline anchors are useless to an auditor. Character offsets are useless to a user. A JSON sidecar with no spans is a citation that the model can fabricate confidently.
The mistake most teams make is picking one pattern and assuming it covers the other two. It doesn't. You need all three, layered, with the inner layers verifiable by the outer ones.
Pattern 1: Inline [1] anchors
This is the UX pattern. It looks like Wikipedia. The model emits prose with bracketed numbers, and you render those numbers as clickable footnotes that scroll the user to a sources panel.
The 2024 SEC rule [1] requires disclosure within
four business days. Materiality is judged from
the registrant's perspective [2].
You build this with a system prompt instruction and a numbered list of retrieved chunks fed into the context:
SYSTEM = """Answer using only the SOURCES below.
After each claim, cite the source by its number
in square brackets, e.g. [1] or [1][3].
Do not cite a source unless it directly supports
the claim."""
def build_prompt(question, chunks):
sources = "\n\n".join(
f"[{i+1}] {c.text}" for i, c in enumerate(chunks)
)
return f"{SYSTEM}\n\nSOURCES:\n{sources}\n\nQ: {question}"
It works. Users read it. Click-through is decent. Ship it for a consumer-facing demo and you'll get compliments.
Now ship it to a bank's legal team. Three things break.
First, the model fabricates anchors. It will cite [4] when you only sent three chunks. Or it will cite [2] for a claim that's actually in chunk [3], because the citation is just another token the model is sampling.
Second, the anchor points at a chunk, not a sentence. The chunk is 800 tokens. The auditor wants to know which sentence in those 800 tokens supports the claim. The anchor can't tell them.
Third, the anchor is positional. If you reindex tomorrow and the same retrieval returns chunks in a different order, [1] now points somewhere else. Old chat logs lie.
Inline anchors are necessary. They're not sufficient.
Pattern 2: Structured citation block
The API pattern. Alongside the answer text, the model emits (or your application attaches) a structured sidecar that names each citation by stable ID and includes enough metadata for any client to render it.
I prefer to keep the sidecar in a separate response field, not embedded in the prose. The model writes prose, your retrieval layer writes the sidecar:
{
"answer": "The 2024 SEC rule [1] requires disclosure within four business days. Materiality is judged from the registrant's perspective [2].",
"citations": [
{
"anchor": 1,
"chunk_id": "sec-rule-2024-08-15:chunk-42",
"doc_id": "sec-rule-2024-08-15",
"doc_title": "SEC Final Rule, Cybersecurity Disclosure",
"doc_url": "https://www.sec.gov/rules/final/...",
"retrieval_score": 0.84,
"retrieval_method": "hybrid:bm25+dense",
"embedding_model": "voyage-3-large@2025-02-01",
"index_version": "prod-2026-05-12"
},
{
"anchor": 2,
"chunk_id": "sec-rule-2024-08-15:chunk-58",
"doc_id": "sec-rule-2024-08-15",
"doc_title": "SEC Final Rule, Cybersecurity Disclosure",
"doc_url": "https://www.sec.gov/rules/final/...",
"retrieval_score": 0.79,
"retrieval_method": "hybrid:bm25+dense",
"embedding_model": "voyage-3-large@2025-02-01",
"index_version": "prod-2026-05-12"
}
]
}
The anchor field maps to the [N] in the prose. The chunk_id is stable across re-renders and survives reindex if you key it on doc+position rather than insertion order. The index_version and embedding_model are the breadcrumbs that let you reproduce the retrieval six months later when someone asks.
Notice what's still missing: where exactly inside the document the cited content lives. If chunk-42 is 800 tokens of regulatory prose, you've narrowed the haystack but you haven't found the needle.
There's another problem: the sidecar can lie about whether the chunk actually supports the claim. Nothing in this JSON proves that the cited sentence is in the cited chunk. The model could be hallucinating the answer, and your retrieval could have returned an irrelevant chunk, and the citation block would still validate against any reasonable schema.
For an API contract this is fine. For an auditor it's not.
Pattern 3: Span-grounded refs
The audit-grade pattern. Each citation carries the exact character offsets in the source document, and for PDFs, the page number plus a bounding box. The point isn't to render the offsets to the user. The point is that any third party can verify the cited span contains the cited claim without trusting your stack at all.
{
"anchor": 1,
"chunk_id": "sec-rule-2024-08-15:chunk-42",
"doc_id": "sec-rule-2024-08-15",
"span": {
"char_start": 18432,
"char_end": 18601,
"text": "Registrants must disclose any cybersecurity incident determined to be material within four business days of that determination."
},
"pdf_locator": {
"page": 14,
"bbox": [72.0, 312.5, 540.0, 348.2]
},
"doc_hash": "sha256:7a3e...91c2",
"extraction_pipeline": "unstructured@0.15.4+layout-v2"
}
A few things to notice.
char_start and char_end index into the canonical text of the document. You store that canonical text alongside the original PDF and you never edit it. If you re-OCR, you bump extraction_pipeline and the offsets get a new version.
doc_hash is a SHA-256 of the source bytes. An auditor with the original PDF can verify they're looking at the same document you cited.
pdf_locator is what lets you highlight inside a PDF viewer. The bounding box is in PDF user-space points, top-left origin flipped to match the viewer convention you've standardised on (PyMuPDF and pdfminer disagree here; pick one and document it).
span.text is the literal substring. It's redundant given the offsets, but it's the thing your verifier actually checks. If the substring isn't in the document at that offset, something has gone wrong upstream.
Getting span-grounded refs requires that you preserve offsets through the entire ingestion pipeline. That's the part most teams skip. You chunk, you embed, you forget where the chunk came from. Then six months later compliance asks and you can't reconstruct it.
Two failure modes nobody warns you about
The first is hallucinated citations. The model will cite a source that supports a claim the source doesn't actually make. It's not lying on purpose. It's pattern-matching: the chunk contained the right keywords, so the model assumed it contained the right claim. The wider your chunks, the worse this gets.
A team I talked to last year shipped a legal-research RAG with inline anchors and a clean sidecar. The model was citing real cases, real paragraphs, real pinpoint references. Six weeks in, a lawyer noticed an opinion was cited for the opposite of what it held. The chunk contained the phrase "the court rejected the argument that..." and the model latched onto the argument, not the rejection. The citation was confident, structured, and wrong.
The second is citation drift after reindex. You reindex with a new chunker, or a new embedding model, or you fix a PDF extraction bug. Old chat logs still contain chunk_id references. Those IDs no longer exist. Or, worse, they do exist but point at different text.
You guard against both with the same tool.
Verifying citations at runtime
Build a verifier that takes a generated answer and its citation block, and for every claim+citation pair, checks two things: the cited span exists in the cited document at the cited offset, and the claim is entailed by the span.
The first check is mechanical. The second needs an LLM call, but a small, cheap one.
import hashlib
import re
from dataclasses import dataclass
@dataclass
class Citation:
anchor: int
chunk_id: str
doc_id: str
span_text: str
char_start: int
char_end: int
doc_hash: str
@dataclass
class VerifyResult:
anchor: int
span_present: bool
span_supports_claim: bool | None
notes: str
def load_doc_text(doc_id: str) -> tuple[str, str]:
# Returns (canonical_text, sha256_of_source_bytes).
# Reads from your durable doc store, NOT the chunk cache.
path = f"/var/rag/docs/{doc_id}.txt"
with open(path, "rb") as f:
raw = f.read()
return raw.decode("utf-8"), hashlib.sha256(raw).hexdigest()
def extract_claim(answer: str, anchor: int) -> str | None:
# Find the sentence in `answer` that ends with [anchor].
# Cheap heuristic; good enough for verification.
pattern = re.compile(
rf"([^.!?]*?\[{anchor}\][^.!?]*[.!?])"
)
m = pattern.search(answer)
return m.group(1).strip() if m else None
def verify_citation(
answer: str,
cit: Citation,
entailment_check, # callable: (claim, evidence) -> bool
) -> VerifyResult:
doc_text, current_hash = load_doc_text(cit.doc_id)
if not current_hash.startswith(cit.doc_hash.split(":")[-1][:8]):
return VerifyResult(
cit.anchor, False, None,
"doc hash mismatch — source changed since citation"
)
actual = doc_text[cit.char_start:cit.char_end]
if actual.strip() != cit.span_text.strip():
return VerifyResult(
cit.anchor, False, None,
"span text does not match offsets"
)
claim = extract_claim(answer, cit.anchor)
if claim is None:
return VerifyResult(
cit.anchor, True, None,
"anchor present in citations but not in prose"
)
supports = entailment_check(claim, cit.span_text)
return VerifyResult(
cit.anchor, True, supports,
"ok" if supports else "claim not entailed by span"
)
entailment_check is a one-shot LLM call. Cheap model, short prompt: "Does the evidence directly support the claim? Answer YES or NO." You're not asking it to write. You're asking it to compare. Haiku-class models do this fine for under a tenth of a cent per check.
Run the verifier on every response in your eval set. Run it on a sampled fraction of production traffic. Alert when span_supports_claim drops. That's your hallucinated-citation early warning.
The doc-hash check is what saves you on reindex. If the document changed, the citation is stale by definition and you flag it before the user sees it.
When to layer all three: reference schema
For a regulated RAG the answer is "always." The three patterns nest:
- Inline anchors live in the prose. They're indices into the citation array.
- The structured citation block is the array. Each entry carries its anchor index and the metadata your API and frontend need.
- The span-grounded ref is a sub-object on each citation entry:
span,pdf_locator,doc_hash,extraction_pipeline.
{
"answer": "...prose with [1][2] anchors...",
"citations": [
{
"anchor": 1,
"chunk_id": "...",
"doc_id": "...",
"doc_title": "...",
"doc_url": "...",
"retrieval_score": 0.84,
"retrieval_method": "hybrid:bm25+dense",
"embedding_model": "voyage-3-large@2025-02-01",
"index_version": "prod-2026-05-12",
"span": {
"char_start": 18432,
"char_end": 18601,
"text": "..."
},
"pdf_locator": {
"page": 14,
"bbox": [72.0, 312.5, 540.0, 348.2]
},
"doc_hash": "sha256:7a3e...91c2",
"extraction_pipeline": "unstructured@0.15.4+layout-v2"
}
],
"verification": {
"verifier_version": "v3",
"all_spans_present": true,
"all_claims_entailed": true
}
}
The verification block is the part auditors actually look at first. If your verifier ran and said both checks passed, that's the contract. The rest is supporting evidence.
UI patterns that make citations clickable
A citation users don't click is a citation that doesn't build trust. Three patterns work, in roughly increasing investment.
The cheapest is hoverable anchors. The [1] is a small interactive element. Hover shows the chunk text, the source title, the score. Click opens a side panel with the full source. This costs you a few hours and lifts perceived trust noticeably. It's the right default.
The middle option is the side-by-side panel. The answer renders on the left, sources on the right. Clicking [1] highlights the source card and scrolls it into view. The card itself shows the chunk text with the cited span highlighted. Lawyers like this layout because it mirrors how they read briefs.
The premium option is document deep-link. Clicking [1] opens the original PDF in an embedded viewer, scrolls to page 14, draws a yellow box around the cited bbox. This is what makes finance and legal users believe you're not bluffing. The plumbing is non-trivial: you need PDF.js or a commercial viewer, you need the bboxes to be accurate, and you need the rendered offsets in your pdf_locator to match what the viewer expects. Budget a week. It's worth it for the deal.
A small detail that matters more than it should: never render a citation that failed verification with the same styling as one that passed. Greyed out, struck through, badged "unverified", any of those work. What doesn't work is hiding the failure. Auditors notice when citations silently disappear between responses.
The two-week deal stall at the start of this post happened because the team had inline anchors and nothing else. They could have shipped a sidecar in an afternoon. They could have added spans in a sprint. What they couldn't do, after the fact, was prove the answer was grounded, because they hadn't kept the offsets. Keep the offsets.
What's the worst citation drift you've debugged in production? Drop it in the comments.
If this was useful
The retrieval chapter in the RAG Pocket Guide: Retrieval, Chunking, and Reranking Patterns for Production goes deeper on chunking strategies that preserve span offsets through ingestion, plus the eval patterns you need to catch hallucinated citations before they reach production. If you're shipping RAG into a context where someone is going to ask "where did that come from?", it'll save you the two-week scramble.

Top comments (0)