Cihangir Bozdogan

Posted on May 10

AI Cited a URL That Didn't Contain the Claim. I Built the Tooling to Measure How Often

#ai #webdev #seo #llm

Citation hallucination has four distinct failure modes — fabricated URLs, retrieve-then-misquote, URL substitution, and anchor-text drift. They look the same in the response but they have different causes and different fixes. A field report on measuring citation faithfulness in production.

The first time I caught it, I assumed I was misreading the response. The user query was about a specific competitor's pricing. The model produced a confident, well-structured answer with three inline citations. I clicked the first citation. The URL was real. The page existed. The page was about the competitor's product. The page did not contain the price the model had stated. The model had retrieved a real document and emitted a sourced-looking claim that the document did not actually support. Confident, well-cited, and wrong.

This is citation hallucination, and once you know to look for it, you cannot unsee it. The shape of the problem is not "the model made up the URL" though that happens too it is more often "the model retrieved a real URL and wrote a claim that the URL does not actually justify." The user has no way to tell. Citation looks the same whether the claim is supported or fabricated. The link is blue. The user trusts it. The model has effectively produced a falsehood with an authoritative-looking footnote.

I built the tooling to measure how often this happens because no operator I had talked to could give me a number. Vendors do not publish faithfulness metrics. Public benchmarks for citation correctness are partial and synthetic. The only way to know the rate for your specific use case was to measure it. So I did, on a sample of about a thousand grounded queries across three providers' search-tool APIs, on a mix of factual and product-shaped questions. The numbers are not comforting. The mitigations are real but partial. The post is about both.

This post is the field report. The four distinct classes of citation hallucination I have characterised, with their mechanisms and their detection signatures. The measurement methodology what to capture, how to compare, what to alert on. The mitigation patterns, ranked by how much they actually move the rate. And the engineering pattern for shipping a citation-faithfulness check as a layer between the model output and the user. The post is technical because the failure is technical: there is no UX-side band-aid for "the model is confidently wrong with a real-looking source." The fix is in the pipeline.

Four Classes Of Citation Hallucination

The four classes I now treat as distinct in audit work:

Class 1: Fabricated URL. The model emits a URL that does not exist, or that exists at a different domain than the model claims. The mechanism is parametric the model has learned that URLs of a certain pattern (https://[brand].com/about, https://docs.[product].com/api, https://blog.[company].com/2024/...) are plausible, and at generation time it samples a plausible-looking URL even when no actual retrieval happened. When the user clicks the link, they get a 404 or a different domain.

This class is the easiest to detect. If the model is using a tool-call API correctly, the only valid citations are URLs returned by the search tool. Any URL in the response that is not in the search tool's return values is by definition fabricated. The detection is set-difference. The remediation is hard-blocking the response if it contains URLs not in the retrieved set.

Class 2: Retrieve-then-misquote. The model retrieves a real URL with real content, and the URL is included in the response. The claim adjacent to the citation is not actually supported by the cited page. This is the most common class in the data I have collected, by a wide margin. The mechanism is the model summarising or paraphrasing across multiple retrieved passages and attributing a claim to a source that contains adjacent but not identical content.

A concrete example. The retrieved set contained two pages: page A described "feature X is available in the Pro plan," page B described "the Pro plan starts at fifty dollars per month." The model produced "Feature X is available starting at fifty dollars per month [page A]." Page A does not say anything about pricing. Page B does not specifically tie pricing to feature X. The claim is correct in the union of A and B, and the citation is the wrong one.

This class is harder to detect than Class 1 because the URL is real and is in the retrieved set. Detection requires checking whether the claim appears in (or is logically supported by) the cited page's text. Text-matching catches the easy cases; semantic checking catches the harder cases; neither catches the cases where the claim is genuinely a synthesis across pages.

Class 3: URL substitution. The model retrieves URL X, formulates the claim from X's content, and then in the citation emits URL Y typically a more "authoritative-sounding" or canonical URL the model has in training memory for the topic. The claim is correct, the cited URL contains roughly the same information, but the URL the model emitted is not the URL that actually informed the model's output.

This is the subtle one and it is more common than I expected. The mechanism is the model's bias toward "good" citations. If the model retrieved a Reddit thread that accurately stated something, but it has training memory of an official documentation URL on the same topic, it will sometimes substitute the documentation URL as the citation. The user gets a valid citation, the substance is correct, but the chain of evidence is not what the model represents it to be. For most use cases this is a softer failure than Class 2; for use cases where audit trails matter (legal, medical, journalistic), it is a hard failure.

A note on detection. When the substituted URL Y is from training memory and not in the retrieved set, the set-difference check from Class 1 already catches it (it shows up as a fabricated URL whose page happens to be real). The Class 3 detection in the methodology below is therefore scoped to the harder case: cited URL Y is in the retrieved set, but the claim is actually supported by a different retrieved URL X. Cases that fall into both buckets get counted under Class 1.

Class 4: Anchor-text drift. The cited URL is real, the page contains the claim, but the surrounding sentence misrepresents what the page says. The model has done semantic compression that subtly changes the meaning. "The product supports OAuth" becomes "the product is OAuth-compliant," which is similar-shaped but is a stronger claim. "The CEO said X" becomes "the company stated X," which generalises in a way the source does not.

This class is the hardest to detect because the citation is technically valid the URL exists, the page is relevant, the topic matches but the precise claim has drifted. Detection requires fine-grained semantic comparison between the cited sentence and the source's content. Most automated tools miss it. Human review catches it.

Why Each Class Happens

The four classes have different mechanisms, and the mechanisms inform the fix.

Class 1 (fabricated URL) happens when the model is generating tokens at a position where it has learned URLs typically appear, and the retrieved evidence in the context is weak or absent. The model fills in a URL by sampling from its parametric distribution over plausible URL patterns. If the model is operating in a mode where it has not retrieved anything (failure mode A from the previous post on grounding decisions), the entire citation is fabricated. If retrieval has happened but is sparse, the model may still fabricate when it wants to cite something the retrieval didn't cover.

Class 2 (retrieve-then-misquote) happens during the synthesis step. The model has multiple retrieved passages in context and is generating a coherent response. The natural language compression that produces fluent prose also compresses across sources. The model is good at producing summaries; summaries are by nature less precise than their inputs; the citation gets attached at sentence-level granularity to a sentence that has been compressed across sources.

Class 3 (URL substitution) is a citation-quality bias from post-training. Models have been trained to prefer "high-quality" citations official documentation, primary sources, canonical references. When the model has a choice between citing a less-authoritative retrieved URL and a more-authoritative remembered URL, the post-training prior pushes toward the latter. The substitution feels right to the model and is invisible to the user.

Class 4 (anchor-text drift) is a manifestation of the same compression problem as Class 2 but at finer grain. Even when the citation is correctly attached to the sentence the source supports, the sentence itself has often drifted from the source's exact phrasing in a way that subtly changes meaning. The drift is fluency-driven the model wants to produce smooth prose, and smooth prose generalises.

The four mechanisms are different and the mitigations should be different. Lumping them under "the model hallucinated" obscures the structure.

Measurement Methodology

The measurement loop I now use, abstracted:

def measure_citation_faithfulness(query, response):
    retrieved_urls = extract_retrieved_urls(response)
    cited_urls = extract_cited_urls(response)
    cited_claims = extract_claim_url_pairs(response)

    # Class 1: fabricated URLs
    fabricated = cited_urls - retrieved_urls

    # Class 2: retrieve-then-misquote
    misquote_count = 0
    substitution_count = 0
    drift_count = 0

    for claim, url in cited_claims:
        if url not in retrieved_urls:
            continue  # already counted as Class 1
        page_text = fetch_and_extract(url)

        if not text_supports_claim(page_text, claim):
            # Check if the claim is supported by some other retrieved URL
            other_urls = retrieved_urls - {url}
            if any(text_supports_claim(fetch_and_extract(u), claim) for u in other_urls):
                substitution_count += 1  # Class 3
            else:
                misquote_count += 1  # Class 2
        elif not exact_phrasing_match(page_text, claim, threshold=0.85):
            drift_count += 1  # Class 4

    return {
        "fabricated": len(fabricated),
        "misquote": misquote_count,
        "substitution": substitution_count,
        "drift": drift_count,
        "total_citations": len(cited_claims),
    }

The non-trivial primitive is text_supports_claim. There are three implementation choices:

Exact-substring match. Look for the literal claim text in the page. This catches very few legitimate citations because models paraphrase routinely. Useful as a "definitely supported" upper-bound check.
Embedding similarity. Embed the claim and the page passages, take the maximum cosine similarity, threshold at something like 0.7. Catches paraphrase, misses subtle inversions ("does not support" embeds close to "supports").
NLI / entailment classifier. Use a natural-language-inference model (off-the-shelf or fine-tuned) to check whether the page entails the claim, contradicts it, or is neutral. The most accurate option, also the most expensive.

For production faithfulness checking, a hybrid is what I have ended up running: embedding similarity for the bulk of the work, with NLI escalation for the borderline cases (similarity between 0.6 and 0.8). For sampled audit work, full NLI on every claim is fine because the volume is bounded.

The "extract_cited_urls" and "extract_claim_url_pairs" steps depend on the API being used. Anthropic's Messages API with citations returns structured citation objects tied to spans in the response. OpenAI's Responses API includes annotations with URL citations. Gemini's groundingMetadata returns a list of cited URIs. Each provider's structure is different, but each provider exposes structured citation data; the parsing is per-provider plumbing.

Once the measurement is in place, the metrics to track over time:

Hallucinated citation rate = fabricated / total_citations. The Class 1 rate. Should be near zero for any well-configured tool API.
Misquote rate = misquote / total_citations. The Class 2 rate. The ugly one. In my measurements, this clusters in the single-digit-percent range across providers, with significant variance by query class.
Substitution rate = substitution / total_citations. The Class 3 rate. Low single-digit percent.
Drift rate = drift / total_citations. The Class 4 rate. Hardest to measure; my measurements suggest it is more common than the others combined.

These are aggregate rates. The interesting analysis is conditional rates. Misquote rates are higher on multi-source synthesis queries than single-source factual queries. Drift rates are higher on long responses than short ones. Substitution rates are higher when the retrieval included low-authority sources (Reddit, forums) and the model had high-authority training memory available for the topic.

Mitigations, Ranked

What you can actually do about it. Ranked by impact in my experience.

1. Hard-block Class 1 in production. The simplest, highest-leverage mitigation. Every cited URL must be in the retrieved set. If a URL appears in the response that is not in the search tool's return, drop the entire response and either retry or fall back to a non-grounded answer with a "I could not verify a source for this claim" caveat. This is a few lines of code in your response post-processing layer. It eliminates Class 1 at the cost of a small fraction of responses being suppressed. The trade-off is overwhelmingly favourable for any application where citation correctness matters.

2. Sentence-level citation faithfulness check before display. For Class 2 and Class 4, run an embedding or NLI check on each cited sentence before showing it to the user. If the cited URL does not entail the sentence, either suppress the citation (showing the sentence without a source) or flag it for human review. This adds latency and cost, and for high-throughput systems it has to be sampled rather than full-coverage. For lower-throughput, high-stakes applications, full-coverage is realistic.

3. Tool-prompt nudges toward higher-quality retrieval. A common driver of Class 2 is that the retrieval did not return the actually-relevant URL it returned adjacent URLs, and the model synthesised across them. Better retrieval reduces this. Tool descriptions that emphasise specificity ("Search for the exact claim being made, with the brand name and the metric") often produce better retrieved sets. This is a soft lever; impact varies.

4. Discourage URL substitution in the system prompt. A system instruction like "When citing, cite only URLs that were returned by the search tool. Do not cite URLs from your training memory." reduces Class 3 substantially. It does not eliminate it because the model's prior on citation quality is strong. Combined with mitigation 1, you get reasonable coverage.

5. Verbosity controls. Class 4 (drift) is partly fluency-driven. Constraining the model to shorter, more direct phrasing reduces drift at the cost of less natural-sounding prose. For audit-grade outputs, this is the right trade. For consumer-facing chat, it is too restrictive.

6. Show the user the cited passage. Rather than just a URL, show the user the actual quoted sentence from the source page. This shifts the trust burden from the citation-as-link to the citation-as-quote. The user can see whether the quote supports the claim. This is the UX-side mitigation that most improves user-perceived trust, even though it does not change the underlying rate.

7. Diversify citation sources. Models are more likely to substitute when there is a single "obvious" canonical URL for a topic. When the retrieval returns five different sources, the model is biased toward citing the actually-retrieved sources rather than substituting. Forcing retrieval breadth in the search tool's parameters reduces Class 3.

The mitigations are real but each one has a trade-off. There is no zero-cost fix.

A Production-Ready Citation Check

What I have been deploying for clients is a lightweight middleware layer between the model response and the user output. The rough shape:

async def render_with_citation_check(model_response):
    parsed = parse_response_with_citations(model_response)

    for citation in parsed.citations:
        if citation.url not in parsed.retrieved_urls:
            citation.status = "FABRICATED"
            citation.display = None  # suppress the link entirely
            continue

        page_text = await fetch_with_cache(citation.url)
        entailment = await check_entailment(citation.claim, page_text)

        if entailment.label == "entailed":
            citation.status = "VERIFIED"
            citation.display = citation
        elif entailment.label == "contradicted":
            citation.status = "CONTRADICTED"
            citation.display = None
            log_warning(citation)
        else:  # neutral
            citation.status = "UNVERIFIED"
            citation.display = citation_with_warning_icon(citation)

    return render(parsed)

The entailment check uses a small NLI model the off-the-shelf roberta-large-mnli or one of the modern instruction-tuned classifiers and runs in a few hundred milliseconds per claim. For high-throughput systems, the check is sampled; for lower-throughput high-stakes systems, every citation gets checked.

The trade-off is latency. A citation-faithfulness layer adds ~500ms per citation in the hot path. For a response with three citations, that's an extra 1.5 seconds. For some applications this is unacceptable; for others it is the price of a trustworthy product. The architectural decision is whether you can afford the latency for the trust improvement.

A pattern that has worked well for moderate-throughput products: do the citation check asynchronously, ship the response immediately with citations marked "verifying...", and update the UI when the check completes. The user sees an immediate response and a slightly delayed verification badge. The latency cost is decoupled from time-to-first-byte.

An End-to-End Worked Example

The middleware I described in the previous section is abstract. A worked example clarifies what shipping it actually looks like.

The system: a B2B SaaS Q&A bot that answers customer questions about pricing, features, and integrations. Built on Anthropic's Claude with the web search tool enabled, deployed in production for six months before we instrumented faithfulness checking. The customer-facing surface is a chat interface. Each model response can include cited URLs.

The first measurement run was a sample of 500 grounded responses, processed through the four-class detector. The breakdown:

Class 1 (fabricated URL): 4 cases out of ~1200 total citations. Sub-1% rate. Low because the API surface makes fabrication hard when used correctly.
Class 2 (retrieve-then-misquote): 67 cases. Roughly 5–6% of citations. The dominant failure mode.
Class 3 (URL substitution): 14 cases. Roughly 1% of citations. Lower than expected.
Class 4 (anchor-text drift): hard to measure with confidence; the manual-review subset suggested it was the most common class but the automated detection was noisy. Conservatively 8–12% of citations had some degree of phrasing drift.

The mitigations we shipped:

Hard-block on Class 1 any response with a citation outside the retrieved set was suppressed and re-generated. Eliminated Class 1 from the user-visible surface entirely. Added ~2% retry rate on responses, no perceptible latency impact for users.
Sentence-level entailment check on Class 2 every cited claim was checked against the cited page using an off-the-shelf NLI classifier. Claims that scored "neutral" or "contradicted" were flagged. The flagged citations were not shown as links; the underlying claim was kept in the response with a "(unverified)" annotation. Reduced user-trust complaints substantially.
Tool-prompt nudge for retrieval breadth the search tool description was changed to encourage diverse queries rather than a single focused query. This raised the diversity of the retrieved set and reduced Class 3 by giving the model more legitimately-retrieved options to cite.
Async citation verification UI pattern for latency-sensitive queries, the response was shipped with citations marked "verifying," and the verification status updated in the UI when the entailment check completed (typically 500–1500ms after first response).

Four months after deployment: customer-reported "AI lied to me" tickets dropped meaningfully. The rate of unverified-citation flags became a tracked metric. The team had a dashboard showing per-week trends in each of the four classes.

The cost: a few weeks of engineering work to build the middleware, ongoing API spend for the entailment classifier (modest NLI inference is cheap compared to the LLM call it is verifying), and the discipline to keep the dashboard maintained. The benefit: a product that can be defended in front of a customer who asks "how do you ensure the AI's citations are accurate." Before the middleware, the answer was "we trust the model." After, it is "we verify every citation against the cited source before showing it to you."

This is the engineering pattern. It is not a research project. It is a few-hundred-line middleware layer plus an off-the-shelf classifier plus a dashboard. The novel work is the four-class taxonomy that lets you scope the problem; once that is in place, the implementation is conventional.

What This Looks Like Across Providers

A few notes on what I have observed at the provider level. These are observations, not benchmarks; the rates shift over time as providers update their retrieval and post-training.

Anthropic's Claude with web search. Citation faithfulness on the high end of what I have measured. The structured citation API surfaces (each cited text span carries a web_search_result_location block with the source url, title, and a cited_text excerpt) makes the measurement infrastructure easier than other providers. Class 1 rates near zero in my data; Class 2 and Class 4 are the failure modes worth watching.

OpenAI's Responses API web search. Citation faithfulness comparable to Claude in my data, with somewhat different failure-mode distribution. The structured citation annotations let you do similar measurement; the hallucinated-URL rate is low; the misquote rate is similar.

Gemini's grounding. Grounding metadata is structured and rich. The citation-to-claim mapping is sometimes coarser than the other two providers (citations attach to longer spans), which affects how cleanly you can measure misquote at the sentence level.

Perplexity. Search-first, always grounded for default models. The citation rate is high (almost every claim has a citation), and the URLs are typically real. The Class 2 rate (retrieve-then-misquote) is non-trivial because the high citation rate means more citations to be wrong about.

Open-source RAG (LangChain, LlamaIndex, custom). Citation faithfulness depends entirely on how the pipeline was built. There is no provider-side post-training to lean on; the engineering is yours.

The Synthesis

Citation hallucination is not a single failure mode. It is four failure modes that produce visually identical output and have different mechanisms. Treating them as one obscures the engineering response. Treating them as four lets you build the measurement that catches each one and the mitigation that reduces each one.

The single sentence: AI citation failures cluster into fabricated URLs, retrieve-then-misquote, URL substitution, and anchor-text drift, and each class has a distinct measurement and a distinct mitigation that operators who care about citation correctness should ship as a verification layer between the model and the user.

Three things to internalise from this post, in order:

Citation hallucination has four classes, not one. Fabricated URL (Class 1), retrieve-then-misquote (Class 2), URL substitution (Class 3), anchor-text drift (Class 4). Treating them separately is what unlocks the right measurement and the right fix per class.
Class 1 is cheap to eliminate. Classes 2–4 require infrastructure. Hard-blocking responses with non-retrieved URLs is a few lines of code. Sentence-level entailment checking is a real engineering investment but is what turns "AI with citations" into "AI with trustworthy citations."
The faithfulness layer is not optional for high-stakes use cases. Legal, medical, journalistic, regulatory any domain where being confidently wrong with an authoritative-looking source is harmful. For those domains, the citation-check middleware is part of the product, not optional polish.

The honest assessment is that we are in the early-internet phase of AI-cited information. Search engines spent two decades building trust signals around results favicons, snippets, breadcrumbs, knowledge panels that let users assess whether to click. AI assistants are emitting confident citations without the equivalent trust scaffold, and the failure modes are visible to anyone who looks. The operators who build the verification layer first will have a meaningful trust advantage in the products they ship. The operators who treat the citation as a black box and trust the model to get it right will keep producing the kind of confidently-wrong outputs that erode user trust.

The model is good at sounding right. The model is okay at being right. Closing the gap is the engineering work of the next two years.

The four-class taxonomy and the rate observations are based on running the measurement methodology described above on a sample of approximately one thousand grounded queries across Anthropic's web search tool, OpenAI's Responses API web search, and Gemini's grounding. The exact rates I have observed are not reported as quantitative benchmarks because the rates depend heavily on query mix, query class, and provider release; the qualitative claims (Class 2 is most common, Class 4 is hardest to measure, Class 1 is near zero with proper tool use) replicate across the providers I have tested. The measurement methodology is mechanistic and will apply to any provider that returns structured citation data alongside retrieved URLs. Provider behaviour shifts; verify against current docs and your own measurements before committing to a faithfulness-rate target.

Published by Cihangir Bozdogan

DEV Community