Łukasz Blania

Posted on Jun 2

Stop AI articles from hallucinating: retrieve, then write

#ai #webdev #tutorial

Last year a single fabricated statistic in a published article cost me a client's trust. The model wrote "73% of small businesses report X" with total confidence. The number did not exist. Nobody had ever measured it. The client found out before I did.

That one sentence taught me more than any prompt engineering thread ever did. I have since shipped 63,000 articles through a production pipeline, and the thing that moved quality the most was not a better model or a cleverer prompt. It was deciding that the model is not allowed to know anything it cannot cite.

This post is the pattern I use to get there. It works with any LLM and any framework.

TLDR

A language model writing from memory will invent facts that sound plausible. The fix is to retrieve real sources first, extract the facts you trust, and then force the model to write only against those facts. You build a small "fact context" per article and treat anything outside it as off-limits. I will show the three stages with code and the validation pass that catches what slips through.

The actual problem

People frame hallucination as a model defect. It is, partly. But in a content pipeline it is mostly an architecture choice.

When you send a prompt like "Write a 1500 word article about peptide bioavailability," you are asking the model to generate fluent text on a topic. Fluency is what it optimizes for. Truth is a side effect it reaches for from training data that may be old, averaged, or simply wrong for this specific claim.

The model has no signal telling it "you do not actually know this number, so do not state it." So it states it anyway, in the same confident voice it uses for things it does know. That confidence is the dangerous part. A reader cannot tell a real stat from an invented one by tone alone.

So the goal is not "make the model smarter." The goal is to remove its permission to free-associate facts.

Stage 1: Retrieve before you write

Before generating anything, gather real material about the topic. I pull from three source types because they fail in different ways:

A live web search for current facts and recent events
An encyclopedia source for stable entities and definitions
A synthesis source for harder questions that need reasoning across pages

Here is a trimmed version of the retrieval step.

async function gatherSources(topic, keywords) {
  const queries = [topic, ...keywords].slice(0, 5);

  const [web, encyclopedia, synthesis] = await Promise.all([
    braveSearch(queries.join(" ")),      // current facts
    wikipediaLookup(topic),              // stable entities
    perplexityAsk(`Key facts about ${topic} with sources`), // reasoning
  ]);

  return normalizeSources([...web, encyclopedia, ...synthesis]);
}

function normalizeSources(raw) {
  return raw
    .filter((s) => s.text && s.url)
    .map((s) => ({
      url: s.url,
      title: s.title || "",
      text: s.text.slice(0, 4000), // keep token budget sane
    }));
}

The output is a list of source objects, each with a URL and a chunk of real text. Nothing here is generated. It is all pulled from somewhere a human could go and verify.

Stage 2: Extract the facts you trust

Raw sources are noisy. The next step turns them into a short list of clean, attributed claims that the writer is allowed to use. I run this through the model too, but with a tight job: pull claims out, do not add any.

const EXTRACT_PROMPT = `
You are a fact extractor. From the SOURCES below, list atomic factual
claims relevant to the topic "{{topic}}".

Rules:
- Every claim must be supported by the source text. Do not infer.
- Attach the source url to each claim.
- If a claim includes a number, the number must appear in the source.
- Do not add background knowledge. Only what is in the sources.

Return JSON: [{ "claim": string, "url": string }]
`;

async function extractFacts(topic, sources) {
  const sourceBlock = sources
    .map((s, i) => `SOURCE ${i + 1} (${s.url}):\n${s.text}`)
    .join("\n\n");

  const res = await llm.json(EXTRACT_PROMPT
    .replace("{{topic}}", topic) + "\n\nSOURCES:\n" + sourceBlock);

  // drop anything the model failed to attribute
  return res.filter((f) => f.claim && f.url);
}

Now you have a fact sheet. Each item is a claim plus a URL. This is the only knowledge the writer gets to use. If a fact is not on this sheet, it does not go in the article.

The reason this works: extraction is a much easier task than generation. Asking "is this claim in the text in front of you" is close to a lookup. Asking "write a true article about X" is open-ended. Narrow tasks hallucinate far less.

Stage 3: Write against the facts, not from memory

Now generation. The prompt changes shape completely. Instead of "write about the topic," it becomes "write using only these approved facts."

const WRITE_PROMPT = `
Write a section titled "{{heading}}" for an article about "{{topic}}".

You may ONLY use facts from the APPROVED FACTS list below.
- Do not introduce statistics, dates, or named studies that are not listed.
- If you need a fact that is not approved, write around it without inventing one.
- Keep claims attributable. Prefer specific over vague.

APPROVED FACTS:
{{facts}}
`;

async function writeSection(topic, heading, facts) {
  const factList = facts
    .map((f) => `- ${f.claim} [${f.url}]`)
    .join("\n");

  return llm.text(WRITE_PROMPT
    .replace("{{topic}}", topic)
    .replace("{{heading}}", heading)
    .replace("{{facts}}", factList));
}

The model still writes fluent prose. It still controls phrasing, flow, and structure. What it lost is the ability to reach outside the fact sheet for a number that "feels right."

When the writer needs a statistic that is not approved, the instruction tells it to write around the gap instead of filling it with a guess. In practice this produces sentences like "peptide absorption varies with delivery method" instead of "peptide absorption improves by 4x with method Y," where the 4x was never real.

The validation pass that catches the rest

Constraints reduce hallucination. They do not zero it out. So I run a cheap check after generation: pull every number and named claim out of the draft and confirm each one traces back to a source.

async function validateDraft(draft, facts) {
  const claims = await extractClaimsFromDraft(draft); // numbers + named facts
  const approved = facts.map((f) => f.claim.toLowerCase());

  const unsupported = [];
  for (const c of claims) {
    const grounded = approved.some((a) => overlaps(a, c.text.toLowerCase()));
    if (!grounded) unsupported.push(c);
  }

  return {
    clean: unsupported.length === 0,
    unsupported, // flag these for a rewrite or human review
  };
}

Anything in unsupported is a claim the writer produced that the fact sheet does not back. You can route those to a rewrite of just that paragraph, or to a human. Either way, the fabricated stat never reaches publish without someone seeing it first.

This pass is where the client incident from the opening would have been caught. The fake "73%" would land in unsupported, because no source ever contained it.

A few things that made it better

Cap source text per chunk. Long sources blow your token budget and bury the useful lines. 4000 characters per source has been a fine ceiling for me.
Extract facts once, reuse across sections. Running extraction per section wastes calls and produces inconsistent fact sheets. One sheet per article keeps the whole piece consistent.
Keep URLs attached the whole way through. Even if you never render citations, carrying the URL lets validation and human review trace any claim back fast.
Prefer "write around the gap" over "omit the fact." Telling the model to omit makes it drop whole sentences. Telling it to write around the gap keeps the prose flowing while staying honest.
Log the unsupported claims over time. The patterns tell you which topics your sources are too thin on, which is a retrieval problem, not a writing one.

Reality check: when this is overkill

This pattern has real cost. Three retrieval calls, an extraction call, and a validation call sit on top of generation. That is latency and money per article. It is not free.

Skip it when the content is not factual. Personal essays, opinion pieces, brand storytelling, and creative copy do not have facts to ground. Forcing a fact sheet onto a founder's opinion post just makes it stiff.

Skip it when a human reviews every word anyway. If an editor fact-checks each article before it ships, the validation pass is redundant. The grounding still helps the first draft, but the full machinery earns its keep mostly at volume, where no human reads every line.

And be honest about the ceiling. Grounding stops the model from inventing facts. It does not check whether your sources are correct. Garbage sources produce confidently grounded garbage. The retrieval quality is now your real bottleneck, which is a better problem to have but still a problem.

Stack notes for the curious

The version I run in production uses Brave Search for live web results, Wikipedia for entities, and Perplexity for synthesis questions. The extraction and writing both run on a frontier model with JSON mode for the structured steps. Sections generate in parallel, then stitch, which is a separate pattern worth its own post. The fact sheet is the spine that keeps the parallel sections from contradicting each other.

I built this into Articfly (articfly.com) because at 63,000 articles across 9 retainer clients, one hallucinated stat per few hundred posts is still dozens of trust-breaking errors a year. The grounding layer is the difference between a tool a client can rely on and a toy that writes pretty lies.

Back to that client

The article that cost me trust was fluent, well-structured, and wrong in exactly one sentence. That is the trap. Fluency hides the error. The reader trusts the tone.

Retrieve-then-write does not make the model smarter. It changes what the model is allowed to claim. Once a fact has to exist in a real source before it can appear in the draft, the most damaging failure mode mostly goes away.

If you are shipping AI content at any volume, what is your actual process for catching a fabricated fact before it goes live, or are you still trusting the tone?

DEV Community