GAUTAM MANAK

Posted on Jun 3

Why better documentation won't fix AI hallucinations

#ai #mcp #runnerhchallenge #automation

A friend of mine spent six hours debugging a Stripe Connect integration last week.

He was using Cursor. He was using Claude 3.5 Sonnet. He was using one of the better-documented APIs in the world.

For six hours, the model kept inserting a webhook header called X-Stripe-Connect-Signature into his verification logic. He copied it. He tested it. He read the docs. He read them again. He asked the model to "double-check the official documentation."

The model kept doing it.

The header does not exist.

Stripe verifies Connect webhooks with Stripe-Signature — the same header it uses for everything else. The "Connect" version was something the model had quietly invented at some point during fine-tuning, and now every agent in his stack was confidently reproducing it on demand.

He is not a junior engineer. He has shipped six startups. He just made the same mistake that almost every team using AI coding assistants is making right now: he assumed the documentation was the problem.

It wasn't. It almost never is.

The Diagnosis Everyone Gets Wrong

Walk into any AI engineering Slack right now and you will hear the same conversation on repeat:

"Our agents keep hallucinating our API. We need to improve the docs."

So teams improve the docs. They rewrite endpoints. They add code samples. They commission better Markdown. They migrate to fancier docs frameworks. They publish a Notion. They publish an OpenAPI. They publish an llms.txt.

And the hallucinations keep happening.

Because the diagnosis is wrong.

Hallucinated APIs are not a writing problem. They are not a tone problem. They are not even a "the docs are incomplete" problem.

They are a structure problem. The way documentation is shaped on the web is fundamentally hostile to the way models actually read.

What Hallucination Actually Looks Like in Production

Open the network tab the next time you use any AI coding assistant against a real product's docs.

You will see one of three things:

The agent crawled the docs root, grabbed the first 12 KB of rendered HTML, and called it a day.
The agent retrieved three or four chunks from a vector index — usually the wrong chunks, because the embedding model has no idea your "Authentication" page is the canonical source for header behavior.
The agent retrieved nothing at all, because the docs are a JavaScript single-page app and the crawler couldn't see past the loading spinner.

In every case, the model is being asked to answer a precise, schema-level question — "what header do I use to verify this webhook?" — using prose written for humans who are expected to read the page top to bottom and understand context from layout, sidebar, and tone.

The model doesn't get layout. It doesn't get sidebar. It doesn't get tone. It gets a flattened bag of paragraphs with most of the structural signal stripped out.

So it fills in the blanks. It does what language models always do: it produces something that sounds like a Stripe header, because everything in its training data says headers exist and are named in a certain way.

That is hallucination. It is not a creativity failure. It is a structure failure.

Why "Better Docs" Doesn't Move the Needle

Imagine you wanted to teach a brand new junior engineer how to call your API.

You'd give them:

✅ The OpenAPI spec
✅ A Postman collection
✅ The SDK source
✅ A runnable example

What you wouldn't do is tell them: "Read these 400 pages of Markdown and reason about which one is canonical."

That second option is exactly what we are doing with AI agents today.

When we say "improve the docs," we usually mean: write better prose. Add a clearer intro paragraph. Move the warnings up. Add another code sample.

None of that helps the model. The model already had your prose. It generated a header that didn't exist while it had your prose.

What the model is missing is structure the agent can index, version, and call with precision:

A canonical list of endpoints, not an HTML page of endpoints
A canonical list of parameter names and their types — not paragraphs about them
A canonical list of headers, codes, and constraints — not "see the section above"
A way for the agent to ask "what's the latest version of this endpoint?" instead of being trapped in whatever HTML it crawled six minutes ago

This is not a writing exercise. It is an infrastructure exercise.

Markdown Was Built for Humans. Agents Need Something Else.

The honest version of the problem: we built the entire documentation web for a reader who is a human with patience, a Ctrl-F box, and good judgment.

Agents are none of those things.

An agent is closer to a programmable API consumer than a human reader. It needs:

Need	What it means
A typed surface	"What endpoints exist? What does this one return?"
Versioning	"I am working against v2. Don't show me v1 examples."
Tool-shaped retrieval	"When the user asks about Connect webhooks, hand me only the canonical signature-verification section."
Live freshness	"These docs changed three hours ago. Re-index."
Workflow context	"This call requires that call. Don't suggest one without the other."

None of these are properties of a Markdown file. They are properties of an interface.

Until we stop pretending docs are a flat blob of prose and start treating them as an interface that agents call, we will keep getting confidently invented headers, deprecated endpoints, and integrations that look right in chat and break in production.

What an AI-Native Documentation Layer Looks Like

The shape of the fix is already showing up in production stacks.

It is called the Model Context Protocol — MCP for short. It is a small, simple protocol that lets an AI agent talk to a documentation source the way it would talk to any other tool.

The MCP-shaped version of "docs" looks like this:

Tools, not pages. Instead of crawling a 600-URL site, the agent calls search_docs, get_endpoint, list_versions, get_example.
Schemas, not screenshots. Each tool has a typed contract. The agent knows what comes back before it asks.
Versioning is first-class. v1, v2, beta — explicit, queryable, never mixed.
Retrieval is workflow-aware. Asking "how do I verify a Connect webhook" returns the exact verification section, not a vector-search soup.
Freshness is a property of the protocol. Docs update → MCP server updates → agent sees the change.

This is what you actually want sitting between your documentation and any AI assistant your team uses.

You do not want every agent re-crawling your docs and re-inventing your headers. You want a single canonical context layer the agents read from.

The New Stack

Here is the rough shape of where the AI engineering stack is heading:

Layer	Used to be	Becoming
Models	The product	A commodity
Retrieval	A vector DB, bolted on	A first-class context protocol
Tools	Hand-rolled per stack	Standardized via MCP
Docs	Marketing surface	Programmable infrastructure
Agents	One-off Copilot demos	Long-running workflows that need precise context

In every layer, the same thing is happening: ad-hoc artifacts are being replaced by structured interfaces.

Models that win in 2027 will not be the ones with the most parameters. They will be the ones whose teams gave them the cleanest, most structured surface to act against.

If your documentation is still a flat web of HTML pages, your AI strategy has a hole in it. You can't out-prompt a structural problem.

What I Would Do This Week

If I led a developer-tools team, here's my five-step plan — zero docs rewrites required:

Open the network tab. Watch your agent try to read your docs in real time. Notice how thin the signal is.
Pick the top three questions a developer asks an AI assistant about your product. Try them against your live docs in Cursor or Claude. Count the hallucinations.
Decide those three questions should be answered by a tool, not by retrieval. Stand up an MCP layer in front of them.
Treat your docs as the source of truth, but stop expecting agents to read them like humans do. Generate a structured layer on top.
Measure agent accuracy as a product metric, the same way you measure pageviews. It is now a leading indicator for whether developers will pick your product over your competitor's.

You can do all five steps without rewriting a single line of documentation prose.

A Closing Thought

The most underrated trend in AI right now:

The next moat is not the model. It is the structured context the model is allowed to see.

Companies that figure this out first will look, from the outside, like they have smarter agents. They won't. They will just be the ones who turned their documentation, their APIs, and their internal knowledge into infrastructure that agents can actually use.

If you are running any kind of agent stack today — Cursor, Claude, OpenAI Agents, internal copilots, customer-facing AI — and you have ever shipped a fix to "improve the docs for the LLM," I'd love to hear what worked and what didn't.

Drop a comment with one example of an AI agent confidently inventing something against your product. I want to read every one of them.

The agents will read it. They just need it to be the right shape.

DEV Community