DEV Community

David Hurley
David Hurley

Posted on • Originally published at dbhurley.com

When AI Reads the Web Wrong

In February 2024, the Civil Resolution Tribunal of British Columbia ordered Air Canada to pay a passenger named Jake Moffatt a partial refund. The reason: Air Canada's AI chatbot had told Moffatt he could book a full-fare flight and apply for a bereavement discount afterward. That was wrong. The airline's actual policy required the discount to be requested before booking. The chatbot read the airline's own website and got the policy backward.

Air Canada tried to argue that the chatbot was "a separate legal entity that is responsible for its own actions." The tribunal was not persuaded.

This story gets cited as a cautionary tale about AI hallucination. But I think it illustrates something more specific and more fixable. The chatbot did not hallucinate out of thin air. It read a real web page containing real policy information and then misinterpreted what it found. The page almost certainly had the correct policy buried somewhere in the HTML, surrounded by navigation menus, footer links, promotional banners, cookie consent dialogs, and whatever else airlines pack into their web pages these days.

The chatbot's error was not a failure of intelligence. It was a failure of comprehension, caused by the format of the input.

The chatbot did not hallucinate out of thin air. It read a real page with real policy information and misinterpreted what it found. The format of the input caused the error, not the quality of the model.


The format problem nobody talks about

When an AI agent "reads" a web page, it does not see what you see. You see a clean layout with headings, paragraphs, buttons, and images, all arranged in a visual hierarchy that your brain processes in milliseconds. The AI agent sees something very different: tens of thousands of tokens of raw HTML markup, most of which has nothing to do with the actual content of the page.

I have been measuring this for the past several months as part of building Plasmate. Across 50 real websites, the average web page contains about 33,000 tokens of HTML. Of those, roughly 25% is actual content: the text, the headings, the links, the things you would want an AI to understand. The remaining 75% is presentation markup: CSS class names, inline styles, JavaScript, tracking pixels, layout containers, ad slots, data attributes, SVG icons, and structural dividers that exist only to make the page look right in a browser.

The AI agent processes all of it. It pays for all of it (literally, token by token). And it has to figure out, on its own, which parts matter and which parts are noise.

Imagine handing someone a 400-page book and telling them that 300 of those pages are blank or contain random numbers, but the 100 pages with actual content are scattered throughout and not marked in any way. Then ask them to summarize the book accurately. They might get it right most of the time. But sometimes they will latch onto a random number from a blank page and present it as a fact. That is more or less what we are doing to AI agents every time they browse the web.

Four ways agents get web pages wrong

After spending months studying how AI agents interact with web content, I have started to see patterns in their errors. Not all mistakes are the same. They fall into distinct categories, and the category tells you something about what caused the mistake.

Structural errors. The agent invents page elements that do not exist. "Click the Subscribe button in the sidebar" when there is no such button. This happens most often when the agent is working from a text-only version of the page (like markdown) that strips out all structural information. The agent knows buttons probably exist on most pages, so it guesses. Sometimes it guesses wrong.

Content errors. The agent reports facts that are not on the page, or reports the wrong version of a fact that is. A price of $49.99 when the page shows $59.99. This is the classic hallucination, and it can be triggered by noisy input. When the agent is processing 33,000 tokens and trying to find a single number, the probability of grabbing the wrong one goes up with the amount of noise in the input.

Attribution errors. The agent finds the right information but attributes it to the wrong part of the page. It reports a statistic from the sidebar as if it were from the main article. It confuses a promotional offer with the actual product price. This happens because the agent cannot distinguish between page regions. Everything arrives as one flat stream of text, and the agent has to guess where the main content ends and the sidebar begins.

Inference errors. The agent draws conclusions that the page does not support. A product is described as "popular" and the agent reports it as "the best-selling item." This type of error is less about the input format and more about the model's tendency to fill in gaps. But noisier, more confusing input makes inference errors more likely, because the agent has less confidence in what it actually found and more temptation to extrapolate.

Which of these error types caused the Air Canada chatbot to get the bereavement policy wrong? Almost certainly a combination of content and attribution errors. The correct policy was on the page, buried in noise.


The Air Canada case was probably a combination of content and attribution errors. The correct bereavement policy was on the page, but surrounded by enough other content that the chatbot either found the wrong paragraph or misattributed a condition from one section to another.

Why markdown is not the answer

The most common response to the "HTML is too noisy" problem is to convert web pages to markdown before feeding them to an AI agent. Strip out the HTML tags, keep the text. This is what most AI agent frameworks do by default: LangChain, LlamaIndex, CrewAI, and others all convert web pages to plain text or markdown before processing.

Markdown does solve the noise problem. A page that was 33,000 tokens of HTML becomes about 4,500 tokens of markdown. That is a huge improvement in efficiency. But it creates a new problem: markdown throws away all the structural information along with the noise.

A markdown representation of a web page cannot tell you which text is a button and which is a heading. It cannot tell you which elements are interactive and which are static. It cannot distinguish between the main content area and the sidebar. It cannot tell you what form fields are available or what options are in a dropdown menu.

For simple reading tasks, this does not matter. If you just need to extract the text of an article, markdown is great. But for anything that requires understanding the structure of the page, knowing what you can click, figuring out how to fill a form, or navigating a multi-step workflow, markdown is blind.

This creates an awkward situation for AI agent developers. They use one format (markdown) when they need to read pages, and a completely different format (raw HTML with DOM selectors) when they need to interact with pages. Two systems, two sets of failure modes, no unified understanding of the page.

What if the page told the agent what it needed to know?

This is the question I have been working on. Not "how do we make AI smarter at reading HTML," but "how do we give AI a format that is actually designed for it?"

The idea behind the Semantic Object Model (SOM) is straightforward: take a web page and compile it into a structured representation that preserves what an agent needs (the content, the element types, the interactive affordances, the page regions) while discarding what it does not (the CSS, the scripts, the tracking, the layout containers, the visual presentation).

The output is a JSON document that organizes the page into typed regions (navigation, main content, sidebar, footer) containing typed elements (headings, paragraphs, links, buttons, form fields) with explicit declarations of what actions are available (click, type, select, toggle). An agent reading a SOM document knows exactly what is on the page, where it is, what type of element it is, and what it can do with it.

This is not a summary or an extraction. It is a compiled representation of the full page, preserving all the semantic information while eliminating the visual presentation layer. Think of it as the difference between giving someone a blueprint of a building versus giving them a photograph. The photograph is richer in visual detail, but the blueprint tells you what every room is for, where the doors are, and which ones are locked.

The numbers

We have been running benchmarks across 50 real websites with two different AI models (GPT-4o and Claude Sonnet 4). The efficiency numbers are dramatic:

SOM uses about 8,300 tokens per page compared to 33,000 for raw HTML. That is a 4x reduction. For navigation-heavy pages, the ratio reaches 5.4x. For pages with heavy advertising, 6x.

But efficiency is not the point of this post. The point is correctness.

In our latest research, we set up 150 tasks across six categories: extracting specific facts, comparing information across page sections, identifying navigation structure, summarizing content, handling noisy pages with lots of ads, and identifying interactive elements. We tested each task with all three formats (HTML, markdown, SOM) across four different AI models.

The results are still being finalized, but the patterns are clear. For tasks that require understanding page structure (navigation, interactive elements, adversarial pages with lots of noise), structured representations produce measurably better results. The agent makes fewer structural errors because it does not have to guess about page structure. It makes fewer attribution errors because regions are explicitly labeled. It makes fewer content errors on noisy pages because the noise has been filtered out at compile time rather than at inference time.

Perhaps the most interesting finding is about speed. On Claude, SOM is faster than markdown despite using nearly twice as many tokens.

Claude Sonnet 4 with HTML input: 16.2 seconds per task. With SOM input: 8.5 seconds. Nearly 2x faster, despite SOM having more tokens than markdown. Structured input reduces the work the model has to do.


Our interpretation is that structured input reduces the amount of work the model has to do to understand the page. When the structure is explicit, the model spends less time reasoning about "what is this element?" and "where does the sidebar end?" and more time reasoning about the actual question.

The part that surprised me

One thing we built into SOM that I initially considered a nice-to-have turned out to be one of the most important features: provenance tracking. Every element in a SOM document has a stable identifier. When an agent extracts a fact from a SOM page, the system can record which specific element that fact came from.

This means you can programmatically verify an agent's claims. If the agent says "the product costs $49.99," you can check whether element e_a3f2b1 in the main region actually contains that price. If it does, the claim is verified. If it does not, you know the agent made an error, and you know it before the claim reaches a user.

Compare this to verifying a claim against raw HTML or markdown. The best you can do is search for the string "$49.99" somewhere in the document. If it appears in an ad or in a "customers also bought" section rather than the product listing, you cannot tell the difference. The claim looks verified even though the agent found the number in the wrong place.
For high-stakes applications (medical information, financial advice, legal policy lookup), this is the difference between an AI assistant you can trust and one you cannot.

Back to Air Canada

Would structured representations have prevented the Air Canada chatbot error? I think they would have helped significantly. The bereavement policy was on the page. The issue was that the chatbot could not reliably distinguish the policy text from the surrounding content, or it confused conditions from adjacent sections.

A SOM representation of that page would have placed the bereavement policy in a clearly labeled content region, with each policy condition as a distinct paragraph element. The chatbot would have received a clean, structured view of the policy rather than a wall of HTML that happened to contain the policy somewhere within it.

Would that guarantee a correct answer? No. AI agents can still make inference errors regardless of input format. But it would eliminate the structural and attribution errors that are caused by noisy, ambiguous input. And for a chatbot handling customer-facing policy questions, eliminating those error categories is worth a lot.

What this means going forward

The web was built for human eyes. Every page on the internet is designed to be rendered as pixels on a screen. That made perfect sense when humans were the only consumers of web content.

They are not anymore. AI agents now account for a growing share of web traffic. Cloudflare reported that crawler traffic grew 18% in a single year, with some AI crawlers growing over 300%. These agents are browsing billions of pages, and every one of those pages is served in a format designed for someone who will never use it.

The fix is not to make agents better at reading HTML. It is to give agents a format designed for them, the same way we gave search engines sitemaps and gave applications APIs. Each new class of web consumer has eventually gotten infrastructure designed for its consumption model. AI agents are the next consumer class, and they need the same.

I published a detailed research paper on information fidelity that goes deep on the methodology, the hallucination taxonomy, and the experimental framework. If you are building agent systems or evaluating web content representations, the paper has the technical depth. This post is the accessible version: AI agents read the web wrong because we are feeding them the wrong format, and there is a better way.
David Hurley is the founder of Plasmate Labs. Previously, he founded Mautic, the world's first open source marketing automation platform. He writes at dbhurley.com/blog and publishes research at dbhurley.com/papers.

Top comments (0)