<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Marcelo Santos</title>
    <description>The latest articles on DEV Community by Marcelo Santos (@marcelo_aingestor).</description>
    <link>https://dev.to/marcelo_aingestor</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3940774%2F479d1773-2ef1-4a5a-b6ce-57b77c20aaa2.jpg</url>
      <title>DEV Community: Marcelo Santos</title>
      <link>https://dev.to/marcelo_aingestor</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/marcelo_aingestor"/>
    <language>en</language>
    <item>
      <title>HTML vs Markdown for LLMs: Why Clean Structure Beats Raw Pages</title>
      <dc:creator>Marcelo Santos</dc:creator>
      <pubDate>Tue, 19 May 2026 16:59:36 +0000</pubDate>
      <link>https://dev.to/marcelo_aingestor/html-vs-markdown-for-llms-why-clean-structure-beats-raw-pages-2f35</link>
      <guid>https://dev.to/marcelo_aingestor/html-vs-markdown-for-llms-why-clean-structure-beats-raw-pages-2f35</guid>
      <description>&lt;h1&gt;
  
  
  HTML vs Markdown for LLMs: Why Clean Structure Beats Raw Pages
&lt;/h1&gt;

&lt;p&gt;When people build RAG pipelines or AI agents for the first time, they often focus on embeddings, vector databases, chunking strategies, and prompt engineering.&lt;/p&gt;

&lt;p&gt;But there’s another problem hiding underneath all of that:&lt;/p&gt;

&lt;p&gt;Your model is probably ingesting terrible input.&lt;/p&gt;

&lt;p&gt;A lot of AI pipelines still feed raw HTML directly into LLM workflows. Technically, it works. But in practice, it creates noisy context windows, inflated token usage, poor chunking quality, and retrieval results that feel strangely inconsistent.&lt;/p&gt;

&lt;p&gt;After spending time testing retrieval pipelines across different websites and document structures, one pattern became very clear:&lt;/p&gt;

&lt;p&gt;LLMs understand clean Markdown dramatically better than noisy HTML.&lt;/p&gt;

&lt;p&gt;This post explains why.&lt;/p&gt;




&lt;h1&gt;
  
  
  The Hidden Problem With Raw HTML
&lt;/h1&gt;

&lt;p&gt;Most modern web pages contain far more than the actual content you care about.&lt;/p&gt;

&lt;p&gt;A typical page includes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;navigation menus&lt;/li&gt;
&lt;li&gt;sidebars&lt;/li&gt;
&lt;li&gt;cookie banners&lt;/li&gt;
&lt;li&gt;repeated headers&lt;/li&gt;
&lt;li&gt;tracking elements&lt;/li&gt;
&lt;li&gt;hidden accessibility labels&lt;/li&gt;
&lt;li&gt;styling wrappers&lt;/li&gt;
&lt;li&gt;scripts&lt;/li&gt;
&lt;li&gt;unrelated links&lt;/li&gt;
&lt;li&gt;footer blocks&lt;/li&gt;
&lt;li&gt;promotional sections&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Humans naturally ignore most of this.&lt;/p&gt;

&lt;p&gt;LLMs do not.&lt;/p&gt;

&lt;p&gt;To a language model, all of that becomes part of the context window unless your ingestion layer removes it first.&lt;/p&gt;

&lt;p&gt;That means your retrieval pipeline may waste thousands of unnecessary tokens before the model even reaches the useful information.&lt;/p&gt;




&lt;h1&gt;
  
  
  Why This Matters for RAG Systems
&lt;/h1&gt;

&lt;p&gt;RAG systems depend on relevance.&lt;/p&gt;

&lt;p&gt;The cleaner the source material is, the easier it becomes to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;generate meaningful embeddings&lt;/li&gt;
&lt;li&gt;chunk content semantically&lt;/li&gt;
&lt;li&gt;retrieve useful passages&lt;/li&gt;
&lt;li&gt;reduce hallucinations&lt;/li&gt;
&lt;li&gt;improve answer consistency&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Raw HTML introduces a lot of structural noise.&lt;/p&gt;

&lt;p&gt;For example, imagine a documentation page where the actual article is surrounded by:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;200 navigation links&lt;/li&gt;
&lt;li&gt;repeated UI labels&lt;/li&gt;
&lt;li&gt;footer menus&lt;/li&gt;
&lt;li&gt;account controls&lt;/li&gt;
&lt;li&gt;search widgets&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A chunking system may accidentally split or embed those elements together with the article itself.&lt;/p&gt;

&lt;p&gt;The result?&lt;/p&gt;

&lt;p&gt;Lower signal quality.&lt;/p&gt;




&lt;h1&gt;
  
  
  Markdown Preserves Meaning Better
&lt;/h1&gt;

&lt;p&gt;Markdown is interesting because it sits in a sweet spot between:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;human readability&lt;/li&gt;
&lt;li&gt;machine readability&lt;/li&gt;
&lt;li&gt;structural simplicity&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Headings remain headings.&lt;/p&gt;

&lt;p&gt;Lists remain lists.&lt;/p&gt;

&lt;p&gt;Tables remain tables.&lt;/p&gt;

&lt;p&gt;Code blocks remain code blocks.&lt;/p&gt;

&lt;p&gt;But all the visual and layout noise disappears.&lt;/p&gt;

&lt;p&gt;That dramatically improves how LLM pipelines interpret the information hierarchy of a document.&lt;/p&gt;

&lt;p&gt;In many cases, semantic chunking becomes easier almost automatically because the structure is already clean.&lt;/p&gt;




&lt;h1&gt;
  
  
  Token Reduction Is Bigger Than Most People Expect
&lt;/h1&gt;

&lt;p&gt;One of the most surprising findings when comparing raw HTML against cleaned Markdown is how much token usage drops.&lt;/p&gt;

&lt;p&gt;Some pages lose:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;50%&lt;/li&gt;
&lt;li&gt;60%&lt;/li&gt;
&lt;li&gt;even 70–80%&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;of their tokens after boilerplate removal and structural cleanup.&lt;/p&gt;

&lt;p&gt;That matters a lot when:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;using large context windows&lt;/li&gt;
&lt;li&gt;processing thousands of pages&lt;/li&gt;
&lt;li&gt;running browser-agent systems&lt;/li&gt;
&lt;li&gt;scaling ingestion pipelines&lt;/li&gt;
&lt;li&gt;optimizing inference costs&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A noisy page that costs 12,000 tokens to ingest might become a 3,500-token document after cleanup.&lt;/p&gt;

&lt;p&gt;And often the useful information becomes &lt;em&gt;more accurate&lt;/em&gt;, not less.&lt;/p&gt;




&lt;h1&gt;
  
  
  Browser Agents Suffer From This Too
&lt;/h1&gt;

&lt;p&gt;This problem is not limited to RAG.&lt;/p&gt;

&lt;p&gt;AI agents that read live websites also struggle with noisy page structures.&lt;/p&gt;

&lt;p&gt;If an agent receives raw page content filled with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;navigation&lt;/li&gt;
&lt;li&gt;ads&lt;/li&gt;
&lt;li&gt;repetitive UI labels&lt;/li&gt;
&lt;li&gt;modal overlays&lt;/li&gt;
&lt;li&gt;irrelevant links&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;its reasoning quality drops.&lt;/p&gt;

&lt;p&gt;You can think of it like forcing a human to read an article while someone constantly interrupts every paragraph with menus and advertisements.&lt;/p&gt;

&lt;p&gt;Cleaner structured input improves reasoning reliability.&lt;/p&gt;




&lt;h1&gt;
  
  
  Semantic Chunking Starts Before Chunking
&lt;/h1&gt;

&lt;p&gt;A lot of developers talk about semantic chunking as if it begins during segmentation.&lt;/p&gt;

&lt;p&gt;In reality, it starts much earlier.&lt;/p&gt;

&lt;p&gt;It starts with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;clean extraction&lt;/li&gt;
&lt;li&gt;preserving hierarchy&lt;/li&gt;
&lt;li&gt;removing layout noise&lt;/li&gt;
&lt;li&gt;keeping structural meaning intact&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A perfect chunking algorithm cannot fully compensate for bad source material.&lt;/p&gt;




&lt;h1&gt;
  
  
  Why Markdown Is Becoming the Default AI-Friendly Format
&lt;/h1&gt;

&lt;p&gt;There’s a reason many modern AI ingestion pipelines increasingly standardize around Markdown.&lt;/p&gt;

&lt;p&gt;It provides:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;lightweight structure&lt;/li&gt;
&lt;li&gt;semantic clarity&lt;/li&gt;
&lt;li&gt;low token overhead&lt;/li&gt;
&lt;li&gt;portability&lt;/li&gt;
&lt;li&gt;easy preprocessing&lt;/li&gt;
&lt;li&gt;easy debugging&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For AI systems, Markdown behaves almost like a simplified intermediate language between the web and language models.&lt;/p&gt;




&lt;h1&gt;
  
  
  A Practical Approach
&lt;/h1&gt;

&lt;p&gt;The workflow that tends to work best looks something like this:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Fetch the page&lt;/li&gt;
&lt;li&gt;Remove boilerplate and layout noise&lt;/li&gt;
&lt;li&gt;Preserve semantic structure&lt;/li&gt;
&lt;li&gt;Convert into clean Markdown&lt;/li&gt;
&lt;li&gt;Chunk semantically&lt;/li&gt;
&lt;li&gt;Generate embeddings&lt;/li&gt;
&lt;li&gt;Store and retrieve&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The quality gains compound at every step afterward.&lt;/p&gt;




&lt;h1&gt;
  
  
  Final Thoughts
&lt;/h1&gt;

&lt;p&gt;A lot of AI engineering conversations focus on advanced retrieval techniques, larger context windows, or smarter prompts.&lt;/p&gt;

&lt;p&gt;But cleaner inputs often produce bigger improvements than expected.&lt;/p&gt;

&lt;p&gt;In many pipelines, the problem is not the model.&lt;/p&gt;

&lt;p&gt;It’s the ingestion layer.&lt;/p&gt;

&lt;p&gt;If your RAG system, browser agent, or retrieval workflow still consumes raw HTML directly, it’s worth experimenting with structured Markdown preprocessing first.&lt;/p&gt;

&lt;p&gt;You may end up improving:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;retrieval quality&lt;/li&gt;
&lt;li&gt;chunk consistency&lt;/li&gt;
&lt;li&gt;token efficiency&lt;/li&gt;
&lt;li&gt;reasoning stability&lt;/li&gt;
&lt;li&gt;operating costs&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;all at the same time.&lt;/p&gt;

&lt;p&gt;For developers interested in experimenting further, this article goes deeper into the topic of &lt;a href="https://aingestor.com/blog/html-vs-markdown-for-llms" rel="noopener noreferrer"&gt;&lt;code&gt;HTML vs Markdown for LLMs&lt;/code&gt;&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;There’s also a public &lt;a href="https://aingestor.com/url-to-markdown-api" rel="noopener noreferrer"&gt;&lt;code&gt;URL to Markdown API&lt;/code&gt;&lt;/a&gt; that lets you compare noisy HTML against cleaned AI-ready Markdown in real time.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>html</category>
      <category>llm</category>
      <category>rag</category>
    </item>
  </channel>
</rss>
