Markdown Is Not The Future of LLM Data Infrastructure
There's a fundamental mismatch in how we're feeding data to AI systems.
OpenAI, Anthropic, Google, every major AI lab spends billions training models to reason, plan, and take actions. Then we feed them web data converted to markdown and hope for the best.
Markdown was designed for humans to write formatted documents. It was never meant to be a data interchange format for machine intelligence. Yet somehow it became the default output of every "LLM-ready" web scraping API.
This is a problem. And it's getting worse as AI systems move from chat interfaces to autonomous agents that need to act on data, not just summarize it.
What "LLM-Ready" Markdown Actually Looks Like
I scraped the Wikipedia article for "Association football" using a popular markdown-based scraping API. Here's what came back (abbreviated):
[Jump to content](https://en.wikipedia.org/wiki/Association_football#bodyContent)
Main menu
Main menu
move to sidebarhide
Navigation
- [Main page](https://en.wikipedia.org/wiki/Main_Page "Visit the main page [alt-z]")
- [Contents](https://en.wikipedia.org/wiki/Wikipedia:Contents)
- [Current events](https://en.wikipedia.org/wiki/Portal:Current_events)
- [Random article](https://en.wikipedia.org/wiki/Special:Random)
Contribute
- [Help](https://en.wikipedia.org/wiki/Help:Contents)
- [Learn to edit](https://en.wikipedia.org/wiki/Help:Introduction)
- [Community portal](https://en.wikipedia.org/wiki/Wikipedia:Community_portal)
## Contents
move to sidebarhide
- [(Top)](https://en.wikipedia.org/wiki/Association_football#)
- [1Name](https://en.wikipedia.org/wiki/Association_football#Name)
- [2History](https://en.wikipedia.org/wiki/Association_football#History)Toggle History subsection
# Association football
246 languages
- [Acèh](https://ace.wikipedia.org/wiki/Sipak_bhan "Sipak bhan – Acehnese")
- [Afrikaans](https://af.wikipedia.org/wiki/Sokker "Sokker – Afrikaans")
- [አማርኛ](https://am.wikipedia.org/wiki/እግር_ኳስ "እግር ኳስ – Amharic")
- [العربية](https://ar.wikipedia.org/wiki/كرة_القدم "كرة القدم – Arabic")
... (242 more language links)
[Edit links](https://www.wikidata.org/wiki/Special:EntityPage/Q2736)
- [Article](https://en.wikipedia.org/wiki/Association_football)
- [Talk](https://en.wikipedia.org/wiki/Talk:Association_football)
English
- [Read](https://en.wikipedia.org/wiki/Association_football)
- [View source](https://en.wikipedia.org/w/index.php?title=Association_football&action=edit)
Tools
Actions
- [Read](https://en.wikipedia.org/wiki/Association_football)
- [View source](https://en.wikipedia.org/w/index.php?title=Association_football&action=edit)
General
- [What links here](https://en.wikipedia.org/wiki/Special:WhatLinksHere/Association_football)
- [Related changes](https://en.wikipedia.org/wiki/Special:RecentChangesLinked/Association_football)
Appearance
move to sidebarhide
Text
- Small
- Standard
- Large
Width
- Standard
- Wide
Color (beta)
- Automatic
- Light
- Dark
From Wikipedia, the free encyclopedia
Team sport played with a ball
**Association football**, commonly shortened as **football**...
That's 373KB of output. The actual article content doesn't start until line 465.
Before that? Navigation menus. Table of contents. 246 language selector links. UI toggle instructions. Appearance settings. Action buttons. Metadata panels.
This is what gets fed to LLMs when you use markdown-based scraping.
The Same Page Through Structured Extraction
Same URL. AlterLab's output:
{
"@context": "https://alterlab.io/context/combined",
"@type": "Article",
"headline": "Association football",
"author": {
"@type": "Organization",
"name": "Contributors to Wikimedia projects"
},
"datePublished": "2001-11-16T00:52:59Z",
"dateModified": "2026-01-29T21:03:22Z",
"image": "https://upload.wikimedia.org/wikipedia/commons/4/42/Football_in_Bloomington%2C_Indiana%2C_1995.jpg",
"publisher": {
"@type": "Organization",
"name": "Wikimedia Foundation, Inc.",
"logo": {
"@type": "ImageObject",
"url": "https://www.wikimedia.org/static/images/wmf-hor-googpub.png"
}
},
"url": "https://en.wikipedia.org/wiki/Association_football",
"_meta": {
"extractionMethod": "schema_org",
"extractionTimestamp": "2026-02-05T12:23:30.036979Z",
"sourceUrl": "https://en.wikipedia.org/wiki/Association_football",
"language": "en"
},
"content": {
"paragraphs": [
"Association football, commonly shortened as football and named soccer in some locations, is a team sport played between two teams of 11 players who almost exclusively use their feet to propel a ball around a rectangular field called a pitch.",
"The objective of the game is to score more goals than the opposing team by moving the ball beyond the goal line into a rectangular-framed goal defended by the opponent...",
"Association football is played in accordance with the Laws of the Game, a set of rules that has been in effect since 1863..."
],
"headings": [
{"level": 2, "text": "Name"},
{"level": 2, "text": "History"},
{"level": 3, "text": "Ancient precursors"},
{"level": 3, "text": "Medieval precursors"},
{"level": 2, "text": "Gameplay"},
{"level": 2, "text": "Laws"}
],
"links": [
{"text": "team sport", "url": "https://en.wikipedia.org/wiki/Team_sport"},
{"text": "FIFA", "url": "https://en.wikipedia.org/wiki/FIFA"},
{"text": "Laws of the Game", "url": "https://en.wikipedia.org/wiki/Laws_of_the_Game_(association_football)"}
],
"images": [
{
"url": "https://upload.wikimedia.org/wikipedia/commons/thumb/4/42/Football_in_Bloomington%2C_Indiana%2C_1995.jpg/330px-Football_in_Bloomington%2C_Indiana%2C_1995.jpg",
"alt": "The attacking player attempts to kick the ball into the net"
}
]
}
}
No navigation. No language selectors. No UI chrome. No "move to sidebarhide."
Just the article. Structured. Typed. Ready to use.
Why This Matters for AI Infrastructure
GPT-5.2 can reason across complex domains. Claude Opus 4.5 handles nuanced multi-step analysis. Gemini 3 processes multimodal inputs natively.
But none of that matters if you're feeding them 373KB of garbage to find 15KB of content.
Context Windows Are Expensive
Every token in a context window costs money and attention. When 95% of your input is navigation menus and language selectors, you're:
- Paying for tokens that add zero value
- Diluting the signal with noise
- Forcing the model to separate content from chrome
A 128K context window sounds huge until you realize you're filling 120K of it with Wikipedia's UI.
Agents Need Clean Data for Actions
An LLM agent trying to answer "When was association football first played?" has to:
With markdown:
- Parse through navigation menus
- Skip 246 language links
- Find the infobox (which is mangled into markdown)
- Hope "First played | December, 1863" survived the conversion
- Extract "December, 1863" from the surrounding noise
With structured extraction:
- Access
article.datePublishedor find "First played" in the structured metadata - Done
Error probability compounds with complexity. When your agent needs to cross-reference five Wikipedia articles, markdown parsing errors multiply. Structured data stays clean.
Training Data Quality Determines Model Quality
If you're building training datasets from web content, the extraction format determines downstream quality.
Markdown-converted training data contains:
- Navigation patterns the model shouldn't learn
- UI text that pollutes language understanding
- Structural artifacts that create noise
Structured training data contains:
- Actual content
- Proper semantic relationships
- Clean boundaries between data types
Models learn what they see. Show them navigation menus, they learn navigation menus.
What Data Infrastructure Consumers Actually Need
OpenAI's browsing, Anthropic's computer use, Google's grounding all need web data. Here's what would make that data actually useful:
1. Content Separation
The article is not the page. A page contains navigation, ads, related content, footers, cookie banners. The article is the thing you came for.
Structured extraction separates these. Markdown mixes them.
2. Semantic Typing
"December, 1863" as a string is ambiguous. Is it a date? A title? A caption?
"datePublished": "1863-12-01T00:00:00Z" is unambiguous. It's a date. It's typed. It can be compared, sorted, filtered.
3. Hierarchical Structure
Article content has structure: headings, paragraphs, lists, tables. This hierarchy carries meaning.
Markdown flattens hierarchy into text with formatting hints. JSON preserves it:
{
"headings": [
{"level": 2, "text": "History"},
{"level": 3, "text": "Ancient precursors"},
{"level": 3, "text": "Medieval precursors"}
]
}
An agent can now navigate the document programmatically: "Find the section about medieval history" becomes a data operation, not a text search.
4. Relationship Preservation
Links aren't just URLs. They're relationships. "Association football is governed by FIFA" contains a relationship: football → governed_by → FIFA.
Structured extraction can preserve these:
{
"links": [
{"text": "FIFA", "url": "https://en.wikipedia.org/wiki/FIFA", "context": "governing body"}
]
}
Markdown gives you [FIFA](https://en.wikipedia.org/wiki/FIFA). The relationship is implied, not stated.
The Economics of Clean Data
Here's the math that should matter to anyone building AI infrastructure:
Markdown approach:
- 373KB input per article
- ~93K tokens at ~4 chars/token
- At $0.01/1K tokens (input), that's $0.93 per article just to read it
- 95% of that cost is noise
Structured approach:
- ~15KB input per article
- ~3.7K tokens
- $0.037 per article
- Near 100% signal
Scale to millions of pages and the difference is millions of dollars.
But cost isn't even the main issue. The main issue is that noisy inputs produce noisy outputs. Garbage in, garbage out scales with your ambitions.
AlterLab's Approach
We built AlterLab around a principle: extract what matters, in the format that matters.
Schema.org extraction: When pages have structured data, we use it. The Wikipedia example above was extracted from embedded schema.org markup, giving us typed fields automatically.
Intelligent content separation: We identify and isolate the main content from navigation, ads, and chrome. The article is the article, not the page.
Hierarchical preservation: Document structure survives extraction. Headings, paragraphs, lists, tables retain their relationships.
Flexible output: Need markdown for a specific use case? We provide it. But we also provide the structured data so you can choose based on your needs, not our assumptions.
For AI Labs and Data Infrastructure Teams
If you're building systems that consume web data at scale:
For training pipelines: Structured extraction produces cleaner datasets. Less noise in training means less noise in outputs.
For RAG systems: Structured chunks with metadata outperform markdown blobs. Retrieval improves when content boundaries are clear.
For agents: Actions need typed data. When your agent calls a function, it needs parameters, not markdown to parse.
For evaluations: Test your models on clean inputs. Separate parsing ability from reasoning ability.
The Future Is Typed, Not Formatted
Markdown was a reasonable v1 approach when LLMs were primarily chat interfaces summarizing text. We're past that.
Agents need typed data for tool use. Training pipelines need clean structured inputs. Evaluation needs isolated, verifiable data points.
The scraping APIs that survive will be the ones that understand: AI systems don't need documents. They need data.
Try It
AlterLab is a web scraping API built for AI applications. Structured JSON output, schema-based extraction, typed fields by default.
No credit card required. Free tier available.
If you're at an AI company evaluating data infrastructure, reach out. We'd love to show you what structured web extraction looks like at scale.
The markdown-to-LLM pipeline was a reasonable first attempt. It's time for v2.
Yash Dubey
CEO, RapierCraft
Top comments (0)