Yash Dubey

Posted on Feb 5 • Originally published at alterlab.io

Markdown Is Not The Future of LLM Data Infrastructure

#ai #llm #webscraping #dataengineering

Markdown Is Not The Future of LLM Data Infrastructure

There's a fundamental mismatch in how we're feeding data to AI systems.

OpenAI, Anthropic, Google, every major AI lab spends billions training models to reason, plan, and take actions. Then we feed them web data converted to markdown and hope for the best.

Markdown was designed for humans to write formatted documents. It was never meant to be a data interchange format for machine intelligence. Yet somehow it became the default output of every "LLM-ready" web scraping API.

This is a problem. And it's getting worse as AI systems move from chat interfaces to autonomous agents that need to act on data, not just summarize it.

What "LLM-Ready" Markdown Actually Looks Like

I scraped the Wikipedia article for "Association football" using a popular markdown-based scraping API. Here's what came back (abbreviated):

[Jump to content](https://en.wikipedia.org/wiki/Association_football#bodyContent)

Main menu

Main menu

move to sidebarhide

Navigation

- [Main page](https://en.wikipedia.org/wiki/Main_Page "Visit the main page [alt-z]")
- [Contents](https://en.wikipedia.org/wiki/Wikipedia:Contents)
- [Current events](https://en.wikipedia.org/wiki/Portal:Current_events)
- [Random article](https://en.wikipedia.org/wiki/Special:Random)

Contribute

- [Help](https://en.wikipedia.org/wiki/Help:Contents)
- [Learn to edit](https://en.wikipedia.org/wiki/Help:Introduction)
- [Community portal](https://en.wikipedia.org/wiki/Wikipedia:Community_portal)

## Contents

move to sidebarhide

- [(Top)](https://en.wikipedia.org/wiki/Association_football#)
- [1Name](https://en.wikipedia.org/wiki/Association_football#Name)
- [2History](https://en.wikipedia.org/wiki/Association_football#History)Toggle History subsection

# Association football

246 languages

- [Acèh](https://ace.wikipedia.org/wiki/Sipak_bhan "Sipak bhan – Acehnese")
- [Afrikaans](https://af.wikipedia.org/wiki/Sokker "Sokker – Afrikaans")
- [አማርኛ](https://am.wikipedia.org/wiki/እግር_ኳስ "እግር ኳስ – Amharic")
- [العربية](https://ar.wikipedia.org/wiki/كرة_القدم "كرة القدم – Arabic")
... (242 more language links)

[Edit links](https://www.wikidata.org/wiki/Special:EntityPage/Q2736)

- [Article](https://en.wikipedia.org/wiki/Association_football)
- [Talk](https://en.wikipedia.org/wiki/Talk:Association_football)

English

- [Read](https://en.wikipedia.org/wiki/Association_football)
- [View source](https://en.wikipedia.org/w/index.php?title=Association_football&action=edit)

Tools

Actions

- [Read](https://en.wikipedia.org/wiki/Association_football)
- [View source](https://en.wikipedia.org/w/index.php?title=Association_football&action=edit)

General

- [What links here](https://en.wikipedia.org/wiki/Special:WhatLinksHere/Association_football)
- [Related changes](https://en.wikipedia.org/wiki/Special:RecentChangesLinked/Association_football)

Appearance

move to sidebarhide

Text

- Small
- Standard
- Large

Width

- Standard
- Wide

Color (beta)

- Automatic
- Light
- Dark

From Wikipedia, the free encyclopedia

Team sport played with a ball

**Association football**, commonly shortened as **football**...

That's 373KB of output. The actual article content doesn't start until line 465.

Before that? Navigation menus. Table of contents. 246 language selector links. UI toggle instructions. Appearance settings. Action buttons. Metadata panels.

This is what gets fed to LLMs when you use markdown-based scraping.

The Same Page Through Structured Extraction

Same URL. AlterLab's output:

{
  "@context": "https://alterlab.io/context/combined",
  "@type": "Article",
  "headline": "Association football",
  "author": {
    "@type": "Organization",
    "name": "Contributors to Wikimedia projects"
  },
  "datePublished": "2001-11-16T00:52:59Z",
  "dateModified": "2026-01-29T21:03:22Z",
  "image": "https://upload.wikimedia.org/wikipedia/commons/4/42/Football_in_Bloomington%2C_Indiana%2C_1995.jpg",
  "publisher": {
    "@type": "Organization",
    "name": "Wikimedia Foundation, Inc.",
    "logo": {
      "@type": "ImageObject",
      "url": "https://www.wikimedia.org/static/images/wmf-hor-googpub.png"
    }
  },
  "url": "https://en.wikipedia.org/wiki/Association_football",
  "_meta": {
    "extractionMethod": "schema_org",
    "extractionTimestamp": "2026-02-05T12:23:30.036979Z",
    "sourceUrl": "https://en.wikipedia.org/wiki/Association_football",
    "language": "en"
  },
  "content": {
    "paragraphs": [
      "Association football, commonly shortened as football and named soccer in some locations, is a team sport played between two teams of 11 players who almost exclusively use their feet to propel a ball around a rectangular field called a pitch.",
      "The objective of the game is to score more goals than the opposing team by moving the ball beyond the goal line into a rectangular-framed goal defended by the opponent...",
      "Association football is played in accordance with the Laws of the Game, a set of rules that has been in effect since 1863..."
    ],
    "headings": [
      {"level": 2, "text": "Name"},
      {"level": 2, "text": "History"},
      {"level": 3, "text": "Ancient precursors"},
      {"level": 3, "text": "Medieval precursors"},
      {"level": 2, "text": "Gameplay"},
      {"level": 2, "text": "Laws"}
    ],
    "links": [
      {"text": "team sport", "url": "https://en.wikipedia.org/wiki/Team_sport"},
      {"text": "FIFA", "url": "https://en.wikipedia.org/wiki/FIFA"},
      {"text": "Laws of the Game", "url": "https://en.wikipedia.org/wiki/Laws_of_the_Game_(association_football)"}
    ],
    "images": [
      {
        "url": "https://upload.wikimedia.org/wikipedia/commons/thumb/4/42/Football_in_Bloomington%2C_Indiana%2C_1995.jpg/330px-Football_in_Bloomington%2C_Indiana%2C_1995.jpg",
        "alt": "The attacking player attempts to kick the ball into the net"
      }
    ]
  }
}

No navigation. No language selectors. No UI chrome. No "move to sidebarhide."

Just the article. Structured. Typed. Ready to use.

Why This Matters for AI Infrastructure

GPT-5.2 can reason across complex domains. Claude Opus 4.5 handles nuanced multi-step analysis. Gemini 3 processes multimodal inputs natively.

But none of that matters if you're feeding them 373KB of garbage to find 15KB of content.

Context Windows Are Expensive

Every token in a context window costs money and attention. When 95% of your input is navigation menus and language selectors, you're:

Paying for tokens that add zero value
Diluting the signal with noise
Forcing the model to separate content from chrome

A 128K context window sounds huge until you realize you're filling 120K of it with Wikipedia's UI.

Agents Need Clean Data for Actions

An LLM agent trying to answer "When was association football first played?" has to:

With markdown:

Parse through navigation menus
Skip 246 language links
Find the infobox (which is mangled into markdown)
Hope "First played | December, 1863" survived the conversion
Extract "December, 1863" from the surrounding noise

With structured extraction:

Access article.datePublished or find "First played" in the structured metadata
Done

Error probability compounds with complexity. When your agent needs to cross-reference five Wikipedia articles, markdown parsing errors multiply. Structured data stays clean.

Training Data Quality Determines Model Quality

If you're building training datasets from web content, the extraction format determines downstream quality.

Markdown-converted training data contains:

Navigation patterns the model shouldn't learn
UI text that pollutes language understanding
Structural artifacts that create noise

Structured training data contains:

Actual content
Proper semantic relationships
Clean boundaries between data types

Models learn what they see. Show them navigation menus, they learn navigation menus.

What Data Infrastructure Consumers Actually Need

OpenAI's browsing, Anthropic's computer use, Google's grounding all need web data. Here's what would make that data actually useful:

1. Content Separation

The article is not the page. A page contains navigation, ads, related content, footers, cookie banners. The article is the thing you came for.

Structured extraction separates these. Markdown mixes them.

2. Semantic Typing

"December, 1863" as a string is ambiguous. Is it a date? A title? A caption?

"datePublished": "1863-12-01T00:00:00Z" is unambiguous. It's a date. It's typed. It can be compared, sorted, filtered.

3. Hierarchical Structure

Article content has structure: headings, paragraphs, lists, tables. This hierarchy carries meaning.

Markdown flattens hierarchy into text with formatting hints. JSON preserves it:

{
  "headings": [
    {"level": 2, "text": "History"},
    {"level": 3, "text": "Ancient precursors"},
    {"level": 3, "text": "Medieval precursors"}
  ]
}

An agent can now navigate the document programmatically: "Find the section about medieval history" becomes a data operation, not a text search.

4. Relationship Preservation

Links aren't just URLs. They're relationships. "Association football is governed by FIFA" contains a relationship: football → governed_by → FIFA.

Structured extraction can preserve these:

{
  "links": [
    {"text": "FIFA", "url": "https://en.wikipedia.org/wiki/FIFA", "context": "governing body"}
  ]
}

Markdown gives you [FIFA](https://en.wikipedia.org/wiki/FIFA). The relationship is implied, not stated.

The Economics of Clean Data

Here's the math that should matter to anyone building AI infrastructure:

Markdown approach:

373KB input per article
~93K tokens at ~4 chars/token
At $0.01/1K tokens (input), that's $0.93 per article just to read it
95% of that cost is noise

Structured approach:

~15KB input per article
~3.7K tokens
$0.037 per article
Near 100% signal

Scale to millions of pages and the difference is millions of dollars.

But cost isn't even the main issue. The main issue is that noisy inputs produce noisy outputs. Garbage in, garbage out scales with your ambitions.

AlterLab's Approach

We built AlterLab around a principle: extract what matters, in the format that matters.

Schema.org extraction: When pages have structured data, we use it. The Wikipedia example above was extracted from embedded schema.org markup, giving us typed fields automatically.

Intelligent content separation: We identify and isolate the main content from navigation, ads, and chrome. The article is the article, not the page.

Hierarchical preservation: Document structure survives extraction. Headings, paragraphs, lists, tables retain their relationships.

Flexible output: Need markdown for a specific use case? We provide it. But we also provide the structured data so you can choose based on your needs, not our assumptions.

For AI Labs and Data Infrastructure Teams

If you're building systems that consume web data at scale:

For training pipelines: Structured extraction produces cleaner datasets. Less noise in training means less noise in outputs.

For RAG systems: Structured chunks with metadata outperform markdown blobs. Retrieval improves when content boundaries are clear.

For agents: Actions need typed data. When your agent calls a function, it needs parameters, not markdown to parse.

For evaluations: Test your models on clean inputs. Separate parsing ability from reasoning ability.

The Future Is Typed, Not Formatted

Markdown was a reasonable v1 approach when LLMs were primarily chat interfaces summarizing text. We're past that.

Agents need typed data for tool use. Training pipelines need clean structured inputs. Evaluation needs isolated, verifiable data points.

The scraping APIs that survive will be the ones that understand: AI systems don't need documents. They need data.

Try It

AlterLab is a web scraping API built for AI applications. Structured JSON output, schema-based extraction, typed fields by default.

No credit card required. Free tier available.

alterlab.io

If you're at an AI company evaluating data infrastructure, reach out. We'd love to show you what structured web extraction looks like at scale.

The markdown-to-LLM pipeline was a reasonable first attempt. It's time for v2.

Yash Dubey
CEO, RapierCraft

Top comments (1)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.