Your LLM Pipeline Is Choking on Raw HTML. Here's the Fix.

#llm #rag #webscraping #python

I've been building LLM-powered data pipelines for a while now, and there's a mistake I see repeated constantly — teams throwing raw HTML into their context windows and wondering why their models produce garbage output.

It's not the model's fault. It's the data format.

Here's the thing: language models are extraordinarily good at reasoning over structured text. They're decent at extracting meaning from messy prose. But raw HTML? That's a third category entirely — and it's a terrible one.

Let's talk about what's actually going on, and what the better path looks like.

Why HTML is a terrible input format for LLMs

Think about what raw HTML actually looks like when you paste it into a prompt:

<div class="product-card" data-id="1234">
  <span class="price">$29.99</span>
  <div class="rating" aria-label="4.2 out of 5">
    <!-- rating stars -->
  </div>
  <script>trackView('1234')</script>
</div>

Now think about what you actually want the model to reason about: a product that costs $29.99 with a 4.2/5 rating. That's it. Everything else is noise.

But HTML doesn't just add noise — it actively degrades model performance in ways that are hard to measure. Tokens burned on tag boilerplate reduce your effective context budget. Tracking scripts and ad markup confuse extraction. Inline JavaScript is especially brutal: it looks like code to the tokenizer, which means it gets split in weird ways that bleed into the surrounding semantic content.

Worse, different websites structure the same data completely differently. A price tag on Amazon looks nothing like a price tag on Shopify, which looks nothing like a price tag on some random WooCommerce store. If you're feeding raw HTML to an LLM and asking it to find the price, you're essentially asking the model to learn CSS selector patterns on-the-fly — which it can do, sort of, but unreliably and expensively.

The Markdown detour doesn't actually help

A lot of teams try to sidestep the HTML problem by converting to Markdown first. There are tools for this, and at first glance it seems reasonable: strip the tags, keep the text, get something clean.

Except Markdown conversion of real web pages is, in practice, kind of a mess.

You end up with things like:

[Skip to content](#main)
[☰](#nav)
**Price:** ~~$49.99~~ $29.99
[Add to cart](#)
[Share on Facebook](#) [Tweet this](#)

That's the actual output you get from running a real product page through most HTML-to-Markdown converters. The navigation garbage is still there. The skip-link noise is still there. You're burning tokens on header navigation, footer links, cookie banners, and sidebar widgets.

Even when you strip all of that, you still have plain text where you wanted structured data. You got "Price: $29.99" — a string — instead of a price field with a numeric value and a currency. That distinction sounds pedantic until you're trying to do math with it downstream, or sort by it, or compare it across 10,000 products.

What you actually want: typed JSON straight from the URL

The cleanest solution is to skip both HTML and Markdown entirely and go directly to typed, structured output.

Here's what I mean. Instead of:

Fetch URL → get HTML
HTML → strip to Markdown
Markdown → feed to LLM
LLM → extract structured data
Structured data → clean/validate/coerce types

You do:

Fetch URL with a schema → get typed JSON

The schema-first approach looks like this conceptually. You describe what you want:

{
  "fields": [
    {"name": "title", "type": "string", "example": "Blue Widget Pro"},
    {"name": "price", "type": "number", "example": 29.99},
    {"name": "in_stock", "type": "boolean", "example": true},
    {"name": "rating", "type": "number", "example": 4.2}
  ]
}

And you get back:


{
  "title": "Blue Widget Pro",
  "price": 29.99,
  "in_stock": true,
  "rating": 4.2
}

Not "price": "$29.99". Not "in_stock": "Yes". Actual typed values your code can use directly.

This is the approach Runo takes. You define your schema, point it at a URL, and get back structured JSON with properly typed fields. No parsing step, no post-processing, no regex to clean currency strings. The API handles the extraction and type coercion as a single operation.