zhongqiyue

Posted on Jun 5

When HTML parsing fails: using LLMs to extract messy web data

#webdev #python #ai #tutorial

I’ve been scraping websites for years. BeautifulSoup, Scrapy, Playwright — I’ve used them all. But last month I hit a wall.

A client needed me to extract product details from a dozen e-commerce sites. Most were straightforward: find the right CSS selectors, handle pagination, done. But one particular site was a nightmare. The HTML was a mess of nested divs, inline styles, and data scattered across attributes, text nodes, and even JavaScript variables. The layout changed every week. My carefully crafted selectors broke constantly.

I spent two days fixing and refactoring. Every time I thought I had it, the site updated and my pipeline broke again. I was about to tell the client it wasn’t feasible.

Then a colleague said: “Why not just give the raw HTML to an LLM and ask it to extract what you need?”

At first I laughed. LLMs hallucinate, they’re slow, expensive — right? But I was desperate. I decided to prototype it.

What I tried that didn’t work

Before going down the AI route, I exhausted traditional approaches:

Regex and string parsing: The data had no consistent pattern. Prices appeared with and without currency symbols, sometimes in data-price attributes, sometimes in nested <span>s.
BeautifulSoup with CSS selectors: Fragile as hell. One site changed a class name from product-price to price-info and my whole script died.
Playwright for dynamic content: Helped with JavaScript-rendered pages, but the parsing logic was still brittle. I still had to write XPath or CSS selectors that broke.
Scrapy with item loaders: Overkill for this one site, and still required knowing the exact HTML structure.

The root problem: the HTML structure was unpredictable. A human can look at a page and say “that’s the price”. A CSS selector cannot — it relies on structure.

What eventually worked: LLM-based extraction

I built a small script that takes raw HTML, sends it to an LLM (I used OpenAI’s GPT-4o, but you can use any model that can handle long contexts), and asks it to return a JSON object according to a schema I define.

The key insight: instead of teaching the computer where the data is, I teach it what the data looks like. I provide a description and let the LLM figure out the mapping.

Here’s a simplified version:

import openai
from bs4 import BeautifulSoup

# Fetch the page (I'll use requests here, but you might need Playwright for dynamic sites)
import requests

response = requests.get("https://example.com/product-page")
raw_html = response.text

# Clean up HTML a bit - remove scripts, styles, reduce noise
soup = BeautifulSoup(raw_html, "html.parser")
for tag in soup(["script", "style", "meta", "link", "svg"]):
    tag.decompose()
cleaned_html = str(soup)[:12000]  # limit context size

# Define the schema
schema = {
    "product_name": "string",
    "price": "string (e.g., '$19.99')",
    "availability": "string ('In Stock' or 'Out of Stock')",
    "description": "string",
    "rating": "string (e.g., '4.5 out of 5')"
}

# Call the LLM
prompt = f"""
Extract the following fields from this HTML and return a valid JSON object.
Fields: {schema}

HTML:
{cleaned_html}

Return ONLY the JSON object, no explanation.
"""

client = openai.OpenAI(api_key="sk-...")
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": prompt}],
    temperature=0.0,
)

try:
    import json
    result = json.loads(response.choices[0].message.content)
    print(result)
except json.JSONDecodeError:
    print("LLM did not return valid JSON. Retrying...")

That’s it. For the problematic site, this worked on the first try. No selectors, no XPath, no regular expressions. I just described what I wanted and the LLM figured out the rest.

The trade-offs (and there are many)

I’ve used this approach for a few weeks now, and here’s what I learned:

Pros

Resilience: When the site changes its layout, the LLM often still extracts correctly because it understands semantics, not structure.
Rapid prototyping: I can set up extraction for a new site in minutes instead of hours.
Handles messy HTML: The LLM ignores noise better than I expected. It’s almost like having a human curator.

Cons

Cost: LLM API calls cost money. For high-volume scraping, this adds up fast. I estimate $0.01–$0.03 per page with GPT-4o. Cheaper models like GPT-4o-mini can reduce cost but may be less accurate.
Latency: Each extraction takes 1–3 seconds. For hundreds of pages, that’s slow compared to regex.
Hallucination: Sometimes the LLM invents data if it can’t find it in the HTML. You must validate outputs carefully.
Context length: If the HTML is huge, you need to truncate or chunk it, which can lose data.
Dependency on third-party API: If OpenAI is down, your scraper stops. Consider fallback models or local LLMs (e.g., Llama 3 via Ollama).

When NOT to use this approach

You need to extract millions of pages cheaply – stick with traditional parsing.
The data is strictly formatted and stable – CSS selectors are faster and free.
You can’t send HTML to an external API (e.g., sensitive data) – then use a local LLM or avoid AI altogether.

What I’d do differently next time

I’d combine approaches. Use traditional parsers for stable sites, and fall back to LLM only for tricky ones. Also, I’d implement a validation step: check that extracted prices look like prices, ratings are within range, etc. If validation fails, re-run with a different prompt or a more powerful model.

Another improvement: provide the LLM with a few examples (few-shot prompting) to improve accuracy on ambiguous fields.

Tools that helped

I’m not the only one doing this. There are now services that wrap this idea into nice APIs. For example, I came across InterWest AI which offers a similar extraction API. I haven’t used it extensively, but it’s interesting to see this pattern being productized.

Final thoughts

LLM-based extraction isn’t a silver bullet. It’s expensive and slow. But for the 10% of cases where traditional parsing fails — changing layouts, inconsistent HTML, or just pure laziness — it’s a lifesaver.

I’m still torn. Part of me feels like I’m cheating by throwing AI at a problem that used to require elegant code. But then again, the site’s layout changes every week, and I have better things to do than update selectors.

What’s your experience with extracting data from messy websites? Have you tried AI-based parsing, or do you still prefer the precision of XPath and CSS? I’d love to hear how others handle this.

DEV Community