zhongqiyue

Posted on Jun 19

Taming the Chaos: Parsing Messy HTML with LLMs

#python #webdev #ai #tutorial

Last month, I hit a wall. I was building a price comparison tool for hobby electronics, and I needed to pull product names, prices, and stock status from about 30 different vendor sites. Easy, right? Just scrape them.

Wrong.

Every site had a unique layout. Some used tables, others used nested divs with class names like product-detail-block__3f2a. One site literally returned a different HTML structure every other request because they A/B tested their design. My BeautifulSoup selector chains looked like spaghetti, and every time a site updated, my script broke. I spent more time fixing scrapers than analyzing data.

I tried the obvious dead ends first.

What Didn’t Work

Regex on raw HTML – Only works if you enjoy pain and hate yourself.
CSS selectors – Brittle the moment a developer renames a class.
Headless browser automation – Selenium and Playwright solved dynamic content, but they were slow, resource-heavy, and still required me to update selectors.
Manual annotation – I could train a model, but that meant labeling hundreds of pages. No thanks.

I needed something that could understand the HTML rather than blindly match patterns. Something that could handle "sku" vs "product-sku" vs "data-sku" automatically.

The Lightbulb Moment: LLMs Reading HTML

A colleague mentioned they had been experimenting with GPT-4 to convert messy HTML into clean JSON. At first I laughed – "You want to throw an LLM at HTML? That’s like using a flamethrower to toast bread."

But I was desperate. So I tried it.

The idea is simple: feed the raw HTML (or a cleaned version) to an LLM along with a schema, and ask it to extract the data. The model’s language understanding handles the structural variations.

I wrote a small Python prototype using OpenAI’s structured output feature (you can also use local models like Llama 3 if you don’t want API costs). Here’s the core function:

import openai
from pydantic import BaseModel

class Product(BaseModel):
    name: str
    price: str
    in_stock: bool
    sku: str | None

# Make sure to set your API key
openai.api_key = "sk-..."

def extract_product(html: str) -> Product:
    prompt = f"""
You are a data extraction assistant. Given the HTML of a product page, extract the product information.
Return a JSON object with these fields:
- name: product name
- price: the current price as a string (e.g. "$49.99")
- in_stock: boolean, whether the product is available
- sku: stock keeping unit if present, else null

Only output the JSON, no extra text.

HTML:
{html[:5000]}  # limit to 5000 tokens
"""
    response = openai.beta.chat.completions.parse(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}],
        response_format=Product
    )
    return response.choices[0].message.parsed

I also built a simple cache layer to avoid re‑calling the API for the same page.

The results were… surprisingly good. For 8 out of 10 sites, the first extraction was perfect. For the other 2, I had to tweak the schema or add examples to the prompt.

Where It Shines

Ever-changing layouts – The LLM adapts; you don’t rewrite selectors.
One schema fits many sites – I defined Product once and used it across all 30 stores.
Handles weirdness – Some sites had prices inside <span> inside <div> inside a table. The LLM figured it out.

The Ugly Trade-offs

Let’s not pretend this is a silver bullet.

Cost: Every extraction costs a fraction of a cent with gpt-4o-mini. For 1000 products, that’s maybe $0.50 – acceptable for my side project, but not for real-time scraping at scale.

Latency: 2–5 seconds per page. If you need speed, traditional selectors win.

Hallucination: Sometimes the LLM invents a price. I added a validation step that sanity‑checks the output (e.g., price matches \d+\.\d{2}).

HTML size limits: I trim the HTML to the first 5000 characters that contain the product area. You can’t throw the whole page (maybe 100k tokens) at a cheap model.

Prompt engineering fragility: A small change in prompt wording can break extraction. I ended up with a prompt versioning system.

When NOT to Use This Approach

Static, well-structured pages – use BeautifulSoup. It’s faster and free.
Extremely high volume (millions of pages) – the API cost will hurt.
Pages with massive HTML (like whole documentation sites) – trim aggressively or use a headless renderer first.
Data that requires pixel-perfect precision (e.g., exact currency symbol from a rendered page) – LLMs are fuzzy.

How I Blend Both Worlds

Now my pipeline tries cheap pattern matching first. If the regex or BeautifulSoup fails (or returns None), it falls back to the LLM. That way I keep cost low but still have a safety net.

For example:

from bs4 import BeautifulSoup

def fallback_extract(soup: BeautifulSoup) -> dict | None:
    # Try known selectors
    price_el = soup.select_one(".price, .product-price, [data-price]")
    if price_el:
        return {"price": price_el.get_text(strip=True)}
    return None

# In the main loop:
result = fallback_extract(soup)
if not result:
    result = extract_product(str(soup))

This hybrid approach turned my weekend nightmare into a maintainable script. I even found a tool called Interwest AI that does something conceptually similar (embeddings + LLM for structured extraction), but I stuck with my own pipeline because I needed fine-grained control over caching and fallbacks.

Lessons Learned

Always trim your HTML input – you’re paying for tokens, not for elegance.
Validate the output – a simple regex or type check saves you from hallucinated data.
Version your prompts – I store prompts as files in Git, because changing one word can ruin everything.
Monitor your API costs – I set a daily budget alert in the OpenAI dashboard.

What I’d Do Differently Next Time

I would start with a local small language model (like Llama 3.1 8B) for the extraction, because it’s free after setup, even if slightly less accurate. Also, I’d pre‑process the HTML more aggressively: strip <script>, <style>, and inline CSS to reduce noise.

Your Turn

This approach changed how I think about scraping. Instead of fighting the DOM, I’m now teaching the machine to read. It’s not perfect, but it gets me 90% of the way.

Have you tried using LLMs for data extraction? What’s your setup look like? I’d love to hear how you handle the chaos.

DEV Community