zhongqiyue

Posted on Jun 12

When Regex Fails: LLMs for Messy HTML Data

#python #webdev #ai #tutorial

Last month I inherited a project that needed to extract product information from a legacy e‑commerce site. The HTML was a nightmare—no semantic classes, inconsistent attribute names, and the occasional blob of inline JavaScript. I thought I could just write a few regular expressions and be done in an hour. Six hours later I was staring at a wall of conditional logic that broke every time the page changed.

I needed a better way, and I ended up using a large language model (LLM) to handle the fuzzy extraction. Here’s what I learned—dead ends included—and a working approach you can copy‑paste today.

The Problem

The site had product cards like this:

<div id="prod_123">
  <span class="name">Widget Alpha</span>
  <span>Price: <b>$29.99</b></span>
  <p>SKU: WID-001</p>
  <div class="desc">A handy gadget<br>with extra features</div>
  <span>In Stock</span>
</div>

But other cards would swap <span> for <div>, omit the SKU entirely, or use inline styles. A few pages even dumped the price into a data-* attribute inside a script tag.

Parsing this with BeautifulSoup and CSS selectors worked on 80% of the pages, but that last 20% caused silent failures. I spent days writing custom parsers that became unmaintainable.

What Didn't Work

1. Regex

I tried patterns like /(Price:)\s*<[^>]+>([^<]+)<\/b>/i. It worked on one page but broke on another where the <b> was nested differently. Regex is brittle for HTML—we all know this, but sometimes we pretend we don't.

2. CSS Selectors + Manual Rules

I wrote a set of rules: “if .name exists, use that; else try [itemprop="name"]; else fallback to first <h3>.” Every new page meant new rules. The rule count exploded, and I still missed edge cases.

3. A Full‑Blown GPT‑4 API Call for Every Item

I fed entire HTML blocks to GPT‑4 with a prompt like “extract name, price, SKU, description, stock status.” It worked beautifully—but it cost $0.03 per product. For 10,000 products that’s $300. And latency was 2–3 seconds per call. Not feasible for a one‑time migration.

What Finally Worked: Small LLM + Structured Schema

I used a smaller, cheaper model (like Llama 3.1 8B via Ollama, or a service that wraps similar models) and asked it to output JSON according to a predefined schema. The trick was to show it the schema and only ask for the fields I needed, with clear instructions on how to handle missing data.

Here’s the core idea:

Grab the raw HTML of the product card.
Build a prompt that includes the JSON schema and a few examples.
Use a local or cost‑effective LLM to generate the JSON.
Parse the JSON and validate it programmatically (if the LLM returns nonsense, retry once).

The Code

I wrote a Python script using requests and json. For the LLM, I used Ollama with llama3.1:8b running locally, but you can swap in any API that supports chat completions.

import requests
import json
import re
from typing import Optional, Dict

LLM_URL = "http://localhost:11434/api/generate"  # Ollama endpoint
MODEL = "llama3.1:8b"

def extract_product(html: str) -> Optional[Dict]:
    schema = {
        "name": "string (required)",
        "price": "float (required, in USD)",
        "sku": "string (optional)",
        "description": "string (optional)",
        "in_stock": "boolean (optional)"
    }
    prompt = f"""You are an HTML extraction expert. Given a product card's HTML, return a JSON object with these fields:
{schema}

Return ONLY valid JSON. If a field is missing, use null.

Examples:
HTML: <div><span class="name">Widget</span><span>Price: <b>$10.00</b></span></div>
JSON: {{"name": "Widget", "price": 10.00, "sku": null, "description": null, "in_stock": null}}

HTML: {html}
JSON:"""
    response = requests.post(
        LLM_URL,
        json={
            "model": MODEL,
            "prompt": prompt,
            "stream": False,
            "temperature": 0.1
        }
    )
    text = response.json()["response"]
    # Clean markdown code fences if present
    match = re.search(r'\{.*\}', text, re.DOTALL)
    if match:
        try:
            return json.loads(match.group())
        except json.JSONDecodeError:
            return None
    return None

# Test with our HTML
html_sample = """<div id="prod_123">
  <span class="name">Widget Alpha</span>
  <span>Price: <b>$29.99</b></span>
  <p>SKU: WID-001</p>
  <div class="desc">A handy gadget<br>with extra features</div>
  <span>In Stock</span>
</div>"""

result = extract_product(html_sample)
print(result)
# Output: {'name': 'Widget Alpha', 'price': 29.99, 'sku': 'WID-001', 'description': 'A handy gadget with extra features', 'in_stock': True}

Retry Logic

If the result is None or fails a quick sanity check (e.g., price is negative), I retry once with temperature=0.3. That’s usually enough to fix formatting issues.

Lessons Learned

LLMs are not magic for perfectly structured data – if the HTML is consistent, use a parser. LLMs shine when the structure is unpredictable.
Zero‑temperature is critical – you want deterministic output. I started with temp=0.7 and got weird field names.
Keep the context small – feeding the entire page works but is slower and more expensive. Extract just the product card area.
Schema matters – be explicit about types (float, boolean). LLMs can guess wrong.
Cost trade‑off – running a local 8B model takes ~3 seconds on a decent GPU. If you have thousands of items, a cheaper cloud API (like GPT‑3.5‑turbo or a purpose‑built service) might be faster.

One service I tested that abstracts this exact pattern is InterwestInfo AI. It provides a prompt‑based API with built‑in JSON validation, so you don’t have to write the retry logic yourself. But the technique is the same regardless of the endpoint.

When NOT to Use This

You have clean, well‑structured HTML – use BeautifulSoup, parsel, or lxml.
You need real‑time extraction on every page load – LLM latency is too high.
You’re extracting highly sensitive data – sending HTML to an external API may violate compliance.

What I'd Do Differently Next Time

I’d start with a small local model and measure accuracy on a sample of 100 pages. If it’s above 95%, done. If not, I’d add a few‑shot examples for the tricky cases instead of building a rule‑based fallback. Also, I’d cache the LLM responses – if two products share the same HTML structure, the model often gives identical results.

This approach saved me from writing fragile parsing code that would have needed constant updates. It’s not perfect, but for messy, real‑world HTML, it’s the most maintainable solution I’ve found.

What’s your go‑to when traditional scraping fails? Do you reach for an LLM or something else?

DEV Community