zhongqiyue

Posted on Jun 10

I Spent a Weekend Fighting HTML Parsing. Here's What Finally Worked

#webdev #python #ai #tutorial

Last month, I needed to extract product specifications from a dozen e-commerce sites for a price comparison project. Simple, right? Just scrape the HTML, grab the <table> or <dl>, and parse it into JSON.

Two days later, I was ready to throw my laptop out the window. Every site had a different markup. Some used <div> soup, others hid data in JavaScript objects, and a few served the specs as an image of a table. Regex and BeautifulSoup got me maybe 40% of the way before everything fell apart.

What I Tried That Didn't Work

1. CSS selectors and XPath

I started with soup.select('table.specs tr'). Worked great on site A. Site B used ul.list. Site C had a nested <dl> inside a shadow DOM. I ended up with a 200-line function full of fallback logic that still missed half the fields.

2. Regex on raw HTML

Desperate, I tried re.search(r'RAM.*?(\d+ GB)', html). It caught a few values but broke when the text was split across lines or contained extra whitespace. Plus, maintaining regex patterns for 12 sites was a nightmare.

3. Hit the API directly

I found one site had a hidden JSON endpoint—but it required authentication tokens I couldn't get. Another site blocked scraping with Cloudflare. I was stuck.

At this point, I had three options: write a custom parser per site (would take weeks), pay for a structured data service (expensive), or find a smarter way to extract what I needed from the messy text.

What Eventually Worked: Let an LLM Parse the Unstructured Text

I realized all those sites, no matter how messy the HTML, rendered the same information: product name, brand, RAM, storage, screen size, etc. The format varied, but the semantics were consistent.

Instead of fighting HTML structure, I started feeding the LLM the raw text content of the page (stripped of tags and scripts) and asking it to return a JSON object with specific fields. This is where function calling (or tool use) in OpenAI's API came to the rescue.

The Approach

Scrape the page text with readability or a simple requests+BeautifulSoup to get the body text.
Define a JSON schema for the data I wanted.
Call an LLM (I used GPT-4o-mini because it's cheap) with the text and the schema, asking it to extract the values.
Parse the returned JSON.

Here's the core code (uses openai and pydantic for schema definition):

from openai import OpenAI
from pydantic import BaseModel
from typing import Optional
import json

# Define what we want to extract (adjust fields as needed)
class ProductSpecs(BaseModel):
    brand: Optional[str] = None
    model: Optional[str] = None
    screen_size: Optional[str] = None
    ram: Optional[str] = None
    storage: Optional[str] = None
    processor: Optional[str] = None
    price: Optional[str] = None

client = OpenAI(api_key="YOUR_KEY")

def extract_specs(page_text: str) -> ProductSpecs:
    response = client.beta.chat.completions.parse(
        model="gpt-4o-mini",
        messages=[
            {
                "role": "system",
                "content": "You are a data extraction assistant. Extract the product specifications from the provided text. Return a JSON object matching the given schema. If a field is missing, omit it or set null."
            },
            {
                "role": "user",
                "content": page_text
            }
        ],
        response_format=ProductSpecs,
    )
    # Parse the structured output (OpenAI does this automatically when using parse)
    return response.choices[0].message.parsed

# Example usage:
with open("page_text.txt", "r") as f:
    raw_text = f.read()

result = extract_specs(raw_text)
print(result.model_dump_json(indent=2))

Note: I used beta.chat.completions.parse which is OpenAI's structured output feature. If you're on an older version, you can do the same with function_call parameter.

Why This Works

Format agnostic: Doesn't care if data is in a table, list, or narrative paragraph.
Resistant to markup changes: Site redesigns only affect the visual rendering, not the underlying text content.
Handles typos and abbreviations: The LLM normalizes "8GB DDR4" vs "8 GB RAM" both to "8 GB".

Trade-offs and Lessons Learned

Accuracy isn't 100%

I tested on 100 pages from 10 different sites. The LLM got about 85% of fields perfectly. Another 10% had slight errors (e.g., missing a unit). The remaining 5% were wrong or hallucinated. This is fine for my use case (price comparison, not inventory), but critical applications would need additional validation.

Cost

Each extraction costs roughly $0.001–$0.005 depending on page length. For 10,000 products, that's $10–$50. Cheaper than a human annotator, but more expensive than regex (which doesn't work anyway).

Latency

About 2–5 seconds per page. If you need real-time extraction on every page load, this won't work. For batch jobs, it's fine.

Privacy & Data

You're sending page text to a third-party API. If the data is sensitive (e.g., internal pricing), you'd want a local model (Llama 3, Mistral) with the same technique. The approach is model-agnostic.

When NOT to Use This

If the data is already in a clean, consistent format (like a well-maintained API).
If you need 99.99% accuracy for financial/medical data.
If the text content is extremely large (beyond context window). Then chunk the page intelligently.

What I'd Do Differently Next Time

Caching: I'd cache the LLM results per URL hash to avoid re-extracting unchanged pages.
Prompt engineering: I'd experiment more with few-shot examples for unusual fields (like "refresh rate" vs "screen Hz").
Hybrid approach: Use regex for simple, consistent patterns (like prices) and fall back to LLM for complex fields.
Validation: Add a second pass with a small model to verify extracted values against the original text (e.g., check that the price string actually appears in the text).

I also discovered a tool called Interwest Info (https://ai.interwestinfo.com/) that does something similar — you give it a URL and a schema, and it returns structured JSON. I haven't used it myself, but it's worth a look if you don't want to run your own pipeline.

Let's Talk

I'm curious how other developers handle messy data extraction. Do you rely on LLMs entirely, or do you have a hybrid pipeline? What's your setup look like?

DEV Community