When CSS Selectors Aren't Enough: Using LLMs for Data Extraction

#tutorial #python #webdev #ai

A few months ago, I took on a side project that sounded simple: scrape product details from a handful of e-commerce sites and build a price comparison tool. I'd done web scraping before—BeautifulSoup, Scrapy, the usual suspects. How hard could it be?

Turns out, really hard. Each site had its own HTML structure. Some used JavaScript rendering. Others changed their class names every week. My carefully crafted CSS selectors broke constantly. I spent more time debugging selectors than actually extracting data.

I tried regex. I tried XPath. I even tried headless browsers with Puppeteer. Nothing stuck. The problem wasn't the tools—it was that the data was buried in unstructured, human-readable pages. I needed a way to understand the meaning of the content, not just its position in the DOM.

The Breaking Point

One site had product prices hidden inside a <span> with no class, nested three levels deep in a table layout. Another used a custom font that rendered prices as images. I was about to give up when I thought: what if I treated the entire page as a document and asked an LLM to extract the fields I needed?

I'd been using GPT-4 for code generation, but never for extraction. The idea felt like overkill—but so was spending three hours per site.

The Approach: Structured Extraction from Unstructured HTML

The core idea is simple: feed the raw HTML (or rendered text) into a language model with a prompt that describes the schema you want. The model returns JSON. No selectors, no regex, no brittle parsing.

Here's a minimal example using Python and OpenAI's API:

import openai
from bs4 import BeautifulSoup

openai.api_key = "sk-..."  # your key

def extract_product_info(html_content):
    # Clean the HTML to reduce tokens
    soup = BeautifulSoup(html_content, 'html.parser')
    # Remove scripts, styles, etc.
    for tag in soup(['script', 'style', 'nav', 'footer']):
        tag.decompose()
    clean_text = soup.get_text(separator=' ', strip=True)[:3000]  # limit tokens

    prompt = f"""
Extract product information from the following web page text.
Return a JSON object with these fields:
- name (string)
- price (number, in USD)
- availability (boolean)
- description (string, max 100 words)

If a field is not found, set it to null.

Page text:
{clean_text}
"""

    response = openai.ChatCompletion.create(
        model="gpt-4",
        messages=[
            {"role": "system", "content": "You are a data extraction assistant. Always return valid JSON."},
            {"role": "user", "content": prompt}
        ],
        temperature=0,
        max_tokens=500
    )

    return response.choices[0].message.content

This worked surprisingly well. For a typical product page, I'd get back something like:

{
  "name": "Wireless Bluetooth Headphones",
  "price": 49.99,
  "availability": true,
  "description": "High-quality wireless headphones with noise cancellation and 20-hour battery life."
}

Even when the price was in an image alt text or buried in a paragraph, the LLM often inferred it correctly.

Real-World Trade-offs

This approach isn't magic. Here's what I learned:

Pros:

Resilient to HTML changes. If the site redesigns its layout, the LLM still understands the content.
Works with JavaScript-rendered pages if you first extract the text (via Puppeteer or Playwright).
Handles multiple languages reasonably well.

Cons:

Cost: GPT-4 costs about $0.03 per request for 3k tokens. For a few hundred pages, that's acceptable. For millions, it's not.
Latency: 2-5 seconds per request. Not suitable for real-time scraping.
Hallucination: The model might invent data if the prompt is vague. Always validate with a schema and fallback.
Token limits: Long pages need truncation or chunking. You might lose context.

When NOT to Use This

If you're scraping a single well-structured site, CSS selectors are faster and cheaper.
If you need real-time extraction (e.g., live prices), this is too slow.
If you have millions of pages, the cost adds up quickly.

I now use a hybrid approach: try a simple selector first, and fall back to the LLM when the selector fails. That keeps costs low while maintaining resilience.

The Tool That Inspired This

While researching, I stumbled across a service called Interwest AI that does exactly this—extracts structured data from web pages using LLMs. I didn't end up using it because I wanted full control, but it confirmed that the approach was viable. Their documentation gave me ideas for prompt engineering and schema design.

What I'd Do Differently

Better prompt engineering: I'd add few-shot examples for tricky fields like dates or ratings.
Caching: Cache extracted results by URL hash to avoid re-processing unchanged pages.
Async: Use asyncio to parallelize requests and reduce total time.
Validation: Use Pydantic to parse the LLM output and catch malformed JSON.

Lessons Learned

The best tool depends on the problem. For dynamic, messy data, LLMs are a game-changer (sorry, I said I wouldn't use that word—let's say "very useful").
Don't over-engineer. Start with a simple prompt and iterate.
Always measure cost and latency before scaling.

Your Turn

Have you used LLMs for data extraction? What's your setup look like? I'm curious if anyone has tried fine-tuning a smaller model for this task—seems like a cheaper alternative for high-volume scraping.