My web scraping nightmare ended when I let an LLM read the HTML

#webdev #python #ai #scraping

I've been building scrapers for years. Usually it's a simple script: requests.get, BeautifulSoup, a few CSS selectors, and you're done. But last month I hit a wall that made me question everything I knew.

I needed to extract product information from a dozen different e-commerce sites. Each one had a slightly different layout: some used <div class="price">, others had <span itemprop="price">, and one particularly evil site was rendering prices inside a deeply nested React tree with randomized class names. My selector-based approach turned into a fragile mess of fallback patterns, try/except blocks, and eventually a 300-line function that still failed on every fourth page.

I tried headless browsers. That worked — until I needed to scale. Running 20 browsers concurrently ate memory, and each new site required tweaking wait conditions and timeouts. I spent more time maintaining scrapers than using the data.

Then I had a stupid idea: what if I just fed the raw HTML to an AI and asked it to find the data? I wasn't looking for a "no-code" solution; I wanted a technique that could handle structural chaos without me writing custom rules for every domain.

The approach: LLM-powered extraction

The core idea is simple: instead of writing selectors, write a prompt that describes what you want. Give the LLM the HTML and ask it to extract structured fields. The model doesn't care whether the price is in a <span> or a <div> — it understands semantics.

Here's the basic implementation using Python and OpenAI (though you could use any LLM API, including local models via Ollama):

import requests
from bs4 import BeautifulSoup
from openai import OpenAI

client = OpenAI(api_key="your-key")

def extract_product_data(url):
    response = requests.get(url, headers={"User-Agent": "Mozilla/5.0"})
    soup = BeautifulSoup(response.text, "html.parser")
    # Limit HTML to avoid huge prompts and extra cost
    html_snippet = str(soup)[:8000]  # Truncate after first 8000 chars

    prompt = f"""
    You are a data extraction assistant. From the HTML below, extract the following fields:
    - product_name (string)
    - price (string, including currency symbol if present)
    - availability (string like 'in stock' or 'out of stock')
    - description (string, max 100 words)

    Return ONLY a JSON object with these keys. If a field is not found, set it to null.

    HTML:
    {html_snippet}
    """

    completion = client.chat.completions.create(
        model="gpt-4o-mini",  # cheaper model works fine
        messages=[{"role": "user", "content": prompt}],
        temperature=0.1
    )
    return completion.choices[0].message.content

product = extract_product_data("https://example.com/product/123")
print(product)
# Output: {"product_name": "Widget 3000", "price": "$49.99", ...}

I ran this on a handful of pages from different sites. It worked. Not perfectly — sometimes the description was too long, or the model hallucinated a price when none was visible. But the success rate was around 80% without any site-specific tweaks.

Making it production-ready

The naive snippet above has problems: cost, latency, and token limits. Here's what I improved:

Caching: Store results by URL hash so you don't re-extract on every run.
HTML preprocessing: Remove <script>, <style>, and <svg> tags. Strip excessive whitespace. This cuts tokens by 30-50%.
Retry & validation: Check that the output is valid JSON, and if it's missing expected fields, retry with a more specific prompt.
Batch processing: Send multiple URLs in one call if the total HTML fits in context (useful for listing pages).
Fallback chain: First try cheap model (GPT-4o-mini), and if confidence is low, re-run with a stronger model.

import json
import hashlib

def extract_batch(urls):
    html_batch = []
    for url in urls:
        html = fetch_and_clean(url)[:2000]  # shorter per URL
        html_batch.append(f"<page url='{url}'>\n{html}\n</page>")
    combined = "\n---\n".join(html_batch)

    prompt = f"""
    Extract product data from each page below. Return a JSON array of objects, each with fields:
    url, product_name, price, availability.

    {combined}
    """
    # ... call LLM

I also built a simple prompt cache — if I query the same URL again, I serve the cached result. This is critical for development.

When this approach shines

Variable HTML structures: E-commerce sites that change layout often (Black Friday redesigns? No problem.)
Small-scale scraping: 100–1000 pages per day. At scale, LLM costs add up ($0.15 per million tokens for GPT-4o-mini, so about $0.01 per page).
Rapid prototyping: You can go from zero to working extractor in 10 minutes.

When NOT to use it

Fixed, well-documented APIs or structured HTML: If the site has consistent selectors, a hand-crafted CSS path is faster and cheaper.
Huge volume: Scraping millions of pages? LLM cost will dwarf compute. Stick with traditional methods.
Real-time requirements: Each call takes 1–3 seconds. If you need sub-second extraction, use a regex + heuristics.
Sensitive data: Sending full HTML to a third-party API may violate terms or privacy policies. Consider local models (Llama 3, Mistral) via Ollama.

Trade-offs and lessons learned

Prompt engineering matters more than model size. I spent hours tuning prompts and got better results than switching to GPT-4.
Token limits are your biggest enemy. A single product page can easily be 50KB HTML. Truncate intelligently — keep the <body> and remove everything above <h1>.
Hallucinations are real. I had the model invent a "discount" field that didn't exist. Always validate against a known ground truth.
Cost vs. accuracy trade-off: On a test set of 100 pages, GPT-4o-mini achieved 82% accuracy (exact field match). GPT-4 achieved 93%. For my use case, the cheaper model was fine, but your mileage may vary.

I ended up building a small internal tool that combines both approaches: fast CSS selectors for known patterns, and falls back to the LLM when selectors return null. That hybrid gave me the best of both worlds.

There are also services that abstract this further. I tried one at https://ai.interwestinfo.com/ that handles the caching and model selection — but honestly, the DIY approach taught me more about prompt design and cost management. Your choice depends on whether you want to own the stack or outsource the complexity.

Where to go from here

If you're interested in this technique, try it on your own pain points — scraping job postings, news headlines, or even PDF text extraction. The principle is the same: instead of writing rules, you describe the data and let the model figure out the structure.

Start with a small batch, measure accuracy, and only scale up once you're happy with the prompt. And always keep a backup scraper — I learned that the hard way when an API rate limit kicked in.

What's your setup look like? Have you tried using LLMs to replace fragile selectors, or do you stick with traditional scraping? I'd love to hear your war stories.