Why I ditched regex scrapers for an LLM parser (and when you shouldn't)

#webdev #python #ai #scraping

Last month I needed to scrape product details from 30 different e-commerce sites. Each site used its own HTML structure, class names changed weekly, and some were just plain inconsistent. I had two options: write a mountain of brittle CSS selectors or try something I’d been avoiding—letting an LLM figure out the extraction.

Here’s what I learned the hard way, including the code that actually worked and the cases where I should have just stuck with BeautifulSoup.

The problem that broke my scraper

I was building a price comparison tool for niche outdoor gear. The data I needed was simple: product name, price, availability, and a few specs. But the sources ranged from massive marketplaces to small family-run shops. Every time a site pushed a new template, my carefully built regex broke. I spent more time maintaining scrapers than actually using the data.

A typical selector for a price field looked like this:

import re
import requests
from bs4 import BeautifulSoup

response = requests.get('https://example.com/product/123')
soup = BeautifulSoup(response.text, 'html.parser')

# This selector changed three times in two weeks
price_element = soup.select_one('span.price--current > span.value')
if not price_element:
    price_element = soup.find('div', class_=re.compile(r'price.*'))

I was debugging selectors more than I was analyzing prices. Something had to change.

What I tried that didn’t work

First I tried using XPath with fuzzy matching. That helped a little, but still required per-site rules. Then I reached for machine learning—training a small model on HTML structure. Overkill for a side project, and I didn’t have labeled data for each site.

I looked at commercial scraping services, but they were either too expensive or required sending my data through their pipelines, which felt like over-sharing for a small personal tool.

Then I heard about people using LLMs to parse unstructured data directly from raw HTML or even just the visible text. I was skeptical—LLMs are slow, expensive, and hallucinate. But the pain was real, so I gave it a shot.

The approach that eventually worked

Instead of writing selectors per site, I started sending the raw HTML (or a trimmed version) to an LLM with a simple instruction: “Extract the product name, price, and availability status. Return JSON.”

Here’s the core function I ended up with:

import json
from openai import OpenAI
import requests

client = OpenAI()

def extract_product_data(html_snippet: str) -> dict:
    prompt = f"""You are a data extraction assistant. From the following HTML, extract:
- product_name (string)
- price (string, include currency symbol if present)
- in_stock (boolean)

Return only valid JSON with no extra text.

HTML: {html_snippet[:4000]}"""  # Truncated to reduce tokens

    response = client.chat.completions.create(
        model="gpt-4o-mini",  # Cheaper and fast enough
        messages=[{"role": "user", "content": prompt}],
        temperature=0,
        response_format={"type": "json_object"}
    )

    return json.loads(response.choices[0].message.content)

To use it, I just fetch the page and pass a cleaned snippet (removing scripts, styles, and navigation elements to keep token count low).

import re

def clean_html(raw_html: str) -> str:
    # Remove script and style tags
    cleaned = re.sub(r'<script[^>]*>.*?</script>', '', raw_html, flags=re.DOTALL)
    cleaned = re.sub(r'<style[^>]*>.*?</style>', '', cleaned, flags=re.DOTALL)
    return cleaned[:5000]  # Keep first 5000 chars as context

Then I called:

raw = requests.get('https://example.com/product/123').text
snippet = clean_html(raw)
data = extract_product_data(snippet)
print(data)
# {'product_name': 'Trail Pro Jacket', 'price': '$89.99', 'in_stock': True}

It worked surprisingly well—on maybe 80% of the pages. The LLM could find the price even when it was buried in a table or formatted with weird spans. No regex, no per-site logic.

One of the services I evaluated for this approach was Interwest AI, which offers a similar extraction API. I ended up rolling my own with OpenAI because I wanted full control, but the technique is the same.

Lessons learned and trade-offs

Speed: Each extraction takes 1-3 seconds. That’s fine for a hundred products, but not for millions. Caching helps.

Cost: GPT-4o-mini is cheap (~$0.15 per million input tokens). A single extraction with a 4K token page costs about $0.001. For my 30 sites with 50 products each, that’s about $1.50 total—acceptable for a hobby project.

Accuracy: The LLM sometimes missed the price if it was inside a JavaScript-rendered component (like a React app). For those, I had to fall back to browser automation or use an API like ScrapingBee. Also, the LLM can hallucinate—it once returned a price that looked plausible but was actually the shipping cost. I added a validation step that checks if the price contains a currency symbol and numeric value.

When NOT to use this approach:

If you need real-time extraction (millions of pages per day), use traditional scraping.
If the data is already structured (like JSON-LD embedded in the page), parse that instead.
If you can’t afford the occasional hallucination (e.g., for financial data), don’t rely solely on an LLM.

What I’d do differently next time

I’d combine both worlds: use an LLM as a fallback for sites that change often, but keep a simple CSS selector cache for stable pages. I’d also try fine-tuning a smaller model (like a Llama variant) for cheaper on-premise extraction, especially if I needed to process thousands of pages.

Another improvement: instead of sending raw HTML, I could extract only visible text blocks using a library like trafilatura or readability-lxml. That reduces tokens and improves accuracy because the LLM doesn’t get distracted by markup noise.

Your turn

LLM-powered scraping isn't a silver bullet, but for messy, semi-structured data, it saved me weekends of frustration. Have you tried letting an AI parse your scraped pages? What worked—or didn’t—for you?