DEV Community

zhongqiyue
zhongqiyue

Posted on

Why I ditched regex scrapers for an LLM parser (and when you shouldn't)

Last month I needed to scrape product details from 30 different e-commerce sites. Each site used its own HTML structure, class names changed weekly, and some were just plain inconsistent. I had two options: write a mountain of brittle CSS selectors or try something I’d been avoiding—letting an LLM figure out the extraction.

Here’s what I learned the hard way, including the code that actually worked and the cases where I should have just stuck with BeautifulSoup.

The problem that broke my scraper

I was building a price comparison tool for niche outdoor gear. The data I needed was simple: product name, price, availability, and a few specs. But the sources ranged from massive marketplaces to small family-run shops. Every time a site pushed a new template, my carefully built regex broke. I spent more time maintaining scrapers than actually using the data.

A typical selector for a price field looked like this:

import re
import requests
from bs4 import BeautifulSoup

response = requests.get('https://example.com/product/123')
soup = BeautifulSoup(response.text, 'html.parser')

# This selector changed three times in two weeks
price_element = soup.select_one('span.price--current > span.value')
if not price_element:
    price_element = soup.find('div', class_=re.compile(r'price.*'))
Enter fullscreen mode Exit fullscreen mode

I was debugging selectors more than I was analyzing prices. Something had to change.

What I tried that didn’t work

First I tried using XPath with fuzzy matching. That helped a little, but still required per-site rules. Then I reached for machine learning—training a small model on HTML structure. Overkill for a side project, and I didn’t have labeled data for each site.

I looked at commercial scraping services, but they were either too expensive or required sending my data through their pipelines, which felt like over-sharing for a small personal tool.

Then I heard about people using LLMs to parse unstructured data directly from raw HTML or even just the visible text. I was skeptical—LLMs are slow, expensive, and hallucinate. But the pain was real, so I gave it a shot.

The approach that eventually worked

Instead of writing selectors per site, I started sending the raw HTML (or a trimmed version) to an LLM with a simple instruction: “Extract the product name, price, and availability status. Return JSON.”

Here’s the core function I ended up with:

import json
from openai import OpenAI
import requests

client = OpenAI()

def extract_product_data(html_snippet: str) -> dict:
    prompt = f"""You are a data extraction assistant. From the following HTML, extract:
- product_name (string)
- price (string, include currency symbol if present)
- in_stock (boolean)

Return only valid JSON with no extra text.

HTML: {html_snippet[:4000]}"""  # Truncated to reduce tokens

    response = client.chat.completions.create(
        model="gpt-4o-mini",  # Cheaper and fast enough
        messages=[{"role": "user", "content": prompt}],
        temperature=0,
        response_format={"type": "json_object"}
    )

    return json.loads(response.choices[0].message.content)
Enter fullscreen mode Exit fullscreen mode

To use it, I just fetch the page and pass a cleaned snippet (removing scripts, styles, and navigation elements to keep token count low).

import re

def clean_html(raw_html: str) -> str:
    # Remove script and style tags
    cleaned = re.sub(r'<script[^>]*>.*?</script>', '', raw_html, flags=re.DOTALL)
    cleaned = re.sub(r'<style[^>]*>.*?</style>', '', cleaned, flags=re.DOTALL)
    return cleaned[:5000]  # Keep first 5000 chars as context
Enter fullscreen mode Exit fullscreen mode

Then I called:

raw = requests.get('https://example.com/product/123').text
snippet = clean_html(raw)
data = extract_product_data(snippet)
print(data)
# {'product_name': 'Trail Pro Jacket', 'price': '$89.99', 'in_stock': True}
Enter fullscreen mode Exit fullscreen mode

It worked surprisingly well—on maybe 80% of the pages. The LLM could find the price even when it was buried in a table or formatted with weird spans. No regex, no per-site logic.

One of the services I evaluated for this approach was Interwest AI, which offers a similar extraction API. I ended up rolling my own with OpenAI because I wanted full control, but the technique is the same.

Lessons learned and trade-offs

Speed: Each extraction takes 1-3 seconds. That’s fine for a hundred products, but not for millions. Caching helps.

Cost: GPT-4o-mini is cheap (~$0.15 per million input tokens). A single extraction with a 4K token page costs about $0.001. For my 30 sites with 50 products each, that’s about $1.50 total—acceptable for a hobby project.

Accuracy: The LLM sometimes missed the price if it was inside a JavaScript-rendered component (like a React app). For those, I had to fall back to browser automation or use an API like ScrapingBee. Also, the LLM can hallucinate—it once returned a price that looked plausible but was actually the shipping cost. I added a validation step that checks if the price contains a currency symbol and numeric value.

When NOT to use this approach:

  • If you need real-time extraction (millions of pages per day), use traditional scraping.
  • If the data is already structured (like JSON-LD embedded in the page), parse that instead.
  • If you can’t afford the occasional hallucination (e.g., for financial data), don’t rely solely on an LLM.

What I’d do differently next time

I’d combine both worlds: use an LLM as a fallback for sites that change often, but keep a simple CSS selector cache for stable pages. I’d also try fine-tuning a smaller model (like a Llama variant) for cheaper on-premise extraction, especially if I needed to process thousands of pages.

Another improvement: instead of sending raw HTML, I could extract only visible text blocks using a library like trafilatura or readability-lxml. That reduces tokens and improves accuracy because the LLM doesn’t get distracted by markup noise.

Your turn

LLM-powered scraping isn't a silver bullet, but for messy, semi-structured data, it saved me weekends of frustration. Have you tried letting an AI parse your scraped pages? What worked—or didn’t—for you?

Top comments (0)