DEV Community

zhongqiyue
zhongqiyue

Posted on

Regex broke my scraper: Using LLMs for robust data extraction

I've been building scrapers for years. I know the drill: find the CSS selector, write a regex, test it, deploy, and hope the website doesn't change its markup next week. But last month, I hit a wall. I was tasked with extracting product prices and availability from over 200 supplier websites. Each site had its own layout, some rendered with JavaScript, others were plain HTML. My initial approach was the same old combo of BeautifulSoup, XPath, and regex. It worked... for about a week.

Then one of the biggest suppliers rolled out a redesign. My carefully crafted selector .price--current vanished. The price was now inside a nested <span> with a dynamic class name like _3xj0a _2v9j3. Regex? Forget it. I spent two days patching it, only to have another site change the next week. I was fighting a losing battle.

What I tried (and why it didn't work)

First, I tried more sophisticated CSS selectors using :contains() and nth-of-type. That worked until the site added a banner ad before the price. I tried matching patterns like \$\d+\.\d{2} but some prices were in EUR, some had discounts, and some were hidden in JavaScript rendered content. I even used headless browsers (Playwright) to wait for elements, but that slowed things down and still broke when the DOM structure shifted.

I looked into visual testing tools and machine learning approaches like object detection on screenshots — too heavy and unreliable for text extraction.

The shift: Letting an LLM do the heavy lifting

A colleague mentioned they used GPT to parse unstructured data from PDFs. I thought: why not try it on raw HTML? The idea was simple: instead of guessing selectors, send the relevant HTML snippet (or even the text content) to an LLM with a clear instruction to extract structured fields.

I knew that sending entire pages would be expensive and slow, so I first stripped down the HTML using BeautifulSoup: remove scripts, styles, and navigation, then flatten the body into clean text while preserving some structural markers (like h1, table). Then I'd prompt the LLM to return a JSON object with fields like name, price, availability, sku.

Here's the core of what I built:

import requests
from bs4 import BeautifulSoup
import json

# This function cleans the page and sends a targeted prompt to the LLM
def extract_product_info(url, llm_api_endpoint, api_key):
    response = requests.get(url, timeout=10)
    soup = BeautifulSoup(response.text, 'html.parser')
    # Remove noise
    for tag in soup(['script', 'style', 'nav', 'footer', 'header', 'aside']):
        tag.decompose()
    body = soup.body
    # Get visible text with some structural hints
    text = body.get_text(separator='\n', strip=True)
    # Truncate to avoid token limits (e.g., 3000 chars)
    text = text[:3000]

    prompt = f"""
You are a data extraction assistant. Extract product information from the following web page text.
Return a JSON object with these fields:
- product_name
- price (as a string with currency symbol if present)
- availability (e.g., 'In stock', 'Out of stock')
- sku (if found)
If a field is missing, set it to null.

Page text:
{text}
"""

    headers = {
        'Authorization': f'Bearer {api_key}',
        'Content-Type': 'application/json'
    }
    payload = {
        'model': 'gpt-4o-mini',  # cheaper, faster
        'messages': [{'role': 'user', 'content': prompt}],
        'temperature': 0.1,  # low for consistent output
        'response_format': { 'type': 'json_object' }
    }
    # This could be any OpenAI-compatible endpoint; I used a custom one like https://ai.interwestinfo.com/v1
    r = requests.post(llm_api_endpoint, json=payload, headers=headers)
    r.raise_for_status()
    data = r.json()
    return json.loads(data['choices'][0]['message']['content'])

# Example usage
result = extract_product_info('https://example-store.com/product/123',
                              'https://api.openai.com/v1/chat/completions',
                              'sk-your-key-here')
print(result)
Enter fullscreen mode Exit fullscreen mode

I tested this on 10 different e‑commerce pages. It worked on all of them — even on the site that had changed its classes. The LLM understood context: it recognized "€39,99" as a price even without a dollar sign, and it knew that "ships in 3-5 days" meant availability was not "In stock".

Where this approach shines — and where it doesn't

Pros:

  • Robust to layout changes – I haven't touched the code in weeks, even as sites updated.
  • Adaptable – Want to extract rating or description? Just change the prompt.
  • No more regex hell – The LLM handles currency formats, abbreviations, and missing data gracefully.

Cons:

  • Cost – Each request costs a fraction of a cent, but at scale it adds up. For high‑volume scraping (thousands of pages/hour), this isn't viable without caching or local models.
  • Latency – LLM calls take 1-3 seconds. Compare to a quick regex match, it's slow.
  • Hallucinations – Sometimes the LLM invents a SKU or misreads a discount as the main price. Always validate with a secondary rule (e.g., check that price matches \d+\.\d{2}).
  • Prompt engineering – You need to fine‑tune the prompt to avoid over‑extraction. I spent a day tweaking examples.

What I'd do differently next time

First, I'd pre‑filter pages more aggressively. Some pages had thousands of lines; sending the whole thing wasted tokens. I used BeautifulSoup to keep only the main content area once I identified common patterns. Second, I'd add a validation step: after getting the JSON from the LLM, I parse the price field with regex to ensure it looks like a valid number. If not, re‑prompt or flag it for manual review.

I also explored using a small local model (like Phi-3) for the first pass, falling back to a cloud model only when confidence was low. This reduced cost by 80%.

Finally, I realized LLMs are terrible at counting rows in tables or precise numeric extraction (like "5 items" vs "5,00"). For those, I still use regex on the snippet identified by the LLM.

When NOT to use this

If you're scraping a single site with a stable API or a clear sitemap, skip the LLM. If you need real‑time data (sub‑second), this is too slow. If your data is purely numerical and structured (like log files), regex is faster and cheaper. And if you're on a tight budget, even gpt-4o-mini adds up — a local model might be better.

But for my use case — extracting semi‑structured data from a chaotic set of web pages — this approach has been a lifesaver. I haven't touched my scraping code in three weeks, even though sites have changed several times. The LLM abstracts away the fragile DOM.

Now I'm curious: How do you handle website changes in your scrapers? Are you still wrestling with selectors, or have you tried a language model approach? What's your setup look like?

Top comments (0)