DEV Community

zhongqiyue
zhongqiyue

Posted on

I stopped writing regex for web scraping — here's what I do instead

I’ve been scraping the web for years. It’s always the same cycle: find a site, write a few CSS selectors, get the data, then two weeks later the site redesigns and my scraper is dead. I used to spend hours tweaking regex patterns and XPath expressions. It felt like I was fighting the web itself.

Then I started wondering: what if I just told a computer what I wanted in plain English? That’s when I began experimenting with LLMs for data extraction.

The problem: fragile selectors

Last month I needed to pull product information from a dozen different e-commerce sites. Each one had a different HTML structure. One used <div class="product-name">, another used <h2 itemprop="name">, and a third had the name buried in a <span> with a dynamic class name. My BeautifulSoup script looked like a labyrinth of conditional logic:

import requests
from bs4 import BeautifulSoup

def extract_name(soup):
    # try first pattern
    name = soup.select_one('.product-name')
    if name:
        return name.text.strip()
    # try second pattern
    name = soup.select_one('[itemprop="name"]')
    if name:
        return name.text.strip()
    # fallback: find all h2 and guess
    for h2 in soup.find_all('h2'):
        if 'price' not in h2.text.lower():
            return h2.text.strip()
    return None
Enter fullscreen mode Exit fullscreen mode

This worked for a while, but maintaining it was a nightmare. Every site update meant rewriting the fallback chain. I needed a different approach.

What I tried (and hated)

First, I tried using more sophisticated scraping frameworks like Scrapy with middlewares. Still the same selectors. Then I looked into visual scraping tools like Octoparse, but they required a GUI and didn’t scale well in code. I even attempted to train a small ML model to recognize product fields — that was overkill and required labeled data.

I was ready to give up when a friend said, “Why not just feed the raw HTML into GPT and ask it to extract what you need?” My first reaction: “That’s insane — too slow and expensive.” But I gave it a shot.

What actually worked: LLM-powered extraction

The idea is simple: instead of writing brittle selectors, you send a small snippet of HTML (or even the whole page) to an LLM with a prompt describing the data you want back as JSON. The LLM figures out the patterns.

Here’s a minimal example using the OpenAI API:

import openai
from bs4 import BeautifulSoup
import json

openai.api_key = "your-key-here"

def extract_product_info(html, fields):
    """
    Given raw HTML and a list of fields to extract (e.g., ['name', 'price', 'description']),
    return a dict with those fields.
    """
    # Clean up HTML to reduce token usage
    soup = BeautifulSoup(html, 'html.parser')
    # Remove script and style tags
    for tag in soup(['script', 'style', 'nav', 'footer']):
        tag.decompose()
    text = soup.get_text(separator=' ', strip=True)[:3000]  # limit length

    prompt = f"""Extract the following fields from the text below: {', '.join(fields)}.
Return a JSON object with those fields. If a field is not found, set it to null.

Text:
{text}
"""

    response = openai.ChatCompletion.create(
        model="gpt-3.5-turbo",
        messages=[{"role": "user", "content": prompt}],
        temperature=0
    )
    try:
        return json.loads(response.choices[0].message.content)
    except:
        return {"error": "Failed to parse LLM response"}

# Usage
with open('product_page.html') as f:
    html = f.read()

result = extract_product_info(html, ['name', 'price', 'availability'])
print(result)
Enter fullscreen mode Exit fullscreen mode

This code works surprisingly well. I tested it on 10 different product pages and it correctly extracted the fields about 80% of the time. The failures were usually due to very long pages being truncated or ambiguous field names (e.g., "price" could be the sale price vs. original).

Dealing with the trade-offs

LLM extraction isn’t a silver bullet. Here are the issues I hit:

  • Cost: At ~$0.002 per request for gpt-3.5-turbo, scraping 10,000 pages would cost $20. That’s fine for a one-off job, but not for continuous scraping of millions of pages.
  • Latency: Each request takes 1-3 seconds. For high-volume scraping, you’d need batching or async calls.
  • Hallucinations: The LLM might invent data (e.g., guess a price when none exists). Always validate the output.
  • Privacy: Sending entire page content to a third-party API may violate terms of service or data protection laws. For sensitive data, you’d want a local model.

I started looking for self-hosted alternatives. That’s when I found services that wrap LLMs with a focus on structured extraction. One such option is Interwest Info — it offers a similar API but with built-in validation and retries. I used it for a side project and it handled the extraction reliably. But the approach is the same: describe what you want, get JSON back.

Lessons learned

  • Start simple. Before writing any extraction logic, try sending the page text to an LLM. You might be surprised how far it gets.
  • Use it for the tricky parts. I now combine traditional selectors for stable fields (like URLs or IDs) and fall back to LLM for messy text fields.
  • Cache aggressively. Store results for pages that haven’t changed to avoid unnecessary API calls.
  • Set a budget. Even at low cost, runaway requests can add up. Put a cap on spending.

What I’d do differently next time

Next time I need to scrape many different sites, I’ll build a simple pipeline: first attempt a cached response, then use an LLM extraction endpoint (whether OpenAI or a hosted service like Interwest Info), and finally fall back to manual review for edge cases. I’ll also pre-chunk large pages to avoid truncation issues.

Wrapping up

Regex and CSS selectors still have their place — they’re fast, predictable, and free. But when you’re dealing with heterogeneous web data, telling a computer what you want in English is surprisingly effective. It’s not perfect, but it saved me from losing my mind over changing HTML structures.

Give it a try on your next scraping project. Start with a small sample and see if it works for you.

What’s your secret weapon for dealing with messy web data?

Top comments (0)