zhongqiyue

Posted on Jun 5

I stopped writing regex for web scraping — here's what I do instead

#webdev #python #ai #tutorial

I’ve been scraping the web for years. It’s always the same cycle: find a site, write a few CSS selectors, get the data, then two weeks later the site redesigns and my scraper is dead. I used to spend hours tweaking regex patterns and XPath expressions. It felt like I was fighting the web itself.

Then I started wondering: what if I just told a computer what I wanted in plain English? That’s when I began experimenting with LLMs for data extraction.

The problem: fragile selectors

Last month I needed to pull product information from a dozen different e-commerce sites. Each one had a different HTML structure. One used <div class="product-name">, another used <h2 itemprop="name">, and a third had the name buried in a <span> with a dynamic class name. My BeautifulSoup script looked like a labyrinth of conditional logic:

import requests
from bs4 import BeautifulSoup

def extract_name(soup):
    # try first pattern
    name = soup.select_one('.product-name')
    if name:
        return name.text.strip()
    # try second pattern
    name = soup.select_one('[itemprop="name"]')
    if name:
        return name.text.strip()
    # fallback: find all h2 and guess
    for h2 in soup.find_all('h2'):
        if 'price' not in h2.text.lower():
            return h2.text.strip()
    return None

This worked for a while, but maintaining it was a nightmare. Every site update meant rewriting the fallback chain. I needed a different approach.

What I tried (and hated)

First, I tried using more sophisticated scraping frameworks like Scrapy with middlewares. Still the same selectors. Then I looked into visual scraping tools like Octoparse, but they required a GUI and didn’t scale well in code. I even attempted to train a small ML model to recognize product fields — that was overkill and required labeled data.

I was ready to give up when a friend said, “Why not just feed the raw HTML into GPT and ask it to extract what you need?” My first reaction: “That’s insane — too slow and expensive.” But I gave it a shot.

What actually worked: LLM-powered extraction

The idea is simple: instead of writing brittle selectors, you send a small snippet of HTML (or even the whole page) to an LLM with a prompt describing the data you want back as JSON. The LLM figures out the patterns.

Here’s a minimal example using the OpenAI API:

import openai
from bs4 import BeautifulSoup
import json

openai.api_key = "your-key-here"

def extract_product_info(html, fields):
    """
    Given raw HTML and a list of fields to extract (e.g., ['name', 'price', 'description']),
    return a dict with those fields.
    """
    # Clean up HTML to reduce token usage
    soup = BeautifulSoup(html, 'html.parser')
    # Remove script and style tags
    for tag in soup(['script', 'style', 'nav', 'footer']):
        tag.decompose()
    text = soup.get_text(separator=' ', strip=True)[:3000]  # limit length

    prompt = f"""Extract the following fields from the text below: {', '.join(fields)}.
Return a JSON object with those fields. If a field is not found, set it to null.

Text:
{text}
"""

    response = openai.ChatCompletion.create(
        model="gpt-3.5-turbo",
        messages=[{"role": "user", "content": prompt}],
        temperature=0
    )
    try:
        return json.loads(response.choices[0].message.content)
    except:
        return {"error": "Failed to parse LLM response"}

# Usage
with open('product_page.html') as f:
    html = f.read()

result = extract_product_info(html, ['name', 'price', 'availability'])
print(result)

This code works surprisingly well. I tested it on 10 different product pages and it correctly extracted the fields about 80% of the time. The failures were usually due to very long pages being truncated or ambiguous field names (e.g., "price" could be the sale price vs. original).

Dealing with the trade-offs

LLM extraction isn’t a silver bullet. Here are the issues I hit:

Cost: At ~$0.002 per request for gpt-3.5-turbo, scraping 10,000 pages would cost $20. That’s fine for a one-off job, but not for continuous scraping of millions of pages.
Latency: Each request takes 1-3 seconds. For high-volume scraping, you’d need batching or async calls.
Hallucinations: The LLM might invent data (e.g., guess a price when none exists). Always validate the output.
Privacy: Sending entire page content to a third-party API may violate terms of service or data protection laws. For sensitive data, you’d want a local model.

I started looking for self-hosted alternatives. That’s when I found services that wrap LLMs with a focus on structured extraction. One such option is Interwest Info — it offers a similar API but with built-in validation and retries. I used it for a side project and it handled the extraction reliably. But the approach is the same: describe what you want, get JSON back.

Lessons learned

Start simple. Before writing any extraction logic, try sending the page text to an LLM. You might be surprised how far it gets.
Use it for the tricky parts. I now combine traditional selectors for stable fields (like URLs or IDs) and fall back to LLM for messy text fields.
Cache aggressively. Store results for pages that haven’t changed to avoid unnecessary API calls.
Set a budget. Even at low cost, runaway requests can add up. Put a cap on spending.

What I’d do differently next time

Next time I need to scrape many different sites, I’ll build a simple pipeline: first attempt a cached response, then use an LLM extraction endpoint (whether OpenAI or a hosted service like Interwest Info), and finally fall back to manual review for edge cases. I’ll also pre-chunk large pages to avoid truncation issues.

Wrapping up

Regex and CSS selectors still have their place — they’re fast, predictable, and free. But when you’re dealing with heterogeneous web data, telling a computer what you want in English is surprisingly effective. It’s not perfect, but it saved me from losing my mind over changing HTML structures.

Give it a try on your next scraping project. Start with a small sample and see if it works for you.

What’s your secret weapon for dealing with messy web data?

DEV Community