zhongqiyue

Posted on May 28

Why I Stopped Writing CSS Selectors for Web Scraping

#webdev #python #ai #scraping

I’ve been scraping websites for years. Mostly e-commerce product data, job listings, that sort of thing. And for a long time, my workflow looked like this:

Open the page in Chrome DevTools.
Hunt for a unique CSS selector or XPath that would reliably grab the price, title, description.
Write a BeautifulSoup or Selenium script.
Run it once. Works great.
Run it a week later. Everything breaks because the site redesigned their product cards.

I got tired of playing whack-a-mole with HTML structures. So I started experimenting with a different approach: using large language models (LLMs) to extract structured data directly from raw HTML.

The Breaking Point

Last year I needed to scrape product data from 50+ different e-commerce sites for a price comparison tool. Each site had its own HTML layout. Some used JavaScript rendering, some had anti-bot measures. I spent two weeks writing custom selectors for each site, and then another two weeks fixing them when the sites updated.

I tried headless browsers with Selenium. That worked, but it was slow and still fragile – one class name change and my script would return None for the price. I tried regex on the raw HTML. That was a nightmare.

I needed something that understood the semantics of the page, not just the syntax.

The Lightbulb Moment: LLMs as a Universal Parser

I had been playing with GPT-3.5 for text generation, but then I wondered: what if I feed it the raw HTML and ask it to extract specific fields? The LLM doesn't care about class names – it understands that "$19.99" next to "Price:" is the price.

I tested it on a messy product page. I sent the HTML as a string with a prompt like:

Extract the following fields from this HTML:
- product_name
- price (as a number)
- description
- availability (in stock / out of stock)

Return as JSON.

HTML:
{html}

It worked. Not perfectly, but surprisingly well. The key insight: LLMs are trained on billions of web pages. They've seen every possible HTML structure. They don't need a selector – they just need the raw markup and a clear schema.

The Code

Here's a Python function I wrote to do this. It uses OpenAI's API, but you could swap in any LLM provider.

import os
import json
from openai import OpenAI
from bs4 import BeautifulSoup

client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

def extract_product_data(html: str) -> dict:
    # Clean the HTML a bit to reduce tokens
    soup = BeautifulSoup(html, "html.parser")
    # Remove script and style tags
    for tag in soup(["script", "style", "noscript"]):
        tag.decompose()
    cleaned_html = str(soup)[:10000]  # limit to 10k chars for cost

    prompt = f"""
You are a data extraction assistant. Given the HTML of a product page, extract the following fields and return a JSON object:
- product_name: string
- price: number (remove currency symbols)
- description: string (first 200 chars)
- availability: string ("in_stock", "out_of_stock", "unknown")

If a field is not found, use null.

HTML:
{cleaned_html}

JSON:
"""

    response = client.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[
            {"role": "system", "content": "You extract structured data from HTML."},
            {"role": "user", "content": prompt}
        ],
        temperature=0.0,
        max_tokens=500
    )

    result_text = response.choices[0].message.content
    # Parse the JSON from the response (handle markdown code blocks)
    if "```

json" in result_text:
        result_text = result_text.split("

```json")[1].split("```

")[0].strip()
    elif "

```" in result_text:
        result_text = result_text.split("```

")[1].split("

```")[0].strip()
    return json.loads(result_text)

# Example usage
with open("product_page.html", "r") as f:
    html = f.read()
data = extract_product_data(html)
print(data)

This is a simplified version. In production, you'd want to handle retries, validation, and cost limits.

What I Learned

Pros:

Resilience: The same prompt works across different sites without modification. A class name change? Doesn't matter. The LLM still finds the price.
Speed of development: I can add a new site in minutes instead of hours.
Handles JavaScript: If you feed it the rendered HTML (from Playwright or similar), it works just as well.

Cons:

Cost: Each API call costs money. For a few hundred pages it's fine, but for millions you'll need to optimize (caching, cheaper models, batching).
Latency: LLM calls take 1-3 seconds. Not great for real-time scraping.
Accuracy: It's not 100%. Sometimes it hallucinates a price or misses a field. You need a validation layer.
Token limits: HTML is verbose. You may need to chunk the page or use a model with larger context.

When NOT to Use This

If you're scraping a single well-structured site, just write a selector. It's faster and free.
If you need real-time results (e.g., live price updates), LLMs are too slow.
If your data is sensitive, sending it to a third-party API might be a no-go.

But for messy, heterogeneous sites? This approach is a lifesaver.

What I'd Do Differently Next Time

I'd start with a hybrid approach: try a simple CSS selector first, and fall back to the LLM if it fails. That way you get the speed of selectors for easy pages and the resilience of AI for the hard ones.

Also, I'd use a local model like Llama 3 via Ollama to avoid API costs and keep data private. The quality is close to GPT-3.5 for extraction tasks.

The Tool That Inspired This

While building this, I stumbled upon a service called Interwest AI (https://ai.interwestinfo.com/) that does exactly this kind of LLM-based extraction as a managed API. I haven't used it in production, but it confirmed that others were thinking the same way.

Over to You

I'm still refining this approach. How do you handle scraping sites that change their HTML every week? Have you tried using LLMs for extraction, or do you stick with traditional selectors? I'd love to hear what's worked for you.

This article is based on my personal experience. The tool mentioned is one example of the approach, not an endorsement.

DEV Community