Why My Web Scrapers Keep Breaking and How I Finally Fixed Them

#webdev #python #ai #tutorial

Last month, I had to scrape product prices from 50 different e-commerce sites. My usual toolkit—BeautifulSoup, CSS selectors, and a bit of XPath—had always worked fine for one-off tasks. But this project required covering dozens of stores with wildly different HTML structures.

I started coding. Within a day, I had a scraper that worked beautifully for three sites. By day two, I was chasing broken selectors on site four and five. By the end of the week, I had a 400-line Python script that was a mess of try/except blocks and regex hacks. Every time a site changed its layout, my scraper silently returned None for half the fields.

It felt like playing Whac‑A‑Mole with HTML.

What I Tried That Didn’t Work

First, I doubled down on better selectors. I used lxml for speed, wrote XPath expressions that were “robust” (they weren’t), and added fallback patterns for each field. For a site with 10 product pages, it worked. For 50? No chance.

Then I tried sitemap scraping – grab the sitemap, parse it, and hope the content is consistent. That just shifted the problem: sitemaps are often missing or broken.

Next, I attempted visual regression – take screenshots and OCR the prices. It worked in demos but failed in production because of dynamic layouts and inconsistent fonts.

I even dabbled in machine learning classifiers to identify price elements (using features like CSS class names, font size, position). I trained a random forest on 1000 annotated elements. It achieved 85% accuracy on my test set, but the remaining 15% meant I had to manually fix hundreds of products. Not acceptable.

What Eventually Worked (the Approach)

After staring at yet another AttributeError at 2am, I had a thought: what if I stopped trying to parse the HTML structure altogether and instead let a language model read the raw HTML like a human would?

The idea is simple: instead of CSS selectors, you dump the inner HTML of a container element (say a product card) into an LLM prompt and ask it to extract the fields you need.

import openai

# For demonstration – replace with your own API key
openai.api_key = "sk-..."

def extract_product_data(html_snippet):
    prompt = f"""
You are a data extraction assistant. Given the HTML below (from a product listing page), extract the following fields in JSON format:
- title
- price
- availability (in stock / out of stock)
- image_url

Return only valid JSON, no extra text.

HTML:
{html_snippet[:3000]}  # limit token size
"""
    response = openai.ChatCompletion.create(
        model="gpt-4",
        messages=[{"role": "user", "content": prompt}],
        temperature=0
    )
    return response.choices[0].message.content

Of course, you can’t throw 50KB of HTML at an LLM every time (cost + latency). The trick is to pre‑segment the page. Use a simple heuristic like finding the main content area (e.g., <div> with a class containing “product”, “item”, or “card”) and pass only that snippet.

I built a small pipeline:

Fetch the page with requests.
Parse with BeautifulSoup – but only to find candidate containers by generic selectors like [class*="product"] or [id*="product"].
For each candidate container, pass its inner HTML (up to ~3000 characters) to the LLM.
The LLM returns structured JSON. Merge results.

Real Code That Works

Here’s a more realistic version that handles multiple products on one page:

import requests
from bs4 import BeautifulSoup
import openai
import json

openai.api_key = "sk-..."

def extract_products_from_page(url):
    response = requests.get(url, headers={"User-Agent": "Mozilla/5.0"})
    soup = BeautifulSoup(response.text, "html.parser")

    # Naive candidate selection – tune for your target sites
    containers = soup.find_all(
        lambda tag: tag.name in ["div", "li", "article"] and
        any(cls for cls in tag.get("class", []) if "product" in cls.lower())
    )

    products = []
    for container in containers[:5]:  # limit to first 5 to save tokens
        html = str(container)[:3000]
        prompt = f"""
Extract product data from this HTML snippet.
Return a JSON object with keys: title, price, in_stock, image_url.
If a value is missing, use null.

HTML:
{html}
"""
        try:
            msg = openai.ChatCompletion.create(
                model="gpt-4",
                messages=[{"role": "user", "content": prompt}],
                temperature=0
            )
            data = json.loads(msg.choices[0].message.content)
            products.append(data)
        except Exception as e:
            print(f"Failed for container: {e}")
    return products

This isn’t production‑ready without error handling and rate limiting, but it demonstrates the core idea.

Lessons Learned & Trade‑offs

Cost: Each extraction costs ~0.01 ¢ – $0.05 per product (depending on model and token count). For 1000 products, that’s $10–$50. Cheaper than manual data entry, but not free.

Latency: GPT‑4 takes 2–5 seconds per call. You can batch containers into one prompt (e.g., “Extract all products from the following 5 HTML snippets…”) to reduce calls.

Reliability: LLMs hallucinate. I’ve seen it invent prices that don’t exist or return malformed JSON. Always validate the output (e.g., check price matches a regex).

Token limits: Each snippet can be ~3000 chars. For a full product page DOM, you’d need to chunk or summarise the HTML first. Tools like html2text can help.

When NOT to use this approach:

You’re scraping one site with stable HTML → traditional selectors are faster and cheaper.
You need real‑time (sub‑second) scraping → the LLM adds 2+ seconds.
You’re dealing with thousands of pages → cost may blow up.

What I’d Do Differently Next Time

First, I’d start with a hybrid approach: use traditional selectors for sites with known, stable structures, and fall back to the LLM for the messy minority. That would cut costs and latency.

Second, I’d cache the LLM responses. If the same HTML snippet appears again (e.g., same product on different visits), reuse the extracted JSON.

Third, I’d explore smaller, specialised models (e.g., phi-3 or local LLMs via Ollama) for extraction. They’re slower but cost nothing per call and can run offline.

Finally, I’d invest time in building a test harness that validates the extracted data against known ground truth. Without that, you’re flying blind.

The approach saved me from rewriting scrapers every week. It’s not a silver bullet, but it’s a pragmatic shift from brittle selectors to flexible language understanding. Tools like #ai.interwestinfo.com/ provide similar extraction APIs, but the core idea remains: let the model parse the structure, not the developer.

What’s your take? Have you tried using LLMs for data extraction, or do you still prefer the old‑school selector approach? I’m curious what works for your use cases.