When web scraping breaks: using AI to extract messy data

#webdev #python #ai #scraping

I spent three days building a web scraper for a client. Three days of carefully crafting CSS selectors, testing edge cases, and patching broken parsers. On day four, the target website redesigned their product pages. Everything broke.

I sat there staring at a wall of None values and AttributeError messages. The old selectors were useless. The new HTML structure was inconsistent across different product categories. Some pages used <div class="price">, others used <span class="cost">, and a few had the price embedded in a JSON-LD script tag. My beautiful, fragile parser was dead.

This is the story of how I gave up on perfect selectors and started using AI to extract data from raw text — and why I'll never go back to pure CSS/XPath scraping for unstructured pages.

The real problem

I was tasked with scraping product information (name, price, description, availability) from a large e-commerce site. The site had thousands of products across dozens of categories, each with slightly different page templates. Some used Bootstrap, others used custom CSS. Some loaded content dynamically via JavaScript. The HTML was a mess.

Initially, I thought I could handle it with a combination of requests, BeautifulSoup, and a few fallback selectors. I wrote a function like this:

import requests
from bs4 import BeautifulSoup

def extract_price(soup):
    # Try common selectors
    price = soup.select_one('.price, .product-price, [itemprop="price"]')
    if price:
        return price.get_text(strip=True)
    # Fallback to regex on page text
    import re
    match = re.search(r'\$?\d+\.\d{2}', soup.get_text())
    return match.group(0) if match else None

It worked for about 60% of pages. The other 40% required manual tweaks. Every time the site updated, I had to update selectors. It was a maintenance nightmare.

What I tried that didn't work

I went down several rabbit holes:

More CSS selectors: I added every possible class name I could find. The list grew to 20+ selectors per field. It was ugly and still fragile.
XPath with contains(): I tried //*[contains(@class, 'price') or contains(@class, 'cost')]. Better, but still broke when class names changed.
Machine learning classification: I attempted to train a model to identify price elements based on HTML features (tag name, class length, parent structure). This required labeled data and was overkill for a single project.
Visual scraping tools: I tried browser-based tools that record clicks. They worked for a few pages but didn't scale to thousands of different layouts.

None of these solved the core issue: the HTML structure was unpredictable. I needed a way to extract data based on meaning, not markup.

What eventually worked

I realized that if I could get the visible text content of a page (or a section), I could treat the extraction as a natural language understanding problem. Instead of asking "find the element with class 'price'", I could ask "find the price in this text".

Large Language Models (LLMs) are surprisingly good at this. You give them raw text and a schema, and they return structured JSON. No selectors, no XPath, no regex hell.

Here's the approach I settled on:

Fetch the HTML with requests or Selenium (if JS-rendered).
Extract the main content area (to reduce noise).
Pass the raw text to an LLM with a prompt describing the fields I want.
Parse the JSON response.

The key insight: the LLM doesn't care about HTML tags. It reads the text like a human would. If the price is written as "$19.99" somewhere in the text, it can find it.

Code example

Let me show you a concrete implementation. I'll use OpenAI's API, but you can swap in any LLM (local models like Llama work too for sensitive data).

import requests
from bs4 import BeautifulSoup
import json
from openai import OpenAI

# Initialize client (set your API key)
client = OpenAI(api_key="your-api-key")

def extract_product_info(url):
    # Fetch page
    response = requests.get(url, headers={"User-Agent": "Mozilla/5.0"})
    soup = BeautifulSoup(response.text, 'html.parser')

    # Remove script/style tags to reduce noise
    for tag in soup(['script', 'style', 'nav', 'footer', 'header']):
        tag.decompose()

    # Get main content - you might need to refine this selector
    main = soup.select_one('main, .content, #content, article')
    if not main:
        main = soup.body

    page_text = main.get_text(separator=' ', strip=True)
    # Truncate if too long (LLM context limits)
    page_text = page_text[:8000]

    # Build prompt
    prompt = f"""Extract the following fields from the product page text below.
Return a JSON object with keys: name, price, description, availability.
If a field is not found, set it to null.

Product page text:
{page_text}

JSON output:
"""

    # Call LLM
    response = client.chat.completions.create(
        model="gpt-4o-mini",  # cheap and fast
        messages=[
            {"role": "system", "content": "You are a data extraction assistant. Return only valid JSON."},
            {"role": "user", "content": prompt}
        ],
        temperature=0.0
    )

    # Parse response
    result = response.choices[0].message.content
    # Sometimes the model wraps in ```
{% endraw %}
json ...
{% raw %}

if result.startswith('```

'):
result = result.split('

        if result.startswith('json'):
            result = result[4:]
    try:
        return json.loads(result)
    except json.JSONDecodeError:
        return {"error": "Failed to parse LLM output", "raw": result}

# Example usage
info = extract_product_info("https://example.com/product/123")
print(info)

This code is surprisingly robust. I tested it on 50 different product pages from various sites, and it extracted the correct price 90% of the time. Failures were usually due to ambiguous text (e.g., multiple prices on the page) or very long descriptions that got truncated.

Lessons learned and trade-offs

Pros:

No more selector maintenance. If the site redesigns, the LLM still works.
Works on dynamic pages if you pre-render with Selenium/Playwright.
Easy to add new fields: just modify the prompt.
Can handle messy, inconsistent HTML.

Cons:

Cost: API calls cost money. For high-volume scraping, this can add up. A local model (like Llama 3.1 8B) can run for free but requires a GPU.
Latency: Each extraction takes 1-3 seconds. Not suitable for real-time scraping of thousands of pages per minute.
Context limits: Long pages need truncation, which may lose data.
Hallucination: The LLM might invent a price if it can't find one. You need to validate outputs.
Privacy: Sending page text to an external API may not be acceptable for sensitive data.

When NOT to use this approach:

You need to scrape millions of pages daily (cost and speed become prohibitive).
The HTML is perfectly structured and stable (classic selectors are faster and cheaper).
You can't send data to third-party APIs due to compliance.

What I'd do differently next time

I'd start with a hybrid approach: use CSS selectors for the 80% of pages that are consistent, and fall back to AI for the messy 20%. That way you get speed where possible and flexibility where needed.

Also, I'd invest more time in cleaning the input text. Removing navigation, ads, and repeated boilerplate improves accuracy and reduces token costs. A simple heuristic like "keep only the element with the most text content" works well.

Finally, I'd set up a validation step: after extraction, check that the price looks like a price (contains digits and a currency symbol), the name isn't empty, etc. If validation fails, you can retry with a different prompt or fall back to a different model.

The tool I used

For this project, I ended up using an AI-powered extraction service (you can find one at https://ai.interwestinfo.com/ ) that wraps this exact approach. But the technique itself is what matters: treat extraction as a language understanding task, not a DOM traversal problem.

I'm now much more relaxed about web scraping. When a client says "the site changed", I don't panic. I just re-run the scraper with the same code. The LLM adapts automatically.

What's your approach when the HTML structure is unpredictable? Have you tried using LLMs for data extraction, or do you stick with traditional methods? I'd love to hear your war stories.