When Traditional Web Scraping Fails: A Practical AI Approach

#ai #python #webdev #tutorial

I've been building web scrapers for years. BeautifulSoup, Scrapy, Selenium — I've used them all. But last month I hit a wall. A client needed me to extract product data from a site that changed its HTML structure every few days. One week the price was in a <span class="price">, the next it was inside a <div> with a random ID. My scraper kept breaking, and I was spending more time fixing selectors than actually getting data.

The Problem

The site was a dynamic e-commerce platform. It used JavaScript to render content, and the developers seemed to enjoy shuffling class names. I tried the usual suspects:

BeautifulSoup + requests: Failed because the content was loaded via JS.
Selenium: Worked, but was slow and brittle. Every layout change required updating XPaths.
Playwright: Same story, just faster.

I needed something that could understand the meaning of the data, not just its position in the DOM. That's when I thought: why not use an AI model to read the page like a human would?

The Idea

Instead of writing CSS selectors, I'd feed the raw HTML (or even a screenshot) to a language model and ask it to extract structured data. The model doesn't care about class names — it understands context. "Find the price" becomes a natural language instruction.

I decided to test this with OpenAI's GPT-4, but the same approach works with any capable LLM (Claude, local models via Ollama, or specialized APIs like the one at https://ai.interwestinfo.com/).

The Code

Here's a simple Python script that extracts product info from a webpage using GPT-4. You'll need an OpenAI API key.

import requests
from bs4 import BeautifulSoup
import openai
import json

# 1. Fetch the page (use a headless browser if JS-heavy)
url = "https://example.com/product-page"
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

# 2. Clean the HTML to reduce tokens
# Remove scripts, styles, and empty tags
for tag in soup(['script', 'style', 'nav', 'footer']):
    tag.decompose()
clean_html = soup.prettify()[:5000]  # limit to first 5000 chars

# 3. Prompt the model
prompt = f"""
Extract the following fields from this HTML and return them as JSON:
- product_name
- price (as a number, without currency symbol)
- availability (in stock / out of stock)
- description (first 100 characters)

HTML:
{clean_html}

Return ONLY valid JSON, no extra text.
"""

openai.api_key = "sk-..."
response = openai.ChatCompletion.create(
    model="gpt-4",
    messages=[{"role": "user", "content": prompt}],
    temperature=0
)

# 4. Parse the JSON response
try:
    data = json.loads(response.choices[0].message.content)
    print(data)
except json.JSONDecodeError:
    print("Failed to parse response:", response.choices[0].message.content)

This is a minimal example. In production, you'd want to handle pagination, retries, and rate limiting.

What I Learned

It works — but it's not magic.

Accuracy: For straightforward pages, the model gets it right ~90% of the time. But if the page is cluttered with ads or the product info is ambiguous, it can hallucinate.
Cost: GPT-4 is expensive. Each request costs a few cents, so this approach is only viable for low-volume scraping (hundreds of pages, not millions).
Latency: Expect 2-5 seconds per page. Not great for real-time, but fine for batch jobs.
Token limits: Large pages need trimming. I often had to split the HTML into chunks and merge results.

Trade-offs and Alternatives

Approach	Pros	Cons
Traditional scraping (CSS/XPath)	Fast, cheap, predictable	Brittle, requires constant maintenance
AI-based extraction	Robust to layout changes, understands context	Slow, expensive, can hallucinate
Hybrid	Best of both worlds	More complex to implement

For my client, I ended up using a hybrid: traditional selectors for stable parts (like the product title), and AI fallback when selectors fail. That reduced costs while keeping reliability high.

When NOT to Use This

High-volume scraping (millions of pages) — cost will kill you.
Real-time APIs — latency is too high.
Pages with sensitive data — sending HTML to third-party APIs might violate terms of service.

What I'd Do Differently Next Time

Use a local model like Llama 3 or Mistral via Ollama to avoid API costs. The accuracy might be lower, but it's free.
Fine-tune a small model on the specific site's HTML patterns — overkill for most projects, but could be fun.
Cache aggressively — don't re-ask the model for the same page.

Final Thoughts

AI won't replace traditional scraping entirely, but it's a powerful tool for those annoying edge cases where selectors break. The technique I showed here is just one example — you could also use vision models on screenshots, or structured extraction APIs.

Have you tried using LLMs for data extraction? What's your setup look like?

Top comments (1)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.