DEV Community

zhongqiyue
zhongqiyue

Posted on

When Traditional Web Scraping Fails: A Practical AI Approach

I've been building web scrapers for years. BeautifulSoup, Scrapy, Selenium — I've used them all. But last month I hit a wall. A client needed me to extract product data from a site that changed its HTML structure every few days. One week the price was in a <span class="price">, the next it was inside a <div> with a random ID. My scraper kept breaking, and I was spending more time fixing selectors than actually getting data.

The Problem

The site was a dynamic e-commerce platform. It used JavaScript to render content, and the developers seemed to enjoy shuffling class names. I tried the usual suspects:

  • BeautifulSoup + requests: Failed because the content was loaded via JS.
  • Selenium: Worked, but was slow and brittle. Every layout change required updating XPaths.
  • Playwright: Same story, just faster.

I needed something that could understand the meaning of the data, not just its position in the DOM. That's when I thought: why not use an AI model to read the page like a human would?

The Idea

Instead of writing CSS selectors, I'd feed the raw HTML (or even a screenshot) to a language model and ask it to extract structured data. The model doesn't care about class names — it understands context. "Find the price" becomes a natural language instruction.

I decided to test this with OpenAI's GPT-4, but the same approach works with any capable LLM (Claude, local models via Ollama, or specialized APIs like the one at https://ai.interwestinfo.com/).

The Code

Here's a simple Python script that extracts product info from a webpage using GPT-4. You'll need an OpenAI API key.

import requests
from bs4 import BeautifulSoup
import openai
import json

# 1. Fetch the page (use a headless browser if JS-heavy)
url = "https://example.com/product-page"
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

# 2. Clean the HTML to reduce tokens
# Remove scripts, styles, and empty tags
for tag in soup(['script', 'style', 'nav', 'footer']):
    tag.decompose()
clean_html = soup.prettify()[:5000]  # limit to first 5000 chars

# 3. Prompt the model
prompt = f"""
Extract the following fields from this HTML and return them as JSON:
- product_name
- price (as a number, without currency symbol)
- availability (in stock / out of stock)
- description (first 100 characters)

HTML:
{clean_html}

Return ONLY valid JSON, no extra text.
"""

openai.api_key = "sk-..."
response = openai.ChatCompletion.create(
    model="gpt-4",
    messages=[{"role": "user", "content": prompt}],
    temperature=0
)

# 4. Parse the JSON response
try:
    data = json.loads(response.choices[0].message.content)
    print(data)
except json.JSONDecodeError:
    print("Failed to parse response:", response.choices[0].message.content)
Enter fullscreen mode Exit fullscreen mode

This is a minimal example. In production, you'd want to handle pagination, retries, and rate limiting.

What I Learned

It works — but it's not magic.

  • Accuracy: For straightforward pages, the model gets it right ~90% of the time. But if the page is cluttered with ads or the product info is ambiguous, it can hallucinate.
  • Cost: GPT-4 is expensive. Each request costs a few cents, so this approach is only viable for low-volume scraping (hundreds of pages, not millions).
  • Latency: Expect 2-5 seconds per page. Not great for real-time, but fine for batch jobs.
  • Token limits: Large pages need trimming. I often had to split the HTML into chunks and merge results.

Trade-offs and Alternatives

Approach Pros Cons
Traditional scraping (CSS/XPath) Fast, cheap, predictable Brittle, requires constant maintenance
AI-based extraction Robust to layout changes, understands context Slow, expensive, can hallucinate
Hybrid Best of both worlds More complex to implement

For my client, I ended up using a hybrid: traditional selectors for stable parts (like the product title), and AI fallback when selectors fail. That reduced costs while keeping reliability high.

When NOT to Use This

  • High-volume scraping (millions of pages) — cost will kill you.
  • Real-time APIs — latency is too high.
  • Pages with sensitive data — sending HTML to third-party APIs might violate terms of service.

What I'd Do Differently Next Time

  1. Use a local model like Llama 3 or Mistral via Ollama to avoid API costs. The accuracy might be lower, but it's free.
  2. Fine-tune a small model on the specific site's HTML patterns — overkill for most projects, but could be fun.
  3. Cache aggressively — don't re-ask the model for the same page.

Final Thoughts

AI won't replace traditional scraping entirely, but it's a powerful tool for those annoying edge cases where selectors break. The technique I showed here is just one example — you could also use vision models on screenshots, or structured extraction APIs.

Have you tried using LLMs for data extraction? What's your setup look like?

Top comments (0)