I've been building web scrapers for years. BeautifulSoup, Scrapy, Selenium — I've used them all. But last month I hit a wall. A client needed me to extract product data from a site that changed its HTML structure every few days. One week the price was in a <span class="price">, the next it was inside a <div> with a random ID. My scraper kept breaking, and I was spending more time fixing selectors than actually getting data.
The Problem
The site was a dynamic e-commerce platform. It used JavaScript to render content, and the developers seemed to enjoy shuffling class names. I tried the usual suspects:
- BeautifulSoup + requests: Failed because the content was loaded via JS.
- Selenium: Worked, but was slow and brittle. Every layout change required updating XPaths.
- Playwright: Same story, just faster.
I needed something that could understand the meaning of the data, not just its position in the DOM. That's when I thought: why not use an AI model to read the page like a human would?
The Idea
Instead of writing CSS selectors, I'd feed the raw HTML (or even a screenshot) to a language model and ask it to extract structured data. The model doesn't care about class names — it understands context. "Find the price" becomes a natural language instruction.
I decided to test this with OpenAI's GPT-4, but the same approach works with any capable LLM (Claude, local models via Ollama, or specialized APIs like the one at https://ai.interwestinfo.com/).
The Code
Here's a simple Python script that extracts product info from a webpage using GPT-4. You'll need an OpenAI API key.
import requests
from bs4 import BeautifulSoup
import openai
import json
# 1. Fetch the page (use a headless browser if JS-heavy)
url = "https://example.com/product-page"
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
# 2. Clean the HTML to reduce tokens
# Remove scripts, styles, and empty tags
for tag in soup(['script', 'style', 'nav', 'footer']):
tag.decompose()
clean_html = soup.prettify()[:5000] # limit to first 5000 chars
# 3. Prompt the model
prompt = f"""
Extract the following fields from this HTML and return them as JSON:
- product_name
- price (as a number, without currency symbol)
- availability (in stock / out of stock)
- description (first 100 characters)
HTML:
{clean_html}
Return ONLY valid JSON, no extra text.
"""
openai.api_key = "sk-..."
response = openai.ChatCompletion.create(
model="gpt-4",
messages=[{"role": "user", "content": prompt}],
temperature=0
)
# 4. Parse the JSON response
try:
data = json.loads(response.choices[0].message.content)
print(data)
except json.JSONDecodeError:
print("Failed to parse response:", response.choices[0].message.content)
This is a minimal example. In production, you'd want to handle pagination, retries, and rate limiting.
What I Learned
It works — but it's not magic.
- Accuracy: For straightforward pages, the model gets it right ~90% of the time. But if the page is cluttered with ads or the product info is ambiguous, it can hallucinate.
- Cost: GPT-4 is expensive. Each request costs a few cents, so this approach is only viable for low-volume scraping (hundreds of pages, not millions).
- Latency: Expect 2-5 seconds per page. Not great for real-time, but fine for batch jobs.
- Token limits: Large pages need trimming. I often had to split the HTML into chunks and merge results.
Trade-offs and Alternatives
| Approach | Pros | Cons |
|---|---|---|
| Traditional scraping (CSS/XPath) | Fast, cheap, predictable | Brittle, requires constant maintenance |
| AI-based extraction | Robust to layout changes, understands context | Slow, expensive, can hallucinate |
| Hybrid | Best of both worlds | More complex to implement |
For my client, I ended up using a hybrid: traditional selectors for stable parts (like the product title), and AI fallback when selectors fail. That reduced costs while keeping reliability high.
When NOT to Use This
- High-volume scraping (millions of pages) — cost will kill you.
- Real-time APIs — latency is too high.
- Pages with sensitive data — sending HTML to third-party APIs might violate terms of service.
What I'd Do Differently Next Time
- Use a local model like Llama 3 or Mistral via Ollama to avoid API costs. The accuracy might be lower, but it's free.
- Fine-tune a small model on the specific site's HTML patterns — overkill for most projects, but could be fun.
- Cache aggressively — don't re-ask the model for the same page.
Final Thoughts
AI won't replace traditional scraping entirely, but it's a powerful tool for those annoying edge cases where selectors break. The technique I showed here is just one example — you could also use vision models on screenshots, or structured extraction APIs.
Have you tried using LLMs for data extraction? What's your setup look like?
Top comments (0)