I’ve been scraping the web for years. It’s always the same cycle: find a site, write a few CSS selectors, get the data, then two weeks later the site redesigns and my scraper is dead. I used to spend hours tweaking regex patterns and XPath expressions. It felt like I was fighting the web itself.
Then I started wondering: what if I just told a computer what I wanted in plain English? That’s when I began experimenting with LLMs for data extraction.
The problem: fragile selectors
Last month I needed to pull product information from a dozen different e-commerce sites. Each one had a different HTML structure. One used <div class="product-name">, another used <h2 itemprop="name">, and a third had the name buried in a <span> with a dynamic class name. My BeautifulSoup script looked like a labyrinth of conditional logic:
import requests
from bs4 import BeautifulSoup
def extract_name(soup):
# try first pattern
name = soup.select_one('.product-name')
if name:
return name.text.strip()
# try second pattern
name = soup.select_one('[itemprop="name"]')
if name:
return name.text.strip()
# fallback: find all h2 and guess
for h2 in soup.find_all('h2'):
if 'price' not in h2.text.lower():
return h2.text.strip()
return None
This worked for a while, but maintaining it was a nightmare. Every site update meant rewriting the fallback chain. I needed a different approach.
What I tried (and hated)
First, I tried using more sophisticated scraping frameworks like Scrapy with middlewares. Still the same selectors. Then I looked into visual scraping tools like Octoparse, but they required a GUI and didn’t scale well in code. I even attempted to train a small ML model to recognize product fields — that was overkill and required labeled data.
I was ready to give up when a friend said, “Why not just feed the raw HTML into GPT and ask it to extract what you need?” My first reaction: “That’s insane — too slow and expensive.” But I gave it a shot.
What actually worked: LLM-powered extraction
The idea is simple: instead of writing brittle selectors, you send a small snippet of HTML (or even the whole page) to an LLM with a prompt describing the data you want back as JSON. The LLM figures out the patterns.
Here’s a minimal example using the OpenAI API:
import openai
from bs4 import BeautifulSoup
import json
openai.api_key = "your-key-here"
def extract_product_info(html, fields):
"""
Given raw HTML and a list of fields to extract (e.g., ['name', 'price', 'description']),
return a dict with those fields.
"""
# Clean up HTML to reduce token usage
soup = BeautifulSoup(html, 'html.parser')
# Remove script and style tags
for tag in soup(['script', 'style', 'nav', 'footer']):
tag.decompose()
text = soup.get_text(separator=' ', strip=True)[:3000] # limit length
prompt = f"""Extract the following fields from the text below: {', '.join(fields)}.
Return a JSON object with those fields. If a field is not found, set it to null.
Text:
{text}
"""
response = openai.ChatCompletion.create(
model="gpt-3.5-turbo",
messages=[{"role": "user", "content": prompt}],
temperature=0
)
try:
return json.loads(response.choices[0].message.content)
except:
return {"error": "Failed to parse LLM response"}
# Usage
with open('product_page.html') as f:
html = f.read()
result = extract_product_info(html, ['name', 'price', 'availability'])
print(result)
This code works surprisingly well. I tested it on 10 different product pages and it correctly extracted the fields about 80% of the time. The failures were usually due to very long pages being truncated or ambiguous field names (e.g., "price" could be the sale price vs. original).
Dealing with the trade-offs
LLM extraction isn’t a silver bullet. Here are the issues I hit:
- Cost: At ~$0.002 per request for gpt-3.5-turbo, scraping 10,000 pages would cost $20. That’s fine for a one-off job, but not for continuous scraping of millions of pages.
- Latency: Each request takes 1-3 seconds. For high-volume scraping, you’d need batching or async calls.
- Hallucinations: The LLM might invent data (e.g., guess a price when none exists). Always validate the output.
- Privacy: Sending entire page content to a third-party API may violate terms of service or data protection laws. For sensitive data, you’d want a local model.
I started looking for self-hosted alternatives. That’s when I found services that wrap LLMs with a focus on structured extraction. One such option is Interwest Info — it offers a similar API but with built-in validation and retries. I used it for a side project and it handled the extraction reliably. But the approach is the same: describe what you want, get JSON back.
Lessons learned
- Start simple. Before writing any extraction logic, try sending the page text to an LLM. You might be surprised how far it gets.
- Use it for the tricky parts. I now combine traditional selectors for stable fields (like URLs or IDs) and fall back to LLM for messy text fields.
- Cache aggressively. Store results for pages that haven’t changed to avoid unnecessary API calls.
- Set a budget. Even at low cost, runaway requests can add up. Put a cap on spending.
What I’d do differently next time
Next time I need to scrape many different sites, I’ll build a simple pipeline: first attempt a cached response, then use an LLM extraction endpoint (whether OpenAI or a hosted service like Interwest Info), and finally fall back to manual review for edge cases. I’ll also pre-chunk large pages to avoid truncation issues.
Wrapping up
Regex and CSS selectors still have their place — they’re fast, predictable, and free. But when you’re dealing with heterogeneous web data, telling a computer what you want in English is surprisingly effective. It’s not perfect, but it saved me from losing my mind over changing HTML structures.
Give it a try on your next scraping project. Start with a small sample and see if it works for you.
What’s your secret weapon for dealing with messy web data?
Top comments (0)