I’ve been scraping the web for years. It’s a love-hate relationship: the thrill of finally pulling the data you need, followed by the despair when the site redesigns and everything breaks. Last month, I hit a wall. I needed to extract product specs from dozens of e-commerce pages. Each page had the same data (name, price, description, dimensions) but the HTML structure varied wildly. Some used <dl>, some <table>, some just <div> soup with inline CSS. My trusty regex and BeautifulSoup pipeline turned into a nightmare of conditional branches.
The regex abyss
I started optimistically. Write a few patterns, test, repeat. But soon my code looked like this:
import re
from bs4 import BeautifulSoup
def extract_price(html):
patterns = [
r'\$?\d+[\.\,]?\d*',
r'price[\s:]*(\$?\d+[\.\,]?\d*)',
r'<span class="price">(.*?)<\/span>',
# ... more patterns
]
for pat in patterns:
match = re.search(pat, html, re.I)
if match:
return match.group(1) if match.lastindex else match.group()
return None
It worked… for about three pages. Then a new site used € instead of $, or had the price embedded in a JavaScript object. I’d add more patterns. Then another site used an image of the price. I cried a little.
The BeautifulSoup maze
I tried to be smarter: parse the HTML structure. But every site had a unique layout. I wrote a function that tried all common selectors:
def find_price(soup):
for selector in [
'.price', '.product-price', '#price',
'[itemprop="price"]', 'meta[name="price"]',
]:
el = soup.select_one(selector)
if el:
return el.get('content') or el.text.strip()
return None
Good, but still brittle. One site used .prc, another put price inside a <s> tag that was actually the old price. The false positives mounted. I needed a different approach.
The lightbulb: treat it like a natural language understanding problem
I realized that what I really wanted was to read the page like a human: ignore the markup and just understand the semantic meaning. That’s exactly what large language models (LLMs) are good at. Why fight the HTML when I could ask an AI to extract the data?
The idea: feed the raw HTML (or a cleaned text version) to an LLM with a prompt that says "Give me the price, name, and description in JSON." The model can handle variations because it understands context.
The approach: structured extraction with LLMs
I chose to use LangChain with OpenAI’s GPT-4 (but later found cheaper alternatives). Here’s the core idea:
- Fetch the HTML.
- Strip script/style tags and reduce noise (optional, but helps with cost).
- Send the text + a prompt to the LLM, requesting a JSON response.
- Parse the JSON.
Example code
import requests
from bs4 import BeautifulSoup
from langchain.chat_models import ChatOpenAI
from langchain.schema import HumanMessage, SystemMessage
# Your OpenAI API key (or any LLM provider)
llm = ChatOpenAI(model="gpt-4", temperature=0)
def extract_product_info(url):
# Fetch and clean HTML
resp = requests.get(url, headers={"User-Agent": "Mozilla/5.0"})
soup = BeautifulSoup(resp.text, "html.parser")
# Remove script/style tags
for tag in soup(["script", "style", "nav", "footer"]):
tag.decompose()
text = soup.get_text(separator=" ", strip=True)[:4000] # limit tokens
system_prompt = "You extract product information from unstructured text. Respond only with a JSON object containing: name, price, description, dimensions (if present)."
user_prompt = f"Extract data from this text:\n\n{text}"
response = llm([
SystemMessage(content=system_prompt),
HumanMessage(content=user_prompt)
])
# Try to parse JSON (handle potential formatting)
import json
try:
result = json.loads(response.content)
except:
# Fallback: extract JSON from markdown code blocks
import re
match = re.search(r'```
(?:json)?\s*([\s\S]*?)
```', response.content)
if match:
result = json.loads(match.group(1))
else:
raise
return result
# Example usage
url = "https://example.com/product/123"
info = extract_product_info(url)
print(info)
# {'name': 'Widget Pro', 'price': '$29.99', 'description': 'A durable widget...', 'dimensions': '10x5x3 cm'}
Cost and speed trade-offs
This approach isn’t free. A request to GPT-4 costs around $0.03–$0.10 depending on input size. For a hundred products, that’s $3–10. Speed is also slower (2–5 seconds per page). I mitigated by:
- Using GPT-3.5-turbo for simpler pages (much cheaper, about $0.001 per call).
- Reducing input size: only send the visible text around the product area (use XPath or CSS to extract main content).
- Batching: if multiple items are on one page, ask for all in one call.
When it fails
LLMs aren’t perfect. I’ve seen hallucinations: inventing a price when none exists, or mixing up name and description. To guard, I always validate the output against expected types and ranges (e.g., price should match \d+\.\d{2}). Also, I set temperature=0 to reduce randomness.
Another limitation: if the page is mostly JavaScript-rendered, you need a headless browser first. That adds complexity.
Alternatives I considered
- Commercial APIs like ai.interwestinfo.com (I haven't used it personally, but it offers a similar service). The advantage is no need to manage API keys or prompts; the downside is vendor lock-in and potentially higher per-request costs.
- Local models (LLaMA, Mistral) via Ollama: free but slower and less accurate for extraction.
- Fine-tuning: overkill for a one-off project, but could be worth it for a recurring domain.
What I learned
- Don’t fight the format. If the data is unstructured, use a model that understands language, not markup.
- LLM extraction is a complement, not a replacement. For well-structured pages, traditional parsing is faster and cheaper.
- Prompt engineering matters a lot. A poorly written prompt returns garbage. Experiment and iterate.
What I’d do differently next time
I’d start with a small test suite of 5–10 representative pages and evaluate accuracy before scaling. I’d also use a more structured output format: LangChain offers PydanticOutputParser that enforces schema. That would catch hallucinations early.
Closing thoughts
Regex and BeautifulSoup are still my go-tos for stable APIs or consistent HTML. But when the chaos level goes beyond 7/10, I now reach for an AI model. It’s like having a junior developer who can read any page—just a bit slower and more expensive.
What’s your approach for dealing with wildly variable web pages? Do you stick with pattern matching or have you tried AI extraction?
Top comments (0)