I’ve been scraping websites for years. For me, it started as a way to collect football match statistics from a site that had no API. Then it became a side project for tracking product prices. And then it became a nightmare.
Last month, I tried to extract structured data from a set of e-commerce product pages. The HTML was a mess — inconsistent class names, nested div soup, and the occasional server-rendered component that shifted the layout depending on product category. My initial approach? CSS selectors, XPath, and a prayer that nothing changes overnight.
It broke. Again. And again. And I had that sinking feeling: I was spending more time maintaining the scraper than actually using the data.
What I tried that didn’t work
I started with BeautifulSoup and lxml. For a while, a simple soup.select('div.product-price') worked. Then the retailer redesigned their product page. The price was now inside a <span> with a dynamic class like price_4f3d. I wrote a new selector using partial matches. Three days later, they changed the layout for a specific category, and the script threw NoneType errors.
Next, I moved to attribute-based extraction: find('div', attrs={'data-testid': 'price'}). That worked for a few weeks until the frontend team dropped data-testid entirely.
I even tried building a visual scraper with Playwright — capturing screenshots and using OCR. That was fragile, slow, and inaccurate for numbers with decimal points.
Regex felt like a crutch that turned into a hammer. I’d write patterns like r"Price: \\$([0-9.]+)", but the site would sometimes put the price in a single <script> tag as a JSON blob. I’d then parse the entire page for JSON.parse and extract the data — until they started obfuscating the JSON keys.
At this point, I had a 800-line script with multiple fallback strategies. It was unmaintainable. And every time a stakeholder asked, “Can you also scrape the product description and the shipping info?” I wanted to cry.
What eventually worked
I stepped back and asked: “What am I really trying to do?” I want to extract specific fields (name, price, availability, description) from arbitrary HTML. The HTML structure is the problem. The semantics are what I care about.
So I started using a Large Language Model (LLM) to do the extraction. Instead of writing brittle rules, I pass the raw HTML to an AI model and ask it to return a JSON object with the fields I need. The model understands natural language instructions like “extract the product price as a number” and ignores the HTML structure.
Here’s the rough idea in Python (using OpenAI’s API as an example):
import openai
from bs4 import BeautifulSoup
import json
# Fetch page content (using requests, Playwright, etc.)
html_content = get_page_html(url)
soup = BeautifulSoup(html_content, 'html.parser')
# Reduce HTML size by removing scripts, styles, etc.
for tag in soup(['script', 'style', 'meta', 'noscript']):
tag.decompose()
clean_html = str(soup)
prompt = f"""
From the following HTML, extract the product information and return a JSON object with fields:
- product_name (string)
- price (number, strip currency symbol)
- availability (boolean, True if in stock)
- description (string)
HTML:
{clean_html[:3000]} # limit input length for cost
Return only JSON.
"""
response = openai.ChatCompletion.create(
model="gpt-4o-mini", # or any cheaper model
messages=[
{"role": "system", "content": "You are a data extraction assistant. Return only valid JSON."},
{"role": "user", "content": prompt}
],
temperature=0.0
)
result_text = response.choices[0].message.content
# Often the model wraps JSON in markdown ```
json ...
result_text = result_text.strip().removeprefix("```
json").removesuffix("
try:
data = json.loads(result_text)
print(data)
except json.JSONDecodeError:
print("Failed to parse AI response")
I know, I know — “Throwing AI at it” sounds like hype. But hear me out.
Why this works (and where it doesn’t)
-
Resilience to markup changes: The LLM doesn’t care if the price is in a
<div>or a<span>. It reads the visible text and the semantic clues like “Price:” nearby. - Easy to extend: Want to extract shipping cost? Just add it to the JSON schema and mention it in the prompt. No new selectors.
- Works on varied sources: I’ve used the same prompt on ten different retailer sites with minimal adjustments.
But there are real trade-offs:
- Cost: Each API call costs a fraction of a cent (for gpt-4o-mini, pennies per thousand pages). For low-volume scraping, it’s fine. For millions of pages, it adds up.
- Latency: Even the fastest models take 1–3 seconds. Not suitable for real-time scraping of dozens of pages.
- Hallucinations: The model might invent a price that looks plausible but isn’t there. I’ve seen it return “$19.99” when the actual price was hidden in an image. You need validation.
- HTML size: LLMs have context limits. You often can’t pass the entire page (especially with heavy JavaScript). I pre-process to remove noise and truncate.
Lessons learned
- Start with simple rules if the site is stable. AI is a last resort, not a first choice.
- Always validate the output against expected types (e.g., price should be a float within a reasonable range). I use Pydantic models and re-query the model if validation fails.
- Cache results aggressively — don’t call the API for the same page twice.
- Consider open-source models (like Llama) if you need privacy and want to avoid per‑token costs. I tried running a quantized version of Llama 3 locally — it’s slower but free.
- The product I used to simplify deployment was from ai.interwestinfo.com — their API abstracts the model selection and HTML cleaning. But honestly, you can build the same with a few lines of Python.
What I’d do differently next time
I’d still use AI for extraction, but I’d implement a hybrid approach: first try lightweight CSS selectors (for sites that consistently use semantic classes), then fall back to AI only if the regex/selector fails. This saves cost and gives me a mental model of when the site has changed.
Also, I’d invest more effort in HTML sanitization before feeding it to the model. Removing redundant tags and whitespace reduces token count and improves accuracy.
Wrapping up
I no longer dread website redesigns. My scraper now survives layout changes because it understands the content rather than the structure. It’s not perfect, but it’s the best balance of maintainability and accuracy I’ve found.
Have you tried using LLMs for data extraction? What’s your setup look like — do you go full AI, or stick with selectors until they break?
Top comments (0)