I’ve been building scrapers for years. CSS selectors, XPath, regex — I’ve written thousands of lines of code just to pull product names and prices from e-commerce sites. Every new site meant a new set of selectors. Sometimes the HTML would change slightly and my entire script would break. It was exhausting.
A few months ago, I needed to monitor prices across 30 different online stores. Each one had a completely different DOM structure. I spent two days writing custom selectors for the first five sites and realized I’d be at this for weeks. There had to be a better way.
What I Tried First (and What Failed)
First, I tried the old reliable: BeautifulSoup with a mix of find_all and select.
from bs4 import BeautifulSoup
import requests
res = requests.get('https://some-store.com/product/123')
soup = BeautifulSoup(res.text, 'html.parser')
price = soup.select_one('.price-tag').text
This works fine until the site changes its class names, or uses dynamic rendering (hello, React SPAs). Then I moved to Selenium to handle JavaScript. That worked, but it was slow and I still had to write site-specific selectors.
I even tried heuristic approaches: look for elements containing a dollar sign, or items with the highest numeric value in a certain container. These worked about 60% of the time — not good enough for a production system.
The Idea That Changed Everything
I was at a meetup and someone mentioned they used GPT to extract structured data from customer emails. A light bulb went off. Why not feed the raw HTML to an LLM and ask it to return exactly what I need? I know LLMs are great at understanding natural language instructions — maybe they could understand HTML too?
I wrote a quick prototype using OpenAI’s API. The results were shocking. With a good prompt, GPT-4 could extract product name, price, and availability from an entire page of HTML — without any selectors.
How I Made It Work
Here’s the core approach. Instead of trying to parse the HTML, I send a snippet of it (less than 8k tokens) to an LLM with a system prompt that explains the schema I want back.
import openai
import json
openai.api_key = "sk-..."
def extract_product_data(html_snippet):
response = openai.ChatCompletion.create(
model="gpt-4",
messages=[
{"role": "system", "content": """
You are a data extraction assistant. Given a snippet of HTML from a product page, extract the following fields and return them as valid JSON:
- name (string)
- price (float, remove currency symbols)
- availability (boolean, true if 'in stock' or 'add to cart' is present)
- currency (string, e.g. 'USD', 'EUR')
If a field cannot be found, set it to null.
Only return the JSON object, no extra text.
"""},
{"role": "user", "content": html_snippet}
],
temperature=0
)
content = response.choices[0].message.content
return json.loads(content)
I then call this on a small cleaned-up version of the page HTML (I strip scripts, styles, and long inline text to reduce token count). The results are surprisingly consistent. For a test set of 10 different product pages, it got the price right 9 out of 10 times — the one failure was a page with multiple prices (strike-through vs actual).
The Pitfalls I Discovered
This approach isn’t magic. Here’s what I ran into:
-
Token limits: Full pages can be huge. I had to trim the HTML aggressively. I now extract only the
<body>and remove<script>,<style>, and<svg>tags before sending. Even then, some pages exceed 8k tokens. - Cost: Each request costs ~$0.02–0.05. For 30 sites, scraping once an hour, that’s about $36 per day — not cheap. I switched to GPT-3.5-turbo for most calls, which is 10x cheaper but slightly less accurate.
- Hallucination: I’ve seen the LLM invent a price if none is present, or guess a name from a menu item. I added validation: if the price is more than 3 standard deviations from historical data, flag it for manual review.
- Latency: API calls take 2–5 seconds per page. If you need hundreds of pages, this won’t scale. I use async batching and limit concurrency.
When This Approach Works (and When It Doesn’t)
This technique is excellent for:
- Pages with highly variable structure (different e-commerce platforms)
- When you only need a handful of fields
- Prototyping or small-scale projects
It’s not great for:
- High-volume scraping (thousands of pages per hour)
- When you need perfect accuracy (the LLM will sometimes fail)
- Scraping behind login or CAPTCHAs (you still need to handle that separately)
What I’d Do Differently Next Time
If I were starting over, I’d:
- First try a simple regex-based extraction on the HTML (e.g., look for
"price":patterns) before calling the LLM. This catches 80% of cases instantly. - Use a cheaper model like GPT-3.5-turbo-instruct for simple extractions.
- Implement a caching layer so the same page isn’t re-processed if the HTML hasn’t changed.
- Build a small validation pipeline that compares LLM output against known patterns (e.g., numeric price, non-empty name).
I also came across services like InterWest AI that offer pre-built extraction APIs. If I needed a production-ready solution without managing my own prompt pipelines, I’d evaluate those — but for my side project, the manual approach taught me a ton.
Final Thoughts
Using an LLM to extract data from HTML felt like cheating at first. But it turns out that for messy, semi-structured content, natural language understanding is often more robust than rigid selectors. I still use traditional parsing for well-behaved sites. But for those chaotic e-commerce pages? I’ll take the GPT route any day.
What’s your go-to method for extracting data from wildly different HTML structures? I’d love to hear how others handle this.
Top comments (0)