Last month, I hit a wall. I was building a price comparison tool for hobby electronics, and I needed to pull product names, prices, and stock status from about 30 different vendor sites. Easy, right? Just scrape them.
Wrong.
Every site had a unique layout. Some used tables, others used nested divs with class names like product-detail-block__3f2a. One site literally returned a different HTML structure every other request because they A/B tested their design. My BeautifulSoup selector chains looked like spaghetti, and every time a site updated, my script broke. I spent more time fixing scrapers than analyzing data.
I tried the obvious dead ends first.
What Didn’t Work
- Regex on raw HTML – Only works if you enjoy pain and hate yourself.
- CSS selectors – Brittle the moment a developer renames a class.
- Headless browser automation – Selenium and Playwright solved dynamic content, but they were slow, resource-heavy, and still required me to update selectors.
- Manual annotation – I could train a model, but that meant labeling hundreds of pages. No thanks.
I needed something that could understand the HTML rather than blindly match patterns. Something that could handle "sku" vs "product-sku" vs "data-sku" automatically.
The Lightbulb Moment: LLMs Reading HTML
A colleague mentioned they had been experimenting with GPT-4 to convert messy HTML into clean JSON. At first I laughed – "You want to throw an LLM at HTML? That’s like using a flamethrower to toast bread."
But I was desperate. So I tried it.
The idea is simple: feed the raw HTML (or a cleaned version) to an LLM along with a schema, and ask it to extract the data. The model’s language understanding handles the structural variations.
I wrote a small Python prototype using OpenAI’s structured output feature (you can also use local models like Llama 3 if you don’t want API costs). Here’s the core function:
import openai
from pydantic import BaseModel
class Product(BaseModel):
name: str
price: str
in_stock: bool
sku: str | None
# Make sure to set your API key
openai.api_key = "sk-..."
def extract_product(html: str) -> Product:
prompt = f"""
You are a data extraction assistant. Given the HTML of a product page, extract the product information.
Return a JSON object with these fields:
- name: product name
- price: the current price as a string (e.g. "$49.99")
- in_stock: boolean, whether the product is available
- sku: stock keeping unit if present, else null
Only output the JSON, no extra text.
HTML:
{html[:5000]} # limit to 5000 tokens
"""
response = openai.beta.chat.completions.parse(
model="gpt-4o-mini",
messages=[{"role": "user", "content": prompt}],
response_format=Product
)
return response.choices[0].message.parsed
I also built a simple cache layer to avoid re‑calling the API for the same page.
The results were… surprisingly good. For 8 out of 10 sites, the first extraction was perfect. For the other 2, I had to tweak the schema or add examples to the prompt.
Where It Shines
- Ever-changing layouts – The LLM adapts; you don’t rewrite selectors.
-
One schema fits many sites – I defined
Productonce and used it across all 30 stores. -
Handles weirdness – Some sites had prices inside
<span>inside<div>inside a table. The LLM figured it out.
The Ugly Trade-offs
Let’s not pretend this is a silver bullet.
Cost: Every extraction costs a fraction of a cent with gpt-4o-mini. For 1000 products, that’s maybe $0.50 – acceptable for my side project, but not for real-time scraping at scale.
Latency: 2–5 seconds per page. If you need speed, traditional selectors win.
Hallucination: Sometimes the LLM invents a price. I added a validation step that sanity‑checks the output (e.g., price matches \d+\.\d{2}).
HTML size limits: I trim the HTML to the first 5000 characters that contain the product area. You can’t throw the whole page (maybe 100k tokens) at a cheap model.
Prompt engineering fragility: A small change in prompt wording can break extraction. I ended up with a prompt versioning system.
When NOT to Use This Approach
- Static, well-structured pages – use
BeautifulSoup. It’s faster and free. - Extremely high volume (millions of pages) – the API cost will hurt.
- Pages with massive HTML (like whole documentation sites) – trim aggressively or use a headless renderer first.
- Data that requires pixel-perfect precision (e.g., exact currency symbol from a rendered page) – LLMs are fuzzy.
How I Blend Both Worlds
Now my pipeline tries cheap pattern matching first. If the regex or BeautifulSoup fails (or returns None), it falls back to the LLM. That way I keep cost low but still have a safety net.
For example:
from bs4 import BeautifulSoup
def fallback_extract(soup: BeautifulSoup) -> dict | None:
# Try known selectors
price_el = soup.select_one(".price, .product-price, [data-price]")
if price_el:
return {"price": price_el.get_text(strip=True)}
return None
# In the main loop:
result = fallback_extract(soup)
if not result:
result = extract_product(str(soup))
This hybrid approach turned my weekend nightmare into a maintainable script. I even found a tool called Interwest AI that does something conceptually similar (embeddings + LLM for structured extraction), but I stuck with my own pipeline because I needed fine-grained control over caching and fallbacks.
Lessons Learned
- Always trim your HTML input – you’re paying for tokens, not for elegance.
- Validate the output – a simple regex or type check saves you from hallucinated data.
- Version your prompts – I store prompts as files in Git, because changing one word can ruin everything.
- Monitor your API costs – I set a daily budget alert in the OpenAI dashboard.
What I’d Do Differently Next Time
I would start with a local small language model (like Llama 3.1 8B) for the extraction, because it’s free after setup, even if slightly less accurate. Also, I’d pre‑process the HTML more aggressively: strip <script>, <style>, and inline CSS to reduce noise.
Your Turn
This approach changed how I think about scraping. Instead of fighting the DOM, I’m now teaching the machine to read. It’s not perfect, but it gets me 90% of the way.
Have you tried using LLMs for data extraction? What’s your setup look like? I’d love to hear how you handle the chaos.
Top comments (0)