Why I Gave Up on Perfect Selectors and Asked GPT to Extract My Data

#ai #webdev #python #api

I’ve been building scrapers for years. CSS selectors, XPath, regex — I’ve written thousands of lines of code just to pull product names and prices from e-commerce sites. Every new site meant a new set of selectors. Sometimes the HTML would change slightly and my entire script would break. It was exhausting.

A few months ago, I needed to monitor prices across 30 different online stores. Each one had a completely different DOM structure. I spent two days writing custom selectors for the first five sites and realized I’d be at this for weeks. There had to be a better way.

What I Tried First (and What Failed)

First, I tried the old reliable: BeautifulSoup with a mix of find_all and select.

from bs4 import BeautifulSoup
import requests

res = requests.get('https://some-store.com/product/123')
soup = BeautifulSoup(res.text, 'html.parser')
price = soup.select_one('.price-tag').text

This works fine until the site changes its class names, or uses dynamic rendering (hello, React SPAs). Then I moved to Selenium to handle JavaScript. That worked, but it was slow and I still had to write site-specific selectors.

I even tried heuristic approaches: look for elements containing a dollar sign, or items with the highest numeric value in a certain container. These worked about 60% of the time — not good enough for a production system.

The Idea That Changed Everything

I was at a meetup and someone mentioned they used GPT to extract structured data from customer emails. A light bulb went off. Why not feed the raw HTML to an LLM and ask it to return exactly what I need? I know LLMs are great at understanding natural language instructions — maybe they could understand HTML too?

I wrote a quick prototype using OpenAI’s API. The results were shocking. With a good prompt, GPT-4 could extract product name, price, and availability from an entire page of HTML — without any selectors.

How I Made It Work

Here’s the core approach. Instead of trying to parse the HTML, I send a snippet of it (less than 8k tokens) to an LLM with a system prompt that explains the schema I want back.

import openai
import json

openai.api_key = "sk-..."

def extract_product_data(html_snippet):
    response = openai.ChatCompletion.create(
        model="gpt-4",
        messages=[
            {"role": "system", "content": """
You are a data extraction assistant. Given a snippet of HTML from a product page, extract the following fields and return them as valid JSON:
- name (string)
- price (float, remove currency symbols)
- availability (boolean, true if 'in stock' or 'add to cart' is present)
- currency (string, e.g. 'USD', 'EUR')
If a field cannot be found, set it to null.
Only return the JSON object, no extra text.
"""},
            {"role": "user", "content": html_snippet}
        ],
        temperature=0
    )
    content = response.choices[0].message.content
    return json.loads(content)

I then call this on a small cleaned-up version of the page HTML (I strip scripts, styles, and long inline text to reduce token count). The results are surprisingly consistent. For a test set of 10 different product pages, it got the price right 9 out of 10 times — the one failure was a page with multiple prices (strike-through vs actual).

The Pitfalls I Discovered

This approach isn’t magic. Here’s what I ran into:

Token limits: Full pages can be huge. I had to trim the HTML aggressively. I now extract only the <body> and remove <script>, <style>, and <svg> tags before sending. Even then, some pages exceed 8k tokens.
Cost: Each request costs ~$0.02–0.05. For 30 sites, scraping once an hour, that’s about $36 per day — not cheap. I switched to GPT-3.5-turbo for most calls, which is 10x cheaper but slightly less accurate.
Hallucination: I’ve seen the LLM invent a price if none is present, or guess a name from a menu item. I added validation: if the price is more than 3 standard deviations from historical data, flag it for manual review.
Latency: API calls take 2–5 seconds per page. If you need hundreds of pages, this won’t scale. I use async batching and limit concurrency.

When This Approach Works (and When It Doesn’t)

This technique is excellent for:

Pages with highly variable structure (different e-commerce platforms)
When you only need a handful of fields
Prototyping or small-scale projects

It’s not great for:

High-volume scraping (thousands of pages per hour)
When you need perfect accuracy (the LLM will sometimes fail)
Scraping behind login or CAPTCHAs (you still need to handle that separately)

What I’d Do Differently Next Time

If I were starting over, I’d:

First try a simple regex-based extraction on the HTML (e.g., look for "price": patterns) before calling the LLM. This catches 80% of cases instantly.
Use a cheaper model like GPT-3.5-turbo-instruct for simple extractions.
Implement a caching layer so the same page isn’t re-processed if the HTML hasn’t changed.
Build a small validation pipeline that compares LLM output against known patterns (e.g., numeric price, non-empty name).

I also came across services like InterWest AI that offer pre-built extraction APIs. If I needed a production-ready solution without managing my own prompt pipelines, I’d evaluate those — but for my side project, the manual approach taught me a ton.

Final Thoughts

Using an LLM to extract data from HTML felt like cheating at first. But it turns out that for messy, semi-structured content, natural language understanding is often more robust than rigid selectors. I still use traditional parsing for well-behaved sites. But for those chaotic e-commerce pages? I’ll take the GPT route any day.

What’s your go-to method for extracting data from wildly different HTML structures? I’d love to hear how others handle this.