DEV Community

zhongqiyue
zhongqiyue

Posted on

Why I Gave Up on Perfect Selectors and Asked GPT to Extract My Data

I’ve been building scrapers for years. CSS selectors, XPath, regex — I’ve written thousands of lines of code just to pull product names and prices from e-commerce sites. Every new site meant a new set of selectors. Sometimes the HTML would change slightly and my entire script would break. It was exhausting.

A few months ago, I needed to monitor prices across 30 different online stores. Each one had a completely different DOM structure. I spent two days writing custom selectors for the first five sites and realized I’d be at this for weeks. There had to be a better way.

What I Tried First (and What Failed)

First, I tried the old reliable: BeautifulSoup with a mix of find_all and select.

from bs4 import BeautifulSoup
import requests

res = requests.get('https://some-store.com/product/123')
soup = BeautifulSoup(res.text, 'html.parser')
price = soup.select_one('.price-tag').text
Enter fullscreen mode Exit fullscreen mode

This works fine until the site changes its class names, or uses dynamic rendering (hello, React SPAs). Then I moved to Selenium to handle JavaScript. That worked, but it was slow and I still had to write site-specific selectors.

I even tried heuristic approaches: look for elements containing a dollar sign, or items with the highest numeric value in a certain container. These worked about 60% of the time — not good enough for a production system.

The Idea That Changed Everything

I was at a meetup and someone mentioned they used GPT to extract structured data from customer emails. A light bulb went off. Why not feed the raw HTML to an LLM and ask it to return exactly what I need? I know LLMs are great at understanding natural language instructions — maybe they could understand HTML too?

I wrote a quick prototype using OpenAI’s API. The results were shocking. With a good prompt, GPT-4 could extract product name, price, and availability from an entire page of HTML — without any selectors.

How I Made It Work

Here’s the core approach. Instead of trying to parse the HTML, I send a snippet of it (less than 8k tokens) to an LLM with a system prompt that explains the schema I want back.

import openai
import json

openai.api_key = "sk-..."

def extract_product_data(html_snippet):
    response = openai.ChatCompletion.create(
        model="gpt-4",
        messages=[
            {"role": "system", "content": """
You are a data extraction assistant. Given a snippet of HTML from a product page, extract the following fields and return them as valid JSON:
- name (string)
- price (float, remove currency symbols)
- availability (boolean, true if 'in stock' or 'add to cart' is present)
- currency (string, e.g. 'USD', 'EUR')
If a field cannot be found, set it to null.
Only return the JSON object, no extra text.
"""},
            {"role": "user", "content": html_snippet}
        ],
        temperature=0
    )
    content = response.choices[0].message.content
    return json.loads(content)
Enter fullscreen mode Exit fullscreen mode

I then call this on a small cleaned-up version of the page HTML (I strip scripts, styles, and long inline text to reduce token count). The results are surprisingly consistent. For a test set of 10 different product pages, it got the price right 9 out of 10 times — the one failure was a page with multiple prices (strike-through vs actual).

The Pitfalls I Discovered

This approach isn’t magic. Here’s what I ran into:

  • Token limits: Full pages can be huge. I had to trim the HTML aggressively. I now extract only the <body> and remove <script>, <style>, and <svg> tags before sending. Even then, some pages exceed 8k tokens.
  • Cost: Each request costs ~$0.02–0.05. For 30 sites, scraping once an hour, that’s about $36 per day — not cheap. I switched to GPT-3.5-turbo for most calls, which is 10x cheaper but slightly less accurate.
  • Hallucination: I’ve seen the LLM invent a price if none is present, or guess a name from a menu item. I added validation: if the price is more than 3 standard deviations from historical data, flag it for manual review.
  • Latency: API calls take 2–5 seconds per page. If you need hundreds of pages, this won’t scale. I use async batching and limit concurrency.

When This Approach Works (and When It Doesn’t)

This technique is excellent for:

  • Pages with highly variable structure (different e-commerce platforms)
  • When you only need a handful of fields
  • Prototyping or small-scale projects

It’s not great for:

  • High-volume scraping (thousands of pages per hour)
  • When you need perfect accuracy (the LLM will sometimes fail)
  • Scraping behind login or CAPTCHAs (you still need to handle that separately)

What I’d Do Differently Next Time

If I were starting over, I’d:

  • First try a simple regex-based extraction on the HTML (e.g., look for "price": patterns) before calling the LLM. This catches 80% of cases instantly.
  • Use a cheaper model like GPT-3.5-turbo-instruct for simple extractions.
  • Implement a caching layer so the same page isn’t re-processed if the HTML hasn’t changed.
  • Build a small validation pipeline that compares LLM output against known patterns (e.g., numeric price, non-empty name).

I also came across services like InterWest AI that offer pre-built extraction APIs. If I needed a production-ready solution without managing my own prompt pipelines, I’d evaluate those — but for my side project, the manual approach taught me a ton.

Final Thoughts

Using an LLM to extract data from HTML felt like cheating at first. But it turns out that for messy, semi-structured content, natural language understanding is often more robust than rigid selectors. I still use traditional parsing for well-behaved sites. But for those chaotic e-commerce pages? I’ll take the GPT route any day.

What’s your go-to method for extracting data from wildly different HTML structures? I’d love to hear how others handle this.

Top comments (0)