DEV Community

zhongqiyue
zhongqiyue

Posted on

I stopped fighting broken parsers — here's how I use LLMs to extract web data reliably

A few months ago, I was building a price tracker for limited-edition sneakers. I had a list of 50+ store URLs, and I needed to extract product name, price, availability, and size options. Classic scraping, right?

I started with CSS selectors. BeautifulSoup + requests. It worked for about a week. Then one site changed their class names. Another added a dynamic loader. A third injected ads that shifted the DOM. I spent more time fixing selectors than actually using the data.

I tried regex on the raw HTML. That was a disaster — fragile and unreadable. I tried headless browsers with Playwright, waiting for specific elements. Still broke when the layout changed.

The problem was fundamental: I was trying to reverse-engineer the presentation layer. But what I really wanted was the meaning of the content — the product's price, not the CSS class it lived in.

The turning point: LLMs for structured extraction

I had been using GPT for text generation, but I hadn't considered it for data extraction. Then I saw a demo of function calling — you can ask an LLM to return a JSON object matching a schema. That's exactly what I needed: give the model some raw HTML (or just the text), define the fields I want, and let it figure out the mapping.

Here's the core idea:

  1. Fetch the page (or use a simplified text version)
  2. Define a schema for the data you want
  3. Call the LLM with the page content and the schema
  4. Get back structured JSON

It's not magic — there are trade-offs — but it solved my immediate problem beautifully.

Code example with OpenAI function calling

Let me show you a concrete example. I'll use Python and the OpenAI API. The key is the functions parameter that tells the model exactly what JSON shape to return.

import openai
from pydantic import BaseModel
import json

# Define your schema using Pydantic
class ProductExtract(BaseModel):
    name: str
    price: float
    currency: str = "USD"
    in_stock: bool
    size_options: list[str] = []
    url: str

# Convert schema to OpenAI function definition
def schema_to_function(model: BaseModel):
    schema = model.model_json_schema()
    return {
        "name": "extract_product",
        "description": "Extract product information from webpage text",
        "parameters": schema
    }

# Example: scrape a product page
def extract_product_from_html(html_text: str, url: str) -> ProductExtract:
    # Truncate long text to fit context window
    truncated = html_text[:8000]

    system_prompt = """
    You are a data extraction assistant. Given the HTML content of a product page,
    extract the requested fields. If a field is not found, use null or empty list.
    Return only valid JSON matching the given schema.
    """

    response = openai.chat.completions.create(
        model="gpt-4o-mini",  # cheaper, fast enough
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": f"Page content:\n\n{truncated}\n\nURL: {url}"}
        ],
        functions=[schema_to_function(ProductExtract)],
        function_call={"name": "extract_product"},
        temperature=0.0
    )

    args = response.choices[0].message.function_call.arguments
    return ProductExtract(**json.loads(args))

# Usage
raw_html = requests.get("https://example-sneaker-store.com/air-max-2024").text
product = extract_product_from_html(raw_html, url)
print(product.model_dump_json(indent=2))
Enter fullscreen mode Exit fullscreen mode

This returns something like:

{
  "name": "Air Max 2024 'Triple Black'",
  "price": 189.99,
  "currency": "USD",
  "in_stock": true,
  "size_options": ["7", "8", "9", "10", "11", "12"],
  "url": "https://example-sneaker-store.com/air-max-2024"
}
Enter fullscreen mode Exit fullscreen mode

The function definition tells the model exactly what fields to fill. No CSS, no regex, no fragile selectors.

When this approach shines

  • Frequently changing sites — blogs, e-commerce, news where HTML structure changes often
  • Heterogeneous sources — you're scraping 20 different sites, each with its own layout
  • Text-heavy content — articles, reviews, descriptions where the meaning is in the text, not the tags

I've been using this for my sneaker tracker for three months now. I've only had to adjust prompts twice (when a site started using images for prices, which I handled by adding an image-to-text step).

Trade-offs and limitations

Let's be real: this isn't a silver bullet.

Cost: Every extraction costs a fraction of a cent. For my 50 sites checked once a day, it's about $1.50/month. Fine. But if you're scraping millions of pages, the cost adds up fast.

Latency: The API call takes 1-3 seconds per page. For a few hundred pages it's okay, but not for real-time scraping.

Hallucination: The model might invent data if it's not present. I've seen it return prices that don't exist. Mitigation: ask for confidence scores or require source text evidence. But that adds complexity.

Context window: You can't dump the whole page. I truncate to 8000 chars, but sometimes the price is in a part of the HTML I cut off. Solution: pre-process the HTML to extract the main content area using something like trafilatura.

Privacy: If you're scraping sensitive/internal sites, sending data to an API might be a concern. You could use a local model (e.g., Ollama with function calling) but results are less reliable.

Alternatives to consider

  • Traditional scraping + ML: Use a library like readabilipy to extract main content, then apply a smaller model for NER on prices. Cheaper but more setup.
  • Visual scraping tools: Services like Diffbot or Browse AI. They work but are pricey and sometimes still break.
  • Prompt-only extraction: Without function calling, just ask for JSON. Less reliable because the model might deviate from the schema.

What I'd do differently next time

I'd start with a hybrid approach: use a lightweight HTML parser to extract common patterns (like prices with $ sign) as a fallback. If that fails, call the LLM. This reduces cost and improves speed for the 80% of pages that have standard formatting.

Also, I'd build a small validation step: after extraction, check if the price looks plausible (positive number, within expected range) and retry with a different prompt if not.

Final thoughts

LLMs aren't just for chat or content generation. They're surprisingly good at turning messy, unstructured data into clean JSON — as long as you define exactly what you want. If you're tired of chasing HTML changes, give this technique a try. Start with a small set of pages, measure the accuracy, and see if the trade-offs work for you.

What's your approach to pulling structured data from wild, untamed web pages? I'd love to hear how others handle this.

Top comments (0)