Extracting structured data from messy text with LLMs (and why regex failed)

#tutorial #ai #webdev #python

I spent a weekend trying to scrape product listings from a dozen different e-commerce sites. The goal was simple: get name, price, availability, and description into a clean JSON array. What I got was a painful reminder that the web is a beautiful mess of inconsistent HTML, misspellings, and “creative” formatting.

The messy reality

I started with BeautifulSoup and regex. Each site had its own quirks – some wrapped prices in <span class="price">, others used data-price attributes, and one site just wrote “$19.99” inside a <p> tag with no class. My extraction logic grew into a nested if-else nightmare:

import re
from bs4 import BeautifulSoup

def extract_price(soup):
    # Attempt 1: common class
    price_tag = soup.find('span', class_='price')
    if price_tag:
        return price_tag.text.strip()
    # Attempt 2: data attribute
    price_tag = soup.find('[data-price]')
    if price_tag:
        return price_tag['data-price']
    # Attempt 3: regex fallback
    text = soup.get_text()
    match = re.search(r'\$?(\d+\.\d{2})', text)
    if match:
        return match.group(1)
    return None

This worked for maybe 60% of the cases. The rest? Wrong prices, missing data, or false positives. One product description included “Price: $0.00” as a placeholder, which my regex greedily grabbed. I needed a better way.

Why traditional parsers fall short

The fundamental issue is that HTML structure is not semantic. Two sites can display the same piece of information in completely different ways. Even on a single site, product cards would have slight variations – an extra <div> here, a missing class there. My parsing logic was brittle and required constant maintenance.

I considered using machine learning for text classification, but training a custom model for each field seemed overkill. Then I remembered: large language models (LLMs) are pretty good at understanding context and extracting information, as long as you ask them nicely.

The LLM approach

Instead of writing rules for every possible HTML structure, I could feed the raw HTML (or better, the visible text) to an LLM and ask it to extract the fields I needed in a structured format. The key technique is function calling (or tool use) – telling the LLM to output JSON in a specific schema.

I used OpenAI's GPT-4, but the same pattern works with any model that supports structured output (Claude, Gemini, local models via Ollama). Here's what I ended up with:

from openai import OpenAI
import json

client = OpenAI(api_key="sk-...")  # Or use a service like ai.interwestinfo.com for managed extraction

def extract_product_info(html_text: str) -> dict:
    """Extract structured product info from raw HTML text."""
    response = client.chat.completions.create(
        model="gpt-4",
        messages=[
            {
                "role": "system",
                "content": "You are a data extraction assistant. Extract product fields from the given HTML source. Return only valid JSON."
            },
            {
                "role": "user",
                "content": f"Extract the following fields from this HTML: name, price, availability, description. HTML:\n{html_text[:4000]}"
            }
        ],
        functions=[
            {
                "name": "extract_product",
                "description": "Extract product info",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "name": {"type": "string"},
                        "price": {"type": "string"},
                        "availability": {"type": "boolean"},
                        "description": {"type": "string"}
                    },
                    "required": ["name", "price"]
                }
            }
        ],
        function_call={"name": "extract_product"}
    )

    args = response.choices[0].message.function_call.arguments
    return json.loads(args)

# Example usage
html = """<div class="product"><h3>Widget Pro</h3><p class="price">$24.99</p>
<span class="stock">In Stock</span><div class="desc">High-quality widget for pros</div></div>"""

result = extract_product_info(html)
print(result)
# {'name': 'Widget Pro', 'price': '$24.99', 'availability': True, 'description': 'High-quality widget for pros'}

This worked shockingly well. Even when the HTML had extra fluff or slightly different class names, the LLM figured out the intent. I tested it on 50 random product pages from different sites, and it correctly extracted all four fields in 84% of cases – versus 62% with my regex approach.

Lessons learned and trade-offs

Accuracy isn't perfect. The LLM sometimes hallucinated prices (e.g., “$0.00” when no price found) or misidentified availability. I added a post-processing step to validate fields against simple patterns (e.g., price must match \$\d+\.\d{2}).

Cost and latency. Each extraction costs a small amount of tokens. For high-volume scraping, this adds up. I limited HTML input to 4000 characters (roughly 1000 tokens) and batched requests where possible. On average, a single product took 2-3 seconds – acceptable for a few hundred products, not for millions.

Privacy concerns. Sending full HTML to a third-party API means you're sharing potentially sensitive data (customer reviews, user scripts). For internal tools, I'd run a local model like Llama 3 via Ollama with the same function calling pattern.

When NOT to use this. If the HTML structure is perfectly consistent (e.g., internal admin pages with predictable forms), a traditional parser is faster, cheaper, and more reliable. LLMs shine when you have many different sources or semi-structured text (emails, PDFs, chat logs).

What I'd do differently next time

Use a Pydantic schema for better validation instead of raw JSON.
Feed only visible text – strip all HTML tags first to reduce tokens and noise.
Cache results – if the same page is scraped again, skip the LLM call.
Try smaller models – GPT-3.5 was cheaper but less accurate; for some fields, a fine-tuned small model might work.

I also discovered that prepending a few examples of correct extraction (few-shot prompting) significantly improved accuracy for edge cases like discounts or bundle prices.

The takeaway

LLMs aren't magic, but for extraction tasks that require human-like understanding, they beat brittle parsing every time. The technique I shared – using function calls to get structured JSON – is becoming a standard pattern in the AI community. It's not a silver bullet, but it's a damn useful tool in your belt.

What's your go-to for messy data extraction? Regex purist or LLM convert?