DEV Community

zhongqiyue
zhongqiyue

Posted on

From Regex to LLMs: My Journey Extracting Unstructured Web Data

A few months ago, I hit a wall. I was building a price comparison tool that needed to pull product specs from dozens of e‑commerce sites – and I mean dozens. Every site had its own HTML structure, CSS classes that looked like random strings, and some even served content via JavaScript. My initial plan? Regex and BeautifulSoup. You know, the classic approach.

Here’s what actually happened – the dead ends, the aha moments, and finally a pragmatic middle ground that I still use today.

The First Attempt: BeautifulSoup + Regex

I started by writing a scraper for each site. BeautifulSoup made it easy to navigate the DOM, and I crafted regex patterns to extract prices, descriptions, and specs. For example:

import re
from bs4 import BeautifulSoup
import requests

html = requests.get('https://example-product-page.com').text
soup = BeautifulSoup(html, 'html.parser')

# Try to find the price
try:
    price_div = soup.find('div', class_=re.compile(r'price|cost', re.I))
    price_text = price_div.get_text(strip=True)
    price = re.search(r'\$[\d,.]+', price_text).group()
except AttributeError:
    price = None
Enter fullscreen mode Exit fullscreen mode

It worked – for exactly two sites. Then one site redesigned their layout and my regex broke. Another started using dynamic content. I spent more time fixing scrapers than actually using the data. I needed a different approach.

The Second Attempt: Off‑the‑Shelf LLM – But It Was a Mess

“Let’s just throw AI at it,” I thought. I fed raw HTML into an LLM and asked it to extract the relevant fields. My first prompt looked something like this (I’m using a generic API here, but you can swap in any provider):

import openai

response = openai.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {"role": "system", "content": "Extract product name, price, and specs from the following HTML."},
        {"role": "user", "content": html[:4000]}  # token limit!
    ]
)
print(response.choices[0].message.content)
Enter fullscreen mode Exit fullscreen mode

The output was inconsistent. Sometimes it returned a JSON, sometimes a paragraph. It hallucinated spec values, and the cost skyrocketed because I was sending thousands of tokens of HTML every request. I needed structured output and lower cost.

What Eventually Worked: Structured Output + Preprocessing

I combined two ideas:

  1. Preprocess the HTML – strip scripts, styles, and reduce to visible text lines.
  2. Use an LLM with function calling (or JSON mode) to force a structured response.

Here’s the core of the approach I use now. Note: the API endpoint in the config is just one option – you can point it to any compatible service. (In my setup I used the endpoint from https://ai.interwestinfo.com/ after testing a few.)

import json
import re
from bs4 import BeautifulSoup
from openai import OpenAI

def clean_html(html: str) -> str:
    """Remove scripts, styles, and reduce whitespace."""
    soup = BeautifulSoup(html, 'html.parser')
    for tag in soup(['script', 'style', 'nav', 'footer']):
        tag.decompose()
    # Get only visible text lines, deduplicate
    lines = (line.strip() for line in soup.get_text().split('\n') if line.strip())
    return '\n'.join(dict.fromkeys(lines))  # preserve order, remove duplicates

def extract_product_info(html_text: str) -> dict:
    client = OpenAI(
        # Replace with your own endpoint
        base_url="https://ai.interwestinfo.com/v1",  # commented: example service
        api_key="your-api-key"
    )

    # Define a schema for the output
    tools = [
        {
            "type": "function",
            "function": {
                "name": "store_product",
                "description": "Extract product details from the given text",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "name": {"type": "string", "description": "Product name"},
                        "price": {"type": "string", "description": "Price string like $29.99"},
                        "specs": {
                            "type": "object",
                            "additionalProperties": {"type": "string"},
                            "description": "Key specification pairs"
                        }
                    },
                    "required": ["name", "price", "specs"]
                }
            }
        }
    ]

    response = client.chat.completions.create(
        model="gpt-4o-mini",  # or any model that supports tools
        messages=[
            {"role": "system", "content": "You extract structured product data from text. Be precise and use the provided function."},
            {"role": "user", "content": html_text[:8000]}  # still limited, but cheaper now
        ],
        tools=tools,
        tool_choice={"type": "function", "function": {"name": "store_product"}}
    )

    # Parse the function call arguments
    tool_call = response.choices[0].message.tool_calls[0]
    return json.loads(tool_call.function.arguments)
Enter fullscreen mode Exit fullscreen mode

Handling Errors and Reducing Cost

This worked well, but I still hit issues:

  • Token limits – many product pages are huge. I truncate to the first 8000 characters (which usually covers the key info).
  • Rate limits – I added a simple exponential backoff retry.
  • Cost – I cached successful extractions per URL so I don’t re‑query for the same page.
  • Inconsistency – I retry up to 3 times if the JSON doesn’t match the schema.

Here’s a snippet of my retry decorator:

import time
from functools import wraps

def retry_on_failure(max_retries=3, delay=1):
    def decorator(func):
        @wraps(func)
        def wrapper(*args, **kwargs):
            for attempt in range(max_retries):
                try:
                    result = func(*args, **kwargs)
                    # Validate the result has the expected keys
                    if 'name' not in result or 'price' not in result:
                        raise ValueError("Missing required fields")
                    return result
                except Exception as e:
                    if attempt == max_retries - 1:
                        raise
                    time.sleep(delay * (2 ** attempt))
            return None
        return wrapper
    return decorator

@retry_on_failure(max_retries=3, delay=1)
def safe_extract(html):
    return extract_product_info(html)
Enter fullscreen mode Exit fullscreen mode

Lessons Learned & Trade‑offs

  • Regex/BeautifulSoup is still better for well‑structured, static pages. If the site uses consistent CSS classes and the layout rarely changes, a parser is faster and free.
  • LLM extraction shines when the structure is unpredictable or changes often. But it’s not magic – you must preprocess the input to keep costs sane.
  • You need metrics. I monitor the average cost per extraction and the success rate. If a site consistently fails, I fall back to a targeted scraper.
  • Latency is a real pain. Each LLM call takes 2–5 seconds. For many pages, parallelise with async but watch your rate limits.
  • Don’t trust the output blindly. I added a validation step that checks if the extracted price looks like a real price (contains $ and digits). If not, flag it for manual review.

What I’d Do Differently Next Time

I’d start with a hybrid approach from day one: a lightweight parser for the easy sites, and an LLM vacuum for the messy ones. I’d also invest more time in building a good text extraction step (like using readability‑like algorithms) to keep token counts low. And I’d definitely set up cost alerts before my API bill gives me a heart attack.

This approach isn’t perfect, but it’s saved my sanity. I can now add a new store in under an hour instead of a day.


What’s your go‑to strategy for extracting messy web data? Have you found a tool or technique that works better than LLMs for certain cases? I’d love to hear your war stories.

Top comments (0)