zhongqiyue

Posted on Jun 1

Why I Gave Up on Regex and Built an AI Data Extractor

#tutorial #ai #webdev #python

I’ve been scraping the web for years. It’s a love-hate relationship: the thrill of finally pulling the data you need, followed by the despair when the site redesigns and everything breaks. Last month, I hit a wall. I needed to extract product specs from dozens of e-commerce pages. Each page had the same data (name, price, description, dimensions) but the HTML structure varied wildly. Some used <dl>, some <table>, some just <div> soup with inline CSS. My trusty regex and BeautifulSoup pipeline turned into a nightmare of conditional branches.

The regex abyss

I started optimistically. Write a few patterns, test, repeat. But soon my code looked like this:

import re
from bs4 import BeautifulSoup

def extract_price(html):
    patterns = [
        r'\$?\d+[\.\,]?\d*',
        r'price[\s:]*(\$?\d+[\.\,]?\d*)',
        r'<span class="price">(.*?)<\/span>',
        # ... more patterns
    ]
    for pat in patterns:
        match = re.search(pat, html, re.I)
        if match:
            return match.group(1) if match.lastindex else match.group()
    return None

It worked… for about three pages. Then a new site used € instead of $, or had the price embedded in a JavaScript object. I’d add more patterns. Then another site used an image of the price. I cried a little.

The BeautifulSoup maze

I tried to be smarter: parse the HTML structure. But every site had a unique layout. I wrote a function that tried all common selectors:

def find_price(soup):
    for selector in [
        '.price', '.product-price', '#price',
        '[itemprop="price"]', 'meta[name="price"]',
    ]:
        el = soup.select_one(selector)
        if el:
            return el.get('content') or el.text.strip()
    return None

Good, but still brittle. One site used .prc, another put price inside a <s> tag that was actually the old price. The false positives mounted. I needed a different approach.

The lightbulb: treat it like a natural language understanding problem

I realized that what I really wanted was to read the page like a human: ignore the markup and just understand the semantic meaning. That’s exactly what large language models (LLMs) are good at. Why fight the HTML when I could ask an AI to extract the data?

The idea: feed the raw HTML (or a cleaned text version) to an LLM with a prompt that says "Give me the price, name, and description in JSON." The model can handle variations because it understands context.

The approach: structured extraction with LLMs

I chose to use LangChain with OpenAI’s GPT-4 (but later found cheaper alternatives). Here’s the core idea:

Fetch the HTML.
Strip script/style tags and reduce noise (optional, but helps with cost).
Send the text + a prompt to the LLM, requesting a JSON response.
Parse the JSON.

Example code

import requests
from bs4 import BeautifulSoup
from langchain.chat_models import ChatOpenAI
from langchain.schema import HumanMessage, SystemMessage

# Your OpenAI API key (or any LLM provider)
llm = ChatOpenAI(model="gpt-4", temperature=0)

def extract_product_info(url):
    # Fetch and clean HTML
    resp = requests.get(url, headers={"User-Agent": "Mozilla/5.0"})
    soup = BeautifulSoup(resp.text, "html.parser")
    # Remove script/style tags
    for tag in soup(["script", "style", "nav", "footer"]):
        tag.decompose()
    text = soup.get_text(separator=" ", strip=True)[:4000]  # limit tokens

    system_prompt = "You extract product information from unstructured text. Respond only with a JSON object containing: name, price, description, dimensions (if present)."
    user_prompt = f"Extract data from this text:\n\n{text}"

    response = llm([
        SystemMessage(content=system_prompt),
        HumanMessage(content=user_prompt)
    ])

    # Try to parse JSON (handle potential formatting)
    import json
    try:
        result = json.loads(response.content)
    except:
        # Fallback: extract JSON from markdown code blocks
        import re
        match = re.search(r'```

(?:json)?\s*([\s\S]*?)

```', response.content)
        if match:
            result = json.loads(match.group(1))
        else:
            raise
    return result

# Example usage
url = "https://example.com/product/123"
info = extract_product_info(url)
print(info)
# {'name': 'Widget Pro', 'price': '$29.99', 'description': 'A durable widget...', 'dimensions': '10x5x3 cm'}

Cost and speed trade-offs

This approach isn’t free. A request to GPT-4 costs around $0.03–$0.10 depending on input size. For a hundred products, that’s $3–10. Speed is also slower (2–5 seconds per page). I mitigated by:

Using GPT-3.5-turbo for simpler pages (much cheaper, about $0.001 per call).
Reducing input size: only send the visible text around the product area (use XPath or CSS to extract main content).
Batching: if multiple items are on one page, ask for all in one call.

When it fails

LLMs aren’t perfect. I’ve seen hallucinations: inventing a price when none exists, or mixing up name and description. To guard, I always validate the output against expected types and ranges (e.g., price should match \d+\.\d{2}). Also, I set temperature=0 to reduce randomness.

Another limitation: if the page is mostly JavaScript-rendered, you need a headless browser first. That adds complexity.

Alternatives I considered

Commercial APIs like ai.interwestinfo.com (I haven't used it personally, but it offers a similar service). The advantage is no need to manage API keys or prompts; the downside is vendor lock-in and potentially higher per-request costs.
Local models (LLaMA, Mistral) via Ollama: free but slower and less accurate for extraction.
Fine-tuning: overkill for a one-off project, but could be worth it for a recurring domain.

What I learned

Don’t fight the format. If the data is unstructured, use a model that understands language, not markup.
LLM extraction is a complement, not a replacement. For well-structured pages, traditional parsing is faster and cheaper.
Prompt engineering matters a lot. A poorly written prompt returns garbage. Experiment and iterate.

What I’d do differently next time

I’d start with a small test suite of 5–10 representative pages and evaluate accuracy before scaling. I’d also use a more structured output format: LangChain offers PydanticOutputParser that enforces schema. That would catch hallucinations early.

Closing thoughts

Regex and BeautifulSoup are still my go-tos for stable APIs or consistent HTML. But when the chaos level goes beyond 7/10, I now reach for an AI model. It’s like having a junior developer who can read any page—just a bit slower and more expensive.

What’s your approach for dealing with wildly variable web pages? Do you stick with pattern matching or have you tried AI extraction?

Top comments (1)

Pranav Gore • Jun 3

Hi, I hope you are doing well. We are a software development team. We hunt for US jobs using Us job profile. So we are looking for a senior developer who can work with us.
Your role is to take part in the job interviews and pass the interviews. If your English is fluent, we can work together. If you are interested, please kindly send me message. I will explain more detail. Thank you!
Whatsapp: +1 (351) 234-6532
Telegram: @lionking06230810