DEV Community

zhongqiyue
zhongqiyue

Posted on

Why I Gave Up on Regex and Built an AI Data Extractor

I’ve been scraping the web for years. It’s a love-hate relationship: the thrill of finally pulling the data you need, followed by the despair when the site redesigns and everything breaks. Last month, I hit a wall. I needed to extract product specs from dozens of e-commerce pages. Each page had the same data (name, price, description, dimensions) but the HTML structure varied wildly. Some used <dl>, some <table>, some just <div> soup with inline CSS. My trusty regex and BeautifulSoup pipeline turned into a nightmare of conditional branches.

The regex abyss

I started optimistically. Write a few patterns, test, repeat. But soon my code looked like this:

import re
from bs4 import BeautifulSoup

def extract_price(html):
    patterns = [
        r'\$?\d+[\.\,]?\d*',
        r'price[\s:]*(\$?\d+[\.\,]?\d*)',
        r'<span class="price">(.*?)<\/span>',
        # ... more patterns
    ]
    for pat in patterns:
        match = re.search(pat, html, re.I)
        if match:
            return match.group(1) if match.lastindex else match.group()
    return None
Enter fullscreen mode Exit fullscreen mode

It worked… for about three pages. Then a new site used instead of $, or had the price embedded in a JavaScript object. I’d add more patterns. Then another site used an image of the price. I cried a little.

The BeautifulSoup maze

I tried to be smarter: parse the HTML structure. But every site had a unique layout. I wrote a function that tried all common selectors:

def find_price(soup):
    for selector in [
        '.price', '.product-price', '#price',
        '[itemprop="price"]', 'meta[name="price"]',
    ]:
        el = soup.select_one(selector)
        if el:
            return el.get('content') or el.text.strip()
    return None
Enter fullscreen mode Exit fullscreen mode

Good, but still brittle. One site used .prc, another put price inside a <s> tag that was actually the old price. The false positives mounted. I needed a different approach.

The lightbulb: treat it like a natural language understanding problem

I realized that what I really wanted was to read the page like a human: ignore the markup and just understand the semantic meaning. That’s exactly what large language models (LLMs) are good at. Why fight the HTML when I could ask an AI to extract the data?

The idea: feed the raw HTML (or a cleaned text version) to an LLM with a prompt that says "Give me the price, name, and description in JSON." The model can handle variations because it understands context.

The approach: structured extraction with LLMs

I chose to use LangChain with OpenAI’s GPT-4 (but later found cheaper alternatives). Here’s the core idea:

  1. Fetch the HTML.
  2. Strip script/style tags and reduce noise (optional, but helps with cost).
  3. Send the text + a prompt to the LLM, requesting a JSON response.
  4. Parse the JSON.

Example code

import requests
from bs4 import BeautifulSoup
from langchain.chat_models import ChatOpenAI
from langchain.schema import HumanMessage, SystemMessage

# Your OpenAI API key (or any LLM provider)
llm = ChatOpenAI(model="gpt-4", temperature=0)

def extract_product_info(url):
    # Fetch and clean HTML
    resp = requests.get(url, headers={"User-Agent": "Mozilla/5.0"})
    soup = BeautifulSoup(resp.text, "html.parser")
    # Remove script/style tags
    for tag in soup(["script", "style", "nav", "footer"]):
        tag.decompose()
    text = soup.get_text(separator=" ", strip=True)[:4000]  # limit tokens

    system_prompt = "You extract product information from unstructured text. Respond only with a JSON object containing: name, price, description, dimensions (if present)."
    user_prompt = f"Extract data from this text:\n\n{text}"

    response = llm([
        SystemMessage(content=system_prompt),
        HumanMessage(content=user_prompt)
    ])

    # Try to parse JSON (handle potential formatting)
    import json
    try:
        result = json.loads(response.content)
    except:
        # Fallback: extract JSON from markdown code blocks
        import re
        match = re.search(r'```

(?:json)?\s*([\s\S]*?)

```', response.content)
        if match:
            result = json.loads(match.group(1))
        else:
            raise
    return result

# Example usage
url = "https://example.com/product/123"
info = extract_product_info(url)
print(info)
# {'name': 'Widget Pro', 'price': '$29.99', 'description': 'A durable widget...', 'dimensions': '10x5x3 cm'}
Enter fullscreen mode Exit fullscreen mode

Cost and speed trade-offs

This approach isn’t free. A request to GPT-4 costs around $0.03–$0.10 depending on input size. For a hundred products, that’s $3–10. Speed is also slower (2–5 seconds per page). I mitigated by:

  • Using GPT-3.5-turbo for simpler pages (much cheaper, about $0.001 per call).
  • Reducing input size: only send the visible text around the product area (use XPath or CSS to extract main content).
  • Batching: if multiple items are on one page, ask for all in one call.

When it fails

LLMs aren’t perfect. I’ve seen hallucinations: inventing a price when none exists, or mixing up name and description. To guard, I always validate the output against expected types and ranges (e.g., price should match \d+\.\d{2}). Also, I set temperature=0 to reduce randomness.

Another limitation: if the page is mostly JavaScript-rendered, you need a headless browser first. That adds complexity.

Alternatives I considered

  • Commercial APIs like ai.interwestinfo.com (I haven't used it personally, but it offers a similar service). The advantage is no need to manage API keys or prompts; the downside is vendor lock-in and potentially higher per-request costs.
  • Local models (LLaMA, Mistral) via Ollama: free but slower and less accurate for extraction.
  • Fine-tuning: overkill for a one-off project, but could be worth it for a recurring domain.

What I learned

  • Don’t fight the format. If the data is unstructured, use a model that understands language, not markup.
  • LLM extraction is a complement, not a replacement. For well-structured pages, traditional parsing is faster and cheaper.
  • Prompt engineering matters a lot. A poorly written prompt returns garbage. Experiment and iterate.

What I’d do differently next time

I’d start with a small test suite of 5–10 representative pages and evaluate accuracy before scaling. I’d also use a more structured output format: LangChain offers PydanticOutputParser that enforces schema. That would catch hallucinations early.

Closing thoughts

Regex and BeautifulSoup are still my go-tos for stable APIs or consistent HTML. But when the chaos level goes beyond 7/10, I now reach for an AI model. It’s like having a junior developer who can read any page—just a bit slower and more expensive.

What’s your approach for dealing with wildly variable web pages? Do you stick with pattern matching or have you tried AI extraction?

Top comments (0)