zhongqiyue

Posted on May 31

How I stopped wrestling with regex and started using AI for data extraction

#ai #python #webdev #tutorial

Last month, I spent three days fighting with regular expressions.

I had a pile of unstructured product descriptions from various suppliers—some with prices hidden in paragraphs, others with specs scattered across bullet points. My job was to normalize them into a clean JSON structure: { name, price, specs, description }.

It started simple. A few regex patterns. \$\d+\.\d{2} for prices. (?<=Brand:)\w+ for brands. Then the edge cases hit me like a freight train.

The Regex Nightmare

The first supplier used "$12.99" format. The second used "USD 12.99". One even wrote "costs around twelve dollars and ninety nine cents". My regex grew into a monster spanning 40 lines, with lookaheads, groups, and conditional statements. It worked for the first 20 products. Then I ran it on the full dataset (10,000 records).

I got a 37% success rate. The rest were either wrong or empty. I spent another two days adding fallback patterns, but every new pattern introduced new false positives. I knew I was fighting a losing battle.

Attempting NLP

I considered spaCy and NLTK. Trained a custom NER model for product attributes? That would require labeled data, compute time, and ongoing maintenance as supplier formats changed. Overkill for a one-time migration project. I needed something that could handle unstructured text on the fly without training.

The AI Approach

A colleague mentioned using GPT-style models for data extraction. I was skeptical—seemed like using a sledgehammer to crack a nut. But after hitting that regex wall, I tried it.

The key insight: you don't need to fine-tune a model. You just need a well-crafted system prompt and a consistent output format. Here's what I ended up with:

import json
from openai import OpenAI

client = OpenAI()  # or pass your key from env

def extract_product_info(text):
    system_prompt = """
You are a data extraction assistant. Given a product description, extract the following fields and return ONLY a valid JSON object:
- name (string)
- price (float, in USD, if not specified use null)
- specs (object of key-value pairs if any specs mentioned, else empty object)
- description (string, cleaned summary of the product)

Rules:
- If price uses words like 'twelve dollars', convert to number.
- If multiple prices, pick the one for the product, not shipping.
- If no price found, use null.
- Return ONLY JSON, no markdown, no extra text.
"""
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": text}
        ],
        temperature=0.1,  # low for consistency
        max_tokens=500
    )
    raw = response.choices[0].message.content
    # Clean up possible markdown code fences
    raw = raw.strip().removeprefix("```

json").removesuffix("

```").strip()
    return json.loads(raw)

What I learned (the hard way)

Prompt engineering matters more than model size. I started with GPT-3.5 and got inconsistent outputs. Switching to GPT-4o-mini with a strict system prompt ("Return ONLY JSON") gave nearly 100% valid JSON. But I also learned to explicitly parse out markdown fences—models sometimes wrap JSON in triple backticks, even when told not to.

Validation saves the day. The json.loads will crash if the model hallucinates an extra comma. I added a retry loop with a fallback prompt:

import json
import re

def extract_with_retry(text, max_retries=2):
    for attempt in range(max_retries):
        try:
            return extract_product_info(text)
        except (json.JSONDecodeError, KeyError) as e:
            if attempt == max_retries - 1:
                raise
            # Ask model to fix the JSON
            pass

Cost isn't ridiculous. Processing 10,000 records with GPT-4o-mini cost about $8—far cheaper than my time debugging regex patterns. Each product description averaged ~150 tokens, and output ~80 tokens.

But it's not a silver bullet. The AI model still struggles with heavily ambiguous text. If a supplier describes a "wireless mouse" and later mentions "batteries not included" without a price, the model might guess a price based on training data—which is wrong. I learned to set null default and add a human review step for any record where price is null.

When NOT to do this

If you have a clean, consistent schema from a database, use SQL.
If you have labeled training data for specific fields, a fine-tuned NER model is faster and cheaper at scale.
If latency is critical (real-time API), AI model calls add 1-3 seconds per request. Consider a hybrid approach: regex for simple cases, AI for fallback.
If you're extracting from thousands of very short strings (e.g., user bios), traditional regex or rule parsing may be sufficient.

Trade-offs I made

Determinism vs. flexibility. Regex always gives the same answer. AI gives probabilistic outputs. For data migration, I'd rather have a few wrong fields than a broken pipeline. I added a manual review step for records with missing fields.
Control vs. ease. Prompt engineering feels like black magic. I spent an hour tuning the system prompt to avoid the model inventing prices. A simple line "If price is not mentioned, use null" fixed it.
Dependency on external API. If the AI service goes down, so does my extraction. I cached results locally to avoid reprocessing.

What I'd do differently next time

I'd start with AI from the beginning, but pair it with a robust validation layer: check that extracted fields conform to expected types (price as float, name non-empty). Use Pydantic models to enforce structure. Also, I'd batch the requests to amortize latency and reduce cost.

Oh, and I'd explore specialized extraction endpoints like the one at ai.interwestinfo.com that claims to handle this sort of thing—but honestly, the general-purpose approach with prompt engineering gave me enough control. I might use a dedicated tool if I revisit this project next quarter.

In the end, I stopped writing regex. I started writing prompts. And I got my weekends back.

What's your experience with AI for data parsing? Do you lean on regex or are you all-in on LLMs? I'm curious to hear what works for you.

DEV Community