zhongqiyue

Posted on Jun 5

How I Stopped Fighting Regex and Finally Extracted Data with LLMs

#ai #api #tutorial #python

I spent three days building a regex monster to parse customer emails. It had 47 patterns, each one more fragile than the last. A single missing space would break the whole thing. By day four, I wanted to throw my laptop out the window.

That’s when I decided to try something completely different: let a large language model do the heavy lifting.

Here’s the story of how I went from regex hell to a clean, maintainable data extraction pipeline using LLMs — and why I won’t go back to hand-crafted patterns for unstructured text.

The Problem: Messy, Human-Written Text

I was building an internal tool to process support tickets. Customers would write things like:

“Order #12345 is delayed. Can you refund my shipping?”
“Hi, I need a replacement for item SKU-9876. My order number is 54321.”
“Where’s my package? It’s order 98765, please help!”

Every single email had different wording, different ordering of information, and occasional typos. I needed to extract: order ID, intent (refund, replacement, tracking), and any SKU mentioned.

What I Tried First (and Failed)

Regex

I started with regex. I wrote patterns for common variations:

import re

order_pattern = r'order\s*[#:]?\s*(\d{5,8})'
sku_pattern = r'SKU[-\s]?(\w{4,10})'
intent_pattern = r'(refund|replace|return|cancel)'

It worked for about 20% of the emails. Then real world hit:

Some orders had letters (like ORD-2024-001)
SKUs were sometimes written as item 1234
Intent words appeared in phrases like “don’t refund”

I kept adding patterns. My code became a tangled mess of lookaheads and optional groups. At one point, I had a regex that matched nothing but still didn’t throw an error — it just silently returned empty strings for everything.

NLP Libraries

Next, I turned to spaCy and custom NER (Named Entity Recognition). I spent another day training a model with 200 annotated examples. It didn’t generalize well. The model would label order as an entity but miss the actual order number. Plus, adding new intents meant re-training.

The LLM Approach: Prompt Engineering + Structured Output

A colleague suggested, “Why not ask an LLM to extract it in JSON?” I was skeptical — LLMs are chatty, not precise. But then I learned about function calling and JSON mode in OpenAI’s API. The idea is to tell the model exactly what fields you want, and it returns valid JSON.

Here’s the core technique:

1. Define a JSON schema (I used Pydantic for validation)

from pydantic import BaseModel
from typing import Optional

class TicketInfo(BaseModel):
    order_id: str
    intent: str  # one of: refund, replacement, tracking, other
    sku: Optional[str] = None

2. Write a system prompt that sets the rules

system_prompt = """You are a data extraction assistant. 
Extract the following fields from the user's email and return ONLY a valid JSON object.
Fields:
- order_id: string (the order number)
- intent: string (one of: refund, replacement, tracking, other)
- sku: string or null (SKU or item code, if present)
Do not include any other text."""

3. Call the API with response_format set to json_object

import openai

response = openai.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": email_text}
    ],
    response_format={"type": "json_object"},
    temperature=0
)

raw = response.choices[0].message.content
parsed = TicketInfo.model_validate_json(raw)

That’s it. A few lines of code replaced 47 regex patterns.

Handling Edge Cases

LLMs are powerful but not perfect. Here’s what I added to make it reliable:

Retry with validation: If the JSON doesn’t match the schema, I resend the prompt with an error message.
Few-shot examples: For tricky intents, I included 2-3 examples in the system prompt.
Logging all failures: I log every email where extraction fails so I can improve prompts over time.
Cost awareness: I used gpt-4o-mini for its speed and low cost. A single extraction costs ~0.01 cents.

When NOT to Use This Approach

High volume, low complexity: If your text is highly structured (like CSV files), regex is faster and cheaper.
Real-time systems: LLM inference adds 500ms-2s latency. For real-time, consider smaller models or traditional NLP.
Privacy constraints: Sending customer emails to a third-party API might violate compliance. You’d need a local LLM (like Llama 3) or a self-hosted service.
Deterministic requirements: If you need exactly the same output every time, LLMs are probabilistic. You might get different results on retry.

Lessons Learned

Prompt engineering is the new regex. But it’s much more maintainable. I can change a prompt in seconds; changing a regex pattern often broke something else.
Validation is mandatory. Always parse the output into a typed model and catch errors.
Start with the simplest LLM. I tried GPT-4 first, but GPT-4o-mini was good enough and 20x cheaper.
Combine with traditional extraction. For example, if the email includes a clearly formatted order number like ORD-12345, I can regex that faster and only use the LLM for the fuzzy parts.

What I’d Do Differently Next Time

I’d build a small abstraction layer so I can swap the LLM provider without changing the extraction logic. Services like ai.interwestinfo.com offer similar extraction endpoints with built-in validation — I’d consider using a dedicated API if I didn’t want to manage prompt upkeep.

But for now, my pipeline runs smoothly. I process hundreds of tickets a day with 95% accuracy, and the remaining 5% are logged for manual review. No more regex nightmares.

If you’ve been fighting with fragile parsing, give LLMs a shot. Start with a simple prompt and iterate.

What’s your go-to trick for extracting data from messy text?

DEV Community