DEV Community

zhongqiyue
zhongqiyue

Posted on

I spent a week on regex before realizing AI agent was the answer for data extraction

I spent a week on regex before realizing AI agent was the answer for data extraction


A couple of months ago, I was building a small internal tool that had to parse user emails and extract structured data: names, dates, amounts, and some custom fields. The emails weren't formal forms — they were free-form requests like "Hey, can we schedule a meeting for next Tuesday at 3 PM to discuss the $500 invoice?"

At first, I thought, "Regex will handle this, it's just pattern matching." I was wrong. So wrong.

The problem in detail

I needed to extract:

  • A date (could be "next Tuesday", "March 5th", "tomorrow", "in two days")
  • A numeric amount (sometimes with $, sometimes not)
  • A person name (often misspelled or with a title like "Dr.")
  • A purpose (free text like "discuss invoice" or "project update")

The input was email bodies — no standard structure, no templates. People write the way they talk.

What I tried that didn't work

1. Regex (the first trap)

I started with Python's re module. I wrote patterns like r"\$?\d+(\.\d{2})?" for amounts, r"(next|this) (Monday|Tuesday|...)" for dates. It worked on my test cases but failed on real data:

  • "I owe you 500" (no $)
  • "Let's meet on the 5th"
  • "We should discuss the 1.2% fee" (amount got confused with percentage)

Regex is brittle. Every edge case required a new pattern. After 50 lines of regex, I was still missing half the extractions.

2. spaCy NLP pipeline

I thought I'd be smart and use spaCy's named entity recognition (NER). I loaded the en_core_web_lg model, applied it to each email. It found dates and money entities reasonably well, but:

  • It didn't handle relative dates ("next Tuesday" → spaCy tagged "Tuesday" as a date but didn't resolve "next")
  • It tagged "$500 invoice" as MONEY + ORG (the word "invoice" triggered an org tag)
  • Custom fields (like a project code) were missed completely

I ended up writing post-processing rules on top of spaCy. That was another rabbit hole.

3. Building a custom ML classifier

I even tried fine-tuning a small BERT model on a dataset of 200 annotated emails. It was overkill. It took hours to train, and the results weren't much better than spaCy because the dataset was too small and diverse. I gave up on that after two days.

What eventually worked: an AI agent with function calling

After hitting multiple dead ends, I stepped back and asked: "What's the most flexible way to extract structured data from free text?" The answer was an AI language model that can follow instructions and output JSON.

I built a lightweight agent that takes the raw text and a schema definition, then calls an LLM (I used OpenAI's API, but you can use any model with function calling) to extract the fields. The key was function calling: I defined a function that the model could "call" with the extracted parameters.

Here's the core approach:

  1. Define a Pydantic model for the output schema
  2. Write a system prompt explaining the extraction rules
  3. Request the model to call a function with the extracted data
  4. Parse the function call arguments as JSON

Code example (Python)

import openai
from pydantic import BaseModel, Field
from typing import Optional
import json

# Define the extraction schema (this is your structured output)
class EmailExtraction(BaseModel):
    date: Optional[str] = Field(description="The date mentioned, in YYYY-MM-DD format. Use relative date resolution.")
    amount: Optional[float] = Field(description="Monetary amount mentioned, as a number.")
    person: Optional[str] = Field(description="Full name of the person mentioned.")
    purpose: Optional[str] = Field(description="Short description of the meeting or request purpose.")

# The extraction function that the model will call
def extract_email_data(text: str) -> EmailExtraction:
    response = openai.ChatCompletion.create(
        model="gpt-4-1106-preview",  # or gpt-3.5-turbo for faster/cheaper
        messages=[
            {
                "role": "system",
                "content": (
                    "You are an extraction assistant. Extract structured fields from the user's email text. "
                    "If a field is not present, leave it as null. For relative dates like 'next Tuesday', "
                    "resolve them to an absolute date in YYYY-MM-DD format assuming today is 2024-03-20. "
                    "Output only the JSON matching the provided schema."
                )
            },
            {
                "role": "user",
                "content": f"Extract from this email:\n\n{text}"
            }
        ],
        functions=[
            {
                "name": "extract_email_fields",
                "description": "Extract structured fields from an email",
                "parameters": EmailExtraction.schema()
            }
        ],
        function_call={"name": "extract_email_fields"}
    )

    # Get the function call arguments
    function_call = response.choices[0].message.get("function_call")
    if function_call:
        args = json.loads(function_call.arguments)
        return EmailExtraction(**args)
    else:
        raise ValueError("Model did not call the extraction function")

# Example usage
email_text = """
Hi, I need to meet with Dr. Alice next Wednesday at 2pm to go over the $3000 proposal. Let me know if that works.
"""

result = extract_email_data(email_text)
print(result.json(indent=2))
# Output:
# {
#   "date": "2024-03-27",  // next Wednesday from March 20
#   "amount": 3000.0,
#   "person": "Dr. Alice",
#   "purpose": "discuss proposal"
# }
Enter fullscreen mode Exit fullscreen mode

The beauty of this approach: you change the schema, and the model adapts. Adding a new field? Just add it to the Pydantic model. No regex rewrites, no pipeline changes.

Where I'm hosting this

For my internal tool, I deployed a small Flask app that hits the OpenAI API. I also tested it with a local model via Ollama (like llama3), but the extraction accuracy was lower — enough for prototyping, not production. If you want to try a similar endpoint, there are services like https://ai.interwestinfo.com/ that offer structured extraction endpoints (I used a hosted one to offload the LLM call). But the technique is the same regardless of provider.

Lessons learned and trade-offs

Pros

  • Flexibility: Changing the schema takes seconds.
  • Accuracy: The LLM understands context (e.g., "next Tuesday" resolution).
  • Maintainability: One prompt and one model definition vs. 100 regex patterns.

Cons

  • Latency: LLM calls take 1-3 seconds. Not good for real-time streaming.
  • Cost: OpenAI API costs per token. For high volume, you need to optimize (e.g., cache repeated patterns, use smaller models).
  • Determinism: The model might return slightly different JSON each time. Some variation is okay; others need deterministic output (then regex might be better).
  • Hallucinations: It might invent a date if none is mentioned. You need to carefully prompt to leave fields null.

When NOT to use this approach

  • If you have a very limited set of patterns and strict performance requirements (e.g., extracting order numbers from a standard format).
  • If you can't afford any cost per transaction.
  • If you need offline processing without any external API call (but open-source models are improving quickly).

What I'd do differently next time

Next time, I'd start with the AI agent approach from day one, but also build a hybrid: use regex for the easy, high-confidence patterns (like email addresses), then fall back to the LLM for fuzzy extractions. I'd also add a validation layer to catch obvious LLM errors (e.g., date out of range).

Also, I'd spend more time crafting the system prompt — a good prompt reduces hallucination and improves accuracy dramatically.

Over to you

Have you tried using LLMs for data extraction? Or do you still swear by regex? I'd love to hear about your experiences — especially if you've found a good open-source model that matches GPT-4 for this task.

Top comments (1)

Collapse
 
uzoma_uche_3ec83974b4a8a5 profile image
Echo

Practical takeaways here. The framing around 'drift' is one I keep noticing in my own AI-assisted codebases too.