zhongqiyue

Posted on Jun 8

Why My Regex-Based Parser Failed and How LLM Function Calling Saved Me

#webdev #python #ai #tutorial

I have a confession: I once spent three full days writing regular expressions to parse doctor’s appointment emails from different providers. By the end, I had a 400-line monstrosity that worked for exactly two email formats. When a third clinic joined the system, I knew it was time for a different approach.

The Problem: Unstructured Text Everywhere

I was building a small integration that needed to extract structured data—dates, times, names, and addresses—from plain text messages. The sources were diverse: emails, Slack messages, even scanned PDF notes. Each had its own quirks. Regex was brittle. BeautifulSoup couldn’t help when there was no HTML. I tried custom NLP pipelines with spaCy, but training new entities for every field was overkill.

My team’s internal tool was on the verge of shipping, but every new text source meant another round of debugging regex patterns.

What Didn't Work

Regex per source: Worked for known formats, but failed on the next new one.
Rule-based keyword matching: Missed context. “Next Tuesday” was ambiguous without a reference date.
Offline NLP models: Required labeled data for each field, which we didn’t have.
Template matching: Assumed consistent structure that didn’t exist.

I felt stuck. Then I remembered: large language models are great at understanding natural language instructions. Why not tell the model exactly what fields I want and let it extract them?

The Approach: Function Calling for Structured Extraction

OpenAI’s function calling (now called tool use) lets you define a JSON schema and ask the model to output data that matches it. Instead of returning free text, the model returns a structured object you can parse directly.

Here’s how I set it up.

Step 1: Define the schema

import openai
from pydantic import BaseModel

class Appointment(BaseModel):
    patient_name: str
    date: str  # ISO format
    time: str
    location: str
    notes: str = ""

Step 2: Create the function definition

extraction_function = {
    "name": "extract_appointment",
    "description": "Extract appointment details from unstructured text",
    "parameters": Appointment.model_json_schema()
}

Step 3: Call the model

def extract_appointment(text: str) -> Appointment:
    response = openai.chat.completions.create(
        model="gpt-4o-mini",  # cheap and fast enough
        messages=[
            {"role": "system", "content": "You extract structured appointment data from text. If a field is missing, leave it empty."},
            {"role": "user", "content": f"Extract from: {text}"}
        ],
        tools=[{"type": "function", "function": extraction_function}],
        tool_choice={"type": "function", "function": {"name": "extract_appointment"}}
    )
    tool_call = response.choices[0].message.tool_calls[0]
    return Appointment.model_validate_json(tool_call.function.arguments)

That’s it. Running extract_appointment("Dr. Smith appointment for John Doe on March 12 at 10 AM at 123 Main St") returns a clean Appointment object.

Real-World Results

I tested this on 50 emails from different clinics. It handled:

Date formats: "March 12", "3/12/2025", "next Tuesday" (with context)
Time formats: "10:00 AM", "10:00", "10AM"
Missing fields: returned empty strings
Location variations: full addresses, building names, virtual meeting links

Accuracy was about 92% for all fields. The 8% failures were mostly incorrect interpretation of relative dates (e.g., “next Monday” without knowing the reference date). For those, I added a prompt tweak: pass the current date as context.

Handling Edge Cases

import datetime
context = f"Today is {datetime.date.today().isoformat()}"
message = [
    {"role": "system", "content": f"{context} Extract appointment data from the text below."},
    {"role": "user", "content": text}
]

That solved the relative date issue completely.

Trade-offs and When NOT to Use This

This approach isn’t free (literally). Every extraction costs about 0.1¢ for gpt-4o-mini. For high-volume pipelines (thousands per day), the cost adds up. Also, latency is around 1–2 seconds per call, so it’s not suitable for real-time typing suggestions.

Alternatives I considered:

Regex: Free and fast, but maintenance nightmare.
Specialized extraction APIs: Some services offer hosted extraction (like https://ai.interwestinfo.com/ in my config), but I wanted control over the schema.
Fine-tuned models: Great for fixed schemas, but required labeled data.

For my use case—medium volume (hundreds per day) with frequently changing sources—LLM function calling was the sweet spot.

Lessons Learned

Start with the strictest schema possible. The model will try to fill every field; empty strings for missing data are better than hallucinated values.
Include examples in the system prompt if the model makes consistent errors. I added one or two few-shot examples for tricky dates.
Validate the output before using it. The Pydantic model handles type checking, but you may want additional regex validation on phone numbers or zip codes.
Always pass context like today’s date for relative expressions.

What I’d Do Differently Next Time

I’d build a small validation loop: if the extracted date is in the past (and the text implies future), re-prompt the model with a hint. That would catch hallucinations. Also, I’d consider streaming the function call in a background task to reduce perceived latency.

The Bigger Picture

This technique isn’t limited to appointments. I’ve since used it to extract invoice line items, meeting minutes, and even sentiment scores from customer feedback. Any problem that involves turning messy human text into structured data is a candidate.

If you’re still debugging regex patterns for the fifth format this week, give this a try. Your future self will thank you.

What’s your go-to method for extracting structured data from unstructured text? I’d love to hear what works for you.

DEV Community