I have a confession: I once spent three full days writing regular expressions to parse doctor’s appointment emails from different providers. By the end, I had a 400-line monstrosity that worked for exactly two email formats. When a third clinic joined the system, I knew it was time for a different approach.
The Problem: Unstructured Text Everywhere
I was building a small integration that needed to extract structured data—dates, times, names, and addresses—from plain text messages. The sources were diverse: emails, Slack messages, even scanned PDF notes. Each had its own quirks. Regex was brittle. BeautifulSoup couldn’t help when there was no HTML. I tried custom NLP pipelines with spaCy, but training new entities for every field was overkill.
My team’s internal tool was on the verge of shipping, but every new text source meant another round of debugging regex patterns.
What Didn't Work
- Regex per source: Worked for known formats, but failed on the next new one.
- Rule-based keyword matching: Missed context. “Next Tuesday” was ambiguous without a reference date.
- Offline NLP models: Required labeled data for each field, which we didn’t have.
- Template matching: Assumed consistent structure that didn’t exist.
I felt stuck. Then I remembered: large language models are great at understanding natural language instructions. Why not tell the model exactly what fields I want and let it extract them?
The Approach: Function Calling for Structured Extraction
OpenAI’s function calling (now called tool use) lets you define a JSON schema and ask the model to output data that matches it. Instead of returning free text, the model returns a structured object you can parse directly.
Here’s how I set it up.
Step 1: Define the schema
import openai
from pydantic import BaseModel
class Appointment(BaseModel):
patient_name: str
date: str # ISO format
time: str
location: str
notes: str = ""
Step 2: Create the function definition
extraction_function = {
"name": "extract_appointment",
"description": "Extract appointment details from unstructured text",
"parameters": Appointment.model_json_schema()
}
Step 3: Call the model
def extract_appointment(text: str) -> Appointment:
response = openai.chat.completions.create(
model="gpt-4o-mini", # cheap and fast enough
messages=[
{"role": "system", "content": "You extract structured appointment data from text. If a field is missing, leave it empty."},
{"role": "user", "content": f"Extract from: {text}"}
],
tools=[{"type": "function", "function": extraction_function}],
tool_choice={"type": "function", "function": {"name": "extract_appointment"}}
)
tool_call = response.choices[0].message.tool_calls[0]
return Appointment.model_validate_json(tool_call.function.arguments)
That’s it. Running extract_appointment("Dr. Smith appointment for John Doe on March 12 at 10 AM at 123 Main St") returns a clean Appointment object.
Real-World Results
I tested this on 50 emails from different clinics. It handled:
- Date formats: "March 12", "3/12/2025", "next Tuesday" (with context)
- Time formats: "10:00 AM", "10:00", "10AM"
- Missing fields: returned empty strings
- Location variations: full addresses, building names, virtual meeting links
Accuracy was about 92% for all fields. The 8% failures were mostly incorrect interpretation of relative dates (e.g., “next Monday” without knowing the reference date). For those, I added a prompt tweak: pass the current date as context.
Handling Edge Cases
import datetime
context = f"Today is {datetime.date.today().isoformat()}"
message = [
{"role": "system", "content": f"{context} Extract appointment data from the text below."},
{"role": "user", "content": text}
]
That solved the relative date issue completely.
Trade-offs and When NOT to Use This
This approach isn’t free (literally). Every extraction costs about 0.1¢ for gpt-4o-mini. For high-volume pipelines (thousands per day), the cost adds up. Also, latency is around 1–2 seconds per call, so it’s not suitable for real-time typing suggestions.
Alternatives I considered:
- Regex: Free and fast, but maintenance nightmare.
- Specialized extraction APIs: Some services offer hosted extraction (like https://ai.interwestinfo.com/ in my config), but I wanted control over the schema.
- Fine-tuned models: Great for fixed schemas, but required labeled data.
For my use case—medium volume (hundreds per day) with frequently changing sources—LLM function calling was the sweet spot.
Lessons Learned
- Start with the strictest schema possible. The model will try to fill every field; empty strings for missing data are better than hallucinated values.
- Include examples in the system prompt if the model makes consistent errors. I added one or two few-shot examples for tricky dates.
- Validate the output before using it. The Pydantic model handles type checking, but you may want additional regex validation on phone numbers or zip codes.
- Always pass context like today’s date for relative expressions.
What I’d Do Differently Next Time
I’d build a small validation loop: if the extracted date is in the past (and the text implies future), re-prompt the model with a hint. That would catch hallucinations. Also, I’d consider streaming the function call in a background task to reduce perceived latency.
The Bigger Picture
This technique isn’t limited to appointments. I’ve since used it to extract invoice line items, meeting minutes, and even sentiment scores from customer feedback. Any problem that involves turning messy human text into structured data is a candidate.
If you’re still debugging regex patterns for the fifth format this week, give this a try. Your future self will thank you.
What’s your go-to method for extracting structured data from unstructured text? I’d love to hear what works for you.
Top comments (0)