zhongqiyue

Posted on Jun 1

I stopped fighting with regex for data extraction. Here's how AI saved my sanity.

#ai #python #productivity #tutorial

I've been there. You have a pile of messy, unstructured text — support tickets, meeting notes, customer emails — and you need to extract specific fields. Maybe it's a ticket priority, a product name, or a deadline. Your first instinct is regex. It works for the first 10 cases. Then the 11th breaks everything.

I spent two days crafting a beautiful regex pattern only to watch it fail on a slightly different phrasing. I felt like Sisyphus in a Python console. That's when I finally gave in and let AI do the heavy lifting.

The pain of pattern matching

I was building a dashboard that ingested raw support tickets from multiple channels. Each ticket had an unstructured body, but we needed to extract:

Priority (high/medium/low)
Category (billing, technical, feature request)
Assigned agent

The tickets came from Slack, email, and a web form. Every source had its own quirks. People wrote "URGENT" in all caps, or "just a quick question". I tried:

Regex rules – worked for the first 50 tickets, then broke. Maintaining the list of patterns became a full-time job.
Keyword counting – good enough for category, but terrible for priority (people say "critical" and "blocker" interchangeably).
Manual tagging – no.

I needed something that understood meaning, not just characters. Traditional NLP is complex. But then I realized: large language models are already good at this. The trick is getting them to return structured data reliably.

The approach: function calling (or JSON mode)

Most modern LLM APIs support something called function calling or structured output. You define a JSON schema of what you want extracted, and the model returns exactly that shape. No more parsing natural language responses.

Here's the core pattern I use now:

import json
from openai import OpenAI

client = OpenAI(api_key="sk-...")

def extract_ticket_data(raw_text: str) -> dict:
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": "You extract structured data from support tickets."},
            {"role": "user", "content": f"Extract from: {raw_text}"}
        ],
        tools=[
            {
                "type": "function",
                "function": {
                    "name": "record_ticket",
                    "description": "Save extracted ticket information",
                    "parameters": {
                        "type": "object",
                        "properties": {
                            "priority": {
                                "type": "string",
                                "enum": ["high", "medium", "low"]
                            },
                            "category": {
                                "type": "string",
                                "enum": ["billing", "technical", "feature_request"]
                            },
                            "assignee": {
                                "type": "string",
                                "description": "Name or email of the requested agent, if mentioned"
                            }
                        },
                        "required": ["priority", "category"]
                    }
                }
            }
        ],
        tool_choice={"type": "function", "function": {"name": "record_ticket"}}
    )

    return json.loads(response.choices[0].message.tool_calls[0].function.arguments)

# Example
print(extract_ticket_data("URGENT: can't login after billing update. Please assign to Alice."))
# {'priority': 'high', 'category': 'technical', 'assignee': 'Alice'}

This is clean, predictable, and cheap. For a typical ticket (~200 tokens), gpt-4o-mini costs fractions of a cent. And the output is ready to insert into a database.

Lessons learned the hard way

1. Be explicit with enums

If you let the model free‑text the priority, you'll get "critical", "high", "urgent", "asap", etc. By limiting with enum, you force consistency.

2. Handle missing values gracefully

Notice I didn't make assignee required. If the text doesn't name anyone, the model omits it. Then your downstream code can default to "unassigned".

3. Watch for hallucinated data

Once in a while the model invents a priority that looks plausible but is wrong. I mitigate this by:

Only extracting what's explicitly stated (use system prompt: "Only extract if the text clearly states it")
Running a secondary validation (e.g., if priority is "high" but the word "high" appears nowhere, flag it)

4. Latency and batching

Each call takes ~1 second. For 1000 tickets, that's 17 minutes. If you need speed, batch multiple tickets in a single call (prompt: "Extract data from the following 5 tickets, return a JSON array"). Test carefully — I've seen cross‑contamination.

When NOT to use this approach

High‑throughput, low‑complexity – if you're extracting a simple ID from a fixed format, a regex is faster and cheaper.
Offline or air‑gapped environments – you need API access. There are local models, but they're slower and less accurate.
Highly sensitive data – sending everything to OpenAI might violate compliance. Consider self‑hosted models or a service like AI Interwest Info that offers private endpoints. (I've tested it for a similar use case – works well if you can't use public APIs.)

What I'd do differently next time

I'd start with a simple prompt that outputs JSON directly, without function calling. Function calling is cleaner, but for MVP you can just do:

"Extract priority (high/medium/low), category, and assignee from this ticket. Return only JSON: {}"

Then parse with json.loads(). It works 90% of the time. Only add function calling when you need strict schema enforcement at scale.

Also, I'd invest in a small test suite of 50 edge‑case tickets before writing any extraction code. That way you can iterate quickly on prompts and evaluate accuracy.

Final thoughts

AI didn't replace me – it replaced the tedious, brittle patterns I used to write. Now I spend time on defining the schema and validating outputs, not debugging regex that explodes on a missing comma.

What's your go‑to method for extracting structured data from messy text? Are you still on the regex train, or have you jumped to an AI approach?

DEV Community