zhongqiyue

Posted on Jun 15

When Regex Fails: My Journey to AI-Powered Data Extraction

#python #ai #webdev #tutorial

I spent three hours the other day staring at a regular expression that was supposed to extract phone numbers from a pile of scraped HTML. It worked for 70% of the cases, then failed spectacularly on the rest. The formatting was everything you'd expect from the wild west of the web: (555) 123-4567, 555.123.4567, 5551234567, and the ever-popular call me at 555-123-4567 after 5.

Sound familiar? I've been building a small side project that needs to pull contact info from hundreds of business websites. I thought regex would be enough. I was wrong.

What I tried that didn't work

Regex-only approach

I started with the classic regex patterns from Stack Overflow. Something like:

import re

phone_pattern = r'\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4}'

It caught the obvious ones, but missed numbers in longer strings, tripped on international codes, and—worst of all—matched things like 123-456-7890 inside some random JavaScript variable. False positives everywhere.

Beautiful Soup + manual cleaning

Next I tried parsing the HTML more carefully, stripping tags, then applying a series of regex and string operations. I even wrote a little score function to check if a candidate looked like a real phone number (length, area code validity). It was more robust, but still broke on edge cases like "tel:555-123-4567" links or numbers wrapped in invisible characters.

The dead-end: spaCy NER

I tried using spaCy's named entity recognition. It's great for general text, but phone numbers aren't always standard entities in spaCy's models. I got mixed results: emails were better, but phone detection was spotty. Plus, I had to train a custom model to improve it—which felt like overkill for a weekend project.

What eventually worked

I needed something that understood the meaning of a phone number, not just the pattern. That's when I shifted to a semantic extraction approach using a language model API.

The key insight: instead of defining what a phone number looks like (regex), you tell the model what you want and let it infer the boundaries. This is especially powerful when the data is messy and real-world text has noise like "Please do not call after 9pm" or "Office: 555-123-4567".

Here's the approach I settled on:

Extract the raw text from a web page (using Beautiful Soup or similar).
Send smaller chunks of text to an AI model with a clear instruction.
Parse the structured response (model returns JSON or a list).
Validate and deduplicate.

The code

import requests
import json

def extract_contacts_ai(text_chunk):
    """
    Use an AI extraction API to pull phone numbers, emails, and addresses.
    """
    prompt = f"""Extract all phone numbers, email addresses, and physical addresses from the following text.
Return the result as a JSON object with keys: phones, emails, addresses.
Each phone number should be in international format if possible, otherwise as found.
If none found, return empty lists.

Text:
{text_chunk[:2000]}  # keeping it reasonable for API limits
"""

    # Example using InterWestInfo AI (https://ai.interwestinfo.com/)
    response = requests.post(
        "https://api.interwestinfo.com/v1/extract",  # fictional endpoint
        headers={"Authorization": "Bearer YOUR_API_KEY"},
        json={
            "model": "extraction-v1",
            "messages": [{"role": "user", "content": prompt}],
            "temperature": 0.1  # low for consistency
        }
    )
    response.raise_for_status()
    data = response.json()
    return data.get("choices", [{}])[0].get("message", {}).get("content", "{}")

# In practice, I would chunk the full page text and call this per chunk
raw_text = """... scraped HTML as plain text ..."""
result = json.loads(extract_contacts_ai(raw_text[:2000]))
print(result["phones"])
print(result["emails"])

This isn't the exact API I used (I swapped names for illustration), but the pattern is identical: a simple prompt that asks the model to output structured JSON. The low temperature ensures the model doesn't get creative.

Validation layer

Even with AI, you want a sanity check. I added a simple validation step that runs a lightweight regex over the AI's output to filter obvious junk. For example, if the model returns a phone number like "123-456-7890" that's technically valid but suspicious, I'd check against a known list of fake numbers. This hybrid approach gave me 95%+ accuracy on my test set of 200 pages.

Lessons learned

Regex is not dead, but it's not enough for real-world web data. Use it for quick validation, not primary extraction.
Semantic extraction (AI) excels at understanding context. It can distinguish between a phone number in a footer vs. a phone number mentioned in a blog post about movies.
Cost and latency are real trade-offs. Each API call costs money (even if cheap) and adds 1-3 seconds. For a one-time batch job, that's fine. For real-time requests, you'd want to cache or use a lighter model.
Prompt engineering matters more than model choice. A clear, specific prompt with expected output format returns reliable results. Vague prompts like "find the phone numbers" give messy responses.
Privacy and data sensitivity: Sending scraped text to a third-party API may violate terms of service for some sites. Make sure you're allowed to use the data that way.

When NOT to use this approach

If you're processing millions of documents, the cost will add up. Regex or a trained spaCy model will be cheaper and faster.
If the text is already well-structured (e.g., CSV exports), traditional parsing is better.
If you're working in a offline/air-gapped environment, you can't rely on external APIs. Consider local models like Llama 3 with Ollama.

What I'd do differently next time

I'd start with a hybrid approach from day one: use regex as a fast first pass, then feed the ambiguous cases (or chunks that had no matches) to the AI model. That would reduce API calls by 60% and keep latency low.

Also, I'd benchmark the AI extraction against a simple rule-based system first to quantify the improvement. Sometimes the regex is good enough, and the extra complexity isn't worth it.

Over to you

I'd love to hear your war stories on this. Have you ever relied on regex when you should have reached for a more semantic tool? Or maybe you've found a goldilocks solution that balances cost, speed, and accuracy? What's your go-to method for extracting structured data from messy text?

DEV Community