Daniel Valentin Alonso Maqueira

Posted on Apr 3 • Edited on Apr 26

I Built a PII Detection API with Zero AI Cost (Pure Regex)

#privacy #security #python #api

Most PII detection tools charge per API call because they run your text through an LLM. But for detecting structured patterns like emails, phone numbers, and credit cards, you don't need AI at all.

I built Origrid PII Detect -- a PII scanning API that uses pure regex pattern matching. Zero LLM calls, zero AI cost, sub-500ms response times.

The problem

If you're building any app that handles user text (forms, comments, chat, logs), you probably need to check for accidentally exposed personal data before storing or forwarding it.

GDPR fines for non-compliance can reach €20 million or 4% of global revenue, whichever is higher. Meta got hit for €1.2 billion in 2023. Even small companies have been fined six figures for storing emails or phone numbers without proper safeguards. The "we'll add PII detection later" excuse is a liability time bomb.

The existing options are:

Microsoft Presidio -- powerful but requires self-hosting a full NLP pipeline
AWS Comprehend -- great but $0.01+ per request adds up fast
Google DLP -- enterprise pricing, enterprise complexity

For most use cases, you don't need NLP. Emails look like emails. Phone numbers look like phone numbers. Credit cards follow the Luhn algorithm.

The approach: regex with smart deduplication

The API detects 6 entity types using pre-compiled regex patterns:

Entity	How it's detected
Email	RFC 5322 simplified pattern
Phone	International formats (US, EU, UK, LATAM)
Credit card	Visa/MC/Amex/Discover patterns + Luhn validation
SSN	US format XXX-XX-XXXX with range validation
IBAN	European format with country code prefix
IP address	IPv4 with octet range validation

Luhn validation for credit cards

This is the key differentiator from naive regex. A pattern like 4111-1111-1111-1111 matches the Visa format, but we also run Luhn's algorithm to verify it's a mathematically valid card number:

def _luhn_check(number: str) -> bool:
    digits = [int(d) for d in number if d.isdigit()]
    if len(digits) < 13:
        return False
    checksum = 0
    for i, d in enumerate(reversed(digits)):
        if i % 2 == 1:
            d *= 2
            if d > 9:
                d -= 9
        checksum += d
    return checksum % 10 == 0

This eliminates false positives from random number sequences that happen to match card formats.

Smart deduplication

When patterns overlap (e.g., a phone number inside an IBAN), the API deduplicates by priority. Credit cards and SSNs have highest priority since they're the most sensitive.

What you get back

Send this:

curl -X POST https://api.origrid.io/v1/pii/scan \
  -H "Content-Type: application/json" \
  -d '{"text": "Email john@test.com, call +34 612 345 678, card 4111-1111-1111-1111"}'

Get this:

{
  "pii_found": true,
  "entity_count": 3,
  "entities": [
    {
      "type": "email",
      "value": "john@test.com",
      "start": 6,
      "end": 19,
      "confidence": 1.0
    },
    {
      "type": "phone",
      "value": "+34 612 345 678",
      "start": 26,
      "end": 41,
      "confidence": 1.0
    },
    {
      "type": "credit_card",
      "value": "4111-1111-1111-1111",
      "start": 48,
      "end": 67,
      "confidence": 1.0
    }
  ],
  "redacted_text": "Email [EMAIL], call [PHONE], card [CREDIT_CARD]",
  "risk_level": "high"
}

Key features:

Exact positions (start/end) so you can highlight or mask in your UI
Redacted text ready to store safely
Risk level (high = credit cards or SSNs found)

Performance

Because there's no AI model in the loop:

Latency: ~100-400ms (network overhead, not compute)
Cost per call: $0.00 (no LLM tokens)
Reliability: deterministic -- same input always produces same output

When you DO need AI

Regex won't catch:

Names (without a dictionary)
Street addresses (too many formats)
Context-dependent PII ("my birthday is next Thursday")

For those, you need an LLM layer. I'm planning a "deep scan" mode for v2 that adds LLM analysis on top of regex. But for 80% of compliance use cases, regex covers what you need.

Try it free

The API is live on RapidAPI with a free tier (50 requests/month):

Origrid PII Detect on RapidAPI

Or hit the API directly:

import requests

response = requests.post(
    "https://origrid-pii-detect.p.rapidapi.com/v1/pii/scan",
    headers={
        "X-RapidAPI-Key": "YOUR_KEY",
        "Content-Type": "application/json"
    },
    json={"text": "Contact sarah@company.com or 555-123-4567"}
)

data = response.json()
print(data["redacted_text"])
# "Contact [EMAIL] or [PHONE]"

Built with FastAPI, deployed on Docker. The full OpenAPI spec with examples is at api.origrid.io/docs.

What PII patterns would you add? I'm considering passport numbers and driver's license formats for v2. Let me know in the comments.

DEV Community