Most PII detection tools charge per API call because they run your text through an LLM. But for detecting structured patterns like emails, phone numbers, and credit cards, you don't need AI at all.
I built Origrid PII Detect -- a PII scanning API that uses pure regex pattern matching. Zero LLM calls, zero AI cost, sub-500ms response times.
The problem
If you're building any app that handles user text (forms, comments, chat, logs), you probably need to check for accidentally exposed personal data before storing or forwarding it. GDPR requires it. Common sense demands it.
The existing options are:
- Microsoft Presidio -- powerful but requires self-hosting a full NLP pipeline
- AWS Comprehend -- great but $0.01+ per request adds up fast
- Google DLP -- enterprise pricing, enterprise complexity
For most use cases, you don't need NLP. Emails look like emails. Phone numbers look like phone numbers. Credit cards follow the Luhn algorithm.
The approach: regex with smart deduplication
The API detects 6 entity types using pre-compiled regex patterns:
| Entity | How it's detected |
|---|---|
| RFC 5322 simplified pattern | |
| Phone | International formats (US, EU, UK, LATAM) |
| Credit card | Visa/MC/Amex/Discover patterns + Luhn validation |
| SSN | US format XXX-XX-XXXX with range validation |
| IBAN | European format with country code prefix |
| IP address | IPv4 with octet range validation |
Luhn validation for credit cards
This is the key differentiator from naive regex. A pattern like 4111-1111-1111-1111 matches the Visa format, but we also run Luhn's algorithm to verify it's a mathematically valid card number:
def _luhn_check(number: str) -> bool:
digits = [int(d) for d in number if d.isdigit()]
if len(digits) < 13:
return False
checksum = 0
for i, d in enumerate(reversed(digits)):
if i % 2 == 1:
d *= 2
if d > 9:
d -= 9
checksum += d
return checksum % 10 == 0
This eliminates false positives from random number sequences that happen to match card formats.
Smart deduplication
When patterns overlap (e.g., a phone number inside an IBAN), the API deduplicates by priority. Credit cards and SSNs have highest priority since they're the most sensitive.
What you get back
Example response:
{
"pii_found": true,
"entity_count": 3,
"entities": [
{"type": "email", "value": "john@test.com", "start": 6, "end": 19, "confidence": 1.0},
{"type": "phone", "value": "+34 612 345 678", "start": 26, "end": 41, "confidence": 1.0},
{"type": "credit_card", "value": "4111-1111-1111-1111", "start": 48, "end": 67, "confidence": 1.0}
],
"redacted_text": "Email [EMAIL], call [PHONE], card [CREDIT_CARD]",
"risk_level": "high"
}
Key features:
-
Exact positions (
start/end) so you can highlight or mask in your UI - Redacted text ready to store safely
- Risk level (high = credit cards or SSNs found)
Performance
Because there's no AI model in the loop:
- Latency: ~100-400ms (network overhead, not compute)
- Cost per call: $0.00 (no LLM tokens)
- Reliability: deterministic -- same input always produces same output
When you DO need AI
Regex won't catch:
- Names (without a dictionary)
- Street addresses (too many formats)
- Context-dependent PII ("my birthday is next Thursday")
For those, you need an LLM layer. I'm planning a "deep scan" mode for v2 that adds LLM analysis on top of regex. But for 80% of compliance use cases, regex covers what you need.
Try it free
The API is live on RapidAPI with a free tier (50 requests/month):
Origrid PII Detect on RapidAPI
import requests
response = requests.post(
"https://origrid-pii-detect.p.rapidapi.com/v1/pii/scan",
headers={
"X-RapidAPI-Key": "YOUR_KEY",
"Content-Type": "application/json"
},
json={"text": "Contact sarah@company.com or 555-123-4567"}
)
data = response.json()
print(data["redacted_text"])
# "Contact [EMAIL] or [PHONE]"
Built with FastAPI. Full OpenAPI docs available on the RapidAPI listing.
What PII patterns would you add? I'm considering passport numbers and driver's license formats for v2. Let me know in the comments.
Top comments (0)