Every week another company makes headlines after a data breach. And nine times out of ten, the root cause isn't a sophisticated attack — it's plain text Personally Identifiable Information (PII) that was never supposed to be stored in the first place.
If your application accepts free-text input — support tickets, chat messages, form submissions, document uploads — there's a good chance you're inadvertently storing names, email addresses, phone numbers, national ID numbers, and even credit card details in your database right now.
In this tutorial you'll learn how to build a PII detection pipeline in Python that intercepts user-submitted text before it's persisted, flags sensitive data, and optionally redacts it. We'll use a dedicated PII detection API so the heavy NLP lifting is handled server-side, keeping your application logic clean.
Why Not Just Use a Regex?
Regex patterns work well for structured data like email addresses or phone numbers, but they completely miss contextual PII:
- "My doctor is John Smith and he works at St. Mary's" — contains a person's name in free text
- "Call me at oh-four-four-seven..." — phone number written out in words
- "IC: 901231-14-5678" — Malaysian national ID that no generic regex covers
A regex-based scanner will also generate a flood of false positives on things like order IDs, invoice numbers, or hex color codes. You need an approach that understands context, not just patterns.
The Architecture
We'll build a three-layer pipeline:
User Input → PII Scan → Decision Gate → Database / Redacted Store
- Scan layer: Call a PII detection API with the raw text
- Decision gate: Block, redact, or log based on the entities found
- Storage layer: Only clean (or intentionally redacted) text reaches the DB
Setup
Install the dependencies:
pip install httpx fastapi uvicorn python-dotenv
Create a .env file:
RAPIDAPI_KEY=your_rapidapi_key_here
Step 1: The PII Scanner Client
# pii_scanner.py
import httpx
import os
from dataclasses import dataclass
from typing import Optional
RAPIDAPI_KEY = os.getenv("RAPIDAPI_KEY")
GLOBALSHIELD_HOST = "globalshield-api.p.rapidapi.com"
@dataclass(frozen=True)
class PIIEntity:
entity_type: str
value: str
start: int
end: int
confidence: float
@dataclass(frozen=True)
class ScanResult:
has_pii: bool
entities: tuple[PIIEntity, ...]
redacted_text: Optional[str]
def scan_text(text: str) -> ScanResult:
"""
Scan text for PII using GlobalShield API.
Returns a ScanResult with detected entities and redacted version.
"""
url = f"https://{GLOBALSHIELD_HOST}/detect"
headers = {
"x-rapidapi-key": RAPIDAPI_KEY,
"x-rapidapi-host": GLOBALSHIELD_HOST,
"Content-Type": "application/json",
}
payload = {"text": text, "redact": True}
with httpx.Client(timeout=10.0) as client:
response = client.post(url, json=payload, headers=headers)
response.raise_for_status()
data = response.json()
entities = tuple(
PIIEntity(
entity_type=e["type"],
value=e["value"],
start=e["start"],
end=e["end"],
confidence=e["confidence"],
)
for e in data.get("entities", [])
)
return ScanResult(
has_pii=len(entities) > 0,
entities=entities,
redacted_text=data.get("redacted_text"),
)
A few things to note:
- We use
frozen=Trueon the dataclasses — scan results are immutable, which prevents accidental mutation downstream. - We request
"redact": Trueso the API returns a version of the text with PII replaced by[REDACTED_TYPE]tokens. - Timeout is set explicitly. Never let a third-party API call block your request indefinitely.
Step 2: The Decision Gate
# pii_gate.py
from pii_scanner import ScanResult, scan_text
from enum import Enum
# PII types that must NEVER be stored
BLOCKED_TYPES = {
"CREDIT_CARD",
"BANK_ACCOUNT",
"NATIONAL_ID",
"PASSPORT",
"SOCIAL_SECURITY",
}
# PII types where we redact but allow storage
REDACT_TYPES = {
"PERSON",
"EMAIL_ADDRESS",
"PHONE_NUMBER",
"DATE_OF_BIRTH",
"ADDRESS",
}
class GateDecision(Enum):
ALLOW = "allow"
REDACT = "redact"
BLOCK = "block"
def evaluate(result: ScanResult) -> tuple[GateDecision, str]:
"""
Evaluate a scan result and return a gate decision with the text to store.
Returns (decision, text_to_store).
"""
if not result.has_pii:
return GateDecision.ALLOW, "" # caller uses original text
found_types = {e.entity_type for e in result.entities}
if found_types & BLOCKED_TYPES:
return GateDecision.BLOCK, ""
if found_types & REDACT_TYPES:
return GateDecision.REDACT, result.redacted_text or ""
return GateDecision.ALLOW, ""
The gate has three outcomes:
| Decision | Meaning | Action |
|---|---|---|
ALLOW |
No PII found | Store original |
REDACT |
Soft PII (names, emails) | Store redacted version |
BLOCK |
Hard PII (SSN, credit card) | Reject with 422 |
Step 3: Wire It Into a FastAPI Endpoint
# main.py
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from pii_scanner import scan_text
from pii_gate import GateDecision, evaluate
app = FastAPI(title="PII-Safe Submission API")
class SubmissionRequest(BaseModel):
user_id: str
message: str
class SubmissionResponse(BaseModel):
stored_text: str
pii_detected: bool
decision: str
@app.post("/submit", response_model=SubmissionResponse)
def submit_message(req: SubmissionRequest):
scan = scan_text(req.message)
decision, clean_text = evaluate(scan)
if decision == GateDecision.BLOCK:
raise HTTPException(
status_code=422,
detail="Submission contains sensitive financial or identity data and cannot be stored.",
)
text_to_store = clean_text if decision == GateDecision.REDACT else req.message
# --- Replace with your actual DB write ---
# db.save(user_id=req.user_id, message=text_to_store)
return SubmissionResponse(
stored_text=text_to_store,
pii_detected=scan.has_pii,
decision=decision.value,
)
Test it locally:
uvicorn main:app --reload
curl -X POST http://localhost:8000/submit \
-H "Content-Type: application/json" \
-d '{"user_id": "u123", "message": "Hi, my name is Sarah Chen and my email is sarah@example.com"}'
Response:
{
"stored_text": "Hi, my name is [REDACTED_PERSON] and my email is [REDACTED_EMAIL_ADDRESS]",
"pii_detected": true,
"decision": "redact"
}
Step 4: Logging for Compliance Audits
GDPR Article 30 requires you to maintain records of processing activities. Here's a minimal audit log:
import logging
import json
from datetime import datetime, timezone
audit_logger = logging.getLogger("pii_audit")
def log_pii_event(user_id: str, decision: str, entity_types: list[str]):
audit_logger.info(json.dumps({
"timestamp": datetime.now(timezone.utc).isoformat(),
"user_id": user_id,
"decision": decision,
"pii_types_detected": entity_types,
# NOTE: Never log the actual PII values
}))
Notice we log entity types but never the values. Logging {"type": "EMAIL_ADDRESS"} is fine. Logging {"value": "sarah@example.com"} defeats the entire purpose.
Practical Considerations
Latency: An extra API call adds ~50–150ms per request. For high-volume endpoints, add an async version using httpx.AsyncClient and await.
False negatives: No PII detector is perfect. Use this as a first-pass filter, not as a compliance guarantee. Pair it with periodic retroactive scans of your existing data.
What counts as PII varies by jurisdiction: GDPR and CCPA differ in scope. For Malaysian data (PDPA 2010), national ID numbers (MyKAD) and passport numbers are high-priority. Make sure your API provider's entity types map to your compliance requirements.
Caching: Don't cache scan results across users. A cache hit on a previous user's scan result could leak PII between accounts.
Why Use an API Instead of a Local Model?
Running a local NLP model (spaCy, Presidio, etc.) is a valid option — but it means you own the model maintenance, updates for new PII patterns, and the compute cost. A managed API:
- Keeps your deployment lightweight (no multi-GB model files)
- Gets updated automatically when new entity types are added
- Is easier to audit: one external dependency with a clear SLA
GlobalShield API is available on RapidAPI with a free tier for testing and pay-as-you-go pricing for production workloads.
Summary
With about 100 lines of Python you now have:
- A reusable PII scanner client
- A three-outcome decision gate (allow / redact / block)
- A FastAPI endpoint that prevents raw PII from reaching your database
- A GDPR-friendly audit log
The key principle: treat PII detection as infrastructure, not as a feature. Wire it in early, before data accumulates, and you'll save yourself a painful retroactive cleanup — or worse, a breach notification letter.
Dave Sng is an API builder based in Malaysia. He builds developer tools and publishes them on RapidAPI. Follow him on Dev.to for more API tutorials.
Top comments (0)