Dave Sng

Posted on Apr 15

How to Detect and Redact PII in Python Before It Hits Your Database

#python #api #privacy #tutorial

Every week another company makes headlines after a data breach. And nine times out of ten, the root cause isn't a sophisticated attack — it's plain text Personally Identifiable Information (PII) that was never supposed to be stored in the first place.

If your application accepts free-text input — support tickets, chat messages, form submissions, document uploads — there's a good chance you're inadvertently storing names, email addresses, phone numbers, national ID numbers, and even credit card details in your database right now.

In this tutorial you'll learn how to build a PII detection pipeline in Python that intercepts user-submitted text before it's persisted, flags sensitive data, and optionally redacts it. We'll use a dedicated PII detection API so the heavy NLP lifting is handled server-side, keeping your application logic clean.

Why Not Just Use a Regex?

Regex patterns work well for structured data like email addresses or phone numbers, but they completely miss contextual PII:

"My doctor is John Smith and he works at St. Mary's" — contains a person's name in free text
"Call me at oh-four-four-seven..." — phone number written out in words
"IC: 901231-14-5678" — Malaysian national ID that no generic regex covers

A regex-based scanner will also generate a flood of false positives on things like order IDs, invoice numbers, or hex color codes. You need an approach that understands context, not just patterns.

The Architecture

We'll build a three-layer pipeline:

User Input → PII Scan → Decision Gate → Database / Redacted Store

Scan layer: Call a PII detection API with the raw text
Decision gate: Block, redact, or log based on the entities found
Storage layer: Only clean (or intentionally redacted) text reaches the DB

Setup

Install the dependencies:

pip install httpx fastapi uvicorn python-dotenv

Create a .env file:

RAPIDAPI_KEY=your_rapidapi_key_here

Step 1: The PII Scanner Client

# pii_scanner.py
import httpx
import os
from dataclasses import dataclass
from typing import Optional

RAPIDAPI_KEY = os.getenv("RAPIDAPI_KEY")
GLOBALSHIELD_HOST = "globalshield-api.p.rapidapi.com"

@dataclass(frozen=True)
class PIIEntity:
    entity_type: str
    value: str
    start: int
    end: int
    confidence: float

@dataclass(frozen=True)
class ScanResult:
    has_pii: bool
    entities: tuple[PIIEntity, ...]
    redacted_text: Optional[str]

def scan_text(text: str) -> ScanResult:
    """
    Scan text for PII using GlobalShield API.
    Returns a ScanResult with detected entities and redacted version.
    """
    url = f"https://{GLOBALSHIELD_HOST}/detect"
    headers = {
        "x-rapidapi-key": RAPIDAPI_KEY,
        "x-rapidapi-host": GLOBALSHIELD_HOST,
        "Content-Type": "application/json",
    }
    payload = {"text": text, "redact": True}

    with httpx.Client(timeout=10.0) as client:
        response = client.post(url, json=payload, headers=headers)
        response.raise_for_status()
        data = response.json()

    entities = tuple(
        PIIEntity(
            entity_type=e["type"],
            value=e["value"],
            start=e["start"],
            end=e["end"],
            confidence=e["confidence"],
        )
        for e in data.get("entities", [])
    )

    return ScanResult(
        has_pii=len(entities) > 0,
        entities=entities,
        redacted_text=data.get("redacted_text"),
    )

A few things to note:

We use frozen=True on the dataclasses — scan results are immutable, which prevents accidental mutation downstream.
We request "redact": True so the API returns a version of the text with PII replaced by [REDACTED_TYPE] tokens.
Timeout is set explicitly. Never let a third-party API call block your request indefinitely.

Step 2: The Decision Gate

# pii_gate.py
from pii_scanner import ScanResult, scan_text
from enum import Enum

# PII types that must NEVER be stored
BLOCKED_TYPES = {
    "CREDIT_CARD",
    "BANK_ACCOUNT",
    "NATIONAL_ID",
    "PASSPORT",
    "SOCIAL_SECURITY",
}

# PII types where we redact but allow storage
REDACT_TYPES = {
    "PERSON",
    "EMAIL_ADDRESS",
    "PHONE_NUMBER",
    "DATE_OF_BIRTH",
    "ADDRESS",
}

class GateDecision(Enum):
    ALLOW = "allow"
    REDACT = "redact"
    BLOCK = "block"

def evaluate(result: ScanResult) -> tuple[GateDecision, str]:
    """
    Evaluate a scan result and return a gate decision with the text to store.
    Returns (decision, text_to_store).
    """
    if not result.has_pii:
        return GateDecision.ALLOW, ""  # caller uses original text

    found_types = {e.entity_type for e in result.entities}

    if found_types & BLOCKED_TYPES:
        return GateDecision.BLOCK, ""

    if found_types & REDACT_TYPES:
        return GateDecision.REDACT, result.redacted_text or ""

    return GateDecision.ALLOW, ""

The gate has three outcomes:

Decision	Meaning	Action
`ALLOW`	No PII found	Store original
`REDACT`	Soft PII (names, emails)	Store redacted version
`BLOCK`	Hard PII (SSN, credit card)	Reject with 422

Step 3: Wire It Into a FastAPI Endpoint

# main.py
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from pii_scanner import scan_text
from pii_gate import GateDecision, evaluate

app = FastAPI(title="PII-Safe Submission API")

class SubmissionRequest(BaseModel):
    user_id: str
    message: str

class SubmissionResponse(BaseModel):
    stored_text: str
    pii_detected: bool
    decision: str

@app.post("/submit", response_model=SubmissionResponse)
def submit_message(req: SubmissionRequest):
    scan = scan_text(req.message)
    decision, clean_text = evaluate(scan)

    if decision == GateDecision.BLOCK:
        raise HTTPException(
            status_code=422,
            detail="Submission contains sensitive financial or identity data and cannot be stored.",
        )

    text_to_store = clean_text if decision == GateDecision.REDACT else req.message

    # --- Replace with your actual DB write ---
    # db.save(user_id=req.user_id, message=text_to_store)

    return SubmissionResponse(
        stored_text=text_to_store,
        pii_detected=scan.has_pii,
        decision=decision.value,
    )

Test it locally:

uvicorn main:app --reload

curl -X POST http://localhost:8000/submit \
  -H "Content-Type: application/json" \
  -d '{"user_id": "u123", "message": "Hi, my name is Sarah Chen and my email is sarah@example.com"}'

Response:

{
  "stored_text": "Hi, my name is [REDACTED_PERSON] and my email is [REDACTED_EMAIL_ADDRESS]",
  "pii_detected": true,
  "decision": "redact"
}

Step 4: Logging for Compliance Audits

GDPR Article 30 requires you to maintain records of processing activities. Here's a minimal audit log:

import logging
import json
from datetime import datetime, timezone

audit_logger = logging.getLogger("pii_audit")

def log_pii_event(user_id: str, decision: str, entity_types: list[str]):
    audit_logger.info(json.dumps({
        "timestamp": datetime.now(timezone.utc).isoformat(),
        "user_id": user_id,
        "decision": decision,
        "pii_types_detected": entity_types,
        # NOTE: Never log the actual PII values
    }))

Notice we log entity types but never the values. Logging {"type": "EMAIL_ADDRESS"} is fine. Logging {"value": "sarah@example.com"} defeats the entire purpose.

Practical Considerations

Latency: An extra API call adds ~50–150ms per request. For high-volume endpoints, add an async version using httpx.AsyncClient and await.

False negatives: No PII detector is perfect. Use this as a first-pass filter, not as a compliance guarantee. Pair it with periodic retroactive scans of your existing data.

What counts as PII varies by jurisdiction: GDPR and CCPA differ in scope. For Malaysian data (PDPA 2010), national ID numbers (MyKAD) and passport numbers are high-priority. Make sure your API provider's entity types map to your compliance requirements.

Caching: Don't cache scan results across users. A cache hit on a previous user's scan result could leak PII between accounts.

Why Use an API Instead of a Local Model?

Running a local NLP model (spaCy, Presidio, etc.) is a valid option — but it means you own the model maintenance, updates for new PII patterns, and the compute cost. A managed API:

Keeps your deployment lightweight (no multi-GB model files)
Gets updated automatically when new entity types are added
Is easier to audit: one external dependency with a clear SLA

GlobalShield API is available on RapidAPI with a free tier for testing and pay-as-you-go pricing for production workloads.

Summary

With about 100 lines of Python you now have:

A reusable PII scanner client
A three-outcome decision gate (allow / redact / block)
A FastAPI endpoint that prevents raw PII from reaching your database
A GDPR-friendly audit log

The key principle: treat PII detection as infrastructure, not as a feature. Wire it in early, before data accumulates, and you'll save yourself a painful retroactive cleanup — or worse, a breach notification letter.

Dave Sng is an API builder based in Malaysia. He builds developer tools and publishes them on RapidAPI. Follow him on Dev.to for more API tutorials.

DEV Community