Building GDPR-Compliant APIs: Automated PII Detection with Python

#python #api #security #gdpr

Handling user data without a solid PII detection layer is a liability. With the EU AI Act enforcement deadlines in full swing and GDPR fines hitting new records, developers can no longer treat privacy as an afterthought. In this tutorial, we'll build a practical PII detection middleware for FastAPI that automatically scans request and response payloads — and integrate a real-world API to do the heavy lifting.

The Problem: Data Leaks at the API Layer

Most PII leaks don't happen because of sophisticated hacks. They happen because a developer added a debug endpoint that returns full user objects, or a logging middleware that captures raw request bodies. By the time you catch it, the audit log already contains thousands of email addresses, passport numbers, or financial records.

The challenge is that PII comes in many forms:

Structured — email, phone, SSN, credit card numbers (easy to regex)
Semi-structured — names in free-text fields, addresses embedded in notes
Contextual — "patient John Doe" in a medical API, "customer ref: JD-2024-001" mapping to an identity

Regex catches the first category. The second and third require a smarter approach.

Architecture: Middleware-First PII Scanning

The cleanest solution is a FastAPI middleware that intercepts both incoming requests and outgoing responses. This gives you a single enforcement point without touching individual route handlers.

# pii_middleware.py
import json
from starlette.middleware.base import BaseHTTPMiddleware
from starlette.requests import Request
from starlette.responses import Response
from typing import Callable
import httpx

GLOBALSHIELD_API_KEY = "your_rapidapi_key_here"
GLOBALSHIELD_HOST = "globalshield-api.p.rapidapi.com"

async def scan_for_pii(text: str) -> dict:
    """Call GlobalShield API to detect PII in a text blob."""
    async with httpx.AsyncClient() as client:
        response = await client.post(
            "https://globalshield-api.p.rapidapi.com/detect",
            json={"text": text},
            headers={
                "X-RapidAPI-Key": GLOBALSHIELD_API_KEY,
                "X-RapidAPI-Host": GLOBALSHIELD_HOST,
            },
            timeout=5.0,
        )
        response.raise_for_status()
        return response.json()


class PIIGuardMiddleware(BaseHTTPMiddleware):
    def __init__(self, app, block_on_detection: bool = False):
        super().__init__(app)
        self.block_on_detection = block_on_detection

    async def dispatch(self, request: Request, call_next: Callable) -> Response:
        body_bytes = await request.body()
        if body_bytes:
            body_text = body_bytes.decode("utf-8", errors="replace")
            scan_result = await scan_for_pii(body_text)
            if scan_result.get("pii_detected") and self.block_on_detection:
                return Response(
                    content=json.dumps({"error": "PII detected in request payload"}),
                    status_code=422,
                    media_type="application/json",
                )
            request.state.pii_scan = scan_result

        response = await call_next(request)
        return response

Wire it into your FastAPI app:

# main.py
from fastapi import FastAPI
from pii_middleware import PIIGuardMiddleware

app = FastAPI()
app.add_middleware(PIIGuardMiddleware, block_on_detection=False)  # log-only mode

@app.post("/submit")
async def submit_form(request: Request, data: dict):
    pii_info = getattr(request.state, "pii_scan", {})
    if pii_info.get("pii_detected"):
        print(f"PII entities found: {pii_info.get('entities')}")
    return {"status": "received"}

Start with block_on_detection=False so you can audit what's flowing through before you start rejecting requests. Flip it to True once you're confident in the detection accuracy.

GlobalShield API: What It Returns

GlobalShield API is a PII detection service that identifies over 30 entity types across multiple languages. A typical response looks like:

{
  "pii_detected": true,
  "confidence": 0.97,
  "entities": [
    {"type": "EMAIL", "value": "john@example.com", "start": 14, "end": 30},
    {"type": "PHONE", "value": "+60-12-345-6789", "start": 45, "end": 61}
  ],
  "risk_level": "HIGH"
}

The risk_level field is particularly useful for routing decisions: log LOW, quarantine MEDIUM, block HIGH.

Redaction: Masking PII Before Storage

Logging raw payloads for debugging is fine in dev. In production, you want to redact before writing to any persistent store:

def redact_pii(text: str, entities: list[dict]) -> str:
    """Replace detected PII spans with type-labeled placeholders."""
    sorted_entities = sorted(entities, key=lambda e: e["start"], reverse=True)
    chars = list(text)
    for entity in sorted_entities:
        placeholder = f"[{entity['type']}]"
        chars[entity["start"]:entity["end"]] = list(placeholder)
    return "".join(chars)

# Before: "Contact me at john@example.com or +60-12-345-6789"
# After:  "Contact me at [EMAIL] or [PHONE]"

This pattern is safe because we sort descending — replacing from the end prevents earlier offsets from shifting.

GDPR Article 25: Privacy by Design in Practice

GDPR Article 25 requires privacy by design and by default. Here's how the middleware approach satisfies it:

Requirement	Implementation
Data minimization	Block or strip PII not needed for the endpoint's purpose
Purpose limitation	Log which PII types appeared and whether they were expected
Storage limitation	Redact before writing to logs or databases
Integrity	Audit trail via `request.state.pii_scan` metadata

A compliance audit that previously required manual code review can now be automated: run a replay of anonymized production traffic through the middleware in test mode and check that no unexpected PII entity types appear.

Testing the Middleware

# test_pii_middleware.py
import pytest
from httpx import AsyncClient
from main import app

@pytest.mark.asyncio
async def test_pii_detected_in_request(monkeypatch):
    async def mock_scan(text: str) -> dict:
        return {"pii_detected": True, "entities": [{"type": "EMAIL"}], "risk_level": "HIGH"}

    monkeypatch.setattr("pii_middleware.scan_for_pii", mock_scan)

    async with AsyncClient(app=app, base_url="http://test") as ac:
        response = await ac.post("/submit", json={"message": "email: user@test.com"})

    assert response.status_code == 200  # log-only mode, not blocked

@pytest.mark.asyncio
async def test_pii_blocked_in_strict_mode(monkeypatch):
    from pii_middleware import PIIGuardMiddleware
    from fastapi import FastAPI

    strict_app = FastAPI()
    strict_app.add_middleware(PIIGuardMiddleware, block_on_detection=True)

    async def mock_scan(text: str) -> dict:
        return {"pii_detected": True, "entities": [], "risk_level": "HIGH"}

    monkeypatch.setattr("pii_middleware.scan_for_pii", mock_scan)

    @strict_app.post("/submit")
    async def submit(data: dict):
        return {"ok": True}

    async with AsyncClient(app=strict_app, base_url="http://test") as ac:
        response = await ac.post("/submit", json={"note": "sensitive info"})

    assert response.status_code == 422

Performance Considerations

Adding an external HTTP call on every request will add latency. A few mitigations:

Sampling in high-traffic APIs — scan 10% of requests in production, 100% in staging
Async non-blocking — the httpx.AsyncClient call won't block your event loop
Caching by hash — SHA-256 the request body; if you've seen this exact payload before and it was clean, skip the scan
Size threshold — skip scanning payloads under 50 characters; they rarely contain actionable PII

import hashlib

_clean_hashes: set[str] = set()

def should_skip_scan(body: bytes) -> bool:
    if len(body) < 50:
        return True
    body_hash = hashlib.sha256(body).hexdigest()
    return body_hash in _clean_hashes

Wrapping Up

A three-layer approach covers most compliance requirements:

Detect — scan at the middleware level using GlobalShield or similar PII detection APIs
Redact — mask before logging or persisting
Audit — attach scan metadata to every request for compliance reporting

The middleware pattern keeps your route handlers clean and gives you a single place to update detection logic as regulations evolve. The EU AI Act adds new requirements around automated decision-making that build on GDPR foundations — getting the detection layer right now means less retrofitting later.

Full source code for the middleware is available in the examples above. If you're building an API that handles user data across jurisdictions, GlobalShield API is worth evaluating — it supports 30+ entity types including region-specific identifiers like Malaysian IC numbers, UK NINOs, and EU VAT IDs.

Dave Sng is an API builder based in Malaysia. He builds and publishes APIs on RapidAPI covering data validation, compliance, and document automation.