Dave Sng

Posted on Mar 23

How to Auto-Redact PII from User-Uploaded Documents (GDPR Compliant)

#api #python #security #privacy

If your application accepts document uploads — ID scans, contracts, medical records, financial statements — you're sitting on a GDPR liability. Every document that contains a name, phone number, email address, or national ID number is personally identifiable information (PII) that you're legally required to protect.

Manual redaction doesn't scale. You need automated PII detection and redaction that works across languages, document types, and country-specific ID formats.

This article shows you how to build an automated PII redaction pipeline using GlobalShield — an API that combines OCR, language detection, translation, and 3-layer entity detection to find and redact PII from images and PDFs.

The PII Problem

Consider what a typical user-uploaded document contains:

Document Type	PII Found
Passport scan	Full name, DOB, passport number, nationality
Bank statement	Account number, name, address, transaction details
Medical record	Patient name, DOB, insurance number, diagnosis
Employment contract	Name, address, salary, tax ID, phone number
Invoice	Company name, VAT number, IBAN, contact details

Under GDPR Article 17 (Right to Erasure), users can request that you delete all their personal data. If PII is embedded in document images, you can't just DELETE FROM users — you need to redact the actual image pixels.

How GlobalShield Works

GlobalShield processes documents through a 5-stage pipeline:

Upload Image/PDF
    -> Preprocess (grayscale, contrast enhancement)
    -> OCR (Tesseract with hOCR bounding boxes)
    -> Language Detection (Unicode analysis)
    -> Translation (if non-English, via AI)
    -> 3-Layer Entity Detection:
        Layer 1: Regex patterns (emails, phones, IBANs)
        Layer 2: Country-specific rules (MY/SG/US/UK/CN/JP/IN/AU)
        Layer 3: Microsoft Presidio NLP models
    -> Pixel-level Redaction (black boxes over PII regions)
    -> Return redacted image + metadata

The 3-layer detection approach is critical. Regex alone catches structured data like emails and phone numbers, but misses names and addresses. NLP models catch names but miss country-specific IDs like Malaysia's MyKad number or Japan's My Number. By combining all three layers, GlobalShield achieves comprehensive PII detection across 20+ languages.

Supported Languages

GlobalShield supports documents in:

English, Chinese (Simplified/Traditional), Japanese, Korean
Arabic, Hindi, Thai, Vietnamese, Malay
German, French, Spanish, Portuguese, Italian
Russian, Turkish, Polish, Dutch, Swedish
And more (20+ languages total)

Non-English documents are automatically detected and translated before entity detection, then redacted on the original image.

Implementation

Basic Image Redaction

import httpx
import json

GLOBALSHIELD_URL = "https://globalshield.p.rapidapi.com"
HEADERS = {
    "X-RapidAPI-Key": "YOUR_RAPIDAPI_KEY",
    "X-RapidAPI-Host": "globalshield.p.rapidapi.com"
}

async def redact_document(file_path: str) -> tuple[bytes, dict]:
    """Redact PII from an image. Returns redacted image bytes + metadata."""
    async with httpx.AsyncClient() as client:
        with open(file_path, "rb") as f:
            response = await client.post(
                f"{GLOBALSHIELD_URL}/v1/redact",
                headers=HEADERS,
                files={"file": (file_path, f, "image/png")}
            )

        # Redacted image is in response body
        redacted_image = response.content

        # Detection metadata is in the header
        metadata = json.loads(
            response.headers.get("X-GlobalShield-Metadata", "{}")
        )

        return redacted_image, metadata

# Usage
image_bytes, meta = await redact_document("passport_scan.png")
print(f"Detected {meta['total_redacted']} PII entities")
print(f"Language: {meta['language_detected']}")
for entity in meta["entities"]:
    print(f"  [{entity['entity_type']}] {entity['text']} "
          f"(confidence: {entity['confidence']})")

# Save redacted image
with open("passport_redacted.png", "wb") as f:
    f.write(image_bytes)

Detection Only (No Redaction)

If you want to detect PII without modifying the image:

async def detect_pii(file_path: str) -> dict:
    """Detect PII entities without redacting."""
    async with httpx.AsyncClient() as client:
        with open(file_path, "rb") as f:
            response = await client.post(
                f"{GLOBALSHIELD_URL}/v1/detect",
                headers=HEADERS,
                files={"file": (file_path, f, "image/png")}
            )
        return response.json()

result = await detect_pii("contract.png")
for entity in result["entities"]:
    print(f"[{entity['entity_type']}] '{entity['text']}' "
          f"at bbox {entity['bbox']}")

This is useful for audit logging — you can record what PII was found without modifying the original document.

PDF Redaction

GlobalShield handles multi-page PDFs natively:

async def redact_pdf(file_path: str, countries: str = None) -> bytes:
    """Redact PII from all pages of a PDF."""
    async with httpx.AsyncClient(timeout=120.0) as client:
        data = {}
        if countries:
            data["countries"] = countries

        with open(file_path, "rb") as f:
            response = await client.post(
                f"{GLOBALSHIELD_URL}/v1/redact-pdf",
                headers=HEADERS,
                files={"file": (file_path, f, "application/pdf")},
                data=data
            )

        meta = json.loads(
            response.headers.get("X-GlobalShield-Metadata", "{}")
        )
        print(f"Processed {meta['total_pages']} pages, "
              f"found {meta['total_entities']} entities")

        return response.content

# Redact a multi-page contract
redacted_pdf = await redact_pdf(
    "employment_contract.pdf",
    countries="MY,SG"  # Focus on Malaysia + Singapore ID formats
)
with open("contract_redacted.pdf", "wb") as f:
    f.write(redacted_pdf)

Batch Processing

For bulk document processing (e.g., migrating a document archive):

async def batch_redact(file_paths: list[str]) -> bytes:
    """Redact PII from multiple images. Returns ZIP archive."""
    async with httpx.AsyncClient(timeout=120.0) as client:
        files = [
            ("files", (path, open(path, "rb"), "image/png"))
            for path in file_paths
        ]
        response = await client.post(
            f"{GLOBALSHIELD_URL}/v1/batch-redact",
            headers=HEADERS,
            files=files
        )
        return response.content  # ZIP archive

# Process 10 documents at once
import glob
docs = glob.glob("uploads/*.png")[:10]
zip_data = await batch_redact(docs)
with open("redacted_batch.zip", "wb") as f:
    f.write(zip_data)

FastAPI Integration Example

Here's how to add automatic PII redaction to a FastAPI upload endpoint:

from fastapi import FastAPI, UploadFile, File
from fastapi.responses import Response
import httpx
import json

app = FastAPI()

@app.post("/upload-document")
async def upload_document(file: UploadFile = File(...)):
    """Upload a document — PII is automatically redacted before storage."""
    file_bytes = await file.read()

    # Send to GlobalShield for redaction
    async with httpx.AsyncClient() as client:
        response = await client.post(
            "https://globalshield.p.rapidapi.com/v1/redact",
            headers={
                "X-RapidAPI-Key": "YOUR_KEY",
                "X-RapidAPI-Host": "globalshield.p.rapidapi.com"
            },
            files={"file": (file.filename, file_bytes, file.content_type)}
        )

    metadata = json.loads(
        response.headers.get("X-GlobalShield-Metadata", "{}")
    )

    # Store the REDACTED version, not the original
    redacted_bytes = response.content
    # save_to_storage(redacted_bytes, file.filename)

    return {
        "status": "uploaded",
        "filename": file.filename,
        "pii_detected": metadata.get("total_redacted", 0),
        "language": metadata.get("language_detected"),
        "entities": [
            {"type": e["entity_type"], "confidence": e["confidence"]}
            for e in metadata.get("entities", [])
        ]
    }

Country-Specific Detection

GlobalShield's Layer 2 detection includes country-specific ID patterns:

Country	ID Types Detected
Malaysia (MY)	MyKad (NRIC), passport
Singapore (SG)	NRIC/FIN, passport
United States (US)	SSN, driver's license, passport
United Kingdom (UK)	NI number, passport, NHS number
China (CN)	Resident ID (18-digit), passport
Japan (JP)	My Number, passport, residence card
India (IN)	Aadhaar, PAN, passport
Australia (AU)	TFN, Medicare, passport
Taiwan (TW)	National ID, ARC

Pass the countries parameter to focus detection on specific regions:

# Only detect Malaysian and Singaporean ID patterns
response = await client.post(
    "/v1/redact",
    files={"file": ("scan.png", image_bytes, "image/png")},
    data={"countries": "MY,SG"}
)

Pricing

Endpoint	Credits per Call
`/v1/redact` (full pipeline)	10 credits
`/v1/detect` (detection only)	5 credits
`/v1/ocr` (text extraction)	3 credits
`/v1/redact-pdf`	10 credits/page
`/v1/batch-redact`	8 credits/image

Plan	Credits/Month	Price
Basic	500	Free
Pro	10,000	$29.99/mo
Ultra	100,000	$99.99/mo

Get started: GlobalShield on RapidAPI

Why Not Build It Yourself?

You could assemble this pipeline from open-source components:

Tesseract for OCR
langdetect for language detection
Google Translate API for translation
Presidio for entity detection
Pillow for image redaction

But you'd spend weeks handling edge cases: multi-language documents, country-specific ID formats, bounding box alignment between OCR and redaction, PDF page rendering, and batch processing. GlobalShield packages all of this into a single API call.

Processing documents with PII? Try GlobalShield's free tier and let me know how it works for your use case.

DEV Community