If your application accepts document uploads — ID scans, contracts, medical records, financial statements — you're sitting on a GDPR liability. Every document that contains a name, phone number, email address, or national ID number is personally identifiable information (PII) that you're legally required to protect.
Manual redaction doesn't scale. You need automated PII detection and redaction that works across languages, document types, and country-specific ID formats.
This article shows you how to build an automated PII redaction pipeline using GlobalShield — an API that combines OCR, language detection, translation, and 3-layer entity detection to find and redact PII from images and PDFs.
The PII Problem
Consider what a typical user-uploaded document contains:
| Document Type | PII Found |
|---|---|
| Passport scan | Full name, DOB, passport number, nationality |
| Bank statement | Account number, name, address, transaction details |
| Medical record | Patient name, DOB, insurance number, diagnosis |
| Employment contract | Name, address, salary, tax ID, phone number |
| Invoice | Company name, VAT number, IBAN, contact details |
Under GDPR Article 17 (Right to Erasure), users can request that you delete all their personal data. If PII is embedded in document images, you can't just DELETE FROM users — you need to redact the actual image pixels.
How GlobalShield Works
GlobalShield processes documents through a 5-stage pipeline:
Upload Image/PDF
-> Preprocess (grayscale, contrast enhancement)
-> OCR (Tesseract with hOCR bounding boxes)
-> Language Detection (Unicode analysis)
-> Translation (if non-English, via AI)
-> 3-Layer Entity Detection:
Layer 1: Regex patterns (emails, phones, IBANs)
Layer 2: Country-specific rules (MY/SG/US/UK/CN/JP/IN/AU)
Layer 3: Microsoft Presidio NLP models
-> Pixel-level Redaction (black boxes over PII regions)
-> Return redacted image + metadata
The 3-layer detection approach is critical. Regex alone catches structured data like emails and phone numbers, but misses names and addresses. NLP models catch names but miss country-specific IDs like Malaysia's MyKad number or Japan's My Number. By combining all three layers, GlobalShield achieves comprehensive PII detection across 20+ languages.
Supported Languages
GlobalShield supports documents in:
- English, Chinese (Simplified/Traditional), Japanese, Korean
- Arabic, Hindi, Thai, Vietnamese, Malay
- German, French, Spanish, Portuguese, Italian
- Russian, Turkish, Polish, Dutch, Swedish
- And more (20+ languages total)
Non-English documents are automatically detected and translated before entity detection, then redacted on the original image.
Implementation
Basic Image Redaction
import httpx
import json
GLOBALSHIELD_URL = "https://globalshield.p.rapidapi.com"
HEADERS = {
"X-RapidAPI-Key": "YOUR_RAPIDAPI_KEY",
"X-RapidAPI-Host": "globalshield.p.rapidapi.com"
}
async def redact_document(file_path: str) -> tuple[bytes, dict]:
"""Redact PII from an image. Returns redacted image bytes + metadata."""
async with httpx.AsyncClient() as client:
with open(file_path, "rb") as f:
response = await client.post(
f"{GLOBALSHIELD_URL}/v1/redact",
headers=HEADERS,
files={"file": (file_path, f, "image/png")}
)
# Redacted image is in response body
redacted_image = response.content
# Detection metadata is in the header
metadata = json.loads(
response.headers.get("X-GlobalShield-Metadata", "{}")
)
return redacted_image, metadata
# Usage
image_bytes, meta = await redact_document("passport_scan.png")
print(f"Detected {meta['total_redacted']} PII entities")
print(f"Language: {meta['language_detected']}")
for entity in meta["entities"]:
print(f" [{entity['entity_type']}] {entity['text']} "
f"(confidence: {entity['confidence']})")
# Save redacted image
with open("passport_redacted.png", "wb") as f:
f.write(image_bytes)
Detection Only (No Redaction)
If you want to detect PII without modifying the image:
async def detect_pii(file_path: str) -> dict:
"""Detect PII entities without redacting."""
async with httpx.AsyncClient() as client:
with open(file_path, "rb") as f:
response = await client.post(
f"{GLOBALSHIELD_URL}/v1/detect",
headers=HEADERS,
files={"file": (file_path, f, "image/png")}
)
return response.json()
result = await detect_pii("contract.png")
for entity in result["entities"]:
print(f"[{entity['entity_type']}] '{entity['text']}' "
f"at bbox {entity['bbox']}")
This is useful for audit logging — you can record what PII was found without modifying the original document.
PDF Redaction
GlobalShield handles multi-page PDFs natively:
async def redact_pdf(file_path: str, countries: str = None) -> bytes:
"""Redact PII from all pages of a PDF."""
async with httpx.AsyncClient(timeout=120.0) as client:
data = {}
if countries:
data["countries"] = countries
with open(file_path, "rb") as f:
response = await client.post(
f"{GLOBALSHIELD_URL}/v1/redact-pdf",
headers=HEADERS,
files={"file": (file_path, f, "application/pdf")},
data=data
)
meta = json.loads(
response.headers.get("X-GlobalShield-Metadata", "{}")
)
print(f"Processed {meta['total_pages']} pages, "
f"found {meta['total_entities']} entities")
return response.content
# Redact a multi-page contract
redacted_pdf = await redact_pdf(
"employment_contract.pdf",
countries="MY,SG" # Focus on Malaysia + Singapore ID formats
)
with open("contract_redacted.pdf", "wb") as f:
f.write(redacted_pdf)
Batch Processing
For bulk document processing (e.g., migrating a document archive):
async def batch_redact(file_paths: list[str]) -> bytes:
"""Redact PII from multiple images. Returns ZIP archive."""
async with httpx.AsyncClient(timeout=120.0) as client:
files = [
("files", (path, open(path, "rb"), "image/png"))
for path in file_paths
]
response = await client.post(
f"{GLOBALSHIELD_URL}/v1/batch-redact",
headers=HEADERS,
files=files
)
return response.content # ZIP archive
# Process 10 documents at once
import glob
docs = glob.glob("uploads/*.png")[:10]
zip_data = await batch_redact(docs)
with open("redacted_batch.zip", "wb") as f:
f.write(zip_data)
FastAPI Integration Example
Here's how to add automatic PII redaction to a FastAPI upload endpoint:
from fastapi import FastAPI, UploadFile, File
from fastapi.responses import Response
import httpx
import json
app = FastAPI()
@app.post("/upload-document")
async def upload_document(file: UploadFile = File(...)):
"""Upload a document — PII is automatically redacted before storage."""
file_bytes = await file.read()
# Send to GlobalShield for redaction
async with httpx.AsyncClient() as client:
response = await client.post(
"https://globalshield.p.rapidapi.com/v1/redact",
headers={
"X-RapidAPI-Key": "YOUR_KEY",
"X-RapidAPI-Host": "globalshield.p.rapidapi.com"
},
files={"file": (file.filename, file_bytes, file.content_type)}
)
metadata = json.loads(
response.headers.get("X-GlobalShield-Metadata", "{}")
)
# Store the REDACTED version, not the original
redacted_bytes = response.content
# save_to_storage(redacted_bytes, file.filename)
return {
"status": "uploaded",
"filename": file.filename,
"pii_detected": metadata.get("total_redacted", 0),
"language": metadata.get("language_detected"),
"entities": [
{"type": e["entity_type"], "confidence": e["confidence"]}
for e in metadata.get("entities", [])
]
}
Country-Specific Detection
GlobalShield's Layer 2 detection includes country-specific ID patterns:
| Country | ID Types Detected |
|---|---|
| Malaysia (MY) | MyKad (NRIC), passport |
| Singapore (SG) | NRIC/FIN, passport |
| United States (US) | SSN, driver's license, passport |
| United Kingdom (UK) | NI number, passport, NHS number |
| China (CN) | Resident ID (18-digit), passport |
| Japan (JP) | My Number, passport, residence card |
| India (IN) | Aadhaar, PAN, passport |
| Australia (AU) | TFN, Medicare, passport |
| Taiwan (TW) | National ID, ARC |
Pass the countries parameter to focus detection on specific regions:
# Only detect Malaysian and Singaporean ID patterns
response = await client.post(
"/v1/redact",
files={"file": ("scan.png", image_bytes, "image/png")},
data={"countries": "MY,SG"}
)
Pricing
| Endpoint | Credits per Call |
|---|---|
/v1/redact (full pipeline) |
10 credits |
/v1/detect (detection only) |
5 credits |
/v1/ocr (text extraction) |
3 credits |
/v1/redact-pdf |
10 credits/page |
/v1/batch-redact |
8 credits/image |
| Plan | Credits/Month | Price |
|---|---|---|
| Basic | 500 | Free |
| Pro | 10,000 | $29.99/mo |
| Ultra | 100,000 | $99.99/mo |
Get started: GlobalShield on RapidAPI
Why Not Build It Yourself?
You could assemble this pipeline from open-source components:
- Tesseract for OCR
- langdetect for language detection
- Google Translate API for translation
- Presidio for entity detection
- Pillow for image redaction
But you'd spend weeks handling edge cases: multi-language documents, country-specific ID formats, bounding box alignment between OCR and redaction, PDF page rendering, and batch processing. GlobalShield packages all of this into a single API call.
Processing documents with PII? Try GlobalShield's free tier and let me know how it works for your use case.
Top comments (0)