This article was originally published on AI Study Room. For the full version with working code examples and related articles, visit the original post.
AI Data Privacy: PII Detection, Data Anonymization, Local Processing
Introduction
AI applications process vast amounts of data, much of it containing personally identifiable information (PII). Sending raw PII to LLM APIs creates compliance risks under GDPR, CCPA, and other regulations. This article covers practical techniques for detecting and redacting PII, anonymizing training data, and processing sensitive information locally.
PII Detection
Automated detection identifies sensitive data before it reaches an LLM API:
import re
import spacy
from presidio_analyzer import AnalyzerEngine
from presidio_anonymizer import AnonymizerEngine
# Initialize Presidio analyzers
nlp = spacy.load("en_core_web_lg")
analyzer = AnalyzerEngine()
anonymizer = AnonymizerEngine()
def detect_pii(text: str) -> list[dict]:
results = analyzer.analyze(
text=text,
entities=[
"PHONE_NUMBER", "EMAIL_ADDRESS",
"CREDIT_CARD", "SSN", "PERSON",
"LOCATION", "DATE_TIME", "NRP",
"US_BANK_NUMBER", "IP_ADDRESS",
],
language="en",
)
return [
{"entity": r.entity_type, "start": r.start, "end": r.end,
"score": r.score, "text": text[r.start:r.end]}
for r in results
]
def redact_pii(text: str) -> str:
analyzer_results = analyzer.analyze(text=text, language="en")
return anonymizer.anonymize(text=text, analyzer_results=analyzer_results).text
Presidio combines pattern-based detection (regex for credit cards, SSNs, phone numbers) with NLP-based detection (spaCy for person names, locations, organizations). This dual approach catches both structured and unstructured PII.
Data Anonymization
For training data or analytics, full removal may be too destructive. Anonymization preserves utility while protecting privacy:
from faker import Faker
import hashlib
fake = Faker()
class DataAnonymizer:
def __init__(self):
self.mapping_cache = {}
def anonymize_record(self, record: dict, pii_fields: list[str]) -> dict:
anonymized = record.copy()
for field in pii_fields:
if field in anonymized and anonymized[field]:
anonymized[field] = self._replace_value(field, anonymized[field])
return anonymized
def _replace_value(self, field: str, value: str) -> str:
if field == "email":
return fake.email()
elif field == "phone":
return fake.phone_number()
elif field == "name":
return fake.name()
elif field == "address":
return fake.address()
elif field == "ssn":
return fake.ssn()
else:
# Tokenization: stable pseudonym via hashing
hashed = hashlib.sha256(value.encode()).hexdigest()[:16]
return f"USER_{hashed}"
# Differential privacy: add calibrated noise
def add_laplace_noise(true_value: float, epsilon: float = 1.0) -> float:
"""Add Laplace noise for differential privacy.
Lower epsilon = more privacy, less accuracy."""
import numpy as np
scale = 1.0 / epsilon
noise = np.random.laplace(0, scale)
return true_value + noise
Anonymization Strategies
| Technique | Privacy Level | Utility | Use Case |
|-----------|--------------|---------|----------|
| Removal | High | Low | Irreversible redaction |
| Masking | Medium | Medium | Partial visibility (e.g. "****-1234") |
| Pseudonymization | Medium | High | Replace with fake equivalent |
| Generalization | Medium | Medium | ZIP 94301 -> 9430x |
| Differential Privacy | High | Medium | Statistical queries |
| Tokenization | High | High | Deterministic replacement |
Local Processing
For maximum privacy, process sensitive data locally without sending it to external APIs:
from transformers import pipeline
class LocalTextProcessor:
def __init__(self):
# Load small models for local inference
self.classifier = pipeline(
"text-classification",
model="distilbert-base-uncased-finetuned-sst-2-english",
device=-1, # CPU
)
self.ner = pipeline(
"ner",
model="dslim/bert-base-NER",
device=-1,
)
self.summarizer = pipeline(
"summarization",
model="facebook/bart-large-cnn",
device=-1,
)
def process_sensitive_data(self, text: str, task: str) -> dict:
# All processing happens locally; nothing leaves this machine
if task == "classify":
return {"label": s
Read the full article on AI Study Room for complete code examples, comparison tables, and related resources.
Found this useful? Check out more developer guides and tool comparisons on AI Study Room.
Top comments (0)