AI Data Privacy: PII Detection, Data Anonymization, Local Processing

#ai #machinelearning #llm

This article was originally published on AI Study Room. For the full version with working code examples and related articles, visit the original post.

AI Data Privacy: PII Detection, Data Anonymization, Local Processing

Introduction

AI applications process vast amounts of data, much of it containing personally identifiable information (PII). Sending raw PII to LLM APIs creates compliance risks under GDPR, CCPA, and other regulations. This article covers practical techniques for detecting and redacting PII, anonymizing training data, and processing sensitive information locally.

PII Detection

Automated detection identifies sensitive data before it reaches an LLM API:

import re

import spacy

from presidio_analyzer import AnalyzerEngine

from presidio_anonymizer import AnonymizerEngine

# Initialize Presidio analyzers

nlp = spacy.load("en_core_web_lg")

analyzer = AnalyzerEngine()

anonymizer = AnonymizerEngine()

def detect_pii(text: str) -> list[dict]:

    results = analyzer.analyze(

        text=text,

        entities=[

            "PHONE_NUMBER", "EMAIL_ADDRESS",

            "CREDIT_CARD", "SSN", "PERSON",

            "LOCATION", "DATE_TIME", "NRP",

            "US_BANK_NUMBER", "IP_ADDRESS",

        ],

        language="en",

    )

    return [

        {"entity": r.entity_type, "start": r.start, "end": r.end,

         "score": r.score, "text": text[r.start:r.end]}

        for r in results

    ]

def redact_pii(text: str) -> str:

    analyzer_results = analyzer.analyze(text=text, language="en")

    return anonymizer.anonymize(text=text, analyzer_results=analyzer_results).text

Presidio combines pattern-based detection (regex for credit cards, SSNs, phone numbers) with NLP-based detection (spaCy for person names, locations, organizations). This dual approach catches both structured and unstructured PII.

Data Anonymization

For training data or analytics, full removal may be too destructive. Anonymization preserves utility while protecting privacy:

from faker import Faker

import hashlib

fake = Faker()

class DataAnonymizer:

    def __init__(self):

        self.mapping_cache = {}

    def anonymize_record(self, record: dict, pii_fields: list[str]) -> dict:

        anonymized = record.copy()

        for field in pii_fields:

            if field in anonymized and anonymized[field]:

                anonymized[field] = self._replace_value(field, anonymized[field])

        return anonymized

    def _replace_value(self, field: str, value: str) -> str:

        if field == "email":

            return fake.email()

        elif field == "phone":

            return fake.phone_number()

        elif field == "name":

            return fake.name()

        elif field == "address":

            return fake.address()

        elif field == "ssn":

            return fake.ssn()

        else:

            # Tokenization: stable pseudonym via hashing

            hashed = hashlib.sha256(value.encode()).hexdigest()[:16]

            return f"USER_{hashed}"

# Differential privacy: add calibrated noise

def add_laplace_noise(true_value: float, epsilon: float = 1.0) -> float:

    """Add Laplace noise for differential privacy.

    Lower epsilon = more privacy, less accuracy."""

    import numpy as np

    scale = 1.0 / epsilon

    noise = np.random.laplace(0, scale)

    return true_value + noise

Anonymization Strategies

|-----------|--------------|---------|----------|

Local Processing

For maximum privacy, process sensitive data locally without sending it to external APIs:

from transformers import pipeline

class LocalTextProcessor:

    def __init__(self):

        # Load small models for local inference

        self.classifier = pipeline(

            "text-classification",

            model="distilbert-base-uncased-finetuned-sst-2-english",

            device=-1,  # CPU

        )

        self.ner = pipeline(

            "ner",

            model="dslim/bert-base-NER",

            device=-1,

        )

        self.summarizer = pipeline(

            "summarization",

            model="facebook/bart-large-cnn",

            device=-1,

        )

    def process_sensitive_data(self, text: str, task: str) -> dict:

        # All processing happens locally; nothing leaves this machine

        if task == "classify":

            return {"label": s

Read the full article on AI Study Room for complete code examples, comparison tables, and related resources.

Found this useful? Check out more developer guides and tool comparisons on AI Study Room.