Alain Airom (Ayrom)

Posted on May 22

Securing Your GenAI Pipeline: Automated PII Obfuscation with Docling & Microsoft Presidio

#opensource #microsoft #docling #bob

Architecting an Open-Source Document Anonymizer with Docling, Presidio, and Bob

Introduction-PII (Personally identifiable information)

As organizations increasingly rely on GenAI, Large Language Models (LLMs), and cloud-driven analytics, data has become both an enterprise’s greatest asset and its biggest liability. Every day, millions of documents — ranging from PDFs and slide decks to emails and spreadsheets — flow through corporate pipelines. Hidden within this unstructured data is a wealth of Personally Identifiable Information (PII): names, phone numbers, credit card details, financial statements, and geographic locations.

Leaving this data exposed is no longer just a technical oversight; it is a massive compliance and security risk. However, protecting this information isn’t a one-size-fits-all task. Depending on the objective, different workflows demand entirely different approaches to handling sensitive data.

Feeding GenAI and RAG Pipelines (Context Preservation)

When preparing documents for Retrieval-Augmented Generation (RAG) or fine-tuning LLMs, the goal is to protect privacy without destroying the utility of the text. If you completely delete a name or a location, the model loses the semantic context required to understand relationships within the document.

The Need: Anonymization or Pseudonymization (Masking/Replacing).
The Strategy: Replacing a specific name like “John Doe” with a generic token like [PERSON_1] or [EMPLOYEE] allows the LLM to maintain grammatical and logical coherence while keeping the actual individual completely anonymous.

Analytical and Database Processing (Data Linking)

Data engineering teams often need to analyze user behavior, processing trends, or system logs across multiple datasets without knowing exactly who the users are.

The Need: Consistency and Reversibility (Hashing/Encryption).
The Strategy: Using cryptographic Hashing (like SHA-256) ensures that the same PII always results in the same unique string. This allows analysts to join tables and track trends over time without exposing raw data. Alternatively, reversible encryption is used when downstream automated systems must temporarily hide PII but authorized users eventually need to decrypt and view the original values.

Regulatory Compliance and Public Release (Absolute Security)

When sharing documents with third-party vendors, publishing legal discovery files, or complying with strict data privacy mandates like GDPR, CCPA, or HIPAA, there is zero margin for error.

The Need: Irreversible Elimination (Redaction).
The Strategy: Redaction completely strips the text or burns a black bar over the sensitive areas. Once redacted, the original data is destroyed and cannot be recovered by any mathematical or algorithmic means, ensuring total compliance.

The Modern Challenge: Parsing Meets Detection

Implementing these varied obfuscation strategies requires a two-step dance: first, you must accurately extract text and layout structure from complex, messy enterprise documents (such as multi-column PDFs or scanned charts). Second, you must accurately detect and transform the PII without breaking the document’s formatting.

Balancing data utility with absolute privacy is the foundational challenge of modern data engineering — and it is exactly why pairing robust document parsing with intelligent PII detection has become an industry necessity.

Introducing Microsoft Presidio OpenSource Project

The open-source landscape offers powerful, specialized tools to address these diverse data privacy needs. Foremost among them is Microsoft Presidio, a production-ready, open-source library designed to democratize data protection by providing fast, customizable, and scalable PII detection and anonymization.

Rather than locking developers into a specific vendor or a rigid, one-size-fits-all algorithm, Presidio acts as a highly modular pipeline framework. It allows organizations to automate the discovery and obfuscation of sensitive entities across text, images, and unstructured data streams.

Core Architecture: How Presidio Works

Presidio splits the privacy challenge into two distinct, highly configurable stages: Detection and Anonymization.

The Analyzer (PII Detection)

The Presidio Analyzer is the brain of the operation. It is an orchestrator that utilizes an array of diverse detection mechanisms to identify sensitive data with high accuracy. Instead of relying on a single method, it combines:

Pre-defined and Custom Recognizers: Out-of-the-box detectors for global entities like Credit Card numbers, IBANs, Social Security Numbers (SSN), IP addresses, and email formats.
Regex and Validation Checkers: Fast, rule-based matching combined with checksum validation algorithms (such as the Luhn algorithm for credit cards) to drastically reduce false positives.
Spacy / Hugging Face NLP Models: Leveraging advanced Named Entity Recognition (NER) to understand the semantic context of a sentence, allowing it to accurately differentiate between a common noun and a person’s name or a specific location.

The Analyzer outputs a structured list of findings, detailing exactly what type of PII was found, where it is located (character indices), and a confidence score for the detection.

The Anonymizer (PII Obfuscation)

Once the Analyzer maps out the sensitive entities, the Presidio Anonymizer takes over to apply the transformation rules. This engine is designed to handle the exact real-world tasks required by modern developers:

Redact: Deletes the text completely.
Replace: Swaps the sensitive data with a generic placeholder or entity token (e.g., changing “Alice” to <PERSON>).
Mask: Obfuscates a portion of the string (e.g., turning a credit card number into ************1234).
Hash: Computes a cryptographic hash (like SHA-256) of the PII, ensuring data consistency for analytical tracking without exposing the underlying identity.
Custom Functions: Allows developers to write tailored cryptographic or obfuscation logic directly into the pipeline.

Why Combine with Docling?

While Microsoft Presidio is exceptionally skilled at analyzing and transforming raw text, it cannot natively look inside a complex PDF, a multi-tab Excel spreadsheet, or a formatted PowerPoint deck and understand its layout. This is where Docling becomes an indispensable partner. As one of the most advanced document processing tools available today, Docling excels at tearing down the barriers of messy enterprise formats and converting them into clean, structured representations like Markdown or JSON. By accurately preserving document hierarchy, multi-column reading orders, and intricate table structures, Docling ensures that text is extracted exactly as it was meant to be read. Passing this high-fidelity, contextual text to Presidio ensures that PII detection algorithms don’t miss sensitive information split across broken layout lines, allowing developers to safely obfuscate documents while maintaining the structural integrity required for downstream GenAI and RAG pipelines.

Docling-Do we really need to introduce Docling? 😉

Docling simplifies document processing, parsing diverse formats — including advanced PDF understanding — and providing seamless integrations with the gen AI ecosystem.

Docling provides many samples for business use-cases as PII obfuscation, the sample code below is the one I gave to Bob as starting point to build the new application using MS Presidio;

import argparse
import logging
import os
import re
from pathlib import Path
from typing import Dict, List, Tuple

from docling_core.types.doc import ImageRefMode, TableItem, TextItem
from tabulate import tabulate

from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import PdfPipelineOptions
from docling.document_converter import DocumentConverter, PdfFormatOption

_log = logging.getLogger(__name__)

IMAGE_RESOLUTION_SCALE = 2.0
HF_MODEL = "dslim/bert-base-NER"  # Swap with another HF NER/PII model if desired, eg https://huggingface.co/urchade/gliner_multi_pii-v1 looks very promising too!
GLINER_MODEL = "urchade/gliner_multi_pii-v1"


def _build_simple_ner_pipeline():
    """Create a Hugging Face token-classification pipeline for NER.

    Returns a callable like: ner(text) -> List[dict]
    """
    try:
        from transformers import (
            AutoModelForTokenClassification,
            AutoTokenizer,
            pipeline,
        )
    except Exception:
        _log.error("Transformers not installed. Please run: pip install transformers")
        raise

    tokenizer = AutoTokenizer.from_pretrained(HF_MODEL)
    model = AutoModelForTokenClassification.from_pretrained(HF_MODEL)
    ner = pipeline(
        "token-classification",
        model=model,
        tokenizer=tokenizer,
        aggregation_strategy="simple",  # groups subwords into complete entities
        # Note: modern Transformers returns `start`/`end` when possible with aggregation
    )
    return ner


class SimplePiiObfuscator:
    """Tracks PII strings and replaces them with stable IDs per entity type."""

    def __init__(self, ner_callable):
        self.ner = ner_callable
        self.entity_map: Dict[str, str] = {}
        self.counters: Dict[str, int] = {
            "person": 0,
            "org": 0,
            "location": 0,
            "misc": 0,
        }
        # Map model labels to our coarse types
        self.label_map = {
            "PER": "person",
            "PERSON": "person",
            "ORG": "org",
            "ORGANIZATION": "org",
            "LOC": "location",
            "LOCATION": "location",
            "GPE": "location",
            # Fallbacks
            "MISC": "misc",
            "O": "misc",
        }
        # Only obfuscate these by default. Adjust as needed.
        self.allowed_types = {"person", "org", "location"}

    def _next_id(self, typ: str) -> str:
        self.counters[typ] += 1
        return f"{typ}-{self.counters[typ]}"

    def _normalize(self, s: str) -> str:
        return re.sub(r"\s+", " ", s).strip()

    def _extract_entities(self, text: str) -> List[Tuple[str, str]]:
        """Run NER and return a list of (surface_text, type) to obfuscate."""
        if not text:
            return []
        results = self.ner(text)
        # Collect normalized items with optional span info
        items = []
        for r in results:
            raw_label = r.get("entity_group") or r.get("entity") or "MISC"
            label = self.label_map.get(raw_label, "misc")
            if label not in self.allowed_types:
                continue
            start = r.get("start")
            end = r.get("end")
            word = self._normalize(r.get("word") or r.get("text") or "")
            items.append({"label": label, "start": start, "end": end, "word": word})

        found: List[Tuple[str, str]] = []
        # If the pipeline provides character spans, merge consecutive/overlapping
        # entities of the same type into a single span, then take the substring
        # from the original text. This handles cases like subword tokenization
        # where multiple adjacent pieces belong to the same named entity.
        have_spans = any(i["start"] is not None and i["end"] is not None for i in items)
        if have_spans:
            spans = [
                i for i in items if i["start"] is not None and i["end"] is not None
            ]
            # Ensure processing order by start (then end)
            spans.sort(key=lambda x: (x["start"], x["end"]))

            merged = []
            for s in spans:
                if not merged:
                    merged.append(dict(s))
                    continue
                last = merged[-1]
                if s["label"] == last["label"] and s["start"] <= last["end"]:
                    # Merge identical, overlapping, or touching spans of same type
                    last["start"] = min(last["start"], s["start"])
                    last["end"] = max(last["end"], s["end"])
                else:
                    merged.append(dict(s))

            for m in merged:
                surface = self._normalize(text[m["start"] : m["end"]])
                if surface:
                    found.append((surface, m["label"]))

            # Include any items lacking spans as-is (fallback)
            for i in items:
                if i["start"] is None or i["end"] is None:
                    if i["word"]:
                        found.append((i["word"], i["label"]))
        else:
            # Fallback when spans aren't provided: return normalized words
            for i in items:
                if i["word"]:
                    found.append((i["word"], i["label"]))
        return found

    def obfuscate_text(self, text: str) -> str:
        if not text:
            return text

        entities = self._extract_entities(text)
        if not entities:
            return text

        # Deduplicate per text, keep stable global mapping
        unique_words: Dict[str, str] = {}
        for word, label in entities:
            if word not in self.entity_map:
                replacement = self._next_id(label)
                self.entity_map[word] = replacement
            unique_words[word] = self.entity_map[word]

        # Replace longer matches first to avoid partial overlaps
        sorted_pairs = sorted(
            unique_words.items(), key=lambda x: len(x[0]), reverse=True
        )

        def replace_once(s: str, old: str, new: str) -> str:
            # Use simple substring replacement; for stricter matching, use word boundaries
            # when appropriate (e.g., names). This is a demo, keep it simple.
            pattern = re.escape(old)
            return re.sub(pattern, new, s)

        obfuscated = text
        for old, new in sorted_pairs:
            obfuscated = replace_once(obfuscated, old, new)
        return obfuscated


def _build_gliner_model():
    """Create a GLiNER model for PII-like entity extraction.

    Returns a tuple (model, labels) where model.predict_entities(text, labels)
    yields entities with "text" and "label" fields.
    """
    try:
        from gliner import GLiNER  # type: ignore
    except Exception:
        _log.error(
            "GLiNER not installed. Please run: pip install gliner torch --extra-index-url https://download.pytorch.org/whl/cpu"
        )
        raise

    model = GLiNER.from_pretrained(GLINER_MODEL)
    # Curated set of labels for PII detection. Adjust as needed.
    labels = [
        # "work",
        "booking number",
        "personally identifiable information",
        "driver licence",
        "person",
        "full address",
        "company",
        # "actor",
        # "character",
        "email",
        "passport number",
        "Social Security Number",
        "phone number",
    ]
    return model, labels


class AdvancedPIIObfuscator:
    """PII obfuscator powered by GLiNER with fine-grained labels.

    - Uses GLiNER's `predict_entities(text, labels)` to detect entities.
    - Obfuscates with stable IDs per fine-grained label, e.g. `email-1`.
    """

    def __init__(self, gliner_model, labels: List[str]):
        self.model = gliner_model
        self.labels = labels
        self.entity_map: Dict[str, str] = {}
        self.counters: Dict[str, int] = {}

    def _normalize(self, s: str) -> str:
        return re.sub(r"\s+", " ", s).strip()

    def _norm_label(self, label: str) -> str:
        return (
            re.sub(
                r"[^a-z0-9_]+", "_", label.lower().replace(" ", "_").replace("-", "_")
            ).strip("_")
            or "pii"
        )

    def _next_id(self, typ: str) -> str:
        self.cc(typ)
        self.counters[typ] += 1
        return f"{typ}-{self.counters[typ]}"

    def cc(self, typ: str) -> None:
        if typ not in self.counters:
            self.counters[typ] = 0

    def _extract_entities(self, text: str) -> List[Tuple[str, str]]:
        if not text:
            return []
        results = self.model.predict_entities(
            text, self.labels
        )  # expects dicts with text/label
        found: List[Tuple[str, str]] = []
        for r in results:
            label = self._norm_label(str(r.get("label", "pii")))
            surface = self._normalize(str(r.get("text", "")))
            if surface:
                found.append((surface, label))
        return found

    def obfuscate_text(self, text: str) -> str:
        if not text:
            return text
        entities = self._extract_entities(text)
        if not entities:
            return text

        unique_words: Dict[str, str] = {}
        for word, label in entities:
            if word not in self.entity_map:
                replacement = self._next_id(label)
                self.entity_map[word] = replacement
            unique_words[word] = self.entity_map[word]

        sorted_pairs = sorted(
            unique_words.items(), key=lambda x: len(x[0]), reverse=True
        )

        def replace_once(s: str, old: str, new: str) -> str:
            pattern = re.escape(old)
            return re.sub(pattern, new, s)

        obfuscated = text
        for old, new in sorted_pairs:
            obfuscated = replace_once(obfuscated, old, new)
        return obfuscated


def main():
    logging.basicConfig(level=logging.INFO)

    data_folder = Path(__file__).parent / "../../tests/data"
    input_doc_path = data_folder / "pdf/2206.01062.pdf"
    output_dir = Path("scratch")  # ensure this directory exists before saving

    # Choose engine via CLI flag or env var (default: hf)
    parser = argparse.ArgumentParser(description="PII obfuscation example")
    parser.add_argument(
        "--engine",
        choices=["hf", "gliner"],
        default=os.getenv("PII_ENGINE", "hf"),
        help="NER engine: 'hf' (Transformers) or 'gliner' (GLiNER)",
    )
    args = parser.parse_args()

    # Ensure output dir exists
    output_dir.mkdir(parents=True, exist_ok=True)

    # Keep and generate images so Markdown can embed them
    pipeline_options = PdfPipelineOptions()
    pipeline_options.images_scale = IMAGE_RESOLUTION_SCALE
    pipeline_options.generate_page_images = True
    pipeline_options.generate_picture_images = True

    doc_converter = DocumentConverter(
        format_options={
            InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
        }
    )

    conv_res = doc_converter.convert(input_doc_path)
    conv_doc = conv_res.document
    doc_filename = conv_res.input.file.name

    # Save markdown with embedded pictures in original text
    md_filename = output_dir / f"{doc_filename}-with-images-orig.md"
    conv_doc.save_as_markdown(md_filename, image_mode=ImageRefMode.EMBEDDED)

    # Build NER pipeline and obfuscator
    if args.engine == "gliner":
        _log.info("Using GLiNER-based AdvancedPIIObfuscator")
        gliner_model, gliner_labels = _build_gliner_model()
        obfuscator = AdvancedPIIObfuscator(gliner_model, gliner_labels)
    else:
        _log.info("Using HF Transformers-based SimplePiiObfuscator")
        ner = _build_simple_ner_pipeline()
        obfuscator = SimplePiiObfuscator(ner)

    for element, _level in conv_res.document.iterate_items():
        if isinstance(element, TextItem):
            element.orig = element.text
            element.text = obfuscator.obfuscate_text(element.text)
            # print(element.orig, " => ", element.text)

        elif isinstance(element, TableItem):
            for cell in element.data.table_cells:
                cell.text = obfuscator.obfuscate_text(cell.text)

    # Save markdown with embedded pictures and obfuscated text
    md_filename = output_dir / f"{doc_filename}-with-images-pii-obfuscated.md"
    conv_doc.save_as_markdown(md_filename, image_mode=ImageRefMode.EMBEDDED)

    # Optional: log mapping summary
    if obfuscator.entity_map:
        data = []
        for key, val in obfuscator.entity_map.items():
            data.append([key, val])

        _log.info(
            f"Obfuscated entities:\n\n{tabulate(data)}",
        )


if __name__ == "__main__":
    main()

Technical Implementation: Orchestrating the Pipeline

To bring this privacy-first pipeline to life, we built a modular architecture that cleanly separates document layout ingestion from PII parsing and transformation. The backbone of this application is a custom Python orchestrator, wrapped in an intuitive Streamlit web interface for seamless user interaction.

Here is how the core components are structured and implemented.

Handling the Infrastructure & Apple Silicon Workarounds

When deploying layout-aware document processing libraries on diverse development environments — such as macOS systems running on ARM64 architecture — multimodal models can occasionally encounter floating-point compatibility issues with Apple Silicon’s Metal Performance Shaders (MPS).

To ensure absolute stability and uniform execution across all developer machines, we explicitly force the PyTorch execution backend to utilize the CPU right at the script’s initialization:

import os
import torch

# Force CPU usage to avoid MPS float64 compatibility issues on Apple Silicon
os.environ["PYTORCH_ENABLE_MPS_FALLBACK"] = "1"
os.environ["CUDA_VISIBLE_DEVICES"] = ""
os.environ["PYTORCH_MPS_HIGH_WATERMARK_RATIO"] = "0.0"

# Force torch default device to CPU
torch.set_default_device('cpu')
if torch.backends.mps.is_available():
    torch.backends.mps.is_built = lambda: False

Building the Encapsulated PII Engine

We wrapped Microsoft Presidio’s dual-engine architecture (AnalyzerEngine and AnonymizerEngine) inside a clean, extensible class named PresidioPIIObfuscator. This wrapper manages the detection rules and houses the cryptographic configurations for our various obfuscation methods (Replace, Mask, Hash, Redact, Encrypt).

from presidio_analyzer import AnalyzerEngine
from presidio_anonymizer import AnonymizerEngine
from presidio_anonymizer.entities import OperatorConfig

class PresidioPIIObfuscator:
    """Handles PII detection and obfuscation using Microsoft Presidio"""

    def __init__(self):
        self.analyzer = AnalyzerEngine()
        self.anonymizer = AnonymizerEngine()

    def obfuscate_text(self, text: str, method: str = "replace", language: str = "en"):
        # 1. Analyze text for native and context-based PII
        analyzer_results = self.analyzer.analyze(text=text, language=language)

        # 2. Fetch the corresponding operator configuration mapping
        operators = self._get_operator_config(method)

        # 3. Anonymize and transform the text string
        anonymized_result = self.anonymizer.anonymize(
            text=text,
            analyzer_results=analyzer_results,
            operators=operators
        )

        # Extract entity metadata for auditing/reporting
        entities = [
            {
                "entity_type": result.entity_type,
                "start": result.start,
                "end": result.end,
                "score": result.score,
                "original_text": text[result.start:result.end]
            }
            for result in analyzer_results
        ]

        return anonymized_result.text, entities

    def _get_operator_config(self, method: str):
        if method == "replace":
            return {
                "DEFAULT": OperatorConfig("replace", {"new_value": "<REDACTED>"}),
                "PERSON": OperatorConfig("replace", {"new_value": "<PERSON>"}),
                "EMAIL_ADDRESS": OperatorConfig("replace", {"new_value": "<EMAIL>"}),
                "PHONE_NUMBER": OperatorConfig("replace", {"new_value": "<PHONE>"}),
            }
        elif method == "hash":
            return {"DEFAULT": OperatorConfig("hash", {"hash_type": "sha256"})}
        elif method == "redact":
            return {"DEFAULT": OperatorConfig("redact", {})}
        # Additional methods (mask, encrypt) map here...

Layout-Aware Parsing and Document Reconstruction

The true magic happens when we iterate through a document processed by Docling. Standard text extractors flatten a PDF, losing layout context and blending tabular metrics into unreadable strings. Docling, on the other hand, breaks down the layout into distinct programmatic primitives like .texts and .tables.

Our application iterates through these individual structural nodes, filters out empty spaces, strips out the PII chunk-by-chunk via Presidio, and reassembles the final safe text stream:

from docling.document_converter import DocumentConverter, PdfFormatOption
from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import PdfPipelineOptions

def process_document(file_path: str, anonymization_method: str = "replace"):
    obfuscator = PresidioPIIObfuscator()

    # Configure Docling to run table structure recognition and OCR
    pipeline_options = PdfPipelineOptions()
    pipeline_options.do_ocr = True
    pipeline_options.do_table_structure = True

    converter = DocumentConverter(
        format_options={
            InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
        }
    )

    # Structural parsing 
    result = converter.convert(file_path)
    doc = result.document

    obfuscated_parts = []
    all_entities = []

    # Process sequential textual blocks
    for element in doc.texts:
        if element.text.strip():
            obfuscated_text, entities = obfuscator.obfuscate_text(element.text, method=anonymization_method)
            obfuscated_parts.append(obfuscated_text)
            all_entities.extend(entities)

    # Process standalone table models separately to preserve contextual integrity
    for table in doc.tables:
        if str(table).strip():
            obfuscated_text, entities = obfuscator.obfuscate_text(str(table), method=anonymization_method)
            obfuscated_parts.append(f"\n[TABLE]\n{obfuscated_text}\n")
            all_entities.extend(entities)

    return "\n\n".join(obfuscated_parts), all_entities

Interactive Streamlit Interface

To make this pipeline accessible, we built a responsive web interface using Streamlit. The UI handles secure localized file uploads, triggers the core processing pipeline, displays structural data breakdown tables, and builds dynamic on-the-fly preview windows alongside timestamped text download tokens.

The resulting UI provides an instantaneous overview of exactly how many PII elements were altered, categorizes them by risk types (such as PERSON, LOCATION, or CREDIT_CARD), and yields clean text ready for open-source RAG parsing or fine-tuning pipelines safely.

The Entire End-to-End Application

Install the requirements;

# Core dependencies
docling>=2.0.0
docling-core>=2.0.0
presidio-analyzer>=2.2.0
presidio-anonymizer>=2.2.0

# UI Framework - Streamlit for stable web interface
streamlit>=1.28.0

# Document processing
pypdf>=3.0.0
python-docx>=1.0.0
pillow>=10.0.0

# NLP and ML
transformers>=4.30.0
torch>=2.0.0
spacy>=3.5.0

# Utilities
tabulate>=0.9.0
python-dotenv>=1.0.0

The whole Code;

#!/usr/bin/env python3
"""
PII Obfuscation Application using Docling and Microsoft Presidio
"""

import os
import sys
from datetime import datetime
from pathlib import Path
from typing import List, Dict, Any, Optional

import streamlit as st
import os

# Force CPU usage to avoid MPS float64 compatibility issues on Apple Silicon
os.environ["PYTORCH_ENABLE_MPS_FALLBACK"] = "1"
os.environ["CUDA_VISIBLE_DEVICES"] = ""
os.environ["PYTORCH_MPS_HIGH_WATERMARK_RATIO"] = "0.0"

# Import torch and force CPU device
import torch
torch.set_default_device('cpu')
if torch.backends.mps.is_available():
    torch.backends.mps.is_built = lambda: False

from docling.document_converter import DocumentConverter, PdfFormatOption
from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import PdfPipelineOptions
from presidio_analyzer import AnalyzerEngine
from presidio_anonymizer import AnonymizerEngine
from presidio_anonymizer.entities import OperatorConfig


class PresidioPIIObfuscator:
    """Handles PII detection and obfuscation using Microsoft Presidio"""

    def __init__(self):
        """Initialize Presidio Analyzer and Anonymizer engines"""
        self.analyzer = AnalyzerEngine()
        self.anonymizer = AnonymizerEngine()

    def analyze_text(self, text: str, language: str = "en") -> List[Dict[str, Any]]:
        """
        Analyze text for PII entities

        Args:
            text: Text to analyze
            language: Language code (default: "en")

        Returns:
            List of detected PII entities
        """
        results = self.analyzer.analyze(text=text, language=language)
        return [
            {
                "entity_type": result.entity_type,
                "start": result.start,
                "end": result.end,
                "score": result.score,
                "text": text[result.start:result.end]
            }
            for result in results
        ]

    def obfuscate_text(
        self, 
        text: str, 
        method: str = "replace",
        language: str = "en"
    ) -> tuple[str, List[Dict[str, Any]]]:
        """
        Obfuscate PII in text

        Args:
            text: Text to obfuscate
            method: Obfuscation method (replace, mask, hash, redact, encrypt)
            language: Language code (default: "en")

        Returns:
            Tuple of (obfuscated_text, detected_entities)
        """
        # Analyze text for PII
        analyzer_results = self.analyzer.analyze(text=text, language=language)

        # Get operator configuration based on method
        operators = self._get_operator_config(method)

        # Anonymize the text
        anonymized_result = self.anonymizer.anonymize(
            text=text,
            analyzer_results=analyzer_results,
            operators=operators
        )

        # Extract entity information
        entities = [
            {
                "entity_type": result.entity_type,
                "start": result.start,
                "end": result.end,
                "score": result.score,
                "original_text": text[result.start:result.end]
            }
            for result in analyzer_results
        ]

        return anonymized_result.text, entities

    def _get_operator_config(self, method: str) -> Dict[str, OperatorConfig]:
        """
        Get operator configuration for anonymization method

        Args:
            method: Obfuscation method

        Returns:
            Dictionary of operator configurations
        """
        if method == "replace":
            return {
                "DEFAULT": OperatorConfig("replace", {"new_value": "<REDACTED>"}),
                "PERSON": OperatorConfig("replace", {"new_value": "<PERSON>"}),
                "EMAIL_ADDRESS": OperatorConfig("replace", {"new_value": "<EMAIL>"}),
                "PHONE_NUMBER": OperatorConfig("replace", {"new_value": "<PHONE>"}),
                "CREDIT_CARD": OperatorConfig("replace", {"new_value": "<CREDIT_CARD>"}),
                "US_SSN": OperatorConfig("replace", {"new_value": "<SSN>"}),
                "LOCATION": OperatorConfig("replace", {"new_value": "<LOCATION>"}),
            }
        elif method == "mask":
            return {
                "DEFAULT": OperatorConfig("mask", {
                    "masking_char": "*",
                    "chars_to_mask": 100,
                    "from_end": False
                })
            }
        elif method == "hash":
            return {
                "DEFAULT": OperatorConfig("hash", {"hash_type": "sha256"})
            }
        elif method == "redact":
            return {
                "DEFAULT": OperatorConfig("redact", {})
            }
        elif method == "encrypt":
            return {
                "DEFAULT": OperatorConfig("encrypt", {"key": "WmZq4t7w!z%C*F-J"})
            }
        else:
            return {
                "DEFAULT": OperatorConfig("replace", {"new_value": "<REDACTED>"})
            }


def process_document(
    file_path: str,
    anonymization_method: str = "replace",
    output_dir: str = "output"
) -> tuple[str, List[Dict[str, Any]], str]:
    """
    Process a document to detect and obfuscate PII

    Args:
        file_path: Path to input document
        anonymization_method: Method for anonymization
        output_dir: Directory for output files

    Returns:
        Tuple of (obfuscated_text, detected_entities, output_file_path)
    """
    # Initialize obfuscator
    obfuscator = PresidioPIIObfuscator()

    # Initialize Docling converter with PDF pipeline options
    # CPU usage is forced via environment variables to avoid MPS float64 issues
    pipeline_options = PdfPipelineOptions()
    pipeline_options.do_ocr = True
    pipeline_options.do_table_structure = True

    converter = DocumentConverter(
        format_options={
            InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
        }
    )

    # Convert document
    result = converter.convert(file_path)
    doc = result.document

    # Process document content
    obfuscated_parts = []
    all_entities = []

    # Process text elements
    for element in doc.texts:
        text = element.text
        if text.strip():
            obfuscated_text, entities = obfuscator.obfuscate_text(
                text, 
                method=anonymization_method
            )
            obfuscated_parts.append(obfuscated_text)
            all_entities.extend(entities)

    # Process tables
    for table in doc.tables:
        table_text = str(table)
        if table_text.strip():
            obfuscated_text, entities = obfuscator.obfuscate_text(
                table_text,
                method=anonymization_method
            )
            obfuscated_parts.append(f"\n[TABLE]\n{obfuscated_text}\n")
            all_entities.extend(entities)

    # Combine obfuscated content
    final_text = "\n\n".join(obfuscated_parts)

    # Create output directory if it doesn't exist
    os.makedirs(output_dir, exist_ok=True)

    # Generate timestamped output filename
    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
    input_filename = Path(file_path).stem
    output_filename = f"{input_filename}_obfuscated_{timestamp}.txt"
    output_path = os.path.join(output_dir, output_filename)

    # Write obfuscated content to file
    with open(output_path, 'w', encoding='utf-8') as f:
        f.write(f"PII Obfuscation Report\n")
        f.write(f"{'=' * 80}\n\n")
        f.write(f"Input File: {file_path}\n")
        f.write(f"Anonymization Method: {anonymization_method}\n")
        f.write(f"Timestamp: {timestamp}\n")
        f.write(f"Total PII Entities Detected: {len(all_entities)}\n\n")
        f.write(f"{'=' * 80}\n\n")
        f.write(final_text)

    return final_text, all_entities, output_path


def main():
    """Main Streamlit application"""

    st.set_page_config(
        page_title="PII Obfuscation Tool",
        page_icon="🔒",
        layout="wide"
    )

    st.title("🔒 PII Obfuscation Tool")
    st.markdown("Detect and obfuscate Personal Identifiable Information (PII) in documents using Microsoft Presidio and Docling")

    # Sidebar configuration
    st.sidebar.header("Configuration")

    anonymization_method = st.sidebar.selectbox(
        "Anonymization Method",
        ["replace", "mask", "hash", "redact", "encrypt"],
        help="Choose how to obfuscate detected PII"
    )

    st.sidebar.markdown("---")
    st.sidebar.markdown("### Method Descriptions")
    st.sidebar.markdown("""
    - **Replace**: Replace PII with entity type labels
    - **Mask**: Mask PII with asterisks
    - **Hash**: Replace PII with SHA-256 hash
    - **Redact**: Remove PII completely
    - **Encrypt**: Encrypt PII (reversible)
    """)

    # File upload
    uploaded_file = st.file_uploader(
        "Upload a PDF document",
        type=["pdf"],
        help="Select a PDF file to process"
    )

    if uploaded_file is not None:
        # Save uploaded file temporarily
        temp_dir = "temp"
        os.makedirs(temp_dir, exist_ok=True)
        temp_file_path = os.path.join(temp_dir, uploaded_file.name)

        with open(temp_file_path, "wb") as f:
            f.write(uploaded_file.getbuffer())

        st.success(f"File uploaded: {uploaded_file.name}")

        # Process button
        if st.button("🚀 Process Document", type="primary"):
            with st.spinner("Processing document... This may take a few moments."):
                try:
                    # Process the document
                    obfuscated_text, entities, output_path = process_document(
                        temp_file_path,
                        anonymization_method
                    )

                    # Display results
                    st.success("✅ Document processed successfully!")

                    # Statistics
                    col1, col2, col3 = st.columns(3)
                    with col1:
                        st.metric("PII Entities Detected", len(entities))
                    with col2:
                        st.metric("Anonymization Method", anonymization_method.upper())
                    with col3:
                        st.metric("Output File", Path(output_path).name)

                    # Entity breakdown
                    if entities:
                        st.subheader("📊 Detected PII Entities")
                        entity_types = {}
                        for entity in entities:
                            entity_type = entity["entity_type"]
                            entity_types[entity_type] = entity_types.get(entity_type, 0) + 1

                        entity_df = [
                            {"Entity Type": k, "Count": v} 
                            for k, v in sorted(entity_types.items(), key=lambda x: x[1], reverse=True)
                        ]
                        st.table(entity_df)

                    # Display obfuscated text
                    st.subheader("📄 Obfuscated Content Preview")
                    st.text_area(
                        "Preview (first 2000 characters)",
                        obfuscated_text[:2000] + ("..." if len(obfuscated_text) > 2000 else ""),
                        height=300
                    )

                    # Download button
                    with open(output_path, 'r', encoding='utf-8') as f:
                        output_content = f.read()

                    st.download_button(
                        label="📥 Download Obfuscated Document",
                        data=output_content,
                        file_name=Path(output_path).name,
                        mime="text/plain"
                    )

                    st.info(f"💾 Full output saved to: `{output_path}`")

                except Exception as e:
                    st.error(f"❌ Error processing document: {str(e)}")
                    st.exception(e)
                finally:
                    # Clean up temp file
                    if os.path.exists(temp_file_path):
                        os.remove(temp_file_path)

    else:
        st.info("👆 Please upload a PDF document to begin")

        # Example section
        st.markdown("---")
        st.subheader("📚 Example Usage")
        st.markdown("""
        1. Upload a PDF document containing text
        2. Select an anonymization method from the sidebar
        3. Click "Process Document" to detect and obfuscate PII
        4. Review the detected entities and download the obfuscated document

        **Supported PII Types:**
        - Person names
        - Email addresses
        - Phone numbers
        - Credit card numbers
        - Social Security Numbers (SSN)
        - Locations
        - And more...
        """)


if __name__ == "__main__":
    main()

# Made with Bob

Conclusion: The Power of an AI-Driven SDLC with Bob

Building a production-ready, layout-aware privacy pipeline from scratch is typically a complex engineering feat, requiring developers to carefully bridge the gap between structural document extraction and intelligent text transformation. By leveraging Bob to spearhead this project, the entire software development lifecycle was dramatically accelerated. Bob seamlessly orchestrated the integration of Docling and Microsoft Presidio, translating high-level architectural ideas into a robust, end-to-end Python application. From tackling low-level infrastructure workarounds — such as enforcing stable CPU execution execution paths on Apple Silicon architectures — to generating an intuitive, responsive Streamlit web portal, Bob handled the heavy lifting of code generation and system design. The result is a fully local, privacy-first document obfuscation engine that proves how advanced AI companions can transform complex engineering challenges into clean, modular, and deployable enterprise solutions.

>>> Thanks for reading <<<

DEV Community