Architecting an Open-Source Document Anonymizer with Docling, Presidio, and Bob
Introduction-PII (Personally identifiable information)
As organizations increasingly rely on GenAI, Large Language Models (LLMs), and cloud-driven analytics, data has become both an enterprise’s greatest asset and its biggest liability. Every day, millions of documents — ranging from PDFs and slide decks to emails and spreadsheets — flow through corporate pipelines. Hidden within this unstructured data is a wealth of Personally Identifiable Information (PII): names, phone numbers, credit card details, financial statements, and geographic locations.
Leaving this data exposed is no longer just a technical oversight; it is a massive compliance and security risk. However, protecting this information isn’t a one-size-fits-all task. Depending on the objective, different workflows demand entirely different approaches to handling sensitive data.
Feeding GenAI and RAG Pipelines (Context Preservation)
When preparing documents for Retrieval-Augmented Generation (RAG) or fine-tuning LLMs, the goal is to protect privacy without destroying the utility of the text. If you completely delete a name or a location, the model loses the semantic context required to understand relationships within the document.
- The Need: Anonymization or Pseudonymization (Masking/Replacing).
- The Strategy: Replacing a specific name like “John Doe” with a generic token like
[PERSON_1]or[EMPLOYEE]allows the LLM to maintain grammatical and logical coherence while keeping the actual individual completely anonymous.
Analytical and Database Processing (Data Linking)
Data engineering teams often need to analyze user behavior, processing trends, or system logs across multiple datasets without knowing exactly who the users are.
- The Need: Consistency and Reversibility (Hashing/Encryption).
- The Strategy: Using cryptographic Hashing (like SHA-256) ensures that the same PII always results in the same unique string. This allows analysts to join tables and track trends over time without exposing raw data. Alternatively, reversible encryption is used when downstream automated systems must temporarily hide PII but authorized users eventually need to decrypt and view the original values.
Regulatory Compliance and Public Release (Absolute Security)
When sharing documents with third-party vendors, publishing legal discovery files, or complying with strict data privacy mandates like GDPR, CCPA, or HIPAA, there is zero margin for error.
- The Need: Irreversible Elimination (Redaction).
- The Strategy: Redaction completely strips the text or burns a black bar over the sensitive areas. Once redacted, the original data is destroyed and cannot be recovered by any mathematical or algorithmic means, ensuring total compliance.
The Modern Challenge: Parsing Meets Detection
Implementing these varied obfuscation strategies requires a two-step dance: first, you must accurately extract text and layout structure from complex, messy enterprise documents (such as multi-column PDFs or scanned charts). Second, you must accurately detect and transform the PII without breaking the document’s formatting.
- Balancing data utility with absolute privacy is the foundational challenge of modern data engineering — and it is exactly why pairing robust document parsing with intelligent PII detection has become an industry necessity.
Introducing Microsoft Presidio OpenSource Project
The open-source landscape offers powerful, specialized tools to address these diverse data privacy needs. Foremost among them is Microsoft Presidio, a production-ready, open-source library designed to democratize data protection by providing fast, customizable, and scalable PII detection and anonymization.
Rather than locking developers into a specific vendor or a rigid, one-size-fits-all algorithm, Presidio acts as a highly modular pipeline framework. It allows organizations to automate the discovery and obfuscation of sensitive entities across text, images, and unstructured data streams.
Core Architecture: How Presidio Works
Presidio splits the privacy challenge into two distinct, highly configurable stages: Detection and Anonymization.
The Analyzer (PII Detection)
The Presidio Analyzer is the brain of the operation. It is an orchestrator that utilizes an array of diverse detection mechanisms to identify sensitive data with high accuracy. Instead of relying on a single method, it combines:
- Pre-defined and Custom Recognizers: Out-of-the-box detectors for global entities like Credit Card numbers, IBANs, Social Security Numbers (SSN), IP addresses, and email formats.
- Regex and Validation Checkers: Fast, rule-based matching combined with checksum validation algorithms (such as the Luhn algorithm for credit cards) to drastically reduce false positives.
- Spacy / Hugging Face NLP Models: Leveraging advanced Named Entity Recognition (NER) to understand the semantic context of a sentence, allowing it to accurately differentiate between a common noun and a person’s name or a specific location.
The Analyzer outputs a structured list of findings, detailing exactly what type of PII was found, where it is located (character indices), and a confidence score for the detection.
The Anonymizer (PII Obfuscation)
Once the Analyzer maps out the sensitive entities, the Presidio Anonymizer takes over to apply the transformation rules. This engine is designed to handle the exact real-world tasks required by modern developers:
- Redact: Deletes the text completely.
- Replace: Swaps the sensitive data with a generic placeholder or entity token (e.g., changing “Alice” to
<PERSON>). - Mask: Obfuscates a portion of the string (e.g., turning a credit card number into
************1234). - Hash: Computes a cryptographic hash (like
SHA-256) of the PII, ensuring data consistency for analytical tracking without exposing the underlying identity. - Custom Functions: Allows developers to write tailored cryptographic or obfuscation logic directly into the pipeline.
Why Combine with Docling?
While Microsoft Presidio is exceptionally skilled at analyzing and transforming raw text, it cannot natively look inside a complex PDF, a multi-tab Excel spreadsheet, or a formatted PowerPoint deck and understand its layout. This is where Docling becomes an indispensable partner. As one of the most advanced document processing tools available today, Docling excels at tearing down the barriers of messy enterprise formats and converting them into clean, structured representations like Markdown or JSON. By accurately preserving document hierarchy, multi-column reading orders, and intricate table structures, Docling ensures that text is extracted exactly as it was meant to be read. Passing this high-fidelity, contextual text to Presidio ensures that PII detection algorithms don’t miss sensitive information split across broken layout lines, allowing developers to safely obfuscate documents while maintaining the structural integrity required for downstream GenAI and RAG pipelines.
Docling-Do we really need to introduce Docling? 😉
Docling simplifies document processing, parsing diverse formats — including advanced PDF understanding — and providing seamless integrations with the gen AI ecosystem.
Docling provides many samples for business use-cases as PII obfuscation, the sample code below is the one I gave to Bob as starting point to build the new application using MS Presidio;
import argparse
import logging
import os
import re
from pathlib import Path
from typing import Dict, List, Tuple
from docling_core.types.doc import ImageRefMode, TableItem, TextItem
from tabulate import tabulate
from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import PdfPipelineOptions
from docling.document_converter import DocumentConverter, PdfFormatOption
_log = logging.getLogger(__name__)
IMAGE_RESOLUTION_SCALE = 2.0
HF_MODEL = "dslim/bert-base-NER" # Swap with another HF NER/PII model if desired, eg https://huggingface.co/urchade/gliner_multi_pii-v1 looks very promising too!
GLINER_MODEL = "urchade/gliner_multi_pii-v1"
def _build_simple_ner_pipeline():
"""Create a Hugging Face token-classification pipeline for NER.
Returns a callable like: ner(text) -> List[dict]
"""
try:
from transformers import (
AutoModelForTokenClassification,
AutoTokenizer,
pipeline,
)
except Exception:
_log.error("Transformers not installed. Please run: pip install transformers")
raise
tokenizer = AutoTokenizer.from_pretrained(HF_MODEL)
model = AutoModelForTokenClassification.from_pretrained(HF_MODEL)
ner = pipeline(
"token-classification",
model=model,
tokenizer=tokenizer,
aggregation_strategy="simple", # groups subwords into complete entities
# Note: modern Transformers returns `start`/`end` when possible with aggregation
)
return ner
class SimplePiiObfuscator:
"""Tracks PII strings and replaces them with stable IDs per entity type."""
def __init__(self, ner_callable):
self.ner = ner_callable
self.entity_map: Dict[str, str] = {}
self.counters: Dict[str, int] = {
"person": 0,
"org": 0,
"location": 0,
"misc": 0,
}
# Map model labels to our coarse types
self.label_map = {
"PER": "person",
"PERSON": "person",
"ORG": "org",
"ORGANIZATION": "org",
"LOC": "location",
"LOCATION": "location",
"GPE": "location",
# Fallbacks
"MISC": "misc",
"O": "misc",
}
# Only obfuscate these by default. Adjust as needed.
self.allowed_types = {"person", "org", "location"}
def _next_id(self, typ: str) -> str:
self.counters[typ] += 1
return f"{typ}-{self.counters[typ]}"
def _normalize(self, s: str) -> str:
return re.sub(r"\s+", " ", s).strip()
def _extract_entities(self, text: str) -> List[Tuple[str, str]]:
"""Run NER and return a list of (surface_text, type) to obfuscate."""
if not text:
return []
results = self.ner(text)
# Collect normalized items with optional span info
items = []
for r in results:
raw_label = r.get("entity_group") or r.get("entity") or "MISC"
label = self.label_map.get(raw_label, "misc")
if label not in self.allowed_types:
continue
start = r.get("start")
end = r.get("end")
word = self._normalize(r.get("word") or r.get("text") or "")
items.append({"label": label, "start": start, "end": end, "word": word})
found: List[Tuple[str, str]] = []
# If the pipeline provides character spans, merge consecutive/overlapping
# entities of the same type into a single span, then take the substring
# from the original text. This handles cases like subword tokenization
# where multiple adjacent pieces belong to the same named entity.
have_spans = any(i["start"] is not None and i["end"] is not None for i in items)
if have_spans:
spans = [
i for i in items if i["start"] is not None and i["end"] is not None
]
# Ensure processing order by start (then end)
spans.sort(key=lambda x: (x["start"], x["end"]))
merged = []
for s in spans:
if not merged:
merged.append(dict(s))
continue
last = merged[-1]
if s["label"] == last["label"] and s["start"] <= last["end"]:
# Merge identical, overlapping, or touching spans of same type
last["start"] = min(last["start"], s["start"])
last["end"] = max(last["end"], s["end"])
else:
merged.append(dict(s))
for m in merged:
surface = self._normalize(text[m["start"] : m["end"]])
if surface:
found.append((surface, m["label"]))
# Include any items lacking spans as-is (fallback)
for i in items:
if i["start"] is None or i["end"] is None:
if i["word"]:
found.append((i["word"], i["label"]))
else:
# Fallback when spans aren't provided: return normalized words
for i in items:
if i["word"]:
found.append((i["word"], i["label"]))
return found
def obfuscate_text(self, text: str) -> str:
if not text:
return text
entities = self._extract_entities(text)
if not entities:
return text
# Deduplicate per text, keep stable global mapping
unique_words: Dict[str, str] = {}
for word, label in entities:
if word not in self.entity_map:
replacement = self._next_id(label)
self.entity_map[word] = replacement
unique_words[word] = self.entity_map[word]
# Replace longer matches first to avoid partial overlaps
sorted_pairs = sorted(
unique_words.items(), key=lambda x: len(x[0]), reverse=True
)
def replace_once(s: str, old: str, new: str) -> str:
# Use simple substring replacement; for stricter matching, use word boundaries
# when appropriate (e.g., names). This is a demo, keep it simple.
pattern = re.escape(old)
return re.sub(pattern, new, s)
obfuscated = text
for old, new in sorted_pairs:
obfuscated = replace_once(obfuscated, old, new)
return obfuscated
def _build_gliner_model():
"""Create a GLiNER model for PII-like entity extraction.
Returns a tuple (model, labels) where model.predict_entities(text, labels)
yields entities with "text" and "label" fields.
"""
try:
from gliner import GLiNER # type: ignore
except Exception:
_log.error(
"GLiNER not installed. Please run: pip install gliner torch --extra-index-url https://download.pytorch.org/whl/cpu"
)
raise
model = GLiNER.from_pretrained(GLINER_MODEL)
# Curated set of labels for PII detection. Adjust as needed.
labels = [
# "work",
"booking number",
"personally identifiable information",
"driver licence",
"person",
"full address",
"company",
# "actor",
# "character",
"email",
"passport number",
"Social Security Number",
"phone number",
]
return model, labels
class AdvancedPIIObfuscator:
"""PII obfuscator powered by GLiNER with fine-grained labels.
- Uses GLiNER's `predict_entities(text, labels)` to detect entities.
- Obfuscates with stable IDs per fine-grained label, e.g. `email-1`.
"""
def __init__(self, gliner_model, labels: List[str]):
self.model = gliner_model
self.labels = labels
self.entity_map: Dict[str, str] = {}
self.counters: Dict[str, int] = {}
def _normalize(self, s: str) -> str:
return re.sub(r"\s+", " ", s).strip()
def _norm_label(self, label: str) -> str:
return (
re.sub(
r"[^a-z0-9_]+", "_", label.lower().replace(" ", "_").replace("-", "_")
).strip("_")
or "pii"
)
def _next_id(self, typ: str) -> str:
self.cc(typ)
self.counters[typ] += 1
return f"{typ}-{self.counters[typ]}"
def cc(self, typ: str) -> None:
if typ not in self.counters:
self.counters[typ] = 0
def _extract_entities(self, text: str) -> List[Tuple[str, str]]:
if not text:
return []
results = self.model.predict_entities(
text, self.labels
) # expects dicts with text/label
found: List[Tuple[str, str]] = []
for r in results:
label = self._norm_label(str(r.get("label", "pii")))
surface = self._normalize(str(r.get("text", "")))
if surface:
found.append((surface, label))
return found
def obfuscate_text(self, text: str) -> str:
if not text:
return text
entities = self._extract_entities(text)
if not entities:
return text
unique_words: Dict[str, str] = {}
for word, label in entities:
if word not in self.entity_map:
replacement = self._next_id(label)
self.entity_map[word] = replacement
unique_words[word] = self.entity_map[word]
sorted_pairs = sorted(
unique_words.items(), key=lambda x: len(x[0]), reverse=True
)
def replace_once(s: str, old: str, new: str) -> str:
pattern = re.escape(old)
return re.sub(pattern, new, s)
obfuscated = text
for old, new in sorted_pairs:
obfuscated = replace_once(obfuscated, old, new)
return obfuscated
def main():
logging.basicConfig(level=logging.INFO)
data_folder = Path(__file__).parent / "../../tests/data"
input_doc_path = data_folder / "pdf/2206.01062.pdf"
output_dir = Path("scratch") # ensure this directory exists before saving
# Choose engine via CLI flag or env var (default: hf)
parser = argparse.ArgumentParser(description="PII obfuscation example")
parser.add_argument(
"--engine",
choices=["hf", "gliner"],
default=os.getenv("PII_ENGINE", "hf"),
help="NER engine: 'hf' (Transformers) or 'gliner' (GLiNER)",
)
args = parser.parse_args()
# Ensure output dir exists
output_dir.mkdir(parents=True, exist_ok=True)
# Keep and generate images so Markdown can embed them
pipeline_options = PdfPipelineOptions()
pipeline_options.images_scale = IMAGE_RESOLUTION_SCALE
pipeline_options.generate_page_images = True
pipeline_options.generate_picture_images = True
doc_converter = DocumentConverter(
format_options={
InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
}
)
conv_res = doc_converter.convert(input_doc_path)
conv_doc = conv_res.document
doc_filename = conv_res.input.file.name
# Save markdown with embedded pictures in original text
md_filename = output_dir / f"{doc_filename}-with-images-orig.md"
conv_doc.save_as_markdown(md_filename, image_mode=ImageRefMode.EMBEDDED)
# Build NER pipeline and obfuscator
if args.engine == "gliner":
_log.info("Using GLiNER-based AdvancedPIIObfuscator")
gliner_model, gliner_labels = _build_gliner_model()
obfuscator = AdvancedPIIObfuscator(gliner_model, gliner_labels)
else:
_log.info("Using HF Transformers-based SimplePiiObfuscator")
ner = _build_simple_ner_pipeline()
obfuscator = SimplePiiObfuscator(ner)
for element, _level in conv_res.document.iterate_items():
if isinstance(element, TextItem):
element.orig = element.text
element.text = obfuscator.obfuscate_text(element.text)
# print(element.orig, " => ", element.text)
elif isinstance(element, TableItem):
for cell in element.data.table_cells:
cell.text = obfuscator.obfuscate_text(cell.text)
# Save markdown with embedded pictures and obfuscated text
md_filename = output_dir / f"{doc_filename}-with-images-pii-obfuscated.md"
conv_doc.save_as_markdown(md_filename, image_mode=ImageRefMode.EMBEDDED)
# Optional: log mapping summary
if obfuscator.entity_map:
data = []
for key, val in obfuscator.entity_map.items():
data.append([key, val])
_log.info(
f"Obfuscated entities:\n\n{tabulate(data)}",
)
if __name__ == "__main__":
main()
Technical Implementation: Orchestrating the Pipeline
To bring this privacy-first pipeline to life, we built a modular architecture that cleanly separates document layout ingestion from PII parsing and transformation. The backbone of this application is a custom Python orchestrator, wrapped in an intuitive Streamlit web interface for seamless user interaction.
Here is how the core components are structured and implemented.
Handling the Infrastructure & Apple Silicon Workarounds
When deploying layout-aware document processing libraries on diverse development environments — such as macOS systems running on ARM64 architecture — multimodal models can occasionally encounter floating-point compatibility issues with Apple Silicon’s Metal Performance Shaders (MPS).
To ensure absolute stability and uniform execution across all developer machines, we explicitly force the PyTorch execution backend to utilize the CPU right at the script’s initialization:
import os
import torch
# Force CPU usage to avoid MPS float64 compatibility issues on Apple Silicon
os.environ["PYTORCH_ENABLE_MPS_FALLBACK"] = "1"
os.environ["CUDA_VISIBLE_DEVICES"] = ""
os.environ["PYTORCH_MPS_HIGH_WATERMARK_RATIO"] = "0.0"
# Force torch default device to CPU
torch.set_default_device('cpu')
if torch.backends.mps.is_available():
torch.backends.mps.is_built = lambda: False
Building the Encapsulated PII Engine
We wrapped Microsoft Presidio’s dual-engine architecture (AnalyzerEngine and AnonymizerEngine) inside a clean, extensible class named PresidioPIIObfuscator. This wrapper manages the detection rules and houses the cryptographic configurations for our various obfuscation methods (Replace, Mask, Hash, Redact, Encrypt).
from presidio_analyzer import AnalyzerEngine
from presidio_anonymizer import AnonymizerEngine
from presidio_anonymizer.entities import OperatorConfig
class PresidioPIIObfuscator:
"""Handles PII detection and obfuscation using Microsoft Presidio"""
def __init__(self):
self.analyzer = AnalyzerEngine()
self.anonymizer = AnonymizerEngine()
def obfuscate_text(self, text: str, method: str = "replace", language: str = "en"):
# 1. Analyze text for native and context-based PII
analyzer_results = self.analyzer.analyze(text=text, language=language)
# 2. Fetch the corresponding operator configuration mapping
operators = self._get_operator_config(method)
# 3. Anonymize and transform the text string
anonymized_result = self.anonymizer.anonymize(
text=text,
analyzer_results=analyzer_results,
operators=operators
)
# Extract entity metadata for auditing/reporting
entities = [
{
"entity_type": result.entity_type,
"start": result.start,
"end": result.end,
"score": result.score,
"original_text": text[result.start:result.end]
}
for result in analyzer_results
]
return anonymized_result.text, entities
def _get_operator_config(self, method: str):
if method == "replace":
return {
"DEFAULT": OperatorConfig("replace", {"new_value": "<REDACTED>"}),
"PERSON": OperatorConfig("replace", {"new_value": "<PERSON>"}),
"EMAIL_ADDRESS": OperatorConfig("replace", {"new_value": "<EMAIL>"}),
"PHONE_NUMBER": OperatorConfig("replace", {"new_value": "<PHONE>"}),
}
elif method == "hash":
return {"DEFAULT": OperatorConfig("hash", {"hash_type": "sha256"})}
elif method == "redact":
return {"DEFAULT": OperatorConfig("redact", {})}
# Additional methods (mask, encrypt) map here...
Layout-Aware Parsing and Document Reconstruction
The true magic happens when we iterate through a document processed by Docling. Standard text extractors flatten a PDF, losing layout context and blending tabular metrics into unreadable strings. Docling, on the other hand, breaks down the layout into distinct programmatic primitives like .texts and .tables.
Our application iterates through these individual structural nodes, filters out empty spaces, strips out the PII chunk-by-chunk via Presidio, and reassembles the final safe text stream:
from docling.document_converter import DocumentConverter, PdfFormatOption
from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import PdfPipelineOptions
def process_document(file_path: str, anonymization_method: str = "replace"):
obfuscator = PresidioPIIObfuscator()
# Configure Docling to run table structure recognition and OCR
pipeline_options = PdfPipelineOptions()
pipeline_options.do_ocr = True
pipeline_options.do_table_structure = True
converter = DocumentConverter(
format_options={
InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
}
)
# Structural parsing
result = converter.convert(file_path)
doc = result.document
obfuscated_parts = []
all_entities = []
# Process sequential textual blocks
for element in doc.texts:
if element.text.strip():
obfuscated_text, entities = obfuscator.obfuscate_text(element.text, method=anonymization_method)
obfuscated_parts.append(obfuscated_text)
all_entities.extend(entities)
# Process standalone table models separately to preserve contextual integrity
for table in doc.tables:
if str(table).strip():
obfuscated_text, entities = obfuscator.obfuscate_text(str(table), method=anonymization_method)
obfuscated_parts.append(f"\n[TABLE]\n{obfuscated_text}\n")
all_entities.extend(entities)
return "\n\n".join(obfuscated_parts), all_entities
Interactive Streamlit Interface
To make this pipeline accessible, we built a responsive web interface using Streamlit. The UI handles secure localized file uploads, triggers the core processing pipeline, displays structural data breakdown tables, and builds dynamic on-the-fly preview windows alongside timestamped text download tokens.
The resulting UI provides an instantaneous overview of exactly how many PII elements were altered, categorizes them by risk types (such as PERSON, LOCATION, or CREDIT_CARD), and yields clean text ready for open-source RAG parsing or fine-tuning pipelines safely.
The Entire End-to-End Application
- Install the requirements;
# Core dependencies
docling>=2.0.0
docling-core>=2.0.0
presidio-analyzer>=2.2.0
presidio-anonymizer>=2.2.0
# UI Framework - Streamlit for stable web interface
streamlit>=1.28.0
# Document processing
pypdf>=3.0.0
python-docx>=1.0.0
pillow>=10.0.0
# NLP and ML
transformers>=4.30.0
torch>=2.0.0
spacy>=3.5.0
# Utilities
tabulate>=0.9.0
python-dotenv>=1.0.0
- The whole Code;
#!/usr/bin/env python3
"""
PII Obfuscation Application using Docling and Microsoft Presidio
"""
import os
import sys
from datetime import datetime
from pathlib import Path
from typing import List, Dict, Any, Optional
import streamlit as st
import os
# Force CPU usage to avoid MPS float64 compatibility issues on Apple Silicon
os.environ["PYTORCH_ENABLE_MPS_FALLBACK"] = "1"
os.environ["CUDA_VISIBLE_DEVICES"] = ""
os.environ["PYTORCH_MPS_HIGH_WATERMARK_RATIO"] = "0.0"
# Import torch and force CPU device
import torch
torch.set_default_device('cpu')
if torch.backends.mps.is_available():
torch.backends.mps.is_built = lambda: False
from docling.document_converter import DocumentConverter, PdfFormatOption
from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import PdfPipelineOptions
from presidio_analyzer import AnalyzerEngine
from presidio_anonymizer import AnonymizerEngine
from presidio_anonymizer.entities import OperatorConfig
class PresidioPIIObfuscator:
"""Handles PII detection and obfuscation using Microsoft Presidio"""
def __init__(self):
"""Initialize Presidio Analyzer and Anonymizer engines"""
self.analyzer = AnalyzerEngine()
self.anonymizer = AnonymizerEngine()
def analyze_text(self, text: str, language: str = "en") -> List[Dict[str, Any]]:
"""
Analyze text for PII entities
Args:
text: Text to analyze
language: Language code (default: "en")
Returns:
List of detected PII entities
"""
results = self.analyzer.analyze(text=text, language=language)
return [
{
"entity_type": result.entity_type,
"start": result.start,
"end": result.end,
"score": result.score,
"text": text[result.start:result.end]
}
for result in results
]
def obfuscate_text(
self,
text: str,
method: str = "replace",
language: str = "en"
) -> tuple[str, List[Dict[str, Any]]]:
"""
Obfuscate PII in text
Args:
text: Text to obfuscate
method: Obfuscation method (replace, mask, hash, redact, encrypt)
language: Language code (default: "en")
Returns:
Tuple of (obfuscated_text, detected_entities)
"""
# Analyze text for PII
analyzer_results = self.analyzer.analyze(text=text, language=language)
# Get operator configuration based on method
operators = self._get_operator_config(method)
# Anonymize the text
anonymized_result = self.anonymizer.anonymize(
text=text,
analyzer_results=analyzer_results,
operators=operators
)
# Extract entity information
entities = [
{
"entity_type": result.entity_type,
"start": result.start,
"end": result.end,
"score": result.score,
"original_text": text[result.start:result.end]
}
for result in analyzer_results
]
return anonymized_result.text, entities
def _get_operator_config(self, method: str) -> Dict[str, OperatorConfig]:
"""
Get operator configuration for anonymization method
Args:
method: Obfuscation method
Returns:
Dictionary of operator configurations
"""
if method == "replace":
return {
"DEFAULT": OperatorConfig("replace", {"new_value": "<REDACTED>"}),
"PERSON": OperatorConfig("replace", {"new_value": "<PERSON>"}),
"EMAIL_ADDRESS": OperatorConfig("replace", {"new_value": "<EMAIL>"}),
"PHONE_NUMBER": OperatorConfig("replace", {"new_value": "<PHONE>"}),
"CREDIT_CARD": OperatorConfig("replace", {"new_value": "<CREDIT_CARD>"}),
"US_SSN": OperatorConfig("replace", {"new_value": "<SSN>"}),
"LOCATION": OperatorConfig("replace", {"new_value": "<LOCATION>"}),
}
elif method == "mask":
return {
"DEFAULT": OperatorConfig("mask", {
"masking_char": "*",
"chars_to_mask": 100,
"from_end": False
})
}
elif method == "hash":
return {
"DEFAULT": OperatorConfig("hash", {"hash_type": "sha256"})
}
elif method == "redact":
return {
"DEFAULT": OperatorConfig("redact", {})
}
elif method == "encrypt":
return {
"DEFAULT": OperatorConfig("encrypt", {"key": "WmZq4t7w!z%C*F-J"})
}
else:
return {
"DEFAULT": OperatorConfig("replace", {"new_value": "<REDACTED>"})
}
def process_document(
file_path: str,
anonymization_method: str = "replace",
output_dir: str = "output"
) -> tuple[str, List[Dict[str, Any]], str]:
"""
Process a document to detect and obfuscate PII
Args:
file_path: Path to input document
anonymization_method: Method for anonymization
output_dir: Directory for output files
Returns:
Tuple of (obfuscated_text, detected_entities, output_file_path)
"""
# Initialize obfuscator
obfuscator = PresidioPIIObfuscator()
# Initialize Docling converter with PDF pipeline options
# CPU usage is forced via environment variables to avoid MPS float64 issues
pipeline_options = PdfPipelineOptions()
pipeline_options.do_ocr = True
pipeline_options.do_table_structure = True
converter = DocumentConverter(
format_options={
InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
}
)
# Convert document
result = converter.convert(file_path)
doc = result.document
# Process document content
obfuscated_parts = []
all_entities = []
# Process text elements
for element in doc.texts:
text = element.text
if text.strip():
obfuscated_text, entities = obfuscator.obfuscate_text(
text,
method=anonymization_method
)
obfuscated_parts.append(obfuscated_text)
all_entities.extend(entities)
# Process tables
for table in doc.tables:
table_text = str(table)
if table_text.strip():
obfuscated_text, entities = obfuscator.obfuscate_text(
table_text,
method=anonymization_method
)
obfuscated_parts.append(f"\n[TABLE]\n{obfuscated_text}\n")
all_entities.extend(entities)
# Combine obfuscated content
final_text = "\n\n".join(obfuscated_parts)
# Create output directory if it doesn't exist
os.makedirs(output_dir, exist_ok=True)
# Generate timestamped output filename
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
input_filename = Path(file_path).stem
output_filename = f"{input_filename}_obfuscated_{timestamp}.txt"
output_path = os.path.join(output_dir, output_filename)
# Write obfuscated content to file
with open(output_path, 'w', encoding='utf-8') as f:
f.write(f"PII Obfuscation Report\n")
f.write(f"{'=' * 80}\n\n")
f.write(f"Input File: {file_path}\n")
f.write(f"Anonymization Method: {anonymization_method}\n")
f.write(f"Timestamp: {timestamp}\n")
f.write(f"Total PII Entities Detected: {len(all_entities)}\n\n")
f.write(f"{'=' * 80}\n\n")
f.write(final_text)
return final_text, all_entities, output_path
def main():
"""Main Streamlit application"""
st.set_page_config(
page_title="PII Obfuscation Tool",
page_icon="🔒",
layout="wide"
)
st.title("🔒 PII Obfuscation Tool")
st.markdown("Detect and obfuscate Personal Identifiable Information (PII) in documents using Microsoft Presidio and Docling")
# Sidebar configuration
st.sidebar.header("Configuration")
anonymization_method = st.sidebar.selectbox(
"Anonymization Method",
["replace", "mask", "hash", "redact", "encrypt"],
help="Choose how to obfuscate detected PII"
)
st.sidebar.markdown("---")
st.sidebar.markdown("### Method Descriptions")
st.sidebar.markdown("""
- **Replace**: Replace PII with entity type labels
- **Mask**: Mask PII with asterisks
- **Hash**: Replace PII with SHA-256 hash
- **Redact**: Remove PII completely
- **Encrypt**: Encrypt PII (reversible)
""")
# File upload
uploaded_file = st.file_uploader(
"Upload a PDF document",
type=["pdf"],
help="Select a PDF file to process"
)
if uploaded_file is not None:
# Save uploaded file temporarily
temp_dir = "temp"
os.makedirs(temp_dir, exist_ok=True)
temp_file_path = os.path.join(temp_dir, uploaded_file.name)
with open(temp_file_path, "wb") as f:
f.write(uploaded_file.getbuffer())
st.success(f"File uploaded: {uploaded_file.name}")
# Process button
if st.button("🚀 Process Document", type="primary"):
with st.spinner("Processing document... This may take a few moments."):
try:
# Process the document
obfuscated_text, entities, output_path = process_document(
temp_file_path,
anonymization_method
)
# Display results
st.success("✅ Document processed successfully!")
# Statistics
col1, col2, col3 = st.columns(3)
with col1:
st.metric("PII Entities Detected", len(entities))
with col2:
st.metric("Anonymization Method", anonymization_method.upper())
with col3:
st.metric("Output File", Path(output_path).name)
# Entity breakdown
if entities:
st.subheader("📊 Detected PII Entities")
entity_types = {}
for entity in entities:
entity_type = entity["entity_type"]
entity_types[entity_type] = entity_types.get(entity_type, 0) + 1
entity_df = [
{"Entity Type": k, "Count": v}
for k, v in sorted(entity_types.items(), key=lambda x: x[1], reverse=True)
]
st.table(entity_df)
# Display obfuscated text
st.subheader("📄 Obfuscated Content Preview")
st.text_area(
"Preview (first 2000 characters)",
obfuscated_text[:2000] + ("..." if len(obfuscated_text) > 2000 else ""),
height=300
)
# Download button
with open(output_path, 'r', encoding='utf-8') as f:
output_content = f.read()
st.download_button(
label="📥 Download Obfuscated Document",
data=output_content,
file_name=Path(output_path).name,
mime="text/plain"
)
st.info(f"💾 Full output saved to: `{output_path}`")
except Exception as e:
st.error(f"❌ Error processing document: {str(e)}")
st.exception(e)
finally:
# Clean up temp file
if os.path.exists(temp_file_path):
os.remove(temp_file_path)
else:
st.info("👆 Please upload a PDF document to begin")
# Example section
st.markdown("---")
st.subheader("📚 Example Usage")
st.markdown("""
1. Upload a PDF document containing text
2. Select an anonymization method from the sidebar
3. Click "Process Document" to detect and obfuscate PII
4. Review the detected entities and download the obfuscated document
**Supported PII Types:**
- Person names
- Email addresses
- Phone numbers
- Credit card numbers
- Social Security Numbers (SSN)
- Locations
- And more...
""")
if __name__ == "__main__":
main()
# Made with Bob
Conclusion: The Power of an AI-Driven SDLC with Bob
Building a production-ready, layout-aware privacy pipeline from scratch is typically a complex engineering feat, requiring developers to carefully bridge the gap between structural document extraction and intelligent text transformation. By leveraging Bob to spearhead this project, the entire software development lifecycle was dramatically accelerated. Bob seamlessly orchestrated the integration of Docling and Microsoft Presidio, translating high-level architectural ideas into a robust, end-to-end Python application. From tackling low-level infrastructure workarounds — such as enforcing stable CPU execution execution paths on Apple Silicon architectures — to generating an intuitive, responsive Streamlit web portal, Bob handled the heavy lifting of code generation and system design. The result is a fully local, privacy-first document obfuscation engine that proves how advanced AI companions can transform complex engineering challenges into clean, modular, and deployable enterprise solutions.
>>> Thanks for reading <<<
Links
- Microsoft Presidio: https://github.com/microsoft/presidio
- Docling Project: https://docling-project.github.io/docling/
- IBM Bob: https://bob.ibm.com/
- What is personally identifiable information (PII)?: https://www.ibm.com/think/topics/pii







Top comments (0)