Brian Spann

Posted on Jun 10

Anonymization Strategies

#presidio #microsoft #security #tutorial

Detection tells you where the PII is. Anonymization decides what to do about it. Presidio's anonymizer ships with five built-in operators, each suited for different compliance requirements and use cases. Choosing wrong means either destroying data you needed to recover or leaving sensitive information exposed in ways you didn't intend.

This part covers every anonymization operator, when to use each one, how to build pseudonymization with consistent name mappings, and how to process PII in PDFs.

The Five Built-In Operators

Replace

Replaces the detected entity with a specified value. This is the default operator.

from presidio_analyzer import AnalyzerEngine
from presidio_anonymizer import AnonymizerEngine
from presidio_anonymizer.entities import OperatorConfig

analyzer = AnalyzerEngine()
anonymizer = AnonymizerEngine()

text = "John Smith called from 206-555-0147 about his account."

results = analyzer.analyze(text=text, language="en")

# Replace with entity type labels (default behavior)
anonymized = anonymizer.anonymize(
    text=text,
    analyzer_results=results,
    operators={
        "PERSON": OperatorConfig("replace", {"new_value": "[REDACTED NAME]"}),
        "PHONE_NUMBER": OperatorConfig("replace", {"new_value": "[REDACTED PHONE]"})
    }
)

print(anonymized.text)
# Output: [REDACTED NAME] called from [REDACTED PHONE] about his account.

Use replace when you want the output to be human-readable and when the original values don't need to be recovered. Good for sharing anonymized datasets with external teams, displaying sanitized text in dashboards, and audit logs where the PII type matters but the value doesn't.

Redact

Removes the entity entirely, leaving no placeholder behind.

anonymized = anonymizer.anonymize(
    text=text,
    analyzer_results=results,
    operators={
        "PERSON": OperatorConfig("redact"),
        "PHONE_NUMBER": OperatorConfig("redact")
    }
)

print(anonymized.text)
# Output:  called from  about his account.

Redaction changes the text structure and can make sentences unreadable. It's appropriate for internal audit logs where readability isn't a priority, strict compliance scenarios where no trace of PII should remain, and automated pipelines where the text isn't shown to humans.

Mask

Replaces each character with a masking character, preserving the length of the original value.

anonymized = anonymizer.anonymize(
    text=text,
    analyzer_results=results,
    operators={
        "PERSON": OperatorConfig("mask", {
            "masking_char": "*",
            "chars_to_mask": 100,  # Mask all characters
            "from_end": False
        }),
        "PHONE_NUMBER": OperatorConfig("mask", {
            "masking_char": "#",
            "chars_to_mask": 8,    # Mask first 8 chars
            "from_end": False
        })
    }
)

print(anonymized.text)
# Output: ********** called from ########47 about his account.

Masking is useful when you need to preserve the length or partial value. Think credit card receipts showing the last four digits, or support screens where agents need to confirm partial identifiers.

Hash

Replaces the entity with a one-way hash. The same input always produces the same hash, which makes it useful for analytics without exposing raw PII.

anonymized = anonymizer.anonymize(
    text=text,
    analyzer_results=results,
    operators={
        "PERSON": OperatorConfig("hash", {"hash_type": "sha256"}),
        "PHONE_NUMBER": OperatorConfig("hash", {"hash_type": "sha256"})
    }
)

print(anonymized.text)
# Output: ef92b778bafe771e89245b89ecbc08a44a4e166c06659911881f383d4473e94f called from ...

Hash supports sha256 (default) and sha512. Hashing is irreversible. You can't get the original value back from the hash. But you can compare hashes to determine if two records refer to the same person without knowing who that person is. Good for analytics pipelines, deduplication, and cross-referencing anonymized datasets.

Encrypt

Replaces the entity with an encrypted value that can be decrypted later with the right key.

anonymized = anonymizer.anonymize(
    text=text,
    analyzer_results=results,
    operators={
        "DEFAULT": OperatorConfig("encrypt", {"key": "WmZq4t7w!z%C*F-J"})
    }
)

print(anonymized.text)
# Entities replaced with base64-encoded encrypted strings

Encrypt is the only reversible operator. You can deanonymize later:

from presidio_anonymizer import DeanonymizeEngine
from presidio_anonymizer.entities import OperatorConfig

deanonymizer = DeanonymizeEngine()

deanonymized = deanonymizer.deanonymize(
    text=anonymized.text,
    entities=anonymized.items,
    operators={
        "DEFAULT": OperatorConfig("decrypt", {"key": "WmZq4t7w!z%C*F-J"})
    }
)

print(deanonymized.text)
# Output: John Smith called from 206-555-0147 about his account.

Use encrypt/decrypt for the PII proxy pattern (scrub before sending to LLM, decrypt after). We'll build that exact pipeline in Part 5.

Mixing Operators Per Entity Type

In practice you'll want different strategies for different entity types in the same document.

operators = {
    "PERSON": OperatorConfig("replace", {"new_value": "<PERSON>"}),
    "EMAIL_ADDRESS": OperatorConfig("hash", {"hash_type": "sha256"}),
    "PHONE_NUMBER": OperatorConfig("mask", {
        "masking_char": "*",
        "chars_to_mask": 8,
        "from_end": False
    }),
    "CREDIT_CARD": OperatorConfig("encrypt", {"key": "WmZq4t7w!z%C*F-J"}),
    "US_SSN": OperatorConfig("redact"),
    "DEFAULT": OperatorConfig("replace", {"new_value": "<PII>"})
}

The DEFAULT operator catches any entity type that doesn't have a specific operator assigned. Always set a default so nothing slips through unhandled.

Pseudonymization with Consistent Mappings

Standard replacement generates different placeholders each time. If "John Smith" appears three times in a document, each occurrence gets the same generic <PERSON> label. That's fine for redaction but breaks any analysis that needs to track individuals across records.

Pseudonymization maps each unique value to a consistent fake value. "John Smith" always becomes "Robert Chen." "Jane Doe" always becomes "Maria Santos." The mapping is consistent within a dataset but the original values are unrecoverable without the mapping table.

from presidio_anonymizer import AnonymizerEngine
from presidio_anonymizer.entities import OperatorConfig
from faker import Faker

fake = Faker()
Faker.seed(42)  # Reproducible fake data

# Maintain a mapping for consistency
pii_mapping = {}

def get_consistent_replacement(original, entity_type):
    key = f"{entity_type}:{original}"
    if key not in pii_mapping:
        if entity_type == "PERSON":
            pii_mapping[key] = fake.name()
        elif entity_type == "EMAIL_ADDRESS":
            pii_mapping[key] = fake.email()
        elif entity_type == "PHONE_NUMBER":
            pii_mapping[key] = fake.phone_number()
        elif entity_type == "LOCATION":
            pii_mapping[key] = fake.city()
        else:
            pii_mapping[key] = f"[{entity_type}_{len(pii_mapping)}]"
    return pii_mapping[key]

To integrate this with Presidio, you can build a custom operator or post-process the results:

from presidio_analyzer import AnalyzerEngine
from presidio_anonymizer import AnonymizerEngine
from presidio_anonymizer.entities import OperatorConfig

analyzer = AnalyzerEngine()
anonymizer = AnonymizerEngine()

text = """John Smith emailed john@example.com about the project.
Later, John Smith called to follow up. His colleague Jane Doe 
also reached out from jane@example.com."""

results = analyzer.analyze(text=text, language="en")

# Sort by start position (descending) to replace from end to start
sorted_results = sorted(results, key=lambda x: x.start, reverse=True)

pseudonymized = text
for result in sorted_results:
    original = text[result.start:result.end]
    replacement = get_consistent_replacement(original, result.entity_type)
    pseudonymized = pseudonymized[:result.start] + replacement + pseudonymized[result.end:]

print(pseudonymized)

Both occurrences of "John Smith" map to the same fake name. Both email addresses map to consistent fake emails. The relationships in the data are preserved without exposing the real identities.

Reversible vs. Irreversible: When to Use Which

Irreversible (replace, redact, mask, hash): Use when the original values should never be recoverable. Compliance with GDPR right-to-erasure, publishing anonymized datasets, any scenario where re-identification is a risk.

Reversible (encrypt): Use when you need the original values back later. The PII proxy pattern (anonymize before LLM, deanonymize after), temporary anonymization for testing, workflows where an authorized user needs to see the real data.

The key question: does anyone, ever, need to get the original PII back? If yes, encrypt. If no, use one of the irreversible operators. Don't hash when you need reversibility (common mistake). Don't encrypt when you need true anonymization (the key becomes a liability).

Processing PDFs

Presidio doesn't process PDFs natively, but you can extract text, anonymize it, and annotate the original PDF with redaction boxes.

import fitz  # PyMuPDF
from presidio_analyzer import AnalyzerEngine
from presidio_anonymizer import AnonymizerEngine

analyzer = AnalyzerEngine()
anonymizer = AnonymizerEngine()

# Open the PDF
doc = fitz.open("customer_report.pdf")

for page in doc:
    text = page.get_text()

    # Detect PII
    results = analyzer.analyze(text=text, language="en")

    for result in results:
        # Find the text location on the page
        pii_text = text[result.start:result.end]
        instances = page.search_for(pii_text)

        # Draw redaction boxes
        for inst in instances:
            page.add_redact_annot(inst, fill=(0, 0, 0))

    # Apply all redactions on this page
    page.apply_redactions()

# Save the redacted PDF
doc.save("customer_report_redacted.pdf")
doc.close()

This approach searches for each detected PII string on the PDF page and draws a black box over it. The apply_redactions() call permanently removes the underlying text, so the PII is gone from the file, not just covered up visually.

What's Next

You now have the full anonymization toolkit. In Part 5, we'll put it all together as an LLM guardrail: building a PII proxy that intercepts prompts, scrubs PII with encrypt, forwards the clean prompt to the model, and deanonymizes the response. We'll also cover LiteLLM integration, deployment on Azure, and production hardening.

This is Part 4 of the Hands-On Microsoft Presidio series. I write about PII detection, AI infrastructure, and building with Claude Code on Dev.to.

Top comments (2)

Exactey • Jun 10

Good overview of the core strategies. One thing worth adding
from an endpoint management perspective: anonymization becomes
significantly more complex when you're dealing with device
telemetry data from MDM platforms like Intune or JAMF.

The data points that seem harmless in isolation — device
check-in timestamps, compliance status changes, app install
events — can be combined to build surprisingly detailed
behavioral profiles of individual users even after standard
anonymization.

Differential privacy is the right direction for that use
case, but the implementation overhead is still a barrier
for most IT teams in practice.

Are you planning a follow-up covering telemetry and
operational data specifically?