Brian Spann

Posted on Jun 7

Detecting PII in Real-World Text

#presidio #microsoft #security #tutorial

In Part 1 we installed Presidio and ran a basic detection on clean sample text. Real data is messier. Emails have signatures with phone numbers buried in HTML. Support tickets mix PII with technical jargon. Chat logs have informal name references that NER models struggle with. And sometimes the PII isn't in text at all. It's in screenshots and scanned documents.

This part covers how Presidio's detection engine actually works under the hood, how to process different text types you'll encounter in production, and how to handle structured data and images.

How the Analyzer Engine Works

Presidio doesn't rely on a single detection method. It layers three approaches and combines their results.

Named Entity Recognition (NER)

The NER model (spaCy by default) processes the text and identifies entities based on the language model's training. It's good at catching names, locations, and organizations even when they don't follow a fixed pattern. "John Smith" is easy. "Dr. J. Martinez-Garcia" is harder but the NER model handles it because it understands context and word patterns.

The tradeoff is that NER is probabilistic. It can miss unusual names or flag common words as entities. That's why Presidio doesn't stop here.

Pattern Matching (Regex)

For entities with predictable formats, Presidio uses regex recognizers. Credit card numbers, SSNs, email addresses, IP addresses, phone numbers all have known patterns. A Luhn-validated 16-digit number is almost certainly a credit card. A string matching \d{3}-\d{2}-\d{4} in the right context is probably an SSN.

Pattern-based detections typically get higher confidence scores than NER detections because the pattern itself is strong evidence.

Context Scoring

Here's where it gets interesting. Presidio looks at the words surrounding a potential match to boost or lower confidence. If the text says "my SSN is 123-45-6789," the phrase "my SSN is" provides strong context that the number is actually a social security number and not some random ID. The context words push the confidence score higher.

Without context scoring, a 9-digit number in the format XXX-XX-XXXX could be an SSN or a product SKU or an internal reference number. The surrounding words help Presidio decide.

Each recognizer defines its own list of context words. The SSN recognizer looks for words like "social," "security," "ssn," "tax id." The credit card recognizer looks for "credit," "card," "visa," "mastercard," "payment."

Processing Different Text Types

Emails

Email bodies often contain PII in signatures, forwarded messages, and inline contact details. The challenge is separating the PII you care about from the structural noise (headers, disclaimer text, HTML tags).

from presidio_analyzer import AnalyzerEngine

analyzer = AnalyzerEngine()

email_body = """
From: Sarah Chen <sarah.chen@acme.com>
To: support@company.com
Subject: Account Issue

Hi, I'm having trouble with my account. My customer ID is CUS-2847391 
and the last four of my card are 4242. Please call me at (415) 555-0198 
or email me at sarah.chen@acme.com.

Thanks,
Sarah Chen
VP of Engineering, Acme Corp
Office: (415) 555-0100
Mobile: (415) 555-0198
"""

results = analyzer.analyze(text=email_body, language="en")

for result in results:
    print(f"{result.entity_type}: '{email_body[result.start:result.end].strip()}' "
          f"(score: {result.score:.2f})")

Presidio will pick up the email addresses, phone numbers, and the person's name from both the body and the signature. It will also likely flag "Acme Corp" as an organization. You'll notice the same phone number appears twice (in the body and the signature), and Presidio reports each occurrence separately with its own position.

Support Tickets

Support tickets mix PII with technical content. Users paste error messages, stack traces, and config snippets alongside their personal details.

ticket = """
User report from jane.doe@company.com:

I'm getting error 500 when trying to update my billing info. 
My account number is 7829-4451-2290 and I'm using the card 
ending in 8847. The error started after I changed my address 
to 1234 Oak Street, Portland, OR 97201.

Stack trace:
java.lang.NullPointerException at com.billing.PaymentService.update(PaymentService.java:142)
"""

results = analyzer.analyze(text=ticket, language="en")

Presidio handles this well because the regex recognizers match the structured PII (email, account number pattern, zip code) while the NER model catches the street address and name. The stack trace doesn't trigger any false positives because Java class names and file paths don't match PII patterns.

Chat Logs

Chat logs are the hardest text type for PII detection. Messages are short, informal, and full of abbreviations. Names appear without context. Phone numbers get typed without dashes.

chat_log = """
[10:42] mike_t: hey can someone help with my acct? 
[10:42] mike_t: email is m.thompson@gmail.com
[10:43] support_bot: Sure Mike! What's the issue?
[10:44] mike_t: charge on my visa ending 4242 wasnt mine
[10:44] mike_t: my number is 5105105105105100
[10:45] support_bot: I'll look into that. Can you confirm your DOB?
[10:45] mike_t: march 15 1990
"""

results = analyzer.analyze(text=chat_log, language="en")

The credit card number without dashes or spaces is harder to catch, but Presidio's credit card recognizer applies Luhn validation on sequences of digits, so it will still flag it. The date of birth is trickier since Presidio detects dates but classifying a date as a DOB requires context. The surrounding text "confirm your DOB" provides that context.

Confidence Scores and Thresholds

Every result comes with a confidence score between 0 and 1. By default, Presidio returns everything above 0. In production you'll want to set thresholds.

# Only return high-confidence detections
results = analyzer.analyze(
    text=text,
    language="en",
    score_threshold=0.7
)

# Or filter after the fact for more control
high_confidence = [r for r in results if r.score >= 0.7]
medium_confidence = [r for r in results if 0.4 <= r.score < 0.7]
low_confidence = [r for r in results if r.score < 0.4]

A practical approach: use a high threshold (0.7 or above) for automated anonymization where false positives are costly, and a lower threshold (0.3-0.5) for audit/review workflows where a human checks the flagged items.

Batch Processing with presidio-structured

When your PII lives in CSVs, DataFrames, or JSON files, processing text column by column is tedious. The presidio-structured package handles this.

pip install presidio-structured

import pandas as pd
from presidio_structured import StructuredEngine, PandasAnalysisBuilder
from presidio_analyzer import AnalyzerEngine
from presidio_anonymizer import AnonymizerEngine

# Sample DataFrame
df = pd.DataFrame({
    "customer_name": ["John Smith", "Jane Doe", "Bob Wilson"],
    "email": ["john@example.com", "jane@example.com", "bob@example.com"],
    "notes": [
        "Called about SSN 123-45-6789",
        "Address: 456 Elm St, Portland OR",
        "Card ending 4242, refund requested"
    ]
})

# Set up the structured engine
analyzer = AnalyzerEngine()
anonymizer = AnonymizerEngine()
structured_engine = StructuredEngine(
    analyzer_engine=analyzer,
    anonymizer_engine=anonymizer
)

# Build the analysis configuration
analysis_builder = PandasAnalysisBuilder()

# Analyze and anonymize
anonymized_df = structured_engine.anonymize(df, analysis_builder)
print(anonymized_df)

The structured engine processes each cell in the DataFrame, detects PII using the same analyzer, and anonymizes it. You can configure which columns to process, set different thresholds per column, and apply different anonymization operators per entity type.

Image Redaction with presidio-image-redactor

Sometimes PII isn't in text at all. It's in screenshots of forms, scanned documents, or photos of ID cards. Presidio's image redactor handles this by running OCR (via Tesseract) to extract text from images, detecting PII in the extracted text, and then drawing colored boxes over the PII regions in the original image.

# Install the image redactor
pip install presidio-image-redactor

# Make sure Tesseract is installed
# Mac: brew install tesseract
# Ubuntu: apt-get install tesseract-ocr

from presidio_image_redactor import ImageRedactorEngine
from PIL import Image

# Load an image
image = Image.open("support_screenshot.png")

# Initialize the redactor
redactor = ImageRedactorEngine()

# Redact PII from the image
redacted_image = redactor.redact(image, fill=(0, 0, 0))

# Save the result
redacted_image.save("support_screenshot_redacted.png")

The fill parameter sets the color of the redaction boxes. Black (0, 0, 0) is the default. You can also use specific colors per entity type:

from presidio_image_redactor import ImageRedactorEngine
from presidio_analyzer import PatternRecognizer

redactor = ImageRedactorEngine()

# Redact with entity-specific colors
redacted = redactor.redact(
    image,
    fill=(0, 0, 0),       # Default: black
    entities=["PERSON", "PHONE_NUMBER", "EMAIL_ADDRESS"]
)

Image redaction accuracy depends heavily on the OCR quality. Clean screenshots with standard fonts work well. Handwritten text, low-resolution scans, and images with complex backgrounds will produce lower accuracy. For those cases, you may want to preprocess the image (deskew, enhance contrast) before sending it to the redactor.

What's Next

Now you understand how Presidio's detection layers work together and how to process the text types you'll actually encounter. In Part 3, we'll build custom recognizers: deny-list recognizers for company-specific terms, regex recognizers for internal ID formats, rule-based recognizers with context enhancement, and no-code recognizers via YAML configuration.

This is Part 2 of the Hands-On Microsoft Presidio series. I write about PII detection, AI infrastructure, and building with Claude Code on Dev.to.

DEV Community