Building Custom Recognizers

#presidio #microsoft #security #tutorial

Presidio's built-in recognizers cover the common PII types: names, emails, phone numbers, credit cards, SSNs. But every organization has PII that's specific to their business. Internal employee IDs that follow a custom format. Project codenames that shouldn't leak externally. Customer account numbers that don't match any standard pattern. Medical record numbers, policy IDs, internal ticket references. The built-in recognizers don't know about these.

This part covers four ways to build custom recognizers, from the simplest (a list of words to flag) to the most sophisticated (connecting an external NLP service).

Deny-List Recognizers

The fastest way to add a custom recognizer is a deny list. You give Presidio a list of words or phrases and it flags any exact match as a specific entity type.

Use case: your company has internal project codenames (like "Project Titan," "Sapphire," "Nightingale") that are confidential and should never appear in data sent to external services.

from presidio_analyzer import AnalyzerEngine, PatternRecognizer

# Create a deny-list recognizer
project_recognizer = PatternRecognizer(
    supported_entity="INTERNAL_PROJECT",
    deny_list=["Titan", "Sapphire", "Nightingale", "Ironclad", "Meridian"],
    deny_list_score=1.0
)

# Add it to the analyzer
analyzer = AnalyzerEngine()
analyzer.registry.add_recognizer(project_recognizer)

# Test it
text = "The Titan rollout is scheduled for Q3. Contact sarah@company.com for details."
results = analyzer.analyze(text=text, language="en")

for r in results:
    print(f"{r.entity_type}: '{text[r.start:r.end]}' (score: {r.score:.2f})")

Output:

INTERNAL_PROJECT: 'Titan' (score: 1.00)
EMAIL_ADDRESS: 'sarah@company.com' (score: 1.00)

The deny_list_score parameter sets the confidence level for matches. Set it to 1.0 if the deny list is curated and every match is definitely PII. Lower it if some terms might appear in non-sensitive contexts.

Deny lists are case-insensitive by default. "titan," "TITAN," and "Titan" all match.

Regex Recognizers

When your PII follows a pattern but the built-in recognizers don't cover it, write a regex recognizer.

Use case: your company uses employee IDs in the format EMP-XXXXX (EMP- followed by 5 digits) and customer account numbers in the format ACC-XXXX-XXXX.

from presidio_analyzer import PatternRecognizer, Pattern

# Employee ID recognizer
emp_id_pattern = Pattern(
    name="employee_id_pattern",
    regex=r"\bEMP-\d{5}\b",
    score=0.9
)

emp_recognizer = PatternRecognizer(
    supported_entity="EMPLOYEE_ID",
    patterns=[emp_id_pattern],
    name="EmployeeIdRecognizer"
)

# Customer account recognizer
account_pattern = Pattern(
    name="account_number_pattern",
    regex=r"\bACC-\d{4}-\d{4}\b",
    score=0.9
)

account_recognizer = PatternRecognizer(
    supported_entity="CUSTOMER_ACCOUNT",
    patterns=[account_pattern],
    name="CustomerAccountRecognizer"
)

# Register both
analyzer = AnalyzerEngine()
analyzer.registry.add_recognizer(emp_recognizer)
analyzer.registry.add_recognizer(account_recognizer)

text = "Employee EMP-28471 processed refund for account ACC-9921-0047."
results = analyzer.analyze(text=text, language="en")

for r in results:
    print(f"{r.entity_type}: '{text[r.start:r.end]}' (score: {r.score:.2f})")

Output:

EMPLOYEE_ID: 'EMP-28471' (score: 0.90)
CUSTOMER_ACCOUNT: 'ACC-9921-0047' (score: 0.90)

The score in the Pattern object sets the base confidence. You can define multiple patterns for the same entity type if the format varies (some systems might use EMP-XXXXX and others use E-XXXXXXX).

Context Enhancement

Regex patterns alone can produce false positives. A pattern like \d{5} matches any 5-digit number, not just employee IDs. Context words help Presidio distinguish between a zip code and an employee number.

from presidio_analyzer import PatternRecognizer, Pattern

# A medical record number recognizer with context
mrn_pattern = Pattern(
    name="mrn_pattern",
    regex=r"\b\d{7,10}\b",
    score=0.3  # Low base score because 7-10 digit numbers are common
)

mrn_recognizer = PatternRecognizer(
    supported_entity="MEDICAL_RECORD",
    patterns=[mrn_pattern],
    context=["medical record", "mrn", "patient id", "patient number", 
             "chart number", "medical id", "health record"],
    name="MedicalRecordRecognizer"
)

analyzer = AnalyzerEngine()
analyzer.registry.add_recognizer(mrn_recognizer)

# With context: high confidence
text1 = "Patient medical record number: 4829173"
results1 = analyzer.analyze(text=text1, language="en")
# Score boosted because "medical record number" is a context word

# Without context: low confidence (might be filtered by threshold)
text2 = "Order 4829173 shipped on Tuesday"
results2 = analyzer.analyze(text=text2, language="en")
# Score stays at base 0.3 because no context words present

The pattern starts with a low base score (0.3). When context words appear within a configurable window around the match, Presidio boosts the score. When they don't, the score stays low and gets filtered out by your threshold.

This is the right approach for any pattern that's too generic on its own. Set a low base score, provide strong context words, and let the context scoring do the disambiguation.

No-Code Recognizers via YAML

For teams that want to manage recognizers without touching Python code, Presidio supports YAML-based configuration. You define recognizers in a YAML file and load them at startup.

# custom_recognizers.yaml
recognizers:
  - name: "Project Code Recognizer"
    supported_language: "en"
    supported_entity: "INTERNAL_PROJECT"
    deny_list:
      - "Titan"
      - "Sapphire"
      - "Nightingale"
      - "Ironclad"
    deny_list_score: 1.0

  - name: "Employee ID Recognizer"
    supported_language: "en"
    supported_entity: "EMPLOYEE_ID"
    patterns:
      - name: "emp_id"
        regex: "\\bEMP-\\d{5}\\b"
        score: 0.9
    context:
      - "employee"
      - "emp"
      - "staff"
      - "worker"

  - name: "Policy Number Recognizer"
    supported_language: "en"
    supported_entity: "POLICY_NUMBER"
    patterns:
      - name: "policy_format"
        regex: "\\bPOL-[A-Z]{2}-\\d{6}\\b"
        score: 0.95
    context:
      - "policy"
      - "insurance"
      - "coverage"
      - "claim"

Load them into the analyzer:

from presidio_analyzer import AnalyzerEngine
from presidio_analyzer.recognizer_registry import RecognizerRegistryProvider

# Load recognizers from YAML
registry_provider = RecognizerRegistryProvider(
    conf_file="custom_recognizers.yaml"
)

analyzer = AnalyzerEngine(registry=registry_provider.create_recognizer_registry())

The YAML approach is useful when non-developers (security teams, compliance officers) need to update the recognizer list. They edit a YAML file, the service restarts with the new configuration. No code changes, no deployments.

Connecting External Services

For cases where local regex and NER aren't enough, Presidio supports remote recognizers that call external NLP services. Azure AI Language is the most common integration.

from presidio_analyzer import AnalyzerEngine
from presidio_analyzer.nlp_engine import NlpEngineProvider

# Configure the analyzer to use a transformer model instead of spaCy
nlp_config = {
    "nlp_engine_name": "transformers",
    "models": [
        {
            "lang_code": "en",
            "model_name": {
                "spacy": "en_core_web_sm",
                "transformers": "dslim/bert-base-NER"
            }
        }
    ]
}

nlp_engine = NlpEngineProvider(nlp_configuration=nlp_config).create_engine()
analyzer = AnalyzerEngine(nlp_engine=nlp_engine)

The transformer-based NER model (dslim/bert-base-NER or similar) often outperforms spaCy's default model on names and locations, especially for non-English text or unusual name formats. The tradeoff is speed. Transformer models are slower than spaCy, so profile your latency requirements before switching.

Testing Your Recognizers

Before deploying custom recognizers, test them against labeled data.

from presidio_analyzer import AnalyzerEngine

analyzer = AnalyzerEngine()
# (add your custom recognizers)

# Test cases: (input_text, expected_entity_type, expected_value)
test_cases = [
    ("Employee EMP-12345 submitted the report", "EMPLOYEE_ID", "EMP-12345"),
    ("Contact acc-9921-0047 about the refund", "CUSTOMER_ACCOUNT", "ACC-9921-0047"),
    ("Project Titan launch is next month", "INTERNAL_PROJECT", "Titan"),
    ("The titan submarine was discovered", "INTERNAL_PROJECT", "titan"),  # Should this match?
    ("Order number 12345 shipped", None, None),  # Should NOT match EMPLOYEE_ID
]

for text, expected_type, expected_value in test_cases:
    results = analyzer.analyze(text=text, language="en", score_threshold=0.5)
    relevant = [r for r in results if r.entity_type == expected_type] if expected_type else results

    if expected_type and relevant:
        found_value = text[relevant[0].start:relevant[0].end]
        status = "PASS" if found_value.lower() == expected_value.lower() else "FAIL"
    elif not expected_type and not relevant:
        status = "PASS"
    else:
        status = "FAIL"

    print(f"[{status}] '{text}' -> {expected_type or 'NONE'}")

Pay particular attention to false positives (non-PII flagged as PII) and false negatives (actual PII missed). Adjust regex patterns, context words, and score thresholds based on your test results.

What's Next

You can now extend Presidio to detect any entity type your business needs. In Part 4, we'll cover anonymization strategies: the full set of operators (replace, redact, mask, hash, encrypt), pseudonymization with consistent mappings, synthetic data generation, and when to use reversible vs. irreversible anonymization.

This is Part 3 of the Hands-On Microsoft Presidio series. I write about PII detection, AI infrastructure, and building with Claude Code on Dev.to.