DEV Community

Malik B. Parker
Malik B. Parker

Posted on

How to Strip Sensitive Data Before It Hits Your LLM

You built an AI agent that logs into your bank, navigates to billing, and extracts your bill amount. Smart. But now Claude is reading your full name, home address, account numbers, and partial SSN — all sent through an API you don't control. That's not a pipeline. That's a liability.

Here's how I solved it with four regex patterns and an open-source library most people have never heard of.


The Context

I'm building Bill Analyzer — an agentic system that automatically logs into utility and financial sites, navigates to billing pages, and extracts what I owe and when it's due. It uses:

  • Playwright for browser automation
  • Claude (Haiku) as the AI agent for navigation and extraction
  • 1Password CLI for credential management

The architecture has two phases:

  1. Login Agent — navigates login flows with credentials handled opaquely (the agent never sees passwords)
  2. Extract Agent — reads post-login pages to find billing data

The extract agent needs to read the page to find dollar amounts and due dates. But those pages also contain names, addresses, SSNs, account numbers — PII that has no business being sent to any external API.

The constraint: the agent must understand the page well enough to extract billing data, without ever seeing personal information.


Why This Matters Beyond My Project

Any time you're feeding real user data into an LLM — customer support transcripts, medical records, financial documents, scraped web content — you face the same problem. The model needs the structure and relevant content, not the identity.

This isn't just good practice. Depending on your industry, it's GDPR, HIPAA, or CCPA compliance.


The Tool: Microsoft Presidio

Presidio is Microsoft's open-source PII detection and anonymization library. It combines:

  • spaCy NER (Named Entity Recognition) for names and locations
  • Pattern-based recognizers (regex) for structured PII like SSNs, phone numbers, credit cards
  • Custom recognizers you can add for domain-specific patterns

Install:

pip install presidio-analyzer presidio-anonymizer
python -m spacy download en_core_web_lg
Enter fullscreen mode Exit fullscreen mode

Basic usage:

from presidio_analyzer import AnalyzerEngine
from presidio_anonymizer import AnonymizerEngine

analyzer = AnalyzerEngine()
anonymizer = AnonymizerEngine()

text = "John Smith, (555) 123-4567, john@example.com"
results = analyzer.analyze(text=text, language="en")
redacted = anonymizer.anonymize(text=text, analyzer_results=results)

print(redacted.text)
# <PERSON>, <PHONE_NUMBER>, <EMAIL_ADDRESS>
Enter fullscreen mode Exit fullscreen mode

What Presidio Catches Out of the Box

Entity Example Method
PERSON John Smith spaCy NER
PHONE_NUMBER (555) 123-4567 Regex
EMAIL_ADDRESS john@example.com Regex
CREDIT_CARD 4532-1234-5678-9012 Regex + Luhn
US_SSN 123-45-6789 Regex
US_BANK_NUMBER 4829184729 Regex
LOCATION Springfield (city names only) spaCy NER

What It Misses: Street Addresses

This is where I hit a wall. Presidio's LOCATION entity catches city names sometimes, but full street addresses? Completely invisible:

IN:  John Smith, 123 Main St Springfield IL 62701, Balance: $142.37
OUT: <PERSON>, 123 Main St Springfield IL 62701, Balance: $142.37
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
                Address passes through unredacted
Enter fullscreen mode Exit fullscreen mode

Bare addresses are even worse:

IN:  3498 Ebenezer Ave
OUT: 3498 Ebenezer Ave    ← completely missed
Enter fullscreen mode Exit fullscreen mode

For a billing extraction pipeline, this is unacceptable. Your home address is on every utility bill page.


The Fix: Four Custom Regex Patterns

US street addresses follow predictable patterns. I built a custom PatternRecognizer with four patterns, ordered from most specific (highest confidence) to least:

from presidio_analyzer import Pattern, PatternRecognizer

US_STATES = (
    "AL|AK|AZ|AR|CA|CO|CT|DE|FL|GA|HI|ID|IL|IN|IA|KS|KY|LA|ME|MD|"
    "MA|MI|MN|MS|MO|MT|NE|NV|NH|NJ|NM|NY|NC|ND|OH|OK|OR|PA|RI|SC|"
    "SD|TN|TX|UT|VT|VA|WA|WV|WI|WY|DC"
)

STREET_SUFFIXES = (
    r"St(?:reet)?|Ave(?:nue)?|Blvd|Boulevard|Rd|Road|Dr(?:ive)?|"
    r"Ln|Lane|Way|Ct|Court|Pl(?:ace)?|Cir(?:cle)?|Pkwy|Parkway|"
    r"Ter(?:race)?|Hwy|Highway|Loop|Run|Path|Trail"
)

address_recognizer = PatternRecognizer(
    supported_entity="ADDRESS",
    patterns=[
        # 1. Full: "123 Main St, Apt 2, Springfield, IL 62701"
        Pattern(
            "us_address_full",
            rf"\d{{1,5}}\s+[\w\s.]+?(?:{STREET_SUFFIXES})\b"
            rf"[,.\s]+(?:[\w\s#.,]+[,.\s]+)?"
            rf"(?:{US_STATES})\s+\d{{5}}(?:-\d{{4}})?",
            0.85,
        ),
        # 2. With state: "123 Main St, Springfield, IL"
        Pattern(
            "us_address_state",
            rf"\d{{1,5}}\s+[\w\s.]+?(?:{STREET_SUFFIXES})\b"
            rf"[,.\s]+(?:[\w\s#.,]+[,.\s]+)?"
            rf"(?:{US_STATES})\b",
            0.7,
        ),
        # 3. With ZIP: "123 Main St 62701"
        Pattern(
            "us_address_zip",
            rf"\d{{1,5}}\s+[\w\s.]+?(?:{STREET_SUFFIXES})\b"
            rf"[,.\s]+[\w\s.,#]+\d{{5}}(?:-\d{{4}})?",
            0.6,
        ),
        # 4. Bare: "3498 Ebenezer Ave" or "789 Elm Dr, Apt 4"
        Pattern(
            "us_address_street_only",
            rf"\d{{1,5}}\s+[\w\s.]+?(?:{STREET_SUFFIXES})\b"
            rf"(?:[,.\s]+(?:Apt|Suite|Ste|Unit|#)\s*[\w-]+)?",
            0.4,
        ),
    ],
    context=[
        "address", "street", "mailing", "billing",
        "home", "residence", "service"
    ],
)

analyzer = AnalyzerEngine()
analyzer.registry.add_recognizer(address_recognizer)
Enter fullscreen mode Exit fullscreen mode

How the patterns work

All four share a common prefix:

\d{1,5}\s+[\w\s.]+?(?:STREET_SUFFIXES)\b
│          │              │
│          │              └─ Street suffix (St, Ave, Blvd...)
│          └─ Street name (lazy match)
└─ House number (1-5 digits)
Enter fullscreen mode Exit fullscreen mode

Key regex decisions:

  • Lazy +? on the street name prevents overmatching into surrounding text. Without this, 123 Main St Springfield IL 62701, Balance: $142.37 matches all the way through $142.37.
  • Word boundary \b after the suffix prevents partial matches like "Driveways" matching Drive.
  • Context words ("address", "billing", "mailing") boost confidence for borderline matches — a bare 3498 Ebenezer Ave at 0.4 confidence gets boosted when near the word "address".

The Results

IN:  3498 Ebenezer Ave
OUT: <ADDRESS>

IN:  123 Main St
OUT: <ADDRESS>

IN:  789 Elm Drive, Apt 4
OUT: <ADDRESS>

IN:  Service address: 3498 Ebenezer Ave. Your balance is $142.37
OUT: Service address: <ADDRESS>. Your balance is $142.37

IN:  John Smith, 123 Main St Springfield IL 62701, Balance: $142.37
OUT: <PERSON>, <ADDRESS>, Balance: $142.37

IN:  Jane Doe, 456 Oak Avenue, Apt 2B, Chicago, IL 60601, Amount: $98.50
OUT: <PERSON>, <ADDRESS>, Amount: $98.50

IN:  789 Broadway New York NY 10003, Phone: (555) 123-4567
OUT: <ADDRESS>, Phone: <PHONE_NUMBER>

IN:  1234 W Elm Blvd, Suite 100, Los Angeles CA 90001
OUT: <ADDRESS>

IN:  55 Park Dr, Unit 3, Denver CO 80202. Your bill is $75.00 due May 1.
OUT: <ADDRESS>. Your bill is $75.00 due May 1.

IN:  Your bill of $200.00 is due May 1. No address here.
OUT: Your bill of $200.00 is due May 1. No address here.
Enter fullscreen mode Exit fullscreen mode

Every address caught. Every dollar amount and date preserved. No false positives on non-address text.


What Survives Redaction (By Design)

The whole point is that the LLM still gets what it needs:

  • $142.37 — the bill amount (not PII)
  • April 15, 2026 — the due date (not PII)
  • Balance Due: — structural labels (not PII)
  • <PERSON> — the agent knows a name was there without knowing whose

The redacted text preserves enough structure for the AI to do its job while stripping everything that identifies the person.


Known Limitations

  • PO Box addresses (PO Box 1234, Springfield, IL) — not covered, would need another pattern
  • International addresses — US only; other countries need separate recognizers
  • No-suffix streets (123 Broadway works, 123 Maple does not)
  • Account numbers — Presidio sometimes tags these as PHONE_NUMBER due to digit patterns; close enough for redaction purposes

The Bigger Picture

PII redaction isn't just about regex. It's an architectural decision. In my pipeline:

  1. Login phase — the agent never sees credentials (opaque tool fills them, snapshots redact filled fields)
  2. Navigation phase — the agent sees only links and buttons, not page content
  3. Extraction phase — the agent sees Presidio-redacted text (this article)

Three layers of privacy enforcement. The LLM is powerful but sandboxed. It can read a page without knowing who you are.

If you're building any pipeline that sends real-world data through an LLM, ask yourself: does the model actually need to see the PII to do its job? Usually the answer is no.


Resources

Top comments (0)