Malik B. Parker

Posted on Mar 15

How to Strip Sensitive Data Before It Hits Your LLM

#security #python #ai #webscraping

You built an AI agent that logs into your bank, navigates to billing, and extracts your bill amount. Smart. But now Claude is reading your full name, home address, account numbers, and partial SSN — all sent through an API you don't control. That's not a pipeline. That's a liability.

Here's how I solved it with four regex patterns and an open-source library most people have never heard of.

The Context

I'm building Bill Analyzer — an agentic system that automatically logs into utility and financial sites, navigates to billing pages, and extracts what I owe and when it's due. It uses:

Playwright for browser automation
Claude (Haiku) as the AI agent for navigation and extraction
1Password CLI for credential management

The architecture has two phases:

Login Agent — navigates login flows with credentials handled opaquely (the agent never sees passwords)
Extract Agent — reads post-login pages to find billing data

The extract agent needs to read the page to find dollar amounts and due dates. But those pages also contain names, addresses, SSNs, account numbers — PII that has no business being sent to any external API.

The constraint: the agent must understand the page well enough to extract billing data, without ever seeing personal information.

Why This Matters Beyond My Project

Any time you're feeding real user data into an LLM — customer support transcripts, medical records, financial documents, scraped web content — you face the same problem. The model needs the structure and relevant content, not the identity.

This isn't just good practice. Depending on your industry, it's GDPR, HIPAA, or CCPA compliance.

The Tool: Microsoft Presidio

Presidio is Microsoft's open-source PII detection and anonymization library. It combines:

spaCy NER (Named Entity Recognition) for names and locations
Pattern-based recognizers (regex) for structured PII like SSNs, phone numbers, credit cards
Custom recognizers you can add for domain-specific patterns

Install:

pip install presidio-analyzer presidio-anonymizer
python -m spacy download en_core_web_lg

Basic usage:

from presidio_analyzer import AnalyzerEngine
from presidio_anonymizer import AnonymizerEngine

analyzer = AnalyzerEngine()
anonymizer = AnonymizerEngine()

text = "John Smith, (555) 123-4567, john@example.com"
results = analyzer.analyze(text=text, language="en")
redacted = anonymizer.anonymize(text=text, analyzer_results=results)

print(redacted.text)
# <PERSON>, <PHONE_NUMBER>, <EMAIL_ADDRESS>

What Presidio Catches Out of the Box

Entity	Example	Method
`PERSON`	John Smith	spaCy NER
`PHONE_NUMBER`	(555) 123-4567	Regex
`EMAIL_ADDRESS`	john@example.com	Regex
`CREDIT_CARD`	4532-1234-5678-9012	Regex + Luhn
`US_SSN`	123-45-6789	Regex
`US_BANK_NUMBER`	4829184729	Regex
`LOCATION`	Springfield (city names only)	spaCy NER

What It Misses: Street Addresses

This is where I hit a wall. Presidio's LOCATION entity catches city names sometimes, but full street addresses? Completely invisible:

IN:  John Smith, 123 Main St Springfield IL 62701, Balance: $142.37
OUT: <PERSON>, 123 Main St Springfield IL 62701, Balance: $142.37
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
                Address passes through unredacted

Bare addresses are even worse:

IN:  3498 Ebenezer Ave
OUT: 3498 Ebenezer Ave    ← completely missed

For a billing extraction pipeline, this is unacceptable. Your home address is on every utility bill page.

The Fix: Four Custom Regex Patterns

US street addresses follow predictable patterns. I built a custom PatternRecognizer with four patterns, ordered from most specific (highest confidence) to least:

from presidio_analyzer import Pattern, PatternRecognizer

US_STATES = (
    "AL|AK|AZ|AR|CA|CO|CT|DE|FL|GA|HI|ID|IL|IN|IA|KS|KY|LA|ME|MD|"
    "MA|MI|MN|MS|MO|MT|NE|NV|NH|NJ|NM|NY|NC|ND|OH|OK|OR|PA|RI|SC|"
    "SD|TN|TX|UT|VT|VA|WA|WV|WI|WY|DC"
)

STREET_SUFFIXES = (
    r"St(?:reet)?|Ave(?:nue)?|Blvd|Boulevard|Rd|Road|Dr(?:ive)?|"
    r"Ln|Lane|Way|Ct|Court|Pl(?:ace)?|Cir(?:cle)?|Pkwy|Parkway|"
    r"Ter(?:race)?|Hwy|Highway|Loop|Run|Path|Trail"
)

address_recognizer = PatternRecognizer(
    supported_entity="ADDRESS",
    patterns=[
        # 1. Full: "123 Main St, Apt 2, Springfield, IL 62701"
        Pattern(
            "us_address_full",
            rf"\d{{1,5}}\s+[\w\s.]+?(?:{STREET_SUFFIXES})\b"
            rf"[,.\s]+(?:[\w\s#.,]+[,.\s]+)?"
            rf"(?:{US_STATES})\s+\d{{5}}(?:-\d{{4}})?",
            0.85,
        ),
        # 2. With state: "123 Main St, Springfield, IL"
        Pattern(
            "us_address_state",
            rf"\d{{1,5}}\s+[\w\s.]+?(?:{STREET_SUFFIXES})\b"
            rf"[,.\s]+(?:[\w\s#.,]+[,.\s]+)?"
            rf"(?:{US_STATES})\b",
            0.7,
        ),
        # 3. With ZIP: "123 Main St 62701"
        Pattern(
            "us_address_zip",
            rf"\d{{1,5}}\s+[\w\s.]+?(?:{STREET_SUFFIXES})\b"
            rf"[,.\s]+[\w\s.,#]+\d{{5}}(?:-\d{{4}})?",
            0.6,
        ),
        # 4. Bare: "3498 Ebenezer Ave" or "789 Elm Dr, Apt 4"
        Pattern(
            "us_address_street_only",
            rf"\d{{1,5}}\s+[\w\s.]+?(?:{STREET_SUFFIXES})\b"
            rf"(?:[,.\s]+(?:Apt|Suite|Ste|Unit|#)\s*[\w-]+)?",
            0.4,
        ),
    ],
    context=[
        "address", "street", "mailing", "billing",
        "home", "residence", "service"
    ],
)

analyzer = AnalyzerEngine()
analyzer.registry.add_recognizer(address_recognizer)

How the patterns work

All four share a common prefix:

\d{1,5}\s+[\w\s.]+?(?:STREET_SUFFIXES)\b
│          │              │
│          │              └─ Street suffix (St, Ave, Blvd...)
│          └─ Street name (lazy match)
└─ House number (1-5 digits)

Key regex decisions:

Lazy +? on the street name prevents overmatching into surrounding text. Without this, 123 Main St Springfield IL 62701, Balance: $142.37 matches all the way through $142.37.
Word boundary \b after the suffix prevents partial matches like "Driveways" matching Drive.
Context words ("address", "billing", "mailing") boost confidence for borderline matches — a bare 3498 Ebenezer Ave at 0.4 confidence gets boosted when near the word "address".

The Results

IN:  3498 Ebenezer Ave
OUT: <ADDRESS>

IN:  123 Main St
OUT: <ADDRESS>

IN:  789 Elm Drive, Apt 4
OUT: <ADDRESS>

IN:  Service address: 3498 Ebenezer Ave. Your balance is $142.37
OUT: Service address: <ADDRESS>. Your balance is $142.37

IN:  John Smith, 123 Main St Springfield IL 62701, Balance: $142.37
OUT: <PERSON>, <ADDRESS>, Balance: $142.37

IN:  Jane Doe, 456 Oak Avenue, Apt 2B, Chicago, IL 60601, Amount: $98.50
OUT: <PERSON>, <ADDRESS>, Amount: $98.50

IN:  789 Broadway New York NY 10003, Phone: (555) 123-4567
OUT: <ADDRESS>, Phone: <PHONE_NUMBER>

IN:  1234 W Elm Blvd, Suite 100, Los Angeles CA 90001
OUT: <ADDRESS>

IN:  55 Park Dr, Unit 3, Denver CO 80202. Your bill is $75.00 due May 1.
OUT: <ADDRESS>. Your bill is $75.00 due May 1.

IN:  Your bill of $200.00 is due May 1. No address here.
OUT: Your bill of $200.00 is due May 1. No address here.

Every address caught. Every dollar amount and date preserved. No false positives on non-address text.

What Survives Redaction (By Design)

The whole point is that the LLM still gets what it needs:

$142.37 — the bill amount (not PII)
April 15, 2026 — the due date (not PII)
Balance Due: — structural labels (not PII)
<PERSON> — the agent knows a name was there without knowing whose

The redacted text preserves enough structure for the AI to do its job while stripping everything that identifies the person.

Known Limitations

PO Box addresses (PO Box 1234, Springfield, IL) — not covered, would need another pattern
International addresses — US only; other countries need separate recognizers
No-suffix streets (123 Broadway works, 123 Maple does not)
Account numbers — Presidio sometimes tags these as PHONE_NUMBER due to digit patterns; close enough for redaction purposes

The Bigger Picture

PII redaction isn't just about regex. It's an architectural decision. In my pipeline:

Login phase — the agent never sees credentials (opaque tool fills them, snapshots redact filled fields)
Navigation phase — the agent sees only links and buttons, not page content
Extraction phase — the agent sees Presidio-redacted text (this article)

Three layers of privacy enforcement. The LLM is powerful but sandboxed. It can read a page without knowing who you are.

If you're building any pipeline that sends real-world data through an LLM, ask yourself: does the model actually need to see the PII to do its job? Usually the answer is no.

DEV Community