Michal Vich

Posted on Mar 3

How to Protect PII in LLM Pipelines with Python

#llm #privacy #python #security

Tokenize personal data before it reaches the model, restore it in the output.

If we're building AI features that handle customer data — support tickets, medical intake, financial queries — we have a problem. Every prompt we send to an LLM API is logged, cached, and potentially used for training. Names, emails, SSNs, medical records: all of it lands on someone else's servers.

GDPR says we can't send EU personal data to third-party processors without safeguards. HIPAA says protected health information must be de-identified. And even outside regulated industries, sending raw customer data to OpenAI or Anthropic is a liability we shouldn't accept.

Here's what a naive implementation looks like:

from openai import OpenAI

client = OpenAI()

# Every name, email, and SSN in this text hits OpenAI's servers
response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{
        "role": "user",
        "content": (
            "Summarize this support ticket: Customer John Doe (john@acme.com) "
            "called about order #4521. His SSN 123-45-6789 was used for verification."
        ),
    }],
)

John Doe's name, email, and SSN are now in OpenAI's logs. We can do better.

The Approach: Tokenize, Process, Detokenize

The fix is straightforward: replace PII with deterministic tokens before it reaches the model, then restore the originals in the output.

"John Doe (john@acme.com)"
    ↓ tokenize
"<Person_1> (<Email Address_1>)"
    ↓ send to LLM
"I've noted <Person_1>'s issue..."
    ↓ detokenize
"I've noted John Doe's issue..."

The model never sees real data. It works with placeholders that preserve sentence structure, so the output quality stays the same. When we detokenize, the final response reads naturally with all the original values restored.

Blindfold handles both sides of this — the PII detection and the token mapping — and it works with any LLM provider (OpenAI, Anthropic, Mistral, local models, etc.).

Setup

pip install blindfold-sdk

That's all we need to start. Blindfold works locally out of the box — no API key, no account, no network calls.

Example 1: Try It Locally (No API Key, No Network Calls)

Let's start with something we can run right now. Without an API key, Blindfold runs entirely on our machine using regex-based detection. Nothing leaves the process:

from blindfold import Blindfold

# No API key = local mode (regex-only, runs entirely on your machine)
bf = Blindfold()

text = "Contact us at sarah@example.com or 555-867-5309. SSN: 123-45-6789"

# Detect — find PII without modifying the text
detected = bf.detect(text)
print(f"Found {detected.entities_count} entities:")
for entity in detected.detected_entities:
    print(f"  {entity.type}: '{entity.text}' (score: {entity.score:.2f})")
# Found 3 entities:
#   Email Address: 'sarah@example.com' (score: 0.95)
#   Phone Number: '555-867-5309' (score: 0.90)
#   Social Security Number: '123-45-6789' (score: 1.00)

# Tokenize — replace PII with reversible tokens
result = bf.tokenize(text)
print(f"\nTokenized: {result.text}")
# Tokenized: Contact us at <Email Address_1> or <Phone Number_1>. SSN: <Social Security Number_1>

# Detokenize — restore originals
original = bf.detokenize(result.text, result.mapping)
print(f"Restored: {original.text}")
# Restored: Contact us at sarah@example.com or 555-867-5309. SSN: 123-45-6789

# Redact — permanently remove PII
redacted = bf.redact(text)
print(f"\nRedacted: {redacted.text}")
# Redacted: Contact us at [REDACTED] or [REDACTED]. SSN: [REDACTED]

Local mode covers 80+ pattern-based entity types (emails, phone numbers, SSNs, credit cards, IBANs, and more) across 30+ countries using 86 built-in regex detectors. It's fast, deterministic, and has zero external dependencies. For detecting names, addresses, and other context-dependent entities, we switch to the cloud API (Example 4), which adds AI-based detection on top of the regex layer.

Example 2: Detect-Only Mode

Sometimes we don't want to modify text — just know what PII is in it. The detect() method returns every entity with its type, position, and confidence score:

from blindfold import Blindfold

bf = Blindfold()

result = bf.detect(
    "Email john@acme.com or call 555-0123. SSN: 123-45-6789",
    policy="strict",
)

for entity in result.detected_entities:
    print(f"{entity.type}: '{entity.text}' (score: {entity.score:.2f})")

# Email Address: 'john@acme.com' (score: 0.95)
# Phone Number: '555-0123' (score: 0.90)
# Social Security Number: '123-45-6789' (score: 1.00)

Use this to build guardrails: block messages containing PII before they reach the model, generate audit trails for compliance, or flag content for human review.

Going Further: Cloud API

When we need to detect names, addresses, and other context-dependent entities — not just structured patterns — we switch to the Blindfold cloud API, which adds AI-based detection on top of the regex layer.

export BLINDFOLD_API_KEY="your-blindfold-api-key"

Example 3: Redact PII from Documents

When storing or indexing text (RAG pipelines, search indexes, logs), we often want PII permanently removed rather than tokenized:

from blindfold import Blindfold

bf = Blindfold()  # uses BLINDFOLD_API_KEY env var

medical_note = (
    "Patient Sarah Johnson (DOB 03/15/1985) was diagnosed with "
    "Type 2 diabetes. Contact: sarah.j@email.com, SSN 234-56-7890."
)

redacted = bf.redact(medical_note, policy="hipaa_us")
print(redacted.text)
# "Patient [REDACTED] (DOB [REDACTED]) was diagnosed with
#  Type 2 diabetes. Contact: [REDACTED], SSN [REDACTED]."

With redact(), PII is permanently removed — there's no mapping to reverse. Notice that "Sarah Johnson" is detected as a person name — this requires the cloud API's AI model; local regex mode can't detect arbitrary names. This is the right choice for indexing, logging, or any case where we want the data gone, not just hidden.

Example 4: Protect Any LLM Call (Cloud API)

The full pattern: tokenize the input (including names), send safe text to the model, detokenize the output.

pip install openai
export OPENAI_API_KEY="your-openai-api-key"

from blindfold import Blindfold
from openai import OpenAI

bf = Blindfold()
openai = OpenAI()

user_message = (
    "Write a follow-up email to John Doe at john@example.com "
    "about his refund for order #1234."
)

# Step 1: Tokenize — PII is replaced with tokens
tokenized = bf.tokenize(user_message, policy="basic")
print(tokenized.text)
# "Write a follow-up email to <Person_1> at <Email Address_1>
#  about his refund for order #1234."

# Step 2: Send safe text to the LLM
response = openai.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {"role": "system", "content": "You are a helpful customer support assistant."},
        {"role": "user", "content": tokenized.text},
    ],
)
llm_output = response.choices[0].message.content
# The model drafts an email to <Person_1> at <Email Address_1>

# Step 3: Detokenize — restore original values
result = bf.detokenize(llm_output, tokenized.mapping)
print(result.text)
# The model's email now reads "John Doe" and "john@example.com"

The policy parameter controls which entity types are detected. "basic" covers names, emails, phone numbers, and locations. For regulated workloads, use "gdpr_eu", "hipaa_us", or "pci_dss".

This pattern works with any LLM provider — swap openai for anthropic, mistralai, or even a local model running on Ollama. Blindfold doesn't care what's in the middle.

Protection Methods

Blindfold supports six ways to handle detected PII, depending on the use case:

tokenize → <Person_1> — LLM pipelines, reversible round-trips
redact → [REDACTED] — Permanent removal, indexing, storage
mask → J*** D** — Display to end users, partial visibility
hash → a1b2c3d4... — Analytics, deduplication without exposing data
synthesize → Jane Smith — Realistic fake data for testing
encrypt → enc:x8f2k... — Reversible with encryption key

Compliance Policies

Each policy defines which entity types to detect and at what sensitivity:

basic — Names, emails, phones, locations. Best for general apps.
gdpr_eu — Adds IBANs, addresses, dates of birth. Best for EU compliance.
hipaa_us — Adds SSNs, MRNs, medical terms. Best for healthcare.
pci_dss — Adds card numbers, CVVs, expiry dates. Best for payment processing.
strict — All entity types with a lower confidence threshold. Maximum coverage.

Data residency is controlled via the region parameter: "eu" routes to Frankfurt, "us" routes to Virginia. Both the API and all SDKs support this.

Wrapping Up

Three steps — tokenize, call the LLM, detokenize — and customer data never leaves our control. It works with any model provider, any framework, and takes minutes to add to an existing codebase. Start locally with zero setup, switch to the cloud API when we need AI-powered accuracy.

Resources:

Beyond Python, there are SDKs for JavaScript, Go, Java, .NET, a CLI, and an MCP server for AI agent workflows. There's also a LangChain integration if we want deeper framework support.

DEV Community