How We Stop PII From Leaking Through AI Pipelines (Without Breaking the LLM)

#ai #llm #privacy #security

Every AI pipeline tutorial shows you the happy path. Chunk your documents, embed them, stuff them into a context window, call your LLM, get a great answer.

None of them show what happens when your documents contain patient names, account numbers, or SSNs. At that point the demo breaks — and so does your compliance posture.

I've watched this play out repeatedly. Teams build a RAG pipeline, get it working, then legal or security asks a simple question: where exactly does the PII go?

The answer is usually everywhere. Into the embedding. Into the prompt. Logged by the orchestration framework. Possibly cached. Definitely inside the LLM API call leaving your network.

Here's how we think about the problem, and how we solve it.

The naive fix that makes things worse

First instinct: redaction. Find the PII, replace with [REDACTED], move on.

This breaks LLMs in a specific way that's easy to miss in testing and obvious in production.

Take this sentence:

Patient John Doe, DOB 04/12/1978, residing at 123 Main Street.
His email is john.doe@example.com and he was prescribed metformin.
Follow-up with Dr. Sarah Connors on 03/15/2025.

After naive redaction:

Patient [REDACTED], DOB [REDACTED], residing at [REDACTED].
His email is [REDACTED] and he was prescribed metformin.
Follow-up with [REDACTED] on [REDACTED].

Ask an LLM to summarize this. The model either hallucinates to fill the gaps or produces a hedged non-answer. Worse, if this record appears in two different chunks, [REDACTED] gives the model no way to know both refer to the same person.

Referential integrity is gone. The model can't reason across the context correctly.

What format-preserving tokenization does differently

Instead of blanking sensitive values, Protecto replaces them with typed, entity-scoped tokens that preserve meaning and referential integrity:

Patient <PER>005O 0BY</PER>, DOB <DATE_TIME>06N 00E1</DATE_TIME>,
residing at <ADDRESS>06N 00E1 00003b</ADDRESS>.
His email is <EMAIL>3</EMAIL> and he was prescribed metformin.
Follow-up with <PER>7H2K 9QR</PER> on <DATE_TIME>14P 88X2</DATE_TIME>.

The LLM sees a coherent record. It knows <PER>005O 0BY</PER> and
<PER>7H2K 9QR</PER> are two distinct people. If the same patient appears in five chunks, their token is consistent across all of them.

Output quality holds. We track this with something we call RARI (Response Accuracy Retention Index): does the LLM still give accurate answers after masking? With typed tokens, yes. With [REDACTED], often no.

The masking API in practice

Here's a basic scan-and-mask call using auto-detection:

import requests

payload = {
    "mask": [
        {
            "value": "John Doe lives at 123 Main Street. His email is john.doe@example.com"
        }
    ]
}

response = requests.put(
    "https://protecto-trial.protecto.ai/api/vault/mask",
    headers={
        "Authorization": "Bearer <AUTH_TOKEN>",
        "Content-Type": "application/json"
    },
    json=payload
)

print(response.json())

The response returns the original value alongside its masked token:

{
  "data": [
    {
      "value": "John Doe lives at 123 Main Street. His email is john.doe@example.com",
      "token_value": "<PER>005O 0BY</PER> lives at <ADDRESS>06N 00E1 00003b</ADDRESS>. His email is <EMAIL>3</EMAIL>"
    }
  ],
  "success": true,
  "error": { "message": "" }
}

To unmask, pass the token_value back to the unmask endpoint:

unmask_payload = {
    "unmask": [
        {
            "token_value": "<PER>005O 0BY</PER> lives at <ADDRESS>06N 00E1 00003b</ADDRESS>. His email is <EMAIL>3</EMAIL>"
        }
    ]
}

response = requests.put(
    "https://protecto-trial.protecto.ai/api/vault/unmask",
    headers={
        "Authorization": "Bearer <AUTH_TOKEN>",
        "Content-Type": "application/json"
    },
    json=unmask_payload
)

Unmask is role-gated. Only users or agents with the right access policy can retrieve the original values. Every call is logged for audit.

Where this fits in a RAG pipeline

The integration point is before data hits the vector store, and again before you send retrieved context to the LLM:

Raw documents
      |
   [MASK]           <- Protecto API, at ingestion
      |
Vector store (masked embeddings)
      |
Retrieval
      |
   [MASK]           <- Protecto API, on retrieved chunks at query time
      |
LLM prompt (masked)
      |
LLM response (masked)
      |
   [UNMASK]         <- Protecto API, policy-checked, role-gated
      |
Final response to authorized user

The LLM never sees real PII. Your vector store doesn't contain it. Your logs don't capture it. An authorized user gets the real data unmasked in the final response.
Everyone else gets tokens.

The async path for batch workloads

Real-time masking is fine for live agent calls. For bulk jobs — generating embeddings from 50M records, running ETL pipelines — use async:

batch_payload = {
    "mask": [
        {"value": "He lives in the U.S.A"},
        {"value": "Ram lives in the U.S.A"}
    ]
}

response = requests.put(
    "https://protecto-trial.protecto.ai/api/vault/mask/async",
    headers={"Authorization": "Bearer <AUTH_TOKEN>"},
    json=batch_payload
)

tracking_id = response.json()["data"][0]["tracking_id"]

The job runs with autoscaling, Kafka integration for streaming pipelines, and built-in caching for repeated text patterns (useful when processing logs with the same patient ID across thousands of entries).

The edge case that surprised us most

Arabic-script numerals.

A Middle Eastern financial institution we worked with uses GPT-4o for financial summarization. Their input is mixed Arabic and English, and phone numbers often appear in Eastern Arabic digits (٠١٢٣٤٥٦٧٨٩ rather than 0123456789). Standard NER models including AWS Comprehend missed these almost entirely.

Building character-level patterns on top of the base model to handle this got them to 99% recall, 96% precision. It required treating the problem as multilingual NER, not just English PII detection.

If you're building for international deployments, this class of problem is worth solving before you hit production.

On GCP Marketplace

We just listed on Google Cloud Marketplace, which means if you're on GCP you can deploy Protecto Vault directly from your account without a separate procurement track. APIs work with LangChain, Databricks, Snowflake, n8n, crewAI and so many more.

Docs: docs.protecto.ai

Happy to answer questions in the comments about the architecture, tokenization
approach, or how we handle semantic drift at scale.