DEV Community: Shashikiran ML

How We Stop PII From Leaking Through AI Pipelines (Without Breaking the LLM)

Shashikiran ML — Tue, 24 Mar 2026 15:41:03 +0000

Every AI pipeline tutorial shows you the happy path. Chunk your documents, embed them, stuff them into a context window, call your LLM, get a great answer.

None of them show what happens when your documents contain patient names, account numbers, or SSNs. At that point the demo breaks — and so does your compliance posture.

I've watched this play out repeatedly. Teams build a RAG pipeline, get it working, then legal or security asks a simple question: where exactly does the PII go?

The answer is usually everywhere. Into the embedding. Into the prompt. Logged by the orchestration framework. Possibly cached. Definitely inside the LLM API call leaving your network.

Here's how we think about the problem, and how we solve it.

The naive fix that makes things worse

First instinct: redaction. Find the PII, replace with [REDACTED], move on.

This breaks LLMs in a specific way that's easy to miss in testing and obvious in production.

Take this sentence:

Patient John Doe, DOB 04/12/1978, residing at 123 Main Street.
His email is john.doe@example.com and he was prescribed metformin.
Follow-up with Dr. Sarah Connors on 03/15/2025.

After naive redaction:

Patient [REDACTED], DOB [REDACTED], residing at [REDACTED].
His email is [REDACTED] and he was prescribed metformin.
Follow-up with [REDACTED] on [REDACTED].

Ask an LLM to summarize this. The model either hallucinates to fill the gaps or produces a hedged non-answer. Worse, if this record appears in two different chunks, [REDACTED] gives the model no way to know both refer to the same person.

Referential integrity is gone. The model can't reason across the context correctly.

What format-preserving tokenization does differently

Instead of blanking sensitive values, Protecto replaces them with typed, entity-scoped tokens that preserve meaning and referential integrity:

Patient <PER>005O 0BY</PER>, DOB <DATE_TIME>06N 00E1</DATE_TIME>,
residing at <ADDRESS>06N 00E1 00003b</ADDRESS>.
His email is <EMAIL>3</EMAIL> and he was prescribed metformin.
Follow-up with <PER>7H2K 9QR</PER> on <DATE_TIME>14P 88X2</DATE_TIME>.

The LLM sees a coherent record. It knows <PER>005O 0BY</PER> and
<PER>7H2K 9QR</PER> are two distinct people. If the same patient appears in five chunks, their token is consistent across all of them.

Output quality holds. We track this with something we call RARI (Response Accuracy Retention Index): does the LLM still give accurate answers after masking? With typed tokens, yes. With [REDACTED], often no.

The masking API in practice

Here's a basic scan-and-mask call using auto-detection:

import requests

payload = {
    "mask": [
        {
            "value": "John Doe lives at 123 Main Street. His email is john.doe@example.com"
        }
    ]
}

response = requests.put(
    "https://protecto-trial.protecto.ai/api/vault/mask",
    headers={
        "Authorization": "Bearer <AUTH_TOKEN>",
        "Content-Type": "application/json"
    },
    json=payload
)

print(response.json())

The response returns the original value alongside its masked token:

{
  "data": [
    {
      "value": "John Doe lives at 123 Main Street. His email is john.doe@example.com",
      "token_value": "<PER>005O 0BY</PER> lives at <ADDRESS>06N 00E1 00003b</ADDRESS>. His email is <EMAIL>3</EMAIL>"
    }
  ],
  "success": true,
  "error": { "message": "" }
}

To unmask, pass the token_value back to the unmask endpoint:

unmask_payload = {
    "unmask": [
        {
            "token_value": "<PER>005O 0BY</PER> lives at <ADDRESS>06N 00E1 00003b</ADDRESS>. His email is <EMAIL>3</EMAIL>"
        }
    ]
}

response = requests.put(
    "https://protecto-trial.protecto.ai/api/vault/unmask",
    headers={
        "Authorization": "Bearer <AUTH_TOKEN>",
        "Content-Type": "application/json"
    },
    json=unmask_payload
)

Unmask is role-gated. Only users or agents with the right access policy can retrieve the original values. Every call is logged for audit.

Where this fits in a RAG pipeline

The integration point is before data hits the vector store, and again before you send retrieved context to the LLM:

Raw documents
      |
   [MASK]           <- Protecto API, at ingestion
      |
Vector store (masked embeddings)
      |
Retrieval
      |
   [MASK]           <- Protecto API, on retrieved chunks at query time
      |
LLM prompt (masked)
      |
LLM response (masked)
      |
   [UNMASK]         <- Protecto API, policy-checked, role-gated
      |
Final response to authorized user

The LLM never sees real PII. Your vector store doesn't contain it. Your logs don't capture it. An authorized user gets the real data unmasked in the final response.
Everyone else gets tokens.

The async path for batch workloads

Real-time masking is fine for live agent calls. For bulk jobs — generating embeddings from 50M records, running ETL pipelines — use async:

batch_payload = {
    "mask": [
        {"value": "He lives in the U.S.A"},
        {"value": "Ram lives in the U.S.A"}
    ]
}

response = requests.put(
    "https://protecto-trial.protecto.ai/api/vault/mask/async",
    headers={"Authorization": "Bearer <AUTH_TOKEN>"},
    json=batch_payload
)

tracking_id = response.json()["data"][0]["tracking_id"]

The job runs with autoscaling, Kafka integration for streaming pipelines, and built-in caching for repeated text patterns (useful when processing logs with the same patient ID across thousands of entries).

The edge case that surprised us most

Arabic-script numerals.

A Middle Eastern financial institution we worked with uses GPT-4o for financial summarization. Their input is mixed Arabic and English, and phone numbers often appear in Eastern Arabic digits (٠١٢٣٤٥٦٧٨٩ rather than 0123456789). Standard NER models including AWS Comprehend missed these almost entirely.

Building character-level patterns on top of the base model to handle this got them to 99% recall, 96% precision. It required treating the problem as multilingual NER, not just English PII detection.

If you're building for international deployments, this class of problem is worth solving before you hit production.

On GCP Marketplace

We just listed on Google Cloud Marketplace, which means if you're on GCP you can deploy Protecto Vault directly from your account without a separate procurement track. APIs work with LangChain, Databricks, Snowflake, n8n, crewAI and so many more.

Docs: docs.protecto.ai

Happy to answer questions in the comments about the architecture, tokenization
approach, or how we handle semantic drift at scale.

Building Multi-Tenant AI SaaS Without the Data Privacy Nightmares

Shashikiran ML — Tue, 09 Dec 2025 16:21:19 +0000

You've built something cool. An AI agent that answers customer questions. A RAG system that extracts insights from documents. An LLM endpoint that your users love.

Then your CISO asks: "Where's the data protection?"

And you realize: You're shipping customer data through your system completely unmasked. It's in your logs. Your vector database. Your fine-tuning pipeline. Nowhere is it safe.

Now you have three options:

Buy an enterprise tool ($50K+/month, 3-month sales cycle) - Too expensive, too slow
Build your own masking solution (6+ months of engineering) - Too complex, too much maintenance
Find something built for developers (this is where Protecto SaaS comes in) - Fast, affordable, easy

This article walks through option 3. How to add production-grade PII masking to your AI stack in an afternoon.

Why is PII masking hard in AI?

Most data masking tools were built in the 1990s for enterprise data warehouses. They're designed for database admins and compliance officers. They require:

Infrastructure setup and management
Custom rule definition
Manual testing and validation
Vendor negotiations and contracts
3-month minimum commitments

Meanwhile, your AI stack moves at a different pace. You need to:

Add privacy in hours, not months
Integrate via API, not database connections
Pay for what you use, not reserved capacity
Use tools that understand your workflow (LangChain, Llamaindex, Databricks, etc.)

The specific problem:

When you process customer data through an AI agent, that data needs to flow through multiple layers:

Input layer: Customer query with PII
Logging layer: Everything your agent does gets logged
Vector DB layer: Embeddings created from customer data
Fine-tuning layer: Training data with real customer information
Evaluation layer: Test sets with unmasked examples

Traditional masking tools can protect one or two layers. But they struggle with:

Unstructured text: Customer conversations, documents, support tickets
Context preservation: When you mask everything, you destroy data utility
Edge cases: Names hidden in unstructured data, informal identifiers
Performance: Traditional masking is slow (milliseconds matter in real-time)

The result: Most AI teams either ship unprotected (risky) or build custom masking (expensive).

Solution: How LLM-Based Detection Changes Everything

Here's the architecture we built at Protecto to solve this:

Layer 1: Intelligent PII Detection

Traditional approach: Regex patterns. Simple, fast, but misses 15-30% of actual PII.

Better approach: Combine LLMs + statistical validation.

Raw text: "John Smith from Acme Corp called about his account 123-45-6789"

Regex approach finds:

"123-45-6789" → SSN
Misses: "Acme Corp" (organization), "John Smith" (name, sometimes)

LLM approach finds:

"John Smith" → PERSON (98% confidence)
"Acme Corp" → ORG (99% confidence)
"123-45-6789" → SSN (99% confidence)
Validates each finding with statistical model
Result: 99%+ accuracy

Why this matters: You catch edge cases that regex misses. You get high confidence scores. You reduce false positives.

Layer 2: Context-Aware Masking

Here's where most tools fail. They mask aggressively.

Before: "Patient John Smith has diabetes diagnosed in 2019 and takes metformin daily."

Traditional masking:
"Patient [PII] has [PII] diagnosed in [PII] and takes [PII] daily."
→ Completely useless for AI

Intelligent masking:
"Patient [PERSON] has diabetes diagnosed in 2019 and takes metformin daily."
→ AI still understands the context

The difference: Your LLM can work with masked data. It understands the structure. It knows there's a patient with a condition and a medication. The specific details (name, diagnosis type) are masked, but the semantic meaning is preserved.

Layer 3: Compliance & Control

Audit logging: Every operation tracked
Policy management: Define exactly what gets masked how
Unmasking controls: Only authorized users can unmask specific records
Multi-tenancy: Customer data completely isolated

Real Numbers From Production

We've been running this with customers since June 2024:

Processing 50+ million API calls per month
99%+ accuracy on PII detection
Average latency: 12ms for real-time, 30 seconds per 1M documents for async
Cost per million API calls: $15-50 depending on data complexity

Customer Results:

Series A fintech startup: Went from "we can't process customer data" to "training models on real masked data" in 48 hours.

Healthcare startup: Previously couldn't meet HIPAA requirements for unstructured text. Now processes patient notes with zero compliance risk.

Enterprise SaaS: Reduced privacy implementation time from 3 months (estimated) to 2 weeks.

How to Get Started

Visit https://portal.protecto.ai/
Sign up for a free account (
Activate the account by email verification
Start using our API (it’s that simple)
No credit card for free tier. No long-term commitments.

Privacy doesn't have to slow you down. It can be as fast as the code you write.

The companies winning in 2026 will be the ones that built privacy in from day one, not as an afterthought.

Try Protecto SaaS free. See how fast you can add privacy to your AI.