NoPII

Posted on Apr 10

We ran 109 tests to measure how PII protection methods affect LLM output quality. Here's what we learned and what we built.

#ai #privacy #security #showdev

**TL;DR: **We ran 109 tests across GPT-4o, Claude 3.5 Sonnet, and Gemini 1.5 Pro to measure how different PII protection methods affect LLM output quality. Placeholder masking ([PERSON], [SSN]) dropped output quality to 54-68%. Deterministic tokenization (each entity gets its own unique opaque token) preserved 91-96%. We also found that leaving PII labels like "SSN" next to tokenized values causes safety refusals in 15-20% of cases. We built NoPII based on these findings: a reverse proxy that tokenizes PII before prompts reach the model and detokenizes responses on the way back. One base_url change in your existing SDK. Free tier, no credit card. Full paper here: Link

If you are building anything on top of LLM APIs that touches real user data, you have probably had the conversation. The one where the prototype works, the team is excited, and then someone from security or legal asks what exactly is being sent to OpenAI or Anthropic or whichever provider you are using.

That question tends to stall projects for weeks. Sometimes permanently.

We kept running into it ourselves while building AI features for clients in healthcare and financial services. So instead of treating PII protection as a side task to figure out later, we decided to study the problem properly and then build something based on what we found.

This post is about both: the research and the tool.

The question that started this
When you need to keep PII out of LLM prompts, the obvious approach is to detect sensitive entities and replace them with something safe before sending the prompt to the model.

But "replace them with something safe" is where things get interesting, because the replacement strategy you choose has a direct impact on whether the model's response is still useful.

Most guides and most tools default to placeholder masking. You scan the text, find "John Smith," and replace it with [PERSON]. Find "4111-1111-1111-1111" and replace it with [CREDIT_CARD]. It is simple, it is intuitive, and it is what you will land on if you reach for any open-source NER library or most AI gateway guardrails.
The alternative is deterministic tokenization. Instead of generic placeholders, each detected value gets its own unique opaque token. "John Smith" becomes PERSON_a8k2. "Jane Doe" becomes PERSON_m3x9. Same person, same token, every time. Different people, different tokens.

The difference sounds minor until you think about what the model is actually doing with the data.

Why it matters: a concrete example
Imagine a prompt that contains a customer support transcript with three participants: a customer, a support agent, and a supervisor who joins mid-conversation.

With placeholder masking, all three names become [PERSON]. The model receives a transcript where every speaker attribution is identical.
It cannot tell who escalated the issue, who offered the resolution, or who the customer was. The summary it generates is vague at best, wrong at worst.

With deterministic tokenization, each person has a distinct token. The model can track the flow of the conversation, correctly attribute statements, and produce a summary that is actually useful. It just never sees any real names.

This is not a hypothetical. This is the exact kind of prompt that breaks in production when teams ship masking-based PII protection without testing the downstream effects.

What we tested
We wanted to put actual numbers on this, so we built a structured test suite.

109 prompts were sent through three parallel pipelines:

Raw - the original prompt with real PII, sent directly to the model (baseline)
Masked - PII replaced with generic [TYPE] placeholders
Tokenized - PII replaced with deterministic opaque tokens

Each prompt was evaluated across GPT-4o, Claude 3.5 Sonnet, and Gemini 1.5 Pro. We scored on four dimensions: factual accuracy, entity consistency (can the model keep track of who is who), reasoning coherence, and practical usability (could you actually hand this output to an end user).

The prompt types covered multi-person HR summaries, financial transaction analysis, medical record summarization, customer support conversations, and legal document review. These are not toy examples. They are the actual workflows where PII shows up in production.

The results, briefly
Tokenized prompts preserved 91-96% of raw output quality across all three models. Entity relationships held. Reasoning chains stayed intact. The model performed almost identically to receiving the original prompt with real PII.

Masked prompts dropped to 54-68% quality. The steepest declines were in entity consistency and reasoning coherence, exactly as you would expect. When distinct entities collapse into identical placeholders, the model's ability to reason over relationships falls apart.
The gap scaled with prompt complexity. Single-entity prompts (one name, one number) worked fine with either method. But real-world prompts almost never have just one entity. A healthcare intake form might contain a patient name, a referring physician, a pharmacy, a date of birth, and an insurance ID. Mask all of those with generic labels and the model is trying to summarize a document where everything important has been removed.

The problem nobody talks about: safety refusals
We found something during testing that we had not anticipated.
When you tokenize the value of a PII field but leave the label intact, the model sometimes refuses to process the request. For example, the prompt might contain: "The customer's SSN is IDENTIFIER_k8m2." The model sees the word "SSN" next to an opaque string and its safety filter interprets the combination as suspicious. It declines to generate a response.

This happened in roughly 15-20% of our tests involving certain entity types: Social Security numbers, credit card numbers, and passport numbers. It was consistent across all three models.
This is a subtle failure mode. Your PII protection layer is technically working correctly. The data is safely tokenized. But the model refuses to cooperate because the context label next to the tokenized value triggers a safety gate.

We ended up calling this "context phrase neutralization." The fix is to replace the label alongside the value. Instead of "SSN: IDENTIFIER_k8m2," the prompt becomes "Reference number: IDENTIFIER_k8m2." The model processes it normally. No safety refusal. No degraded output.

As far as we can tell, this is not addressed by any other tool or guide in this space. It came directly from the test data.

What we built
The research made it clear that deterministic tokenization was the right approach, but the engineering gap between "I can detect PII and replace it with tokens" and "I have a production-ready privacy layer for LLM APIs" is significant.

We needed to solve for detection, tokenization, vault-backed token storage, response detokenization, streaming support, context phrase neutralization, audit logging, and fail-safe behavior. And we needed all of it to work transparently, without requiring application developers to change their prompt construction or response parsing logic.

That is what NoPII is.

It sits between your application and the LLM provider as a reverse proxy. You point your existing SDK at NoPII instead of directly at the provider. NoPII handles the rest.

# Before
client = OpenAI(api_key="sk-...")

# After
client = OpenAI(
    api_key="sk-...",
    base_url="https://your-nopii-instance.com/v1"
)

Your application code does not change beyond that. Prompts go in with real data, get tokenized in transit, reach the model without PII, and come back detokenized so your app receives the real response.

How we approached the engineering
Rather than building a detection engine from scratch, we started with a proven open-source NER framework and layered our own detection pipeline on top of it. The NER layer handles the initial entity recognition, but raw NER output is noisy. It produces false positives, misses context-dependent PII, and does not understand the difference between a name in a sentence and a name in a data field. We added confidence thresholds, entity-type-specific tuning, and the context phrase neutralization logic on top of the base detection.

The tokenization layer is where things get more interesting. We did not build a hash map or an in-memory lookup. NoPII's tokenization is backed by a PCI Level 1 and SOC2 certified vault infrastructure that we had already built and operated for payment card tokenization. That existing vault is what gives us deterministic, reversible tokenization with real key management, not a weekend project bolted onto an LLM proxy.

For streaming, we had to solve a specific technical problem: LLM responses delivered via Server-Sent Events arrive fragment by fragment, often splitting a token mid-word. The proxy needs to buffer intelligently, detect token boundaries across fragments, detokenize in-flight, and forward the stream without adding visible latency. This was one of the harder engineering problems in the project and it works for both OpenAI and Anthropic streaming APIs.

Fail-safe behavior is baked in by default. If tokenization fails for any reason, the request is blocked. PII does not leak. This is not a toggle. In regulated environments, fail-open is not a defensible design choice.

What you can actually do with it
The use cases are all variations of the same pattern: teams that want to use LLMs with data that contains sensitive information.
A healthcare team summarizing clinical notes that contain patient names and medical record numbers. A fintech company processing transaction descriptions that contain account numbers and counterparty details. An HR team using AI to analyze employee feedback that contains names and performance data. A legal team running contract analysis on documents that contain party names, addresses, and financial terms.

In each case, the team has a working AI workflow that is blocked or delayed because of the PII exposure question. NoPII is designed to make that question answerable without a multi-month infrastructure project.

It works with OpenAI, Anthropic, Google Gemini, xAI Grok, DeepSeek, Mistral, Groq, Together AI, and Fireworks. One proxy endpoint for all of them.

There is an admin dashboard for configuring detection thresholds, testing PII detection on sample text without writing code, and reviewing audit logs. Every detection is logged with entity type, confidence score, timestamp, provider, and model, which gives compliance teams the paper trail they need to sign off.

Try it or read the paper
If the research findings are useful to you on their own, the full paper is here: LINK

If you want to try NoPII, there is a free tier at app.nopii.co. No credit card. No sales call. You can be running tokenized prompts against any supported provider in under five minutes.

The website is www.nopii.co and docs are at docs.nopii.co.

If you are dealing with PII in LLM workflows, whether you have a solution in place or are still figuring it out, I would like to hear what your experience has been. What has worked? What has been frustrating? Happy to discuss in the comments.

DEV Community

We ran 109 tests to measure how PII protection methods affect LLM output quality. Here's what we learned and what we built.

Top comments (0)