N Jensen

Posted on May 21

The Hidden Privacy Problem in Every AI App

#datascience #machinelearning #agents #python

Every AI Product Has the Same Hidden Privacy Problem

The more useful the assistant becomes, the more sensitive the user data becomes.

A user writes something like:

I need to move my appointment from Tuesday to Friday because I had surgery last month and I’m still on medical leave.

My policy number is INS-48291.

The task is simple: reschedule an appointment. But the prompt also includes sensitive medical, employment-related, and internal policy information that the model may not need to complete that task.

In many AI applications, that message is sent directly to a third-party LLM provider.

Why This Matters Now

In regulated environments, teams are expected to limit personal data processing, explain why data is needed, and apply privacy by design.

With LLMs, that becomes harder because users often paste sensitive information directly into prompts, even when that information is not required for the task.

The risk is not that every AI app will receive a massive fine.

The real risk is losing control over where sensitive data goes, how long it is stored, whether it appears in logs or traces, and whether the company can prove that the model actually needed that data in the first place.

Companies want to use LLMs in customer support, healthcare, finance, legal workflows, HR, internal tools, and enterprise automation.

But these are exactly the environments where sensitive data appears naturally in conversations.

So the question is not:

Can we remove all personal data?

The real question is:

Can we protect personal data while keeping the AI useful?

This is the problem I tried to solve with PII Firewall, an open-source privacy layer for AI applications.

PII Firewall behaves as a stateful proxy that sits between your application and any LLM provider. It detects sensitive information, replaces it with safe tokens before the model call, and restores the original values only inside your trusted environment.

It is model-agnostic by design, works across 55+ languages, and lets teams combine multiple PII detection techniques through a simple, unified framework.

The model still gets the context it needs. But it never needs to know who the user really is.

Redaction Is Not Enough

The most common approach to privacy is redaction.

Take this input:

My name is Maria Perez and my email is maria@example.com.

A simple redaction system might turn it into:

My name is [REDACTED] and my email is [REDACTED].

That protects the data, but it also destroys useful context.

If the model later responds with:

I have updated the record for [REDACTED].

The final answer is technically private, but not useful.The model does not know what was redacted. Was it a person, a company, an email address, or a case reference?

And in a multi-turn conversation, a stateless redaction layer cannot reliably preserve who or what each placeholder refers to across messages.

The Core Idea: Pseudonymize, Reason, Rehydrate

PII Firewall uses reversible pseudonymization.

Instead of deleting sensitive values, it replaces them with safe tokens:

John Doe         → PERSON_1
john@example.com → EMAIL_1

So the LLM receives:

Hi, I'm PERSON_1. My email is EMAIL_1.
Can you help me update my insurance claim?

The model can still understand that there is a person, an email address, and a task.

It can respond naturally:

Sure, PERSON_1. I can help you update the account linked to EMAIL_1.

Then PII Firewall rehydrates the response inside your trusted environment:

Sure, John Doe. I can help you update the account linked to john@example.com.

The user gets a personalized answer, but the LLM provider never receives the raw personal data.

A Flexible Privacy Framework

PII Firewall is not the first attempt to protect LLM calls from sensitive data exposure.

Privacy proxies, PII redaction tools, anonymization layers, and AI security gateways are becoming part of the production LLM stack.

But in practice, privacy is rarely solved with a single detector, a single language, or a single rule.

Some applications need fast pattern matching for emails, phone numbers, credit cards, IBANs, or internal IDs. Others need language-aware models to detect names, locations, organizations, dates, or contextual references. Some teams need healthcare-specific detection, while others need finance, legal, HR, customer support, or internal enterprise rules.

PII Firewall lets teams:

Choose the LLM provider
Combine different PII detection techniques
Adapt detection to multiple languages
Define different actions per entity
Keep reversible mappings in a scoped vault

In other words, the goal is not to provide one fixed privacy policy, but a framework for defining the right privacy behavior for each application.

Model-Agnostic by Design

PII Firewall can sit in front of OpenAI, Anthropic, or any provider that accepts text input and returns text output.

This matters because AI infrastructure changes quickly. A team may start with one provider, later route workloads to another, or eventually move some models in-house.

Your privacy layer should not have to be rewritten every time your model strategy changes.

PII Firewall is designed as middleware: a thin privacy boundary between your application and whichever model you choose.

Detector-Agnostic Too

PII detection is not a solved problem with a single perfect technique.

Some sensitive data is structured and easy to detect:

john@example.com
+34 600 123 456
4242 4242 4242 4242

Rule-based patterns work very well for these cases.

Other data is contextual:

John lives near the hospital and spoke with Dr. Martinez last Friday.

Here, names, roles, dates, and locations may need language understanding.

PII Firewall supports multiple detection engines. You can use:

Simple rule-based patterns when you want speed and precision
NER-based detection when you need context
Transformer models for more specialized domains
Hybrid mode to combine several approaches

Built for More Than English

A lot of privacy tooling works well in English and then quietly breaks down elsewhere.

Real users write in Spanish, French, Italian, Portuguese, German, Arabic, Japanese, and many other languages. They use different name structures, address formats, national identifiers, phone formats, bank account formats, and local conventions.

PII Firewall includes language-aware routing, so detection can adapt to the language and region of the text.

This allows the system to apply the right patterns and models for the context instead of relying on a single English-centric detector.

For example, a privacy layer used in Europe should understand IBANs, local phone numbers, national identifiers, and multilingual names.

A global AI product cannot treat privacy as an English-only problem.

Different Domains Need Different Privacy Behavior

Different industries have different privacy needs.

A healthcare assistant should preserve clinical utility while protecting patient identity.

A finance assistant should protect payment data while keeping enough context for analysis.

A legal assistant may need to preserve case references while anonymizing party names and addresses.

PII Firewall includes domain-specific presets for common scenarios such as:

Healthcare
Finance
Legal
Generic

These presets are not meant to be rigid. They are starting points.

Teams can override specific behaviors, add their own entity types, or define new profiles for their own internal data.

For example, a company may want to detect:

Employee IDs
Customer IDs
Ticket numbers
Contract references
Internal project codes

Those identifiers may not be “standard PII”, but they can still be sensitive inside a business context.

The same applies to how different values are handled.

Not all sensitive data should be handled the same way:

A person's name might need to be pseudonymized so the model can continue referencing that person.
A credit card number should usually be masked.
A precise date of birth might be generalized.
A value that should never be exposed may need to be fully redacted.

PII Firewall supports different actions depending on the type of data:

John Doe                   → PERSON_1
john@example.com            → EMAIL_1
4242 4242 4242 4242         → ************4242
17 March 1984               → 1980s
Highly sensitive value      → [REDACTED]

Privacy is not one-size-fits-all. The right behavior depends on the data type, the use case, and the risk profile.

Stateful Privacy

One of the most important parts of PII Firewall is that it is stateful.

A stateless filter can remove data, but it cannot easily remember that across a conversation:

PERSON_1 = John Doe
EMAIL_1  = john@example.com
CASE_1   = Insurance Claim 8821

PII Firewall keeps this mapping in a vault scoped to the right context.

That context can include things like:

Organization
User
Case
Thread

This makes it possible to track mappings safely and restore values only when appropriate.

It also makes compliance workflows easier.

For example, if a user invokes a right-to-forget request, the application can purge the mappings for that user, case, or session.

The goal is to manage sensitive data across its lifecycle, not just remove it from a single prompt.

Example

Here is the basic idea in code:

from privacy_firewall import PrivacyFirewallSDK

firewall = PrivacyFirewallSDK.create(
    domain="healthcare",
    detector_backend="hybrid",
)

context = {
    "tenant_id": "acme-corp",
    "case_id": "case-8821",
    "thread_id": "thread-001",
    "actor_id": "user-42",
}

sanitized = firewall.anonymize_text(
    text=user_prompt,
    context=context,
)

response = my_llm(
    sanitized.sanitized_text,
)

final = firewall.rehydrate_text(
    text=response,
    context=context,
)

Conclusion

As AI moves into production, privacy cannot remain an afterthought, especially as regulation becomes stricter in Europe.

The safest personal data to send to an LLM is the data it never receives.

PII Firewall is an open-source project. If you are building AI products in healthcare, finance, legal tech, customer support, internal automation, or enterprise SaaS, I would love feedback, issues, and contributions from people working on real-world systems.

DEV Community