Every executive team has now had the same uncomfortable meeting. Engineering wants to use Claude for code review. Sales wants GPT-4 to draft proposals. Customer support has been quietly piping tickets into a chatbot through someone's personal API key. Legal walks in, asks one question — "where is that data going?" — and the whole program freezes.
The freeze is rational. The frontier models do live on someone else's infrastructure. Your customer records, M&A drafts, source code, and medical histories are exactly the data you cannot ship to a third party. Yet the productivity gap between teams that have integrated AI well and teams that haven't is now the difference between weeks and quarters.
The usual answer — "self-host an open model" — costs millions, requires a team you don't have, and ships you a model that benchmarks 30% behind whatever Anthropic released last week.
There is a third path. You don't bring the AI inside your walls. You build a wall that stands between your data and the AI. This piece is about that wall — what it is, what it costs, how it scales, and how to deploy one in 30 days without disrupting a single existing system.
The architecture in one sentence
A data sanitization layer is a programmable proxy that sits in the egress path between your applications and any external LLM provider. Outbound: it detects sensitive entities in a prompt, replaces them with reversible tokens, stores the mapping in your vault, and forwards only the tokenized prompt. Inbound: it receives the model's response, restores the original values from the vault, and delivers a complete answer to the user.
The provider sees structure. You keep substance. The mapping never crosses your trust boundary, so the provider literally cannot leak what it never received — a property that matters enormously when your compliance officer asks for guarantees rather than promises.
Key idea. This is not a model. It is plumbing. The frontier model still does the thinking; you just changed what it gets to think about.
Why this is the right primitive
There are four common alternatives, and each has a fatal flaw.
Self-hosted open-weight models (Llama 3.1 70B, Qwen 2.5, DeepSeek V3) sound appealing until you cost out the GPU bill, the model-ops headcount, and the gap between an open model and the closed frontier. Even the most generous self-host plans land at $30k–$120k per month for serious inference traffic, plus two-to-three FTE in MLOps. For most enterprises this is the worst of both worlds: high cost, lower capability. We dig into this trade-off more in our AI transformation playbook.
Provider data-processing agreements (the "we promise we won't train on your data" page) are necessary but insufficient. They are contracts about behavior, not about technical capability. An attacker who breaches the provider, an insider with the wrong access, or a future model that accidentally memorizes your data — none of these are stopped by a DPA. Modern security thinking has moved decisively from promise to prove. See OWASP's LLM Top 10 for why provider trust alone is no longer acceptable.
Pure local redaction in the client (regex stripping in the browser or SDK) is the right intuition wrong direction. Client-side anything is bypassable, inconsistent, and impossible to audit. A central layer enforces a single policy that every team inherits automatically.
Synthetic-data generation sounds elegant — train a small model on synthetic versions of your real data — but it only solves training. Inference still involves real user data, which is the actual problem.
The sanitization layer is the only architecture that gives you frontier capability, central enforcement, and a clean audit trail at the same time.
What happens in a single request
Consider a sales operations analyst asking the AI to draft a follow-up email for a customer who placed a six-figure order. The prompt naturally contains a name, a customer ID, an order amount — the exact data that should never reach a public API in raw form.
Behind the wall, in milliseconds:
-
Detection. A named-entity recognition model scans the prompt and flags
Ahmet YılmazasPERSON,12345678901asNATIONAL_ID,$45,000asMONETARY_AMOUNT. Detection runs through three layers: a transformer NER (multilingual, fine-tuned on your domain), regex rules (for things like IBANs, credit cards, IP addresses), and a domain dictionary (your product names, internal project codenames, partner companies). -
Tokenization. Each sensitive value is replaced with a format-preserving placeholder:
[PERSON_1],[ID_1],[AMOUNT_1]. The original-to-token mapping goes into an encrypted vault inside your environment — typically AES-256 at rest with per-tenant keys via AWS KMS or HashiCorp Vault. -
Policy check. Before the request leaves your perimeter, the policy engine asks: Is this user allowed to send
MONETARY_AMOUNTdata togpt-4o? If yes, forward. If no, block, escalate, or downgrade to a smaller model with stricter constraints. - Transmission. Only the sanitized prompt goes to the provider. Your egress firewall can be configured to allow LLM provider IPs only via the wall — any direct call from an application becomes a policy violation.
- Generation. The model writes the email using tokens. It has no idea who Ahmet is or what he bought.
- Restoration. The response comes back. The wall walks the response text, replaces each token with its original value from the vault, and delivers the final output.
- Logging. Request metadata — user, timestamp, entity types involved, model used, policy applied, token count, cost — is written to your SIEM. The actual sensitive payload is never logged.
End-to-end latency added by the wall: typically 80–250ms on warm cache, less than the variance between OpenAI's own response times on the same prompt. Detection and tokenization can be parallelized; the vault lookup on restoration is the hot path.
The six capabilities, properly scoped
A sanitization layer is six tightly-coupled services behind one API.
1. Detection and classification. Multilingual NER (we use a fine-tuned XLM-RoBERTa for Turkish/English) plus regex plus dictionaries. Critically: the detector has to be tunable per industry. A bank cares about IBANs and SWIFT codes. A hospital cares about ICD-10 codes and medication names. A law firm cares about case numbers and party names. Out-of-the-box PII detection is the starting point, not the destination.
2. Tokenization and masking. Format-preserving so the model still reasons correctly. Ahmet Yılmaz becomes [PERSON_1] (not [REDACTED]) so the model knows it's a person and writes "Dear [PERSON_1]," in the right place. Numeric amounts become [AMOUNT_1] with the right magnitude class so calculations still work. Dates become [DATE_1] with preserved relative ordering.
3. Policy engine. Plain-English rules over (department, model, data class, action). "Marketing can use gpt-4o for any data except MEDICAL_RECORD. Engineering can use claude-3.5-sonnet for anything in the PUBLIC_REPO class but must use the on-prem model for anything in PRIVATE_REPO." These rules are versioned, reviewable in Git, and enforced before any external call. The engine ties closely to how we think about security at the application layer.
4. Audit and compliance. Every request, every response, every policy decision — without the sensitive payload. This is what converts AI from a compliance liability into a defensible process under KVKK, GDPR, ISO 27001, and HIPAA. The audit log is what your legal team will demand in year two and never had in year one.
5. Threat protection. LLMs have a unique attack surface: prompt injection (embedded instructions in user data), jailbreaks (clever prompts that bypass safety), and exfiltration (asking the model to leak its system prompt or training data). The wall inspects both directions for these patterns — incoming prompts for injection attempts, outgoing responses for leaked secrets or non-compliant content.
6. Model router. Different requests, different models. A simple summarization can go to gpt-4o-mini at $0.15 per million input tokens. A high-stakes contract review goes to claude-3.5-sonnet at $3 per million. The router optimizes for cost, latency, and capability per request — and gives you vendor independence as a side effect. We cover the cost-routing pattern in our microservices architecture writeup.
How it scales to enterprise volume
The naive implementation — single Node process, in-memory vault, sequential detection — works for a pilot but caps around 200 requests per second. Real enterprise traffic looks more like 5,000–50,000 RPS at peak. Three architectural decisions get you there.
Stateless detection workers behind a load balancer. Detection and tokenization are CPU-bound but stateless once your models are loaded. Run them as a Kubernetes deployment of 8–32 pods, scale horizontally on CPU. Each pod holds the NER model in memory; cold-start is mitigated by readiness probes that wait for model load. We've covered this Kubernetes pattern in our DevOps best practices guide.
Vault as a managed service. Don't build your own. Use Vault Enterprise, AWS Secrets Manager + KMS, or GCP Secret Manager. The vault is the most sensitive component in your architecture; making it bespoke is exactly the wrong place to save engineering time. Token-to-value lookups become a managed problem with audit logs you don't have to write.
Cache the model client. OpenAI-style HTTP/2 connections benefit hugely from connection pooling. Maintain a warm pool of 10–20 connections per provider per worker; the latency difference between cold-connect and warm is 200ms+ — bigger than your entire detection pipeline.
Background restoration for large responses. Streaming responses (server-sent events) need streaming restoration. As tokens arrive from the model, restore them on the fly and stream to the user. Do not buffer the full response, which forfeits the conversational latency advantage that made LLMs feel magical.
At 50,000 RPS, a properly architected wall adds roughly $0.0001 per request in your own infrastructure (against $0.001–$0.020 in model API cost), uses ~15ms of detection time, and gives you a single audit-able choke point for every AI interaction in the organization. The cost ratio is so favorable that the wall pays for itself just on model cost optimization — routing routine requests away from the flagship model is usually a 40–60% spend reduction. Database operations underneath this scale require their own discipline; we cover that in database optimization for high-traffic apps.
A 30-day deployment plan that actually works
Big-bang rollouts of new security layers fail. Here's how to ship a sanitization layer in one month without disrupting anything.
Week 1 — Pick one workflow. Choose the highest-pain, highest-leverage AI use case currently blocked by data sensitivity. Customer support triage. Contract clause extraction. Internal knowledge search over Confluence or Notion. Code review on private repos. One workflow, one team, one model. Define the entity classes that matter for this workflow and nothing else.
Week 2 — Stand up the wall in shadow mode. Deploy the layer in front of the chosen workflow but in observe-only mode. It detects, logs, would-have-tokenized, but does not modify the request. You now have a real dataset showing exactly what sensitive entities your users send, in what frequency, in what context. This data is gold for the next step.
Week 3 — Tune the detection. Based on shadow data, adjust the entity catalog. Add the domain-specific patterns the off-the-shelf model missed. Suppress the false positives (every team has at least one — for us it was repeatedly flagging "Stripe" as a person). Get the legal team to review the catalog: do they agree these are the categories that matter for KVKK Article 9 / GDPR Article 9 / your sector regulation?
Week 4 — Switch to enforce, then expand. Flip from observe to enforce on the pilot workflow. Watch error rates for 48 hours. Review the audit log with legal and compliance. Once the pattern is validated, the second workflow plugs in with a fraction of the effort because the layer is already running, the policies are already written, and the team already trusts the audit trail.
This phased approach is how every enterprise security primitive (WAFs, secrets managers, SIEM) actually rolled out — and how the sanitization layer should roll out too. The same pattern works for moving regulated workloads to the cloud, which we cover in our cloud migration guide.
The compliance picture, briefly
Under KVKK (Turkish data protection), Article 9 governs cross-border transfer of personal data — which is exactly what happens every time someone sends a customer name to an API hosted in the US. The sanitization layer is the technical control that lets you argue, with audit evidence, that personal data did not cross the border because it never left your perimeter in identified form.
Under GDPR, the same logic applies via Article 44 (transfers to third countries). Pseudonymization is explicitly recognized in Article 4(5) as a privacy-enhancing technology that materially reduces risk. A sanitization layer is, by definition, pseudonymization with a properly-secured re-identification key.
Under ISO 27001 Annex A 8.10 (information deletion) and A 8.11 (data masking), the wall directly satisfies the technical control requirements that auditors look for.
Under HIPAA, the same architecture functions as a de-identification layer per the Safe Harbor method, with the vault holding the identifiers that would otherwise convert PHI exposure into a reportable incident.
The same wall, configured per-industry, gives you a defensible posture across all four regimes. Your security team writes the policy once; the application teams inherit compliance automatically. This is a major reduction in audit overhead.
What this changes for IT
For technology leadership, the sanitization layer is more than a privacy tool — it's a strategic chokepoint. Three implications matter.
Single point of governance. Instead of negotiating data-handling terms with every AI vendor and auditing every integration separately, IT manages one layer with one policy set. Every AI-touching application in the enterprise — from the internal LLM chatbot to the marketing copy generator to the customer support bot we built using modern web architecture — inherits those controls automatically.
Clean separation of concerns. Application teams build features. The wall enforces data protection. Security teams audit one boundary instead of dozens. Compliance teams have one log to review.
Observability into AI usage. For the first time, IT can answer questions that today's ad-hoc AI use makes impossible: which teams are using AI most, on what data, at what cost, with what risk profile? Per-team token spend, per-model cost trends, policy violation rates — all emerge as a byproduct of doing the primary job.
The strategic frame. Most enterprises will eventually have a single AI gateway. The question is whether you design it deliberately as a strategic asset, or accumulate it accidentally as ten different teams build ten different proxies. The first path takes a quarter and pays dividends forever. The second takes years and produces ten different audit liabilities.
Common objections, briefly
"Won't sanitization hurt the model's accuracy?" In practice, no — modern LLMs reason perfectly well over structured placeholders as long as the placeholder preserves the type of entity. Where accuracy does suffer is on natively unstructured tasks like sentiment analysis of customer feedback, where the customer's actual words matter. For those tasks you either accept the trade-off or run them through an on-prem model. The router can make this routing automatic.
"What about agents that need to take real actions on real data?" The wall is for the LLM call, not the tool call. When the model outputs send_email_to([PERSON_1]), your application layer restores [PERSON_1] to the real address before invoking the email tool. The agent's reasoning happens on tokens; the agent's actions happen on real data inside your perimeter.
"Can the provider deduce identity from context?" Possible in theory, mitigated in practice by entity rotation (the same person gets different tokens in different sessions), aggressive minimization (only send the prompt fragments that need to reach the model), and provider-side privacy policies. The threat model here is residual; the alternative is sending everything in clear text.
Ready to build one?
If your organization is currently sending raw customer data to public LLM APIs — and most are — you are accumulating compliance debt every day. If you're holding back AI adoption entirely because legal said no, you are losing the productivity race.
The sanitization layer is the architectural primitive that lets you stop both. Your data stays home. The AI thinks anyway. Compliance gets a defensible answer. Engineering gets to ship.
We've built sanitization layers for regulated industries — finance, healthcare, legal — across both Turkey and Europe. If you want to discuss what one would look like for your stack, your data, and your compliance regime, get in touch. The first conversation costs you 30 minutes and clarifies whether this is the right primitive for your problem.
Keep your data. Use the AI. Both can be true at once.
Top comments (0)