Every company using Large Language Models (LLMs) is sending data somewhere. Most of them don't have a clear answer for what happens to the personal information inside those API calls. That's not a future compliance problem, it's a right-now problem. And waiting for regulation to catch up isn't a strategy; it's a liability.
I founded CloakLLM because I kept seeing the same scenario play out across industries. Companies were ready to adopt LLMs for high-impact use cases-customer support automation, complex document processing, internal knowledge search- but they hit a wall. There was no clear path to move from a "cool demo" to "production-ready" without exposing sensitive customer data to a third-party API.
When that wall is hit, the project usually stalls. But "stalled" is a polite word for what actually happens.
In reality, the project often goes underground. When official AI initiatives are blocked by compliance concerns, employees don't stop needing the technology. They open ChatGPT in a personal browser tab and paste customer data, proprietary code, or sensitive legal summaries into a consumer interface. They do this with zero logging, zero PII protection, and zero audit trail. The compliance concern hasn't been solved; it's just become invisible to the C-suite. This "Shadow AI" is one of the largest unmanaged risks in the enterprise today.
The Vision vs. The Reality
This gap exists everywhere, but in the EU, the tension is particularly high. The regulatory framework is actually ahead of most jurisdictions. The Data Act went live on September 12, 2025. The AI Act mandates automatic record-keeping (Article 12) and transparency (Article 13). GDPR remains the bedrock, requiring data minimization and the right to erasure.
Europe got the vision right. I believe that a trust-based framework is a long-term competitive advantage. But there's a "last mile" problem: none of these high-level principles translates into tools that organizations can actually deploy today.
There is no off-the-shelf solution. No middleware says, "This request contains PII from an EU data subject; here is how to handle it before it leaves your infrastructure." That's why I founded CloakLLM - to create a transparent, open-source layer that sits between the application and the model, ensuring that the "legal yes" is as easy to achieve as the "technical yes."
Solving the Natural Language Problem
Protecting data in an LLM call is fundamentally different from protecting a database. Personal information in natural language isn't a solved problem. Simple pattern matching catches structured data, IBANs, credit cards, and phone numbers, but misses the nuance of human speech.
That's why CloakLLM uses a 3-pass detection pipeline:
Pattern Matching: For high-speed, high-confidence structured data.
Named Entity Recognition: To identify people, organizations, and locations in context.
Local AI Reasoning: This is the critical layer. A local model reasons about what is actually sensitive. It can distinguish between "123 Main St" as a generic example and "the house next to the bakery" as a specific, identifiable address.
All of this happens locally; the data never leaves the organization's infrastructure until it's already replaced with safe placeholder tokens. When inference happens on a cloud API, that's not a design preference; it's a requirement for GDPR compliance.
The Three Critical Gaps
From where I sit, three gaps are preventing the Data Act from becoming a catalyst for adoption:
The API-Layer Standard: The Data Act focuses heavily on IoT-generated data and portability. The AI Act focuses on risk classification. Neither addresses the primary data flow of the current AI boom: the API call. We need a standard for protecting the data in these requests and a recognized format for logging them. Without a standard, every legal department is forced to reinvent the wheel, leading to months of unnecessary friction.
The Documentation-Agile Conflict: The AI Act's Annex IV demands comprehensive records of design decisions and data lineage. But for companies using Retrieval-Augmented Generation (RAG) - where the AI pulls from a live knowledge base to answer questions - "data lineage" is a moving target. Every time the knowledge base updates with new documents or fresh data, the system's context changes.
Annex IV assumes a static audit trail; RAG is inherently dynamic. If you change how the system processes or retrieves information, your compliance snapshot is technically obsolete. Compliance documentation needs to be generated automatically as part of the development process, not maintained manually after the fact. And every month spent on manual compliance work is a month your competitor ships without you.
- The Missing Implementation Layer: Regulators have published a flood of guidance. These are valuable for legal teams, but the people actually building AI products need practical tools and implementation blueprints. The gap between a regulatory PDF and a working product is where most innovation quietly dies.
Moving Toward "Compliance-as-Code"
What would actually change the equation? Standardized, open-source reference implementations.
If the EU were to fund standard open-source components, privacy detection layers, audit logging tools, and consent-tracking systems - in the same way it funds research, the impact would be massive. Standardization removes the excuse. When a compliance layer is one install away, "we couldn't figure out how to handle PII" stops being a valid reason to delay AI adoption.
For companies outside the EU, the lesson is the same: adopting this infrastructure now isn't just about avoiding a future fine. It's about building the trust necessary to move AI out of the sandbox and into the core of your business.
I'd like to hear from others navigating this. If you're leading AI adoption at your company, what's getting in the way? Is it the model capability, or is it the infrastructure of trust?
The code is open-source: cloakllm.dev | github.com/cloakllm/CloakLLM
Top comments (0)