Your RAG Pipeline is Leaking - 4 Data Leak Points Nobody Talks About

#ai #security #opensource #rust

Every enterprise running RAG today is doing what Samsung engineers did in 2023 — sending sensitive data to LLM providers. Except it's automated, at scale, thousands of times per day.

Samsung's problem wasn't careless employees. It was architectural. And your RAG pipeline has the same architecture.

The 4 Leak Points

Your Documents (contracts, financials, HR, strategy)
        |
        v
   1. Chunking                  ✅ Local, safe
        |
        v
   2. Embedding API call         ❌ LEAK #1: raw text to provider
        |
        v
   3. Vector DB (cloud)          ❌ LEAK #2: invertible embeddings
        |
        v
   4. User query embedding       ❌ LEAK #3: query to embedding API
        |
        v
   5. Retrieved context          (your most sensitive chunks)
        |
        v
   6. LLM generation call        ❌ LEAK #4: query + context in plaintext
        |
        v
   Response to user

Six steps. Four leak points. Every single query.

Your compliance team saw a box labeled "LLM" in the architecture diagram and assumed it was local. It isn't.

"But Embeddings Are Just Numbers"

That was conventional wisdom until Zero2Text (Feb 2026) — a zero-training inversion attack that reconstructs text from embedding vectors with only API access. 1.8x higher ROUGE-L scores vs all prior baselines.

Patient records, legal docs, proprietary code — all recoverable from vectors alone.

A Pinecone/Weaviate breach = full plaintext breach. OWASP now classifies this as a Top 10 LLM vulnerability.

Why Existing Solutions Don't Work

Redaction kills utility:

Before: "Tata Motors reported Rs 3.4L Cr revenue in Q3 2025"
After:  "[REDACTED] reported [REDACTED] revenue in [REDACTED]"

Good luck getting useful embeddings from that. Your vector search returns garbage.

PII detectors (Presidio, LLM Guard):

50-200ms overhead per call (Python NER in hot path)
Only catch names/emails — miss revenue figures, deal sizes, project codenames
Stateless — different replacement each call breaks vector search

Cloud-locked tools: Bedrock guardrails = Bedrock only. Private AI = another SaaS middleman.

                    Consistent   Beyond    <10ms     Self-     Pipeline
                    mapping      PII       latency   hosted    aware
Presidio            ❌           ❌        ❌        ✅        ❌
LLM Guard           ❌           ❌        ❌        ✅        ❌
Bedrock Guardrails  ❌           ⚠️        ✅        ❌        ❌
CloakPipe           ✅           ✅        ✅        ✅        ✅

The Fix: Consistent Pseudonymization

Don't redact. Replace consistently.

Map "Tata Motors" → "ORG_7". Same token, every time, across every document and query.

Before: "Tata Motors reported Rs 3.4L Cr revenue in Q3 2025, up 12%"
After:  "ORG_7 reported AMOUNT_12 revenue in DATE_3, up PCT_3"

Semantic structure preserved → embeddings still meaningful → vector search works → LLM responds with pseudonyms → rehydrate back to real values.

"What was Tata Motors' revenue last quarter?"
        ↓
   Pseudonymize → "What was ORG_7's revenue last quarter?"
        ↓
   Embed + Search → retrieve pseudonymized chunks
        ↓
   LLM → "ORG_7 reported AMOUNT_12 in DATE_3..."
        ↓
   Rehydrate → "Tata Motors reported Rs 3.4L Cr in Q3 2025..."
        ↓
   ✅ User sees real answer. Provider never saw "Tata Motors."

Going Further: Kill 3/4 Leak Points

Vectorless tree search builds a local JSON index and lets the LLM reason about relevance. No embedding API. No vector DB. No inversion risk.

VECTOR RAG (4 leaks):              TREE-BASED RAG (1 leak):

Text → Embedding API  ❌           Tree index built locally  ✅
Vectors → Cloud DB    ❌           Tree stored locally       ✅
Query → Embedding API ❌           LLM navigates tree        ✅
Context → LLM         ❌           Pseudonymized → LLM      ⚠️ (protected)

PageIndex (VectifyAI) proved 98.7% accuracy on FinanceBench vs GPT-4o's ~31% for structured docs.

CloakPipe — Drop-In Privacy Proxy

I built CloakPipe — a Rust-native proxy that sits between your app and any OpenAI-compatible API.

Your App  →  CloakPipe  →  LLM API
                |               |
          "Tata Motors"    Sees "ORG_1"
          → "ORG_1"            |
                |              |
          "ORG_1"         ←----+
          → "Tata Motors"

Setup: change OPENAI_BASE_URL. That's it. Your LangChain/LlamaIndex/OpenAI SDK code works unchanged.

v0.1 features:

Multi-layer detection (API keys, JWTs, emails, IPs, financial amounts, fiscal dates, custom TOML rules)
AES-256-GCM encrypted vault + zeroize memory safety
OpenAI-compatible proxy (/v1/chat/completions, /v1/embeddings)
SSE streaming rehydration
Single binary, <5ms overhead

Coming soon: