Your RAG Pipeline Is Leaking Customer Data Into Vector Embeddings

#rag #ai #api #data

"If you embed a chunk that says ""Sarah Mitchell called about her order to 14 Beechwood Avenue, Manchester,"" the embedding captures the semantics of that entire passage. The vector database now contains a representation derived from a customer's personal data. And more importantly, the text chunks stored as metadata contain the original PII in plain text.

Three specific risks:

Cross-user data leakage: Agent A queries the system, retriever pulls chunks from Agent B's tickets containing Agent B's customers' details.
Right to erasure: Customer exercises GDPR Article 17. Their data is fragmented across thousands of embeddings. Identifying and removing specific embeddings that encode their PII is extremely difficult.
Vendor exposure: If your vector DB is hosted (Pinecone, Weaviate Cloud), PII is in another third party's infrastructure.

The fix: sanitise before embedding. Strip PII from chunks before generating embeddings. The semantic meaning is preserved. Retrieval still works on ""customer called about delivery issue."" The personal identifiers are gone.

curl -X POST https://api.comply-tech.co.uk/api/v1/anonymise \
-H ""X-Api-Key: demo-key-complytech"" \
-H ""Content-Type: application/json"" \
-d '{""content"":""Customer Sarah Mitchell (sarah@gmail.com) called about delayed delivery to 14 Beechwood Ave, Manchester"",""contentType"":""text"",""strategy"":""Redact"",""frameworks"":[""GDPR""]}'

We tested this. Retrieval quality is unaffected because semantic search matches on the problem described, not the customer's name."

DEV Community

Your RAG Pipeline Is Leaking Customer Data Into Vector Embeddings

Top comments (0)