"If you embed a chunk that says ""Sarah Mitchell called about her order to 14 Beechwood Avenue, Manchester,"" the embedding captures the semantics of that entire passage. The vector database now contains a representation derived from a customer's personal data. And more importantly, the text chunks stored as metadata contain the original PII in plain text.
Three specific risks:
Cross-user data leakage: Agent A queries the system, retriever pulls chunks from Agent B's tickets containing Agent B's customers' details.
Right to erasure: Customer exercises GDPR Article 17. Their data is fragmented across thousands of embeddings. Identifying and removing specific embeddings that encode their PII is extremely difficult.
Vendor exposure: If your vector DB is hosted (Pinecone, Weaviate Cloud), PII is in another third party's infrastructure.
The fix: sanitise before embedding. Strip PII from chunks before generating embeddings. The semantic meaning is preserved. Retrieval still works on ""customer called about delivery issue."" The personal identifiers are gone.
curl -X POST https://api.comply-tech.co.uk/api/v1/anonymise \
-H ""X-Api-Key: demo-key-complytech"" \
-H ""Content-Type: application/json"" \
-d '{""content"":""Customer Sarah Mitchell (sarah@gmail.com) called about delayed delivery to 14 Beechwood Ave, Manchester"",""contentType"":""text"",""strategy"":""Redact"",""frameworks"":[""GDPR""]}'
We tested this. Retrieval quality is unaffected because semantic search matches on the problem described, not the customer's name."
Top comments (0)