zahraarmantech

Posted on May 31

I tried to hide semantic meaning from embeddings without breaking search

#database #machinelearning #privacy #security

Every vector database has the same problem: embeddings leak meaning.

If someone gets access to your vector store — breach, insider, subpoena — they don’t need to read your documents. They just cluster the embeddings. Five minutes later they know: these 500 vectors are medical records, these 200 are legal cases, these 100 are salary data.

I wanted to know: can you destroy that structure while keeping search working?

The experiment

I took 626,906 real passages from Microsoft’s MSMARCO dataset. I encoded them with a standard sentence transformer. Then I tried to make the embeddings unreadable without killing retrieval quality.

The approach I landed on: split each embedding into 200 independent channels, quantize each to an integer, mask it with a cryptographic salt, and store only the modular residue after dividing by prime numbers.

The raw embedding is never stored. It’s never reconstructed. Even the person running the search never sees it.

What happened

Search quality: 98.2% preserved. Out of 500 queries, the protected system returns nearly identical rankings to plain cosine search.

But here’s the part that surprised me:

Left side: standard embeddings. Same-topic documents cluster together. An attacker sees everything.

Right side: same documents after the transformation. Random scatter. No structure.

And search still returns the same results on both sides.

The attack test

I computed every pairwise distance in the protected system and checked: can you figure out which documents

Left: raw embeddings — perfect correlation between distance and similarity (ρ = 1.00). Attacker wins.

Right: protected system — no correlation (ρ = 0.09). Attacker gets nothing useful.

What didn’t work

Not everything was smooth. Some things I learned the hard way:

BGE embeddings don’t quantize well. MiniLM and MPNet both hit 98%+. BGE dropped to 87%. The embedding distribution matters — models that spread information more uniformly across dimensions lose more during quantization.

Small primes break everything. When I used primes smaller than the number of quantization bins, retrieval quality collapsed from 98% to 38%. The modular reduction needs to be injective — primes must be larger than the bin count. This took me a while to figure out.

The key holder can partially recover geometry. If someone with the key computes thousands of pairwise distances, they can approximate the original embedding structure using MDS (ρ = 0.63). I mitigated this to 0.35 with a log transform, but it’s a fundamental limitation of any distance-preserving scheme. FHE has the same issue.

What this actually is

I want to be precise: this is NOT encryption in the AES sense. You can’t decrypt a barcode back to an embedding. It’s a randomized privacy-preserving encoding — barcodes are computationally indistinguishable from random without the key, under standard cryptographic assumptions (PRF/HMAC-SHA256).

Speed

On the same hardware (Colab T4 GPU), fully homomorphic encryption (CKKS) takes 38.9ms per comparison. This system takes 5ms. Integer arithmetic only, no GPU needed for the comparison step.

Try it

I built a live demo where you can see this working in real time — search both systems side by side:

Live demo: https://huggingface.co/spaces/zahraarman/ZATRON

Code: https://github.com/zahraarmantech/ZATRON

Run locally:

pip install sentence-transformers scikit-learn matplotlib
python demo.py

What I want to know

I’m an independent researcher. I built this because I wanted to know if it was possible. It appears to work, but I’m sure there are things I’m missing.

If you work on vector search, privacy, or retrieval systems — what would break this? What am I not seeing?

Zahra Arman — Independent Researcher
*The method is covered by a US provisional patent.

DEV Community