DEV Community: zahraarmantech

Stop choosing between smart search and private data

zahraarmantech — Thu, 11 Jun 2026 23:27:29 +0000

A few months ago I built a way to search documents by meaning while keeping the embeddings hidden — even from the server doing the search. I called it ZATRON.

The obvious question everyone (including me) kept asking was: does it actually hide anything, or does it just look scrambled?

Scrambled-looking isn't the same as secure. So instead of trusting a correlation number, I did the thing that actually scares me: I trained a neural network to break it.

This post is the honest write-up — including the part where I tried hard to make the attack win.

The setup

Standard semantic search stores embeddings as plain vectors. Anyone with database access can cluster them by topic and infer content without reading a word. ZATRON transforms each embedding into a modular barcode: project onto PCA channels, quantize, add a per-document keyed mask, and keep only residues modulo a set of primes. You compare barcodes in modular space; the original embedding is never reconstructed.

Retrieval still works — 98% of cosine quality on 626K MSMARCO passages. The question is whether the barcodes leak.

Why a correlation number wasn't enough

My first security check was a Spearman correlation between barcode distance and true similarity. It came out near zero (ρ ≈ 0.05). Good — but a low linear correlation only rules out a simple attacker. A neural network doesn't need linearity. It can learn whatever structure is there.

So the real test: give a neural network every advantage and see if it can recover similarity from the barcodes.

The threat model (making the attacker strong on purpose)

I used a known-plaintext attacker — the strongest realistic setting:

It sees all the stored barcodes.
It also gets 80,000 document pairs with their true cosine similarities (as if a chunk of plaintext leaked).
It trains a model — a linear probe and a 3-layer MLP — to predict the similarity of unseen pairs from per-prime circular-difference features.
Train and test pairs share no anchor documents, so it can't just memorize.

And the part that makes the result trustworthy: I ran the identical attack on the unprotected quantized signals as a control. If the attack can't break those, the attack is too weak and the test means nothing.

The result

On 50,000 MSMARCO passages, 100,000 labeled pairs:

Input the attacker sees	Linear probe	MLP (3-layer)
Unprotected signals (control)	ρ = 0.79, AUC = 0.985	ρ = 0.90, AUC = 0.999
ZATRON barcodes	ρ = 0.00, AUC = 0.498	ρ = 0.00, AUC = 0.505

The same network that recovers similarity from unprotected signals almost perfectly (AUC 0.999) gets exactly chance level on the barcodes — with 80,000 labeled pairs to learn from. AUC 0.50 is a coin flip.

It learned nothing.

I also put it head-to-head with the classic baseline

"8x faster than FHE" is a weak flex — everyone knows FHE is slow. The fairer comparison is ASPE (Wong et al., SIGMOD 2009), the classic encrypted-kNN scheme. ASPE preserves scalar products exactly, so retrieval is perfect — but that same property means any observer can read similarities straight off the ciphertexts.

	ASPE (SIGMOD '09)	ZATRON
Retrieval recall@10 (strict)	100%	81%
Observer reads similarity directly	ρ = +0.87	ρ = −0.06
Learned attack (MLP)	ρ = +0.91, AUC = 0.99	ρ = +0.01, AUC = 0.52

ASPE buys perfect recall with total leakage. ZATRON gives up a margin on the strictest retrieval metric and leaks nothing — to a direct observer or a trained network.

What I'm NOT claiming

Honesty is the whole point, so the limits:

This is the observer threat model. A key holder computing many pairwise distances can still partially recover geometry via MDS (ρ ≈ 0.35) — that's inherent to any distance-preserving scheme, FHE included.
It is a randomized privacy-preserving encoding, not a reversible cipher, and not yet independently audited by a cryptographer. That's the right bar before anyone calls it production-grade.
The strict recall metric here (full top-10 set overlap) is harder than the top-1-in-top-10 number I quote elsewhere. Same system, stricter ruler.

Try it / break it

Everything is reproducible:

pip install zatron

The attack and the ASPE comparison are in the repo as runnable scripts (benchmarks/). If you can make the neural attack win — train it longer, give it more pairs, better features — I genuinely want to see it. Finding the weakness is the point.

Code + benchmarks: https://github.com/zahraarmantech/ZATRON
Live demo: https://huggingface.co/spaces/zahraarman/ZATRON

I'd rather have someone break this now than after I've claimed too much.

I tried to hide semantic meaning from embeddings without breaking search

zahraarmantech — Sun, 31 May 2026 14:05:04 +0000

Every vector database has the same problem: embeddings leak meaning.

If someone gets access to your vector store — breach, insider, subpoena — they don’t need to read your documents. They just cluster the embeddings. Five minutes later they know: these 500 vectors are medical records, these 200 are legal cases, these 100 are salary data.

I wanted to know: can you destroy that structure while keeping search working?

The experiment

I took 626,906 real passages from Microsoft’s MSMARCO dataset. I encoded them with a standard sentence transformer. Then I tried to make the embeddings unreadable without killing retrieval quality.

The approach I landed on: split each embedding into 200 independent channels, quantize each to an integer, mask it with a cryptographic salt, and store only the modular residue after dividing by prime numbers.

The raw embedding is never stored. It’s never reconstructed. Even the person running the search never sees it.

What happened

Search quality: 98.2% preserved. Out of 500 queries, the protected system returns nearly identical rankings to plain cosine search.

But here’s the part that surprised me:

Left side: standard embeddings. Same-topic documents cluster together. An attacker sees everything.

Right side: same documents after the transformation. Random scatter. No structure.

And search still returns the same results on both sides.

The attack test

I computed every pairwise distance in the protected system and checked: can you figure out which documents

Left: raw embeddings — perfect correlation between distance and similarity (ρ = 1.00). Attacker wins.

Right: protected system — no correlation (ρ = 0.09). Attacker gets nothing useful.

What didn’t work

Not everything was smooth. Some things I learned the hard way:

BGE embeddings don’t quantize well. MiniLM and MPNet both hit 98%+. BGE dropped to 87%. The embedding distribution matters — models that spread information more uniformly across dimensions lose more during quantization.

Small primes break everything. When I used primes smaller than the number of quantization bins, retrieval quality collapsed from 98% to 38%. The modular reduction needs to be injective — primes must be larger than the bin count. This took me a while to figure out.

The key holder can partially recover geometry. If someone with the key computes thousands of pairwise distances, they can approximate the original embedding structure using MDS (ρ = 0.63). I mitigated this to 0.35 with a log transform, but it’s a fundamental limitation of any distance-preserving scheme. FHE has the same issue.

What this actually is

I want to be precise: this is NOT encryption in the AES sense. You can’t decrypt a barcode back to an embedding. It’s a randomized privacy-preserving encoding — barcodes are computationally indistinguishable from random without the key, under standard cryptographic assumptions (PRF/HMAC-SHA256).

Speed

On the same hardware (Colab T4 GPU), fully homomorphic encryption (CKKS) takes 38.9ms per comparison. This system takes 5ms. Integer arithmetic only, no GPU needed for the comparison step.

Try it

I built a live demo where you can see this working in real time — search both systems side by side:

Live demo: https://huggingface.co/spaces/zahraarman/ZATRON

Code: https://github.com/zahraarmantech/ZATRON

Run locally:

pip install sentence-transformers scikit-learn matplotlib
python demo.py

What I want to know

I’m an independent researcher. I built this because I wanted to know if it was possible. It appears to work, but I’m sure there are things I’m missing.

If you work on vector search, privacy, or retrieval systems — what would break this? What am I not seeing?

Zahra Arman — Independent Researcher
*The method is covered by a US provisional patent.

What happens when you hide embeddings but keep search working?

zahraarmantech — Tue, 26 May 2026 21:16:03 +0000

What happens when you hide embeddings but keep search working?

I spent the last few months building a system that does something counterintuitive: it takes semantic search embeddings, makes them completely unreadable, and somehow search still works at 98% quality.

Here’s what that looks like.

The problem nobody talks about

Every company using semantic search has a dirty secret: their vector database is a map of their entire document collection’s meaning.

Embeddings cluster by topic. If someone gets access to your vector database — a breach, an insider, a subpoena — they don’t need to read a single document. They can cluster the embeddings and immediately see: these 500 documents are about cancer patients, these 200 are about ongoing litigation, these 100 are salary records.

No decryption needed. The structure IS the leak.

Look at the left side. Same-color dots represent same-topic documents. They cluster together — an attacker immediately sees the structure. The right side is the same 50 documents after ZATRON processing. Random noise. No clusters. No structure.

But here’s the thing: search returns the exact same results on both sides.

What I built

ZATRON (Zero-Access Transformed Retrieval Over Noise) transforms embeddings into modular barcodes. The process:

Project the embedding onto 200 independent channels (PCA)
Quantize each channel to an integer (0–49)
Mask each value with a cryptographic salt unique to each document
Store only the modular residues (remainder after dividing by prime numbers)

The key insight: modular arithmetic preserves distance relationships but destroys the original values. Two similar documents produce similar modular distances. But the individual barcodes look like random numbers.

Without the key, you can’t unmask them. With the key, you can compare them. You never reconstruct the original embedding.

Does it actually work?

I tested on real data, not toy examples.

MSMARCO passage retrieval — 626,906 real documents:
The system preserves 98.2% of cosine search quality. Out of 500 queries, the encrypted system returns nearly identical rankings to unencrypted cosine search.

Three different embedding models:
MiniLM: 98.2%. MPNet: 99.2%. BGE: 86.6% (this model’s embedding distribution is less quantization-friendly — I report this honestly).

Five languages:
Arabic, Spanish, Korean, Chinese, English — all above 88%.

Comparison with existing methods:

Method	Quality	Encrypted?
Binary quantization	96.9%	No
Scalar int8	98.8%	No
Product quantization	97.9%	No
ZATRON	99.6%	Yes

Higher quality than every quantization method — and the only one that’s encrypted.

Can an attacker break it?

I ran eight independent attack vectors. All passed.

But the most convincing evidence is visual:

Left: raw embedding distances perfectly predict true document similarity (ρ = 1.00). An attacker with database access knows exactly which documents are related.

Right: ZATRON barcode distances show zero correlation with true similarity (ρ = 0.09). The attacker gets nothing.

What about FHE?

Fully homomorphic encryption (CKKS) can do encrypted search too. But on the same hardware (Google Colab, T4 GPU), CKKS takes 38.9ms per comparison. ZATRON takes 5ms. That’s 8x faster, using only integer arithmetic, no GPU needed.

Both are computationally secure — CKKS under Ring-LWE, ZATRON under PRF (HMAC-SHA256). Different assumptions, both standard.

What this is NOT

I want to be precise about what ZATRON is and isn’t:

It is NOT classical encryption like AES. You can’t “decrypt” a barcode back to an embedding.
It IS a randomized privacy-preserving encoding. Barcodes are computationally indistinguishable from random without the key.
A key holder who computes many pairwise distances CAN partially recover embedding geometry (ρ = 0.63, mitigated to 0.35 with log transform). This is inherent to any distance-preserving scheme, including FHE.

I state these limitations explicitly because overselling helps nobody.

Try it yourself

Live demo (no install needed):
https://huggingface.co/spaces/zahraarman/ZATRON

Code and paper:
https://github.com/zahraarmantech/ZATRON

Run locally:

pip install sentence-transformers scikit-learn matplotlib
python demo.py

Who needs this?

Any organization that searches sensitive documents: hospitals (patient records), law firms (case files), financial institutions (client data), defense (classified documents).

The EU AI Act and GDPR are making embedding privacy a compliance issue, not just a nice-to-have.

What’s next

The system works. The patent is filed. I’m looking for technical feedback, especially from people building vector search infrastructure.

If you work on vector databases, privacy-preserving ML, or searchable encryption — I’d genuinely appreciate your thoughts. What did I miss? What would break it? What would make it useful?

Zahra Arman — Independent Researcher, Plano TX
US Provisional Patent Pending