Venkat

Posted on Mar 6

Zero-Knowledge AI Matching: Binarized Embeddings + Hamming Distance

#privacy #cryptography #webdev #machinelearning

Part 3 of a series on building a privacy-first dating platform for HIV-positive communities. Building a Zero-Knowledge Dating Platform for HIV-Positive Communities covers the architecture. Matching in the Dark: Zero‑Knowledge Filtering Using 32‑Bit Bitmasks covers bitmask filtering.

Bitmasks got us far.

Two people can match on gender, region, marital status, and relationship intent — all without the server understanding any of it. That's the hard filter layer, and it works beautifully.

But here's what bitmasks can't tell you: whether two people will actually connect.

Someone can check every categorical box and still be a terrible match. The things that create real compatibility — how someone writes about themselves, what they care about, how they think about life — are too rich, too nuanced, too human to reduce to a set of switches.

So how do you compute soft compatibility when the server isn't allowed to read a single word of anyone's profile?

This is the second half of the matching engine: client-side embeddings, binarization, and Hamming distance.

AI-powered matching. Zero semantic leakage.

The Problem with Sending Embeddings to the Server

The obvious approach: generate embeddings in the browser, send the float vectors to the server, compute similarity there.

The problem: embeddings leak meaning.

A 512-dimensional float vector like [0.12, -0.03, 0.88, ...] isn't random noise. It encodes semantic structure. With the right ML tools, you can extract approximate meaning from embeddings — infer topics, reconstruct phrases, identify patterns. Researchers have demonstrated embedding inversion attacks that recover sensitive information from vectors alone.

For a general dating app, that's a privacy concern. For HIV-positive users, it's a potential exposure vector.

So we can't send floats. We need something the server can compare without being able to understand.

🧬 Step 1: Generate Embeddings Locally

The browser uses Universal Sentence Encoder (USE) to convert profile text into a 512-dimensional embedding. This runs entirely client-side — on fields like:

About Me
Education & Employment
Hobbies & Interests
Lifestyle

"I love hiking and finding good espresso" → [0.12, -0.03, 0.88, ...]

The server never sees the text. It never sees the floats. Everything that follows happens before anything leaves the browser.

🧪 Step 2: Normalize

We normalize the vector so its magnitude doesn't affect comparisons — only direction matters:

const norm = Math.sqrt(vec.reduce((s, x) => s + x * x, 0));
const normalized = vec.map(x => x / norm);

This ensures consistent behaviour across different devices, browsers, and profile lengths.

⚫ Step 3: Binarize — Floats to Bits

This is where it gets elegant.

We convert each float into a single bit based on its sign:

const bits = normalized.map(x => (x >= 0 ? 1 : 0));

512 floats become a 512-bit binary vector.

What this destroys (intentionally):

Magnitude
Directionality
Semantic structure
Reversibility

What this preserves:

Relative similarity between profiles
Compatibility with fast bitwise operations

Two people with similar embeddings will have similar binarized vectors. Two people who are very different will have vectors that diverge significantly. The signal survives. The meaning doesn't.

This is the step that makes the system genuinely zero-knowledge.

🔐 Step 4: Hash for Integrity

As an additional safeguard, we hash the binary vector:

const hash = sha256(bits.join(""));

The server stores both:

The 512-bit vector — used for matching
The SHA-256 hash — used for integrity verification

Neither reveals the original text. Neither reveals the original floats. A brute-force attack on the hash would require iterating over a space so large it's computationally infeasible.

⚡ Step 5: Hamming Distance on the Backend

Now the server can compute similarity — without understanding what it's comparing.

Hamming distance counts the number of bit positions where two vectors differ:

User A: 1 0 1 1 0 1 0 0 1 ...
User B: 1 0 1 0 0 1 0 0 1 ...
                ↑
         1 bit differs → distance = 1

Lower distance = more similar profiles.

In Erlang:

Distance = hamming(BinaryA, BinaryB),
Score = 1 / (1 + Distance).

This gives a similarity score between 0 and 1. The server ranks potential matches by score, returns the ranked IDs, and the frontend decrypts each profile locally.

The server computed meaningful compatibility rankings — while knowing nothing about what made those profiles compatible.

🧩 The Full Pipeline

Profile text (browser only)
         │
         ▼
  Universal Sentence Encoder
  (runs locally in browser)
         │ 512 floats
         ▼
     Normalize vector
         │ 512 floats (unit length)
         ▼
     Binarize (sign threshold)
         │ 512 bits
         ▼
     SHA-256 hash
         │
         ▼
  Send to backend:
  [512-bit vector] + [hash]
         │
         ▼
  Hamming distance matching
  (server sees only math)

🧩 How the Two Layers Work Together

Layer	Method	Handles
Hard filter	32-bit bitmask	Gender, region, status, intent
Soft ranking	Hamming distance	Personality, hobbies, writing style, interests

The bitmask layer finds possible matches — people who meet categorical criteria. The embedding layer ranks those matches by actual compatibility — the things that are harder to quantify but matter more.

Together they form a two-stage zero-knowledge pipeline:

Hard filter (bitmask) → Soft ranking (Hamming distance)

Neither stage requires the server to read, store, or understand a single word about any user.

⚠️ What This Doesn't Protect Against

Being honest about the limits matters — especially for this community.

Binarization loses information. Converting 512 floats to 512 bits is lossy. Two people who are 92% similar and 78% similar might end up with the same Hamming distance. The ranking is a useful signal, not a precise measurement.

USE itself has biases. Universal Sentence Encoder was trained on general internet text. It may encode cultural, linguistic, or demographic biases in ways that affect match quality for some communities. This is an active area of research and a known limitation of off-the-shelf embedding models.

The embedding model is public. USE is open-source. An attacker who knows the model and captures a binary vector could attempt partial reconstruction. Binarization makes this significantly harder — but not theoretically impossible for a well-resourced adversary. The threat model assumes the server is the primary attack surface, not a compromised client.

Embedding quality depends on input quality. Short or generic "about me" text produces less useful embeddings. Users who write more give the system more signal to work with — but that also means their vectors carry more information. The tradeoff is inherent.

🌑 What This Means for the People Using It

There's a version of this system that would be easier to build: store everything in plaintext, use a recommendation engine, optimize for engagement metrics.

That version would work. It would also mean that a single subpoena, a single disgruntled employee, or a single breach could expose the health status, location, and intimate preferences of every person on the platform.

For most people, that's a risk worth taking for convenience. For the communities this platform was built for, it isn't.

The binarized embedding system isn't perfect. But it means that even if everything goes wrong — the database is leaked, the server is compromised, the company is pressured — the attacker still gets binary vectors and Hamming distances. They don't get profiles. They don't get health information. They don't get names.

That gap — between what the system knows and what an attacker could extract — is the whole point.

The platform is live at HIVPositiveMatches.com — built on everything this series covers.

▶️ Coming Next: Key-Wrapping in Practice

The matching engine is now complete: bitmasks for hard filtering, embeddings for soft ranking, all computed without the server reading anything.

But what happens when a match is made and two users want to actually communicate?

In Part 4, I'll walk through the key-wrapping flow that allows two users to exchange encrypted messages — where even the server facilitating the exchange cannot read what's being said.

DEV Community