I write the matching engine for a small dating and friendship app. The engine is open-source (github.com/donnowyu/soulmate-core, MIT). The hardest thing about working on it was not the math. It was that for the first three weeks I could not refactor anything because I had no way to know whether my changes had broken the ranking.
This post is the testing pattern I landed on. 65 deterministic tests. No database. No fixtures loaded from disk. No live embeddings. The whole suite runs in 1.8 seconds and tells me, in plain language, whether any of fifteen ranking invariants just regressed.
If you maintain anything that takes a query and returns a sorted list (search, recsys, matching, ranking, RAG retrieval), the shape of this should transfer.
The problem with the obvious test setup
The first version of the test suite was the obvious one. Seed a fake Postgres, insert 200 fake profiles, compute embeddings with the real model, ask the engine to rank them, snapshot the top 20.
Three problems showed up within a week.
It was slow. Real embedding calls plus database round trips meant ~40 seconds for the full suite. I stopped running tests before commits.
It was non-deterministic. The embedding model was probabilistic enough that the same input produced subtly different vectors across runs, which meant the snapshot diff was always noisy. I learned to ignore the diff. Predictably, the day a real regression slipped in I ignored that diff too.
And the failures, when they came, did not tell me anything. A snapshot would change. I would stare at a list of 20 user IDs and try to remember what each one was supposed to represent. The test had nothing to say about why the new order was wrong, only that it was different.
What deterministic tests look like for a matching engine
The flip was to stop testing the system end-to-end and start testing each invariant the ranker is supposed to satisfy, one at a time, with the smallest possible synthetic input.
Here is one of the tests, verbatim.
test("intent overlap dominates lexical similarity", () => {
const candidates = [
profile({ intent: "relationship", text: "I like long walks" }),
profile({ intent: "friendship", text: "I like long walks and hiking" }),
];
const viewer = profile({ intent: "relationship", text: "I like hiking" });
const ranked = rank(viewer, candidates);
expect(ranked[0].intent).toBe("relationship");
});
There is no database. profile is a one-line helper that builds a Profile object with sensible defaults. rank is the real function from the engine. The vectors are synthetic, produced by a tiny deterministic embedder (a stable hash of token bigrams projected into 128 dimensions), not the live model.
The test reads, in English, as a sentence: "When two candidates have similar text but different intents, the one with matching intent should win." When that test fails, the diagnostic is not "snapshot differs." It is "rank()[0].intent was 'friendship', expected 'relationship'." That tells you exactly which invariant broke.
There are 65 of these. Each one is roughly four to ten lines. Together they cover the fifteen ranking invariants the matcher is supposed to hold, with several tests per invariant for edge cases (empty fields, very short text, language mismatch, intent set to community, etc.).
How to keep them deterministic without faking the math
The trick is the deterministic embedder. It is about thirty lines. It tokenizes the input the same way the real pipeline does, then projects each token into a fixed 128-dimensional vector using a stable hash. Two inputs with identical tokens produce identical vectors. Two inputs that share most tokens produce vectors with a high cosine similarity, just like a real embedder would.
function tinyEmbed(text: string): Float32Array {
const v = new Float32Array(128);
for (const tok of tokenize(text)) {
const h = stableHash(tok);
for (let i = 0; i < 128; i++) {
v[i] += ((h >> i) & 1) ? 1 : -1;
}
}
return normalize(v);
}
This is not the right embedder for production. It cannot tell that "happy" and "joyful" are similar. But for testing the ranker, it is exactly right. The ranker should care about intent overlap, language match, profile-completeness, and recency before it cares about the small nuance the real embedder adds. The tests assert those four things hold even with the dumb embedder. If a refactor accidentally makes the engine depend on the embedder's semantic nuance for a basic invariant, the test catches it.
Why no database
Every test runs against in-memory objects. The engine accepts a viewer profile and a list of candidate profiles, both as plain TypeScript values, and returns a ranked list. The Postgres + pgvector layer that production uses is a separate module that calls into this pure function. The tests do not touch it.
This is the most expensive design decision in the whole repo. It means the ranker cannot do anything fancy with the database (no clever joins, no SQL-side scoring). Every signal it uses must travel through the function signature. In exchange the tests run in 1.8 seconds and have no flakiness. I have come back to this tradeoff many times. I keep choosing the tests.
What this gets you
Refactor confidence. I rewrote the rerank step three times in two months. Each rewrite, I ran the suite. If the tests passed, the rerank was at least as good as the previous version on every invariant we care about. If a test failed, the message told me exactly which invariant I had broken.
Honest documentation. The test file is the most useful documentation of how the matcher behaves. New contributors read the tests first. The invariants are not in a wiki that will go stale, they are executable.
A floor under quality. I cannot make a change that ships a worse ranker by accident. The bar is "all 65 invariants still hold." Below that bar, CI is red.
The three things this does not test
Snapshot tests are still useful for catching subtle changes in ranking order on real data. I have a separate test file with two of those, gated on a RUN_INTEGRATION=1 env var, that uses real embeddings and a small fixture set. They are slow and they are noisy. They are not the suite I run on every commit.
Performance. The 65 tests do not check that ranking is fast. There is a separate benchmark file that runs nightly.
Calibration. The tests check that the ranker orders correctly. They do not check that the score thresholds match what should be a "good match" for users. That is a product question and the answer comes from user data, not from invariant tests.
The pattern, generalized
If you maintain a system that takes a query plus a candidate set and returns a sorted list, the pattern is:
- List the invariants the ranker should satisfy, in English. "When the user explicitly excludes a language, no candidate of that language should appear in the top N." Write that down.
- For each invariant, build a minimal synthetic input that exercises only it.
- Use a deterministic stand-in for any probabilistic component (embeddings, model calls). Make sure the stand-in is dumb enough that the invariant still has to be satisfied by the ranker, not by accident.
- Skip the persistence layer entirely. Treat the ranker as a pure function.
- Write the assertion in terms of the invariant, not in terms of which specific items came back in which order. "The top result should have intent X" not "the top result should be ID 47."
The reward is a test suite that tells you, in English, what you just broke. After enough refactors that survived it, the suite starts to feel like a colleague who has read every line of the engine and will let you know the moment you do something stupid.
The full engine, all 65 tests, and the tiny deterministic embedder are at github.com/donnowyu/soulmate-core. If you ever want to see one specific test, tests/rank.invariants.test.ts is the file. The engine ships on byvibration.com, which is a small relationship and friendship app that genuinely cannot read your photos.---
title: 65 deterministic tests for a matching engine, no database, no fixtures
published: true
canonical_url: https://byvibration.com/essays/why-matching-layer-is-physically-blind
tags: typescript, testing, opensource, webdev
I write the matching engine for a small dating and friendship app. The engine is open-source (github.com/donnowyu/soulmate-core, MIT). The hardest thing about working on it was not the math. It was that for the first three weeks I could not refactor anything because I had no way to know whether my changes had broken the ranking.
This post is the testing pattern I landed on. 65 deterministic tests. No database. No fixtures loaded from disk. No live embeddings. The whole suite runs in 1.8 seconds and tells me, in plain language, whether any of fifteen ranking invariants just regressed.
If you maintain anything that takes a query and returns a sorted list (search, recsys, matching, ranking, RAG retrieval), the shape of this should transfer.
The problem with the obvious test setup
The first version of the test suite was the obvious one. Seed a fake Postgres, insert 200 fake profiles, compute embeddings with the real model, ask the engine to rank them, snapshot the top 20.
Three problems showed up within a week.
It was slow. Real embedding calls plus database round trips meant ~40 seconds for the full suite. I stopped running tests before commits.
It was non-deterministic. The embedding model was probabilistic enough that the same input produced subtly different vectors across runs, which meant the snapshot diff was always noisy. I learned to ignore the diff. Predictably, the day a real regression slipped in I ignored that diff too.
And the failures, when they came, did not tell me anything. A snapshot would change. I would stare at a list of 20 user IDs and try to remember what each one was supposed to represent. The test had nothing to say about why the new order was wrong, only that it was different.
What deterministic tests look like for a matching engine
The flip was to stop testing the system end-to-end and start testing each invariant the ranker is supposed to satisfy, one at a time, with the smallest possible synthetic input.
Here is one of the tests, verbatim.
test("intent overlap dominates lexical similarity", () => {
const candidates = [
profile({ intent: "relationship", text: "I like long walks" }),
profile({ intent: "friendship", text: "I like long walks and hiking" }),
];
const viewer = profile({ intent: "relationship", text: "I like hiking" });
const ranked = rank(viewer, candidates);
expect(ranked[0].intent).toBe("relationship");
});
There is no database. profile is a one-line helper that builds a Profile object with sensible defaults. rank is the real function from the engine. The vectors are synthetic, produced by a tiny deterministic embedder (a stable hash of token bigrams projected into 128 dimensions), not the live model.
The test reads, in English, as a sentence: "When two candidates have similar text but different intents, the one with matching intent should win." When that test fails, the diagnostic is not "snapshot differs." It is "rank()[0].intent was 'friendship', expected 'relationship'." That tells you exactly which invariant broke.
There are 65 of these. Each one is roughly four to ten lines. Together they cover the fifteen ranking invariants the matcher is supposed to hold, with several tests per invariant for edge cases (empty fields, very short text, language mismatch, intent set to community, etc.).
How to keep them deterministic without faking the math
The trick is the deterministic embedder. It is about thirty lines. It tokenizes the input the same way the real pipeline does, then projects each token into a fixed 128-dimensional vector using a stable hash. Two inputs with identical tokens produce identical vectors. Two inputs that share most tokens produce vectors with a high cosine similarity, just like a real embedder would.
function tinyEmbed(text: string): Float32Array {
const v = new Float32Array(128);
for (const tok of tokenize(text)) {
const h = stableHash(tok);
for (let i = 0; i < 128; i++) {
v[i] += ((h >> i) & 1) ? 1 : -1;
}
}
return normalize(v);
}
This is not the right embedder for production. It cannot tell that "happy" and "joyful" are similar. But for testing the ranker, it is exactly right. The ranker should care about intent overlap, language match, profile-completeness, and recency before it cares about the small nuance the real embedder adds. The tests assert those four things hold even with the dumb embedder. If a refactor accidentally makes the engine depend on the embedder's semantic nuance for a basic invariant, the test catches it.
Why no database
Every test runs against in-memory objects. The engine accepts a viewer profile and a list of candidate profiles, both as plain TypeScript values, and returns a ranked list. The Postgres + pgvector layer that production uses is a separate module that calls into this pure function. The tests do not touch it.
This is the most expensive design decision in the whole repo. It means the ranker cannot do anything fancy with the database (no clever joins, no SQL-side scoring). Every signal it uses must travel through the function signature. In exchange the tests run in 1.8 seconds and have no flakiness. I have come back to this tradeoff many times. I keep choosing the tests.
What this gets you
Refactor confidence. I rewrote the rerank step three times in two months. Each rewrite, I ran the suite. If the tests passed, the rerank was at least as good as the previous version on every invariant we care about. If a test failed, the message told me exactly which invariant I had broken.
Honest documentation. The test file is the most useful documentation of how the matcher behaves. New contributors read the tests first. The invariants are not in a wiki that will go stale, they are executable.
A floor under quality. I cannot make a change that ships a worse ranker by accident. The bar is "all 65 invariants still hold." Below that bar, CI is red.
The three things this does not test
Snapshot tests are still useful for catching subtle changes in ranking order on real data. I have a separate test file with two of those, gated on a RUN_INTEGRATION=1 env var, that uses real embeddings and a small fixture set. They are slow and they are noisy. They are not the suite I run on every commit.
Performance. The 65 tests do not check that ranking is fast. There is a separate benchmark file that runs nightly.
Calibration. The tests check that the ranker orders correctly. They do not check that the score thresholds match what should be a "good match" for users. That is a product question and the answer comes from user data, not from invariant tests.
The pattern, generalized
If you maintain a system that takes a query plus a candidate set and returns a sorted list, the pattern is:
- List the invariants the ranker should satisfy, in English. "When the user explicitly excludes a language, no candidate of that language should appear in the top N." Write that down.
- For each invariant, build a minimal synthetic input that exercises only it.
- Use a deterministic stand-in for any probabilistic component (embeddings, model calls). Make sure the stand-in is dumb enough that the invariant still has to be satisfied by the ranker, not by accident.
- Skip the persistence layer entirely. Treat the ranker as a pure function.
- Write the assertion in terms of the invariant, not in terms of which specific items came back in which order. "The top result should have intent X" not "the top result should be ID 47."
The reward is a test suite that tells you, in English, what you just broke. After enough refactors that survived it, the suite starts to feel like a colleague who has read every line of the engine and will let you know the moment you do something stupid.
The full engine, all 65 tests, and the tiny deterministic embedder are at github.com/donnowyu/soulmate-core. If you ever want to see one specific test, tests/rank.invariants.test.ts is the file. The engine ships on byvibration.com, which is a small relationship and friendship app that genuinely cannot read your photos.---
title: How a photo-blind dating engine actually ranks people (the TypeScript)
published: true
canonical_url: https://byvibration.com/essays/why-matching-layer-is-physically-blind
tags: typescript, webdev, postgres, ai
Last post I argued that the matcher in our dating app cannot read photos because the TypeScript types make it impossible. A few people asked the obvious follow-up. If the matcher never sees a face, what does it see, and how does it decide who you should meet this week?
This post is that. Code samples, vector math, the one heuristic that does most of the work, and the three things we explicitly chose not to do. Repo is at github.com/donnowyu/soulmate-core, MIT.
The thing the matcher actually sees
A profile, in the eyes of the ranker, is this:
type Profile = {
prompts: PromptAnswers; // five short text answers
voice: VoiceTranscript; // ~30s recording, kept as text
intent: Intent; // 'friendship' | 'relationship' | 'community'
meta: ProfileMeta; // age band, language, city, locale
};
No photo field. No height. No income. No "tags." The strongest input by mass is the prompts plus the voice transcript, which together produce somewhere between 800 and 2,500 tokens of free-form text about how this person actually thinks.
That text is the matching substrate. Everything downstream is a function of it.
Step 1: turn text into a vector
We embed the concatenated prompts-plus-voice into a fixed-size vector using a text embedding model. The exact provider does not matter much. We use OpenAI's text-embedding-3-small (1536 dims) because it is cheap, multilingual, and good enough that the rest of the system survives provider churn.
// soulmate-core/src/embed.ts
export async function embedProfile(p: Profile): Promise<Vector> {
const text = formatForEmbedding(p);
const { data } = await openai.embeddings.create({
model: "text-embedding-3-small",
input: text,
});
return data[0].embedding as Vector;
}
function formatForEmbedding(p: Profile): string {
return [
...Object.values(p.prompts),
p.voice.text,
].filter(Boolean).join("\n\n");
}
The vector is what gets stored in Postgres, in a column typed vector(1536) thanks to pgvector. The profile row also stores the prompts and the voice transcript for display, but the matcher reads the vector and only the vector. Whatever else lives on the row is not in the function signature, so the compiler cannot accidentally let it leak in.
Step 2: find candidates with pgvector
Given a viewer with embedded vector v, the candidate query is a cosine-distance ANN lookup, filtered by intent overlap and a completed-profile gate:
SELECT id, embedding <=> $1 AS distance
FROM profiles
WHERE id != $2
AND completed_at IS NOT NULL
AND $3 = ANY(intents)
AND id NOT IN (SELECT target_id FROM blocks WHERE actor_id = $2)
ORDER BY embedding <=> $1
LIMIT 100;
<=> is the pgvector cosine-distance operator. The index is an HNSW on the embedding column so that "100 nearest" runs in milliseconds even at 100k+ profiles. Smaller-distance is more similar, since cosine-distance is 1 - cosine-similarity and the operator returns the distance form.
Two things to notice. First, the SQL itself reads no photo data. There is no photo table in this join. Second, the candidate set is bounded to 100. The ranker never sees more.
Step 3: rank the 100 with a more expensive signal
Cosine distance on embeddings is the cheap pass. It is right about taste, off about intent depth. Two people can write similarly and want very different things. So we re-rank the 100 with a second function that does not call an LLM but does look at structured signals the embedding tends to flatten.
// soulmate-core/src/rank.ts
export function rank(viewer: Profile, candidate: Profile): number {
const text = textSim(viewer, candidate); // 0..1
const intent = intentOverlap(viewer, candidate); // 0..1
const energy = energyMatch(viewer, candidate); // 0..1
const cadence = cadenceMatch(viewer, candidate); // 0..1
return (
text * 0.55 +
intent * 0.25 +
energy * 0.12 +
cadence * 0.08
);
}
textSim is the cosine similarity reconstructed from the distance returned by Postgres. intentOverlap weighs whether both sides want the same kind of connection (friendship, relationship, community), and how strongly. energyMatch and cadenceMatch are small heuristics derived from how much text the person wrote and how fast they answer messages historically. They mostly catch the case where two people are similar on substance but operate on incompatible rhythms.
The weights are not fitted. They are intuitions we did not have data to fit yet, and we kept them in code so any future change is a real diff and not a parameter twiddle that nobody notices. When we have enough signal to fit them, we will, and that PR will be reviewable in one page.
The function returns one float. We pick the top 5 above a 0.45 threshold for the weekly batch. If fewer than 5 cross the threshold, we send fewer. We do not pad.
What we explicitly did not do
Three things kept coming up in design review and we kept choosing not to.
We did not build a feed. There is no infinite-scroll candidate stream in this product. The weekly batch is the whole surface. The argument for a feed is engagement; we are intentionally trading engagement for a different shape of behavior, the one where the user opens the app rarely and deliberately.
We did not let the matcher see photos, not even as a tiebreaker. We considered the version where photos enter at rank time with a small weight, and rejected it for the obvious type-system reason and the less obvious behavioral one: as soon as the matcher can see faces, the production data collection of "what humans clicked on" starts encoding face preference into the ranker even if no explicit feature does. The cleanest defense is to make the photo bytes literally unreachable from the function. The compiler is the policy.
We did not put an LLM in the ranker. The temptation is real, especially since we are already embedding text. We resisted because an LLM in the loop makes the function opaque in a way that the four-feature linear combination is not. If a match is wrong, we can read the four numbers. We cannot read an LLM the same way.
Why this matters outside dating
The pattern, embedding-plus-pgvector-plus-small-linear-rerank, is good for any product where the primary signal is "how this user thinks" rather than "what this user clicked on." Documentation search, similar-issue triage, mentor matching, study-group formation. The dating context is just the one where the cost of being wrong is most visible to the user.
If you want to read the full implementation, it is at github.com/donnowyu/soulmate-core, all of it under MIT. The vector math is in src/rank.ts and src/embed.ts; the SQL is in db/migrations/. Tests cover the rank function and the edge cases of empty profiles, missing voices, and intent mismatch.
The product that wraps this engine is byvibration.com. It is the same idea taken all the way to a working app: you write, the engine reads how you think, you meet by mind not by face.
I work on byvibration. The framework above stands on its own; the product is one way to live inside it.
Top comments (0)