Stop Benchmarking Embedding Models. 90% of Your Search Quality Lives Upstream.

Yanis Schweizer — Mon, 20 Apr 2026 10:40:29 +0000

Brief intro on context. I'm CTO at Vaultt (formerly StudentVenture), a recruitment marketplace for top 1% non-traditional talent. 10,000+ candidate profiles, semantic matching in production for over a year.

We run everything on pgvector inside our main Postgres database, with LLM-generated summaries as the embedding input and hybrid filtering at query time.

Every few months a new "best embedding model in 2026" benchmark lands, and founders ask me the same question: should we be using model X instead of model Y?

Almost always, it's the wrong question. Here's why, with numbers from our own pipeline.

The test

Same candidate corpus. Same eval set of real recruiter queries (not synthetic ones, actual text recruiters type). Same scoring rule: did the top 10 retrieved candidates contain the people the recruiter ended up interviewing?

I ran it across five embedding models spanning the cost and capability range: an open-source model, Google's gemini-embedding-001, OpenAI text-embedding-3-large, Voyage voyage-3.5-large, and a smaller on-prem option.

Best model to worst model: a 7 point spread. Within the range you'd see between two random seeds on the same model.

The other test

Same eval. Kept the worst model in place. Changed only one thing: what text I passed to the embedder.

Version 1: raw profile data, serialized field by field. Name, bio, skills array, experience array, current role.

Version 2: an LLM-generated structured summary. For each candidate, we run a one-time pipeline at ingestion that parses PDF portfolios, OCRs image-based projects, reads their CV, combines it with their self-written bio, and produces a single natural-language paragraph describing what this person actually does and is good at. That paragraph is what we embed.

Quality delta on retrieval: 40 points.

One variable. Upstream of the model call. Forty points.

Why this happens

Embedding models are trained to place semantically similar text close in vector space. The operative word is text. Feed them a JSON blob with redundant fields, inconsistent formatting, and no narrative coherence, and you get a vector that describes "a JSON blob from a recruitment platform." Feed them a clean description of a human's skills and work, and you get a vector that describes the human.

MTEB, BEIR, and every benchmark on the leaderboards you're staring at assume clean, purposeful text as input. That's an implicit assumption most production pipelines violate from the first day.

The architecture we landed on

The code is the easy part. The architectural decisions that mattered:
Decouple embedding input generation from the embedding call itself. We store an embedding_input text column on every candidate. It gets regenerated whenever our preprocessing logic changes. The actual embed call reads from that column. Swapping models or improving preprocessing are both batch jobs. They don't touch application code.

Spend your expensive model at ingestion, not at query time. We call a strong LLM once per candidate to build the structured summary. We call a cheap embedding model on that summary. At query time we use cheap embeddings on the query string and a vector distance lookup. The expensive LLM work is amortized across every future search the candidate ever appears in.

Store vectors in Postgres, not in a dedicated vector DB. We already had Postgres. pgvector with HNSW indexing handles 10k+ vectors at sub-20ms query latency. We get one backup strategy, one permission model, one ORM, and hybrid queries that filter on structured columns and sort on vector distance in a single SQL statement. A dedicated vector store would buy us nothing at this scale. The "pgvector doesn't scale" cliff kicks in north of 50M vectors, and by then you can shard or migrate.

Filter structured data, embed unstructured meaning. Location, availability, role type, time zone: plain Postgres columns with B-tree indexes. "Kind of thinker this person is," "shape of their portfolio": embedding. Composing them is a five-line WHERE clause plus an ORDER BY on vector cosine distance.

The pricing angle

Google gemini-embedding-001: $0.006 per million tokens. Voyage voyage-3.5-large: $0.18 per million tokens. 30x. On our corpus that's the difference between roughly $4 per month and $120 per month. Small money at our scale. At 1M profiles with weekly refreshes, it's thousands versus tens of thousands. For single-digit quality deltas in benchmarks that don't fully represent your actual retrieval task.
The principle: when the quality gap between "premium" and "budget" embedding options is smaller than the quality gap available upstream in your data pipeline, every dollar should go upstream. It isn't close.

What to do this week

If you run semantic search in production and haven't done this, here's the order of operations.

Build an eval set from real user queries. Not synthetic ones. 50 queries with known-good results is enough to start. If you can't measure a change, you'll optimize vibes.

Run your current retrieval against it. Write down the number.
Change the embedding INPUT. Generate a purposeful, LLM-written summary at ingestion. Re-embed. Re-measure. This is where your 40 points live.

Only after you've exhausted input improvements, consider a different model. This is where your 7 points live.

Nine out of ten teams I've worked with do step 4 first. They spend weeks, see marginal improvement, and never loop back to the data prep that would have been 10x the ROI.

The broader lesson

AI systems degrade in predictable places, and almost always it's upstream of the flashy model call. Data cleanliness. Input construction. Query understanding. Eval rigor. These are the unsexy parts. They're also where the quality lives.

The embedding leaderboard is a local maximum. Productive-looking work that moves the needle by single digits. The 40% improvement is a preprocessing pass away, and most teams never take it because the preprocessing doesn't come with a blog post.

Stop comparing models. Fix what you're feeding them.

DEV Community: Yanis Schweizer