Embedding Dimension Reduction: When 1536 256 Doesn't Hurt Recall

#rag #ai #embeddings #performance

Book: RAG Pocket Guide: Retrieval, Chunking, and Reranking Patterns for Production
Also by me: Thinking in Go (2-book series) — Complete Guide to Go Programming + Hexagonal Architecture in Go
My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
Me: xgabriel.com | GitHub

You inherit a vector index. Six million chunks at 1536 dimensions on text-embedding-3-small. The HNSW graph eats around 40 GB of RAM, the pgvector instance pages constantly, and the p99 query latency drifts up every time someone bulk-imports a tenant. The bill is fine; the infra is the problem.

A teammate shows you a paragraph in the OpenAI new embedding models post: the third-generation models support truncation. You can ask for 256-dim vectors directly, or take the 1536-dim vectors you already have and slice the first 256 floats off. According to the OpenAI announcement, retrieval quality on MTEB barely moves. Your index footprint drops by 6×. ANN gets faster because every distance computation touches a sixth of the memory.

It sounds like a free win. Sometimes it is. Sometimes it isn't, and the failure mode is invisible until a customer in your long tail says "I can't find anything anymore."

Here's when truncation is safe, when PCA still pays, and how to read the recall curve before you ship.

What Matryoshka actually means

Matryoshka Representation Learning (the original paper by Kusupati et al.) trains an embedding model so that the prefix of every output vector is itself a usable embedding. The first 64 dimensions form a usable, low-quality vector. 128 is sharper. 256 sharper still. At 1536 you have the full vector. The model is trained to make every prefix useful, not only the full vector.

This is not the same as PCA-after-the-fact. PCA finds the directions of maximum variance on a fixed corpus, then rotates embeddings so the first k axes carry the most signal. It works, but it costs you a fit step on representative data, and the projection is only as good as the corpus you fit it on. Matryoshka bakes the same property into the model at training time. No fit. Truncating the first k floats is the operation.

OpenAI ships text-embedding-3-small (1536 native, truncatable) and text-embedding-3-large (3072 native, truncatable). Cohere's Embed v4 supports the same idea, with 1536 and 256 as supported widths. Per Voyage's model docs, voyage-3-lite is 512 native. Older models such as text-embedding-ada-002, the original BGE checkpoints, and MiniLM are not Matryoshka-trained. Truncating one of them produces a corrupted vector, not a smaller embedding. PCA is the right tool for those.

Recall on a sample workload

A small experiment, runnable in a notebook. Pull a public retrieval dataset (BEIR's NFCorpus, FiQA, SciFact are good candidates because they cover three different domain shapes), embed every passage and query with text-embedding-3-small, evaluate nDCG@10 and recall@10 at four truncation widths: 1536, 768, 512, 256. Do the same for a Matryoshka-aware open model (nomic-embed-text-v1.5 or bge-m3).

The shape you usually see, in line with what the OpenAI post reports for MTEB:

text-embedding-3-small on FiQA-style retrieval

dim    recall@10    nDCG@10    delta vs 1536
1536    0.640        0.503      —
768     0.633        0.498      -1%
512     0.625        0.491      -2%
256     0.604        0.474      -6%
128     0.541        0.418      -15%

The numbers above are illustrative. Exact figures shift with the dataset, the corpus size, the chunking, and the model version. The shape is consistent across published Matryoshka results. 768 is essentially free. 512 costs 1–3% of recall. 256 costs 4–7%. Below 256 the curve falls off a cliff. Your job is to run the same experiment on your own corpus and confirm the shape before you commit.

The same exercise on a non-Matryoshka model (e.g. text-embedding-ada-002, slicing naively) collapses to noise at 768 already. Truncating a model that wasn't trained for it doesn't give you a smaller embedding; it corrupts the one you had.

When truncation pays

Three places where 1536 → 256 is a clean trade.

Index size and RAM. A 1536-float f32 vector is 6 KB. A 256-float f32 vector is 1 KB. On the 6M-chunk scenario from the opener, that's 36 GB vs 6 GB of raw vector data; scale the same arithmetic to 50M chunks and you're at 300 GB vs 50 GB before HNSW overhead. The HNSW graph itself adds a multiplier (typically 1.5–3× the raw vector size, depending on M and link-list overhead). Cutting the dimension cuts both the vectors and the graph proportionally. RAM cost drops 5–6× in practice.

ANN search throughput. HNSW distance computations are dominated by the dot product between the query and candidate vectors. A 6× cut on dimension is roughly a 6× cut on the per-comparison work. Real-world throughput gains depend on memory bandwidth more than ALU throughput, but published HNSW benchmarks (see the ann-benchmarks site) consistently show smaller dims compute faster on the same hardware.

First-stage retrieval in a two-stage pipeline. The pattern most production RAG converges on: a fast, lossy first-stage that pulls the top 100–200 candidates, then a heavier reranker (BGE rerank, Cohere rerank, a cross-encoder) that re-scores the top-N down to the final 10. The first stage doesn't need to be perfect; it just needs to keep the right answer in its top-100. A 256-dim Matryoshka prefix usually does that, and the reranker compensates for the small recall drop. You pay the cost of the reranker on 100 candidates, not on 50 million.

When it hurts

Three situations where 256 dimensions costs you actual users.

Long-tail queries on rare vocabulary. Truncation preserves the high-variance directions. Rare entities (drug names, internal product codes, niche programming-language tokens) often live in low-variance dimensions; they only matter for a small slice of queries. Truncate, and you fold them into noise. The head of your query distribution looks fine on the eval set. The long tail breaks silently.

The detection signal: stratify your eval. Bucket queries by frequency, by domain, by token-rarity. Compute recall per bucket. If the head holds while the tail collapses, your truncation tax is landing on the customers who care most.

Multi-domain corpora where each domain has its own vocabulary. A search system that indexes legal contracts, medical records, and customer support tickets has three separate "axes of variance." Matryoshka training packs the most-shared axes into the prefix, but cross-domain rare terms (the kind that only matter inside one domain) push down into the truncated tail. A single 256-dim space across all three is worse than three 1536-dim spaces, one per domain, even with the same total memory budget.

Hybrid lexical+vector retrieval that already has a strong BM25. If BM25 is already pulling the obvious matches, the vector side's job is to find the semantic near-misses. Those near-misses live in finer dimensions exactly because they're not the obvious matches. Truncate, and the vector side loses ground exactly where it was already weakest.

Truncation, in code

For OpenAI and other Matryoshka-aware APIs, request the dimension you want directly:

from openai import OpenAI

client = OpenAI()

resp = client.embeddings.create(
    model="text-embedding-3-small",
    input=texts,
    dimensions=256,  # native truncation, normalized
)
vectors = [d.embedding for d in resp.data]

The API truncates and re-normalizes for you. If you already have 1536-dim vectors stored, slice and re-normalize manually:

import numpy as np

def truncate_and_renormalize(
    vec: np.ndarray, target: int
) -> np.ndarray:
    head = vec[:target]
    norm = np.linalg.norm(head)
    if norm == 0:
        return head
    return head / norm

Re-normalization matters. Cosine similarity assumes unit vectors. The 1536-dim vector has unit norm; its 256-dim prefix does not, in general.

For non-Matryoshka models you have, fit PCA on a representative sample (10k–100k vectors is usually enough), persist the projection matrix, apply it on every read and write:

from sklearn.decomposition import PCA
import numpy as np

# Fit once on a sample
sample = np.array(load_sample_vectors(50_000))
pca = PCA(n_components=256, whiten=False)
pca.fit(sample)

# Persist pca.components_ (the projection matrix) and
# pca.mean_ (used to center each input before projection).
# Both must be available wherever you call reduce().

def reduce(vec: np.ndarray) -> np.ndarray:
    centered = vec - pca.mean_
    out = centered @ pca.components_.T
    n = np.linalg.norm(out)
    return out / n if n > 0 else out

The fit set should mirror your production corpus distribution. PCA fit on a small in-domain slice and applied to a wider production corpus drops recall on the unseen domains. If you can't fit on representative data, don't PCA — pick a Matryoshka-trained model instead.

A decision rule you can apply on Monday

Three questions, in order. Stop at the first "no."

Is your model Matryoshka-trained? If yes, truncation is on the table. If no, PCA is on the table; truncation is not.
Does your eval set, stratified by query frequency and domain, hold recall within 3–5% at the target dimension? Look at the worst bucket, not the global average. A 15% drop in the worst bucket means you're shipping a regression to your long tail.
Is your retrieval a first stage in front of a reranker? If yes, the recall budget you can absorb is wider; the reranker cleans up. If no, the budget is tighter; treat any drop seriously.

If all three pass, truncate. Re-run the eval after a month of production traffic, because the query distribution drifts and the long-tail bucket grows. If any of the three fail, stay at the higher dimension and look for the win somewhere else. The usual next step is quantization (int8 or binary), which often buys 4–32× on storage with a smaller and more measurable recall hit than dimension cuts (see Qdrant's quantization benchmarks for one published reference).

Where to start

If you're already on text-embedding-3-small or -large and your index is RAM-bound, run the stratified eval at 768 first. It's almost free across published results, and a 2× index cut on 50M vectors is a real number on a real bill. Move to 512 only if the eval holds. Move to 256 only if you have a reranker covering you, or if your domain is narrow enough that the long tail doesn't bite.

If you're on a legacy embedder (ada-002, original BGE, all-MiniLM-L6-v2), do not slice. Either fit PCA on a representative sample with a held-out eval set, or migrate to a Matryoshka-trained model and skip the projection step.

The shape that fails most often: someone reads "1536 → 256 is fine" on a vendor blog, ships it, hits 95% of users with no problem, and quietly degrades the 5% who run their query against the rare end of the corpus. The 95% never tell you; the 5% just leave. Stratify your eval before you ship the cut.

If this was useful

The RAG Pocket Guide covers retrieval as a system, not as a single dial. The chapter on embedding choice walks through Matryoshka, PCA, quantization, and the eval design that catches the long-tail regressions before they hit production. If this post helped you decide whether to slice your vectors, the book is the next layer of the same decisions: chunking, reranking, hybrid retrieval, index choice, and the cost math that ties them together.