Your RAG index is probably 12x bigger than it needs to be for the first retrieval pass. If you're storing full 3072-dimension vectors from text-embedding-3-large and running brute-force or HNSW search over all of them, you're paying for precision you don't use until the final ranking step. Matryoshka embeddings let you cut those vectors down to 256 dimensions for the shortlist pass and only pay for the full 3072 on the handful of candidates that survive.
This isn't lossy compression you bolt on afterward. It's a property baked into how the model was trained.
TL;DR
- Matryoshka embeddings (MRL, Matryoshka Representation Learning) train a single model so that any prefix of the output vector — the first 256, 512, or 1024 dims — is a usable embedding on its own. Information is front-loaded.
- OpenAI's
text-embedding-3-small/-large, Nomic Embed, and several others expose this via adimensionsparameter. That parameter is MRL, not naive slicing. - To truncate correctly you take the first k dimensions and re-normalize to unit length. Skip the re-normalization and your cosine scores break.
- The payoff is adaptive retrieval: shortlist the whole corpus with cheap low-dim vectors, then re-rank the top candidates with full-dim vectors. You get near-full recall at a fraction of the memory and dot-product cost.
- Truncation composes with int8/binary quantization — they attack different axes (dimensions vs. bits-per-dimension).
What are Matryoshka embeddings?
Matryoshka embeddings are vectors trained so that truncating them to a shorter prefix still yields a semantically meaningful embedding. The name comes from Russian nesting dolls: a 256-dim embedding sits inside the 512-dim one, which sits inside the full 3072.
A normal embedding model spreads information across all output dimensions with no ordering guarantee. Chop off the back half and you've destroyed roughly half the signal at random. MRL changes the training objective: during training the loss is computed at multiple prefix lengths simultaneously — say 64, 128, 256, 512, ..., 3072 — and summed. The model is forced to pack the most discriminative information into the earliest dimensions, because those dimensions have to stand alone as a valid embedding.
The result is a coarse-to-fine representation. The front dimensions capture broad semantic structure; later dimensions add finer distinctions. That ordering is the whole trick.
Why can you truncate an MRL vector but not a normal one?
Because MRL explicitly optimized the prefixes to be standalone embeddings, whereas a standard model never had that constraint. With a normal model, dimension 3000 is exactly as likely to carry critical signal as dimension 5. Truncation is random destruction. With MRL, the variance is deliberately concentrated up front, so cutting the tail removes the least informative dimensions first.
Two practical consequences:
- You can only do this with models trained for it. Truncating a legacy
all-MiniLMor an arbitrary sentence-transformer will tank your recall. Check the model card for "Matryoshka" or a supporteddimensionsargument. - Recall degrades gracefully, not off a cliff. Dropping from 3072 to 1024 is usually near-lossless for coarse retrieval; 256 is often still strong for a first-pass shortlist; going to 64 starts to hurt on queries that need fine distinctions.
How do you truncate correctly?
Take the first k dimensions, then L2-normalize the result. The re-normalization is the step people forget, and it quietly corrupts scores.
Cosine similarity assumes unit-norm vectors. A full MRL vector is unit-norm across all 3072 dims. When you slice off a prefix, the remaining sub-vector is not unit-norm — its magnitude is whatever fraction of the total energy lived in those dimensions. If you dot two un-normalized truncated vectors, you're mixing magnitude into a metric that's supposed to be angle-only, and comparisons across items become inconsistent.
import numpy as np
def truncate(vec: np.ndarray, dims: int) -> np.ndarray:
"""Slice an MRL embedding to `dims` and restore unit norm."""
sliced = vec[..., :dims]
norm = np.linalg.norm(sliced, axis=-1, keepdims=True)
return sliced / norm
# full 3072-dim embedding from text-embedding-3-large
full = get_embedding(text) # shape (3072,), unit norm
short = truncate(full, 256) # shape (256,), re-normalized to unit norm
# WRONG: cosine over a non-unit-norm slice
bad = np.dot(full[:256], other[:256])
# RIGHT: cosine over re-normalized slices
good = np.dot(short, truncate(other, 256))
If your vector DB stores normalized vectors and uses cosine/inner-product distance, do the truncation before insertion so everything in the index shares the same dimensionality and norm convention. Don't mix 256-dim and 3072-dim vectors in the same distance computation — that's a category error, not an approximation.
Can you truncate server-side with the API?
Yes. For OpenAI's text-embedding-3 models, pass dimensions and the API returns an already-truncated, re-normalized vector. That's the cheapest path if you know your target size at ingest time.
from openai import OpenAI
client = OpenAI()
resp = client.embeddings.create(
model="text-embedding-3-large",
input="How does grouped-query attention reduce KV cache size?",
dimensions=256, # MRL truncation, server-side + re-normalized
)
vec = resp.data[0].embedding # 256 floats, unit norm
The catch: if you request dimensions=256 at ingest, you've thrown away the tail permanently. You can't reconstruct the 3072-dim vector for a precise re-rank later. So the decision is architectural: do you want one fixed small size everywhere, or do you want to keep full vectors and truncate on the fly?
For a static shortlist-only system, request the small size and save storage. For adaptive retrieval, store the full vector and truncate client-side per query. Keeping full vectors and slicing locally costs more storage but unlocks the two-pass pattern below.
What is adaptive retrieval and why does it matter?
Adaptive retrieval is a two-pass search: a cheap low-dimensional shortlist over the entire corpus, then an expensive full-dimensional re-rank over only the shortlist. It's the reason MRL is worth the trouble.
The cost of exact nearest-neighbor search scales with dimensionality. A dot product over 256 floats is ~12x cheaper than over 3072, and your in-memory index is ~12x smaller. But 256 dims alone can miss fine-grained matches. So you split the work:
- Pass 1 (shortlist): search all N documents using 256-dim truncated vectors. Retrieve top ~100–200 candidates. This is where the 12x memory and speed savings land, because it runs over the whole corpus.
- Pass 2 (re-rank): compute exact similarity between the query and just those ~100 candidates using full 3072-dim vectors. Return top 10.
def adaptive_search(query_vec, index_256, store_full, shortlist=200, k=10):
# Pass 1: cheap, over the entire corpus
q_short = truncate(query_vec, 256)
candidate_ids = index_256.search(q_short, top_k=shortlist)
# Pass 2: exact, only over the shortlist
q_full = query_vec # already 3072-dim, unit norm
scored = [
(cid, np.dot(q_full, store_full[cid]))
for cid in candidate_ids
]
scored.sort(key=lambda x: -x[1])
return scored[:k]
Pass 2 touches a fixed ~200 vectors regardless of corpus size, so its cost is negligible next to a full-dim search over millions of documents. You get accuracy close to searching everything at 3072 dims, at roughly the memory footprint and latency of a 256-dim index. The main storage cost is keeping the full vectors around for store_full, which can live on cheaper disk/SSD-backed storage since it's only hit for the shortlist.
Where does this break?
Truncating a non-MRL model. The single most common mistake. If the model wasn't trained with Matryoshka loss, prefixes are not standalone embeddings and recall collapses. Verify support first.
Forgetting re-normalization. Covered above — un-normalized slices poison cosine similarity. Symptoms: scores that look plausible but ranking that's subtly wrong, especially for short documents whose energy distribution differs from long ones.
Over-truncating for the query type. Coarse queries ("documents about transformers") survive aggressive truncation. Fine-grained queries ("the specific paper that introduced rotary embeddings") need the tail dimensions. Set your shortlist width generously (200+) so pass 2 can recover matches that pass 1 ranked low. If you shortlist only 20 candidates at 256 dims, a true match sitting at rank 60 never reaches the re-ranker.
Assuming it replaces quantization. It doesn't — it composes. Truncation reduces the number of dimensions; int8 or binary quantization reduces bits per dimension. A 256-dim int8 vector is 256 bytes; stack both techniques for the shortlist index and re-rank with full-precision full-dim vectors. Binary vectors with Hamming distance make an even cheaper first filter, with MRL truncation on top.
Cross-model comparison. MRL prefixes are only interchangeable within the same model. A 256-dim slice from text-embedding-3-large and a 256-dim slice from text-embedding-3-small are not comparable; they live in different spaces.
So how much can you actually cut?
Matryoshka embeddings let you truncate a text-embedding-3-large vector from 3072 to 256 dimensions — a 12x reduction in index memory and per-comparison cost — while keeping retrieval quality high, because the model was trained to front-load information into the earliest dimensions. The recipe: slice the first k dims, re-normalize to unit length, use the small vectors for a cheap first-pass shortlist over the whole corpus, then re-rank the top candidates with the full-dimension vectors. Truncate only models trained with MRL (exposed as the dimensions parameter), always re-normalize, keep your shortlist wide enough to catch fine-grained matches, and layer quantization on top for the first pass. Done right, you run a 3072-dim-quality retriever at close to 256-dim-cost.
Top comments (0)