Krunal Kanojiya

Posted on May 27 • Originally published at krunalkanojiya.com

Dual Encoder vs Cross-Encoder: Why Your RAG Pipeline Needs Both

#nlp #python #rag #tutorial

My RAG pipeline looked fine on paper. Fast retrieval. Decent cosine scores. But when I tested it with real queries, the top results were always a little off. Documents that shared vocabulary with the query kept showing up instead of documents that actually answered it. The model was doing its job. The architecture was not.

The fix was not a better model. It was a second model doing a different job.

This post breaks down what that means, why it matters, and how to build the two-stage pipeline in Python.

The Problem With Single-Stage Retrieval

Every search system faces a hard tradeoff between speed and accuracy.

You cannot run a deep computation against every document in a million-item corpus at query time. The latency would be unacceptable. So most pipelines use a fast embedding model to retrieve candidates, stop there, and call it done.

The result is a "close but not quite right" problem. The retrieved documents are topically related but not precisely relevant. The pipeline optimized for speed at the cost of meaning.

Single-stage pipeline (the common mistake):
=============================================
  Query --> [Dual Encoder] --> Top-K results
                               (fast, imprecise)
=============================================

Two-stage pipeline (what actually works):
=============================================
  Query --> [Dual Encoder] --> Top-50 candidates
                --> [Cross-Encoder] --> Reranked Top-5
                                        (fast + precise)
=============================================

The two models are not competing alternatives. They solve different halves of the same problem.

What a Dual Encoder Does

A dual encoder, also called a bi-encoder or two-tower model, uses two separate transformer networks. One encodes the query. The other encodes the document. Both produce a fixed-size vector. Then the system measures cosine similarity between the two vectors.

That single number is the relevance signal.

The reason this is fast is precomputation. You encode every document at index time and store those vectors. At query time, you only encode the query, which takes milliseconds, and run an approximate nearest-neighbor search against precomputed vectors. Corpus size stops mattering for latency.

from sentence_transformers import SentenceTransformer
import numpy as np

model = SentenceTransformer("all-MiniLM-L6-v2")

# Encoded once at index time and stored
documents = [
    "Python is a high-level programming language.",
    "Transformer models revolutionized NLP benchmarks.",
    "Cosine similarity measures the angle between two vectors.",
    "RAG systems combine retrieval with language model generation.",
]
doc_embeddings = model.encode(documents, convert_to_numpy=True)

# Only this runs at query time
query = "How does vector similarity work in search?"
query_embedding = model.encode(query, convert_to_numpy=True)

scores = np.dot(doc_embeddings, query_embedding) / (
    np.linalg.norm(doc_embeddings, axis=1) * np.linalg.norm(query_embedding)
)

ranked = sorted(zip(scores, documents), reverse=True)
for score, doc in ranked[:3]:
    print(f"{score:.4f} | {doc}")

The tradeoff is that the query and document never actually interact. Each gets compressed into one vector independently. That compression loses nuance. The model has no way to understand whether the document answers the query. It only knows whether they live in the same region of vector space.

What a Cross-Encoder Does

A cross-encoder takes the query and a candidate document as a single concatenated input: [CLS] query [SEP] document [SEP]. One transformer runs on this joint sequence. Every query token attends to every document token across all layers. The output is a single relevance score from 0 to 1.

Because the model reads both at the same time, it catches what a dual encoder misses. It understands negation. It distinguishes between a document that mentions a concept and a document that answers a question about it. It scores based on whether the document actually addresses the query, not whether they share vocabulary.

from sentence_transformers.cross_encoder import CrossEncoder

reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")

query = "side effects of stopping medication suddenly"

candidates = [
    "Medication dosage guidelines for common prescriptions.",
    "Abrupt discontinuation of certain medications can cause withdrawal symptoms.",
    "Drug interaction checkers and pharmacy tools.",
    "How to schedule medication reminders on your phone.",
]

pairs = [[query, doc] for doc in candidates]
scores = reranker.predict(pairs)

ranked = sorted(zip(scores, candidates), reverse=True)
for score, doc in ranked:
    print(f"{score:.4f} | {doc}")

The cost is real. Every cross-encoder call requires a full transformer forward pass on the combined input. Nothing can be precomputed. At query time, each candidate costs a separate forward pass. Running this against a million documents is not viable.

This is why you chain them.

The Two-Stage Pipeline

Stage one uses the dual encoder to retrieve the top 50 to 100 candidates. High recall matters here. Any relevant document that misses this cut is permanently gone.

Stage two passes only those candidates to the cross-encoder for reranking. The corpus is now small enough that deep joint attention is computationally viable. The reranker reorders the list. Only the top 5 to 10 results reach the user or the LLM.

from sentence_transformers import SentenceTransformer
from sentence_transformers.cross_encoder import CrossEncoder
import numpy as np

retriever = SentenceTransformer("all-MiniLM-L6-v2")
reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")

corpus = [
    "Stopping blood pressure medication without a doctor's guidance can be dangerous.",
    "Common blood pressure drugs include ACE inhibitors and beta blockers.",
    "Medication adherence improves outcomes in chronic disease management.",
    "Withdrawal effects vary depending on the type and duration of medication use.",
    "Pharmacists can review drug interactions and dosage schedules.",
    "Abrupt cessation of antidepressants can cause discontinuation syndrome.",
    "Always consult a physician before changing any medication regimen.",
    "Over-the-counter pain relievers are generally safe for short-term use.",
]

query = "is it dangerous to stop taking my medication without a doctor?"

# Stage 1: Dual encoder retrieval
doc_embeddings = retriever.encode(corpus, convert_to_numpy=True)
query_embedding = retriever.encode(query, convert_to_numpy=True)

cosine_scores = np.dot(doc_embeddings, query_embedding) / (
    np.linalg.norm(doc_embeddings, axis=1) * np.linalg.norm(query_embedding)
)

top_k = 5
top_indices = np.argsort(cosine_scores)[::-1][:top_k]
candidates = [corpus[i] for i in top_indices]

print("Stage 1 - Dual Encoder Retrieval:")
for i, doc in enumerate(candidates):
    print(f"  {i+1}. {doc}")

# Stage 2: Cross-encoder reranking
pairs = [[query, doc] for doc in candidates]
rerank_scores = reranker.predict(pairs)

final_results = sorted(zip(rerank_scores, candidates), reverse=True)

print("\nStage 2 - Cross-Encoder Reranked:")
for score, doc in final_results:
    print(f"  {score:.4f} | {doc}")

Note for RAG builders: The ceiling of this pipeline is always stage one recall. If the dual encoder misses a relevant document entirely, the cross-encoder never sees it. Retrieve generously with a higher top-k, then rerank aggressively.

Side-by-Side Comparison

Property	Dual Encoder	Cross-Encoder
Input format	Query and document encoded separately	Query and document concatenated as one input
Output	Two vectors, cosine compared	One relevance score per pair
Speed	Very fast, supports precomputation	Slow, no precomputation possible
Accuracy	Moderate, misses nuanced relevance	High, full query-document interaction
Scalability	Scales to millions of documents	Practical only on 50 to 200 candidates
Pipeline role	Stage 1 retrieval	Stage 2 reranking
Example models	all-MiniLM-L6-v2, text-embedding-ada-002	ms-marco-MiniLM-L-6-v2, Cohere Rerank, BGE-Reranker
Token interaction	None (independent encoding)	Full cross-attention across all layers

When to Use What

Scenario	Recommended Approach
Corpus under 500 documents	Cross-encoder directly, skip dual encoder
Large corpus, latency is the priority	Dual encoder only
Large corpus, accuracy matters	Dual encoder retrieval plus cross-encoder reranking
High QPS, tight latency budget	Dual encoder plus ColBERT late interaction
Domain-specific content, general models underperform	Fine-tune the cross-encoder on your domain data
Multilingual corpus	BGE-Reranker or mGTE for multilingual reranking

ColBERT: The Middle Ground Worth Knowing

If cross-encoder latency is too high for your QPS budget, ColBERT is worth knowing. It sits between the two architectures. It encodes query and document separately like a dual encoder, preserving the ability to precompute document vectors. But instead of comparing two pooled vectors, it compares individual token embeddings using a MaxSim operation: for each query token, find the most similar document token across the full sequence.

This gives ColBERT much better accuracy than a standard dual encoder while keeping document precomputation intact. According to the original ColBERT paper from Stanford, it uses two orders of magnitude fewer FLOPs per query than a cross-encoder while maintaining strong retrieval quality. The RAGatouille library is the fastest way to plug ColBERT into an existing pipeline.

Approach	Precompute Docs	Token Interaction	Speed	Accuracy
Dual Encoder	Yes	None	Fastest	Lowest
ColBERT	Yes (per token)	MaxSim per token	Fast	High
Cross-Encoder	No	Full cross-attention	Slowest	Highest

FAQs

What is a dual encoder in NLP?
A dual encoder encodes the query and document into separate vectors and computes cosine similarity between them for fast, scalable retrieval.

What is a cross-encoder?
A cross-encoder takes a query and document as a single concatenated input and produces one precise relevance score by attending to both simultaneously.

Why not use a cross-encoder for all retrieval?
Cross-encoders cannot precompute anything, so running one against millions of documents at query time is computationally infeasible.

What is the two-stage retrieval pipeline?
The dual encoder retrieves the top 50 to 100 candidates for speed, then the cross-encoder reranks only those candidates for precision before passing results to the user or LLM.

What is ColBERT?
ColBERT is a late-interaction model that sits between dual and cross-encoders, comparing per-token vectors with MaxSim instead of pooled vectors, giving better precision than a dual encoder without losing document precomputation.

Which cross-encoder model should I start with?
cross-encoder/ms-marco-MiniLM-L-6-v2 from Sentence Transformers is the standard starting point for English; for managed reranking, Cohere Rerank works well in RAG pipelines.

Does LangChain support cross-encoder reranking?
Yes, both LangChain and LlamaIndex have built-in reranking steps that accept cross-encoder models as a second-stage reranker.

I write deeper technical breakdowns on search architecture, RAG systems, and AI infrastructure over at krunalkanojiya.com. The full version of this article with complete pipeline walkthroughs lives there if you want to go further.

Top comments (1)

Harjot Singh • May 31

"Documents that shared vocabulary with the query kept showing up instead of documents that actually answered it" is the cleanest description of the dual-encoder failure mode I've read, and your diagnosis is the important part: the model was fine, the architecture wasn't. Dual encoders embed query and doc separately, so they can only ever measure surface similarity, they literally never see the query and document together, which is why lexically-overlapping-but-irrelevant chunks win. A cross-encoder reads both at once and can judge actual relevance, but it's too slow to run over the whole corpus, hence the two-stage pattern: cheap dual-encoder for recall, expensive cross-encoder reranker for precision on the shortlist. The deeper lesson is the one people resist because it's less exciting than a bigger model: most RAG quality problems are architecture-and-retrieval problems, not model problems, and the fix is usually a second cheap component doing a different job, not a more expensive single one. Right tool per stage beats one big hammer. That match-the-component-to-the-job thinking is core to how I build retrieval in Moonshift. Did adding the reranker change which base embedding model you needed, or did a weaker dual-encoder become fine once the cross-encoder cleaned up after it?