DEV Community

Cover image for Dual Encoder vs Cross-Encoder: Why Your RAG Pipeline Needs Both
Krunal Kanojiya
Krunal Kanojiya

Posted on • Originally published at krunalkanojiya.com

Dual Encoder vs Cross-Encoder: Why Your RAG Pipeline Needs Both

My RAG pipeline looked fine on paper. Fast retrieval. Decent cosine scores. But when I tested it with real queries, the top results were always a little off. Documents that shared vocabulary with the query kept showing up instead of documents that actually answered it. The model was doing its job. The architecture was not.

The fix was not a better model. It was a second model doing a different job.

This post breaks down what that means, why it matters, and how to build the two-stage pipeline in Python.


The Problem With Single-Stage Retrieval

Every search system faces a hard tradeoff between speed and accuracy.

You cannot run a deep computation against every document in a million-item corpus at query time. The latency would be unacceptable. So most pipelines use a fast embedding model to retrieve candidates, stop there, and call it done.

The result is a "close but not quite right" problem. The retrieved documents are topically related but not precisely relevant. The pipeline optimized for speed at the cost of meaning.

Single-stage pipeline (the common mistake):
=============================================
  Query --> [Dual Encoder] --> Top-K results
                               (fast, imprecise)
=============================================

Two-stage pipeline (what actually works):
=============================================
  Query --> [Dual Encoder] --> Top-50 candidates
                --> [Cross-Encoder] --> Reranked Top-5
                                        (fast + precise)
=============================================
Enter fullscreen mode Exit fullscreen mode

The two models are not competing alternatives. They solve different halves of the same problem.


What a Dual Encoder Does

A dual encoder, also called a bi-encoder or two-tower model, uses two separate transformer networks. One encodes the query. The other encodes the document. Both produce a fixed-size vector. Then the system measures cosine similarity between the two vectors.

That single number is the relevance signal.

The reason this is fast is precomputation. You encode every document at index time and store those vectors. At query time, you only encode the query, which takes milliseconds, and run an approximate nearest-neighbor search against precomputed vectors. Corpus size stops mattering for latency.

from sentence_transformers import SentenceTransformer
import numpy as np

model = SentenceTransformer("all-MiniLM-L6-v2")

# Encoded once at index time and stored
documents = [
    "Python is a high-level programming language.",
    "Transformer models revolutionized NLP benchmarks.",
    "Cosine similarity measures the angle between two vectors.",
    "RAG systems combine retrieval with language model generation.",
]
doc_embeddings = model.encode(documents, convert_to_numpy=True)

# Only this runs at query time
query = "How does vector similarity work in search?"
query_embedding = model.encode(query, convert_to_numpy=True)

scores = np.dot(doc_embeddings, query_embedding) / (
    np.linalg.norm(doc_embeddings, axis=1) * np.linalg.norm(query_embedding)
)

ranked = sorted(zip(scores, documents), reverse=True)
for score, doc in ranked[:3]:
    print(f"{score:.4f} | {doc}")
Enter fullscreen mode Exit fullscreen mode

The tradeoff is that the query and document never actually interact. Each gets compressed into one vector independently. That compression loses nuance. The model has no way to understand whether the document answers the query. It only knows whether they live in the same region of vector space.

cosine similarity score

What a Cross-Encoder Does

A cross-encoder takes the query and a candidate document as a single concatenated input: [CLS] query [SEP] document [SEP]. One transformer runs on this joint sequence. Every query token attends to every document token across all layers. The output is a single relevance score from 0 to 1.

Because the model reads both at the same time, it catches what a dual encoder misses. It understands negation. It distinguishes between a document that mentions a concept and a document that answers a question about it. It scores based on whether the document actually addresses the query, not whether they share vocabulary.

from sentence_transformers.cross_encoder import CrossEncoder

reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")

query = "side effects of stopping medication suddenly"

candidates = [
    "Medication dosage guidelines for common prescriptions.",
    "Abrupt discontinuation of certain medications can cause withdrawal symptoms.",
    "Drug interaction checkers and pharmacy tools.",
    "How to schedule medication reminders on your phone.",
]

pairs = [[query, doc] for doc in candidates]
scores = reranker.predict(pairs)

ranked = sorted(zip(scores, candidates), reverse=True)
for score, doc in ranked:
    print(f"{score:.4f} | {doc}")
Enter fullscreen mode Exit fullscreen mode

The cost is real. Every cross-encoder call requires a full transformer forward pass on the combined input. Nothing can be precomputed. At query time, each candidate costs a separate forward pass. Running this against a million documents is not viable.

This is why you chain them.


The Two-Stage Pipeline

Stage one uses the dual encoder to retrieve the top 50 to 100 candidates. High recall matters here. Any relevant document that misses this cut is permanently gone.

Stage two passes only those candidates to the cross-encoder for reranking. The corpus is now small enough that deep joint attention is computationally viable. The reranker reorders the list. Only the top 5 to 10 results reach the user or the LLM.

from sentence_transformers import SentenceTransformer
from sentence_transformers.cross_encoder import CrossEncoder
import numpy as np

retriever = SentenceTransformer("all-MiniLM-L6-v2")
reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")

corpus = [
    "Stopping blood pressure medication without a doctor's guidance can be dangerous.",
    "Common blood pressure drugs include ACE inhibitors and beta blockers.",
    "Medication adherence improves outcomes in chronic disease management.",
    "Withdrawal effects vary depending on the type and duration of medication use.",
    "Pharmacists can review drug interactions and dosage schedules.",
    "Abrupt cessation of antidepressants can cause discontinuation syndrome.",
    "Always consult a physician before changing any medication regimen.",
    "Over-the-counter pain relievers are generally safe for short-term use.",
]

query = "is it dangerous to stop taking my medication without a doctor?"

# Stage 1: Dual encoder retrieval
doc_embeddings = retriever.encode(corpus, convert_to_numpy=True)
query_embedding = retriever.encode(query, convert_to_numpy=True)

cosine_scores = np.dot(doc_embeddings, query_embedding) / (
    np.linalg.norm(doc_embeddings, axis=1) * np.linalg.norm(query_embedding)
)

top_k = 5
top_indices = np.argsort(cosine_scores)[::-1][:top_k]
candidates = [corpus[i] for i in top_indices]

print("Stage 1 - Dual Encoder Retrieval:")
for i, doc in enumerate(candidates):
    print(f"  {i+1}. {doc}")

# Stage 2: Cross-encoder reranking
pairs = [[query, doc] for doc in candidates]
rerank_scores = reranker.predict(pairs)

final_results = sorted(zip(rerank_scores, candidates), reverse=True)

print("\nStage 2 - Cross-Encoder Reranked:")
for score, doc in final_results:
    print(f"  {score:.4f} | {doc}")
Enter fullscreen mode Exit fullscreen mode

two stage pipeline flow diagram

Note for RAG builders: The ceiling of this pipeline is always stage one recall. If the dual encoder misses a relevant document entirely, the cross-encoder never sees it. Retrieve generously with a higher top-k, then rerank aggressively.


Side-by-Side Comparison

Property Dual Encoder Cross-Encoder
Input format Query and document encoded separately Query and document concatenated as one input
Output Two vectors, cosine compared One relevance score per pair
Speed Very fast, supports precomputation Slow, no precomputation possible
Accuracy Moderate, misses nuanced relevance High, full query-document interaction
Scalability Scales to millions of documents Practical only on 50 to 200 candidates
Pipeline role Stage 1 retrieval Stage 2 reranking
Example models all-MiniLM-L6-v2, text-embedding-ada-002 ms-marco-MiniLM-L-6-v2, Cohere Rerank, BGE-Reranker
Token interaction None (independent encoding) Full cross-attention across all layers

When to Use What

Scenario Recommended Approach
Corpus under 500 documents Cross-encoder directly, skip dual encoder
Large corpus, latency is the priority Dual encoder only
Large corpus, accuracy matters Dual encoder retrieval plus cross-encoder reranking
High QPS, tight latency budget Dual encoder plus ColBERT late interaction
Domain-specific content, general models underperform Fine-tune the cross-encoder on your domain data
Multilingual corpus BGE-Reranker or mGTE for multilingual reranking

ColBERT: The Middle Ground Worth Knowing

If cross-encoder latency is too high for your QPS budget, ColBERT is worth knowing. It sits between the two architectures. It encodes query and document separately like a dual encoder, preserving the ability to precompute document vectors. But instead of comparing two pooled vectors, it compares individual token embeddings using a MaxSim operation: for each query token, find the most similar document token across the full sequence.

This gives ColBERT much better accuracy than a standard dual encoder while keeping document precomputation intact. According to the original ColBERT paper from Stanford, it uses two orders of magnitude fewer FLOPs per query than a cross-encoder while maintaining strong retrieval quality. The RAGatouille library is the fastest way to plug ColBERT into an existing pipeline.

Approach Precompute Docs Token Interaction Speed Accuracy
Dual Encoder Yes None Fastest Lowest
ColBERT Yes (per token) MaxSim per token Fast High
Cross-Encoder No Full cross-attention Slowest Highest

FAQs

What is a dual encoder in NLP?
A dual encoder encodes the query and document into separate vectors and computes cosine similarity between them for fast, scalable retrieval.

What is a cross-encoder?
A cross-encoder takes a query and document as a single concatenated input and produces one precise relevance score by attending to both simultaneously.

Why not use a cross-encoder for all retrieval?
Cross-encoders cannot precompute anything, so running one against millions of documents at query time is computationally infeasible.

What is the two-stage retrieval pipeline?
The dual encoder retrieves the top 50 to 100 candidates for speed, then the cross-encoder reranks only those candidates for precision before passing results to the user or LLM.

What is ColBERT?
ColBERT is a late-interaction model that sits between dual and cross-encoders, comparing per-token vectors with MaxSim instead of pooled vectors, giving better precision than a dual encoder without losing document precomputation.

Which cross-encoder model should I start with?
cross-encoder/ms-marco-MiniLM-L-6-v2 from Sentence Transformers is the standard starting point for English; for managed reranking, Cohere Rerank works well in RAG pipelines.

Does LangChain support cross-encoder reranking?
Yes, both LangChain and LlamaIndex have built-in reranking steps that accept cross-encoder models as a second-stage reranker.


I write deeper technical breakdowns on search architecture, RAG systems, and AI infrastructure over at krunalkanojiya.com. The full version of this article with complete pipeline walkthroughs lives there if you want to go further.

Top comments (0)