To Embed or Not to Embed? That Is the Question
In a series of stories about my grammar RAG assistant BookMind and it pissed me off again.
Student asked: “Explain the Past Simple tense.”
The system gave a decent explanation.
Then the student said: “Give me an exercise on this topic.”
Instead of pulling an exercise from the same unit, the model brought something from a completely different section. The conversation broke.
That was the moment I finally added a proper reranker.
What changed in the pipeline
# Stage 1: Hybrid retrieval (25 candidates)
candidates = retriever.invoke(question)
# Stage 2: Cross-Encoder reranking
scores = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2') \
.predict([[question, doc.page_content] for doc in candidates])
# Stage 3: Only the best 5 go to the LLM
final_context = [doc for _, doc in sorted(zip(scores, candidates), reverse=True)][:5]
Real conversation after adding reranker
Student asks for the rule → system correctly pulls the right section (Past Simple)
Student asks for a task on the same topic → system now pulls the correct exercise (p.378)
Student submits wrong answers (“goed”, “boughten”) → system gives precise feedback and points to the exact unit (Unit 68 > 68.3)
The whole conversation stayed coherent. No more jumping between unrelated parts of the book.
Measurable improvement
Before reranker: Top-1 Accuracy ≈ 40%
After reranker: Top-1 Accuracy ≈ 95%
Reranking 24–25 candidates takes ~1.51 seconds
So?
Embeddings + hybrid search are good at finding something. Cross-encoder reranking is what makes the system actually understand what is relevant for the current question.
The extra 1.5 seconds is worth every millisecond.
Have you tried cross-encoder reranking in your projects? How many candidates do you usually pass to it?
Top comments (0)