DEV Community

Cover image for To Embed or Not to Embed? That Is the Question.
Ksirailway Base
Ksirailway Base

Posted on

To Embed or Not to Embed? That Is the Question.

To Embed or Not to Embed? That Is the Question

In a series of stories about my grammar RAG assistant BookMind and it pissed me off again.

Student asked: “Explain the Past Simple tense.

The system gave a decent explanation.

Then the student said: “Give me an exercise on this topic.

Instead of pulling an exercise from the same unit, the model brought something from a completely different section. The conversation broke.

That was the moment I finally added a proper reranker.

What changed in the pipeline

# Stage 1: Hybrid retrieval (25 candidates)
candidates = retriever.invoke(question)

# Stage 2: Cross-Encoder reranking
scores = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2') \
            .predict([[question, doc.page_content] for doc in candidates])

# Stage 3: Only the best 5 go to the LLM
final_context = [doc for _, doc in sorted(zip(scores, candidates), reverse=True)][:5]
Enter fullscreen mode Exit fullscreen mode

Real conversation after adding reranker
Student asks for the rule → system correctly pulls the right section (Past Simple)
Student asks for a task on the same topic → system now pulls the correct exercise (p.378)
Student submits wrong answers (“goed”, “boughten”) → system gives precise feedback and points to the exact unit (Unit 68 > 68.3)

The whole conversation stayed coherent. No more jumping between unrelated parts of the book.

Measurable improvement
Before reranker: Top-1 Accuracy ≈ 40%
After reranker: Top-1 Accuracy ≈ 95%
Reranking 24–25 candidates takes ~1.51 seconds

So?
Embeddings + hybrid search are good at finding something. Cross-encoder reranking is what makes the system actually understand what is relevant for the current question.

The extra 1.5 seconds is worth every millisecond.

Have you tried cross-encoder reranking in your projects? How many candidates do you usually pass to it?

Top comments (0)