To Embed or Not to Embed? That Is the Question.

#ai #llm #programming #opensource

To Embed or Not to Embed? That Is the Question

In a series of stories about my grammar RAG assistant BookMind and it pissed me off again.

Student asked: “Explain the Past Simple tense.”

The system gave a decent explanation.

Then the student said: “Give me an exercise on this topic.”

Instead of pulling an exercise from the same unit, the model brought something from a completely different section. The conversation broke.

That was the moment I finally added a proper reranker.

What changed in the pipeline

# Stage 1: Hybrid retrieval (25 candidates)
candidates = retriever.invoke(question)

# Stage 2: Cross-Encoder reranking
scores = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2') \
            .predict([[question, doc.page_content] for doc in candidates])

# Stage 3: Only the best 5 go to the LLM
final_context = [doc for _, doc in sorted(zip(scores, candidates), reverse=True)][:5]

Real conversation after adding reranker
Student asks for the rule → system correctly pulls the right section (Past Simple)
Student asks for a task on the same topic → system now pulls the correct exercise (p.378)
Student submits wrong answers (“goed”, “boughten”) → system gives precise feedback and points to the exact unit (Unit 68 > 68.3)

The whole conversation stayed coherent. No more jumping between unrelated parts of the book.

Measurable improvement
Before reranker: Top-1 Accuracy ≈ 40%
After reranker: Top-1 Accuracy ≈ 95%
Reranking 24–25 candidates takes ~1.51 seconds

So?
Embeddings + hybrid search are good at finding something. Cross-encoder reranking is what makes the system actually understand what is relevant for the current question.

The extra 1.5 seconds is worth every millisecond.

Have you tried cross-encoder reranking in your projects? How many candidates do you usually pass to it?