DEV Community

Muhammad Zeeshan
Muhammad Zeeshan

Posted on

Your Node.js RAG is grabbing the wrong sources. Here's a 4MB cross-encoder fix.

The bug nobody talks about in JS RAG tutorials

You wire up a Node.js retrieval pipeline. Embed your docs with OpenAI or Voyage, drop them into Pinecone or pgvector, build a top-5 query, feed it to Claude or GPT. The demo works.

Then you ship to production and users notice the same thing:

"The bot quoted the wrong document."

The retrieval looked fine in your eval. It pulled five plausibly-related chunks. The LLM picked the wrong one to cite because the WRONG ONE was at the top of the list.

This is the bi-encoder problem.

Why embedding similarity falls short

When you do cosine(query_vec, doc_vec) on a vector DB, you are using a bi-encoder: the query and the doc were encoded independently into the same space. That is fast (millisecond-scale ANN search over millions of vectors). It is also lossy. The encoder never saw the query-doc pair TOGETHER. It estimated relevance based on independent meaning.

A cross-encoder sees both at once. It encodes [query, doc] jointly through a small transformer and outputs a single relevance score per pair. Heavier per call. Massively more accurate at picking the right one.

The canonical RAG architecture in 2025 is:

  1. Bi-encoder plus ANN to fetch top-50 candidates (fast, cheap).
  2. Cross-encoder reranker to pick the top-5 from those 50.
  3. Feed top-5 to the LLM.

Python devs have had this since 2023 via FlashRank, sentence-transformers, and BGE rerankers. Node.js devs have had two options:

  1. Pay Cohere Rerank ($1 per 1000 calls, ~300 ms network latency).
  2. Hand-roll with @huggingface/transformers (~80 lines of model loading, batching, score normalization plumbing you do not want to maintain).

That gap is what flashrank-js fills.

Install

npm install flashrank-js
Enter fullscreen mode Exit fullscreen mode

Zero API keys. Zero cloud. ONNX cross-encoder running locally via @huggingface/transformers. Five model tiers from 4 MB to 280 MB. Pick the one that fits your latency budget.

The six-line tutorial

import { Reranker } from "flashrank-js";

const reranker = await Reranker.create({ model: "mini" });

const ranked = await reranker.rerank({
  query: "What is RAG?",
  documents: candidates,  // your top-50 from vector search
  topN: 5,
});
Enter fullscreen mode Exit fullscreen mode

ranked[0] is the most relevant document. Done.

Before / after on a real query

Suppose your vector store returns these five candidates for the query "How does retrieval-augmented generation work?":

const candidates = [
  "RAG combines a retriever and a generator. The retriever finds relevant docs, the generator uses them to answer.",
  "Karachi is a city in Pakistan with a population over 16 million.",
  "Retrieval-augmented generation grounds LLM outputs in real documents to reduce hallucinations.",
  "The capital of France is Paris.",
  "Cross-encoders rerank retrieved documents to surface the most relevant ones for a query.",
];
Enter fullscreen mode Exit fullscreen mode

Vector similarity might put them in this order (depends on your embedding model, but the city and capital snippets often sneak in because they share token statistics with the query):

0.81  RAG combines a retriever and a generator...
0.78  Karachi is a city in Pakistan...      (noise)
0.76  Retrieval-augmented generation grounds...
0.71  The capital of France is Paris...     (noise)
0.69  Cross-encoders rerank retrieved...
Enter fullscreen mode Exit fullscreen mode

Two of five are noise. The LLM picks one of them in its citation, your user sees a hallucination.

Run the same candidates through flashrank-js:

const ranked = await reranker.rerank({
  query: "How does retrieval-augmented generation work?",
  documents: candidates,
  topN: 3,
});
Enter fullscreen mode Exit fullscreen mode

Output:

[0.9982] Retrieval-augmented generation grounds LLM outputs in real documents...
[0.0005] Cross-encoders rerank retrieved documents to surface the most relevant ones...
[0.0004] RAG combines a retriever and a generator...
Enter fullscreen mode Exit fullscreen mode

Three RAG-related docs at top. Karachi and Paris dropped out entirely.

The LLM that sees this top-3 cannot hallucinate a city citation. There is no city in the context.

Why this matters in production

I have been shipping RAG pipelines and AI agent workflows in production for a while. The Python side of those stacks has had FlashRank wired in for years. The JavaScript pieces shipping to the client UI did not. Every new project that needed cross-encoder reranking on the client started with the same 80 lines of @huggingface/transformers boilerplate. That boilerplate ages because transformers.js refactors its API every minor release.

So I packaged the boilerplate.

The five model tiers

flashrank-js ships with four pre-configured cross-encoder models. The fifth is "bring your own ONNX repo from Hugging Face Hub". Pick by latency budget:

Alias Size Language Use case
tiny 4 MB English Lowest latency, edge runtimes
mini (default) 23 MB English Balanced English RAG
bge-base 280 MB Multilingual First multilingual tier
bge-v2-m3 571 MB Multilingual 2025-2026 SOTA-small
bge-large 563 MB Multilingual Max quality

Switching is one line:

const reranker = await Reranker.create({ model: "bge-v2-m3" });
Enter fullscreen mode Exit fullscreen mode

Models download from Hugging Face Hub on first call, cache locally. After that the call is pure inference.

Real benchmarks (Windows x64, Node 24, CPU)

End-to-end including tokenization, median of five runs:

Model Load (first call) 5 docs 10 docs 20 docs
tiny 190 ms 3 ms 6 ms 10 ms
mini 260 ms 37 ms 68 ms 97 ms

Cohere Rerank API for comparison: 200 to 500 ms including network round-trip, $1 per 1000 calls. At 1 million queries per month, that is $1,000 in rerank fees that can be zero with flashrank-js.

Vercel AI SDK style

If your stack uses Vercel AI SDK 6's rerank(), the call shape mirrors it:

import { rerank } from "flashrank-js/vercel-ai-sdk";

const { ranking, results } = await rerank({
  model: "mini",
  query: "...",
  documents: candidates,
  topN: 5,
});
Enter fullscreen mode Exit fullscreen mode

Same { index, relevanceScore } shape Vercel's API returns. Swapping from the paid Cohere provider to local flashrank is a one-line import change.

(Heads up: this is a standalone function with a familiar shape, not a RerankingModelV2 provider you pass into Vercel's rerank() from the ai package. A true provider adapter is on the v1.x roadmap.)

Honest limits

This is not a silver bullet:

  • Cross-encoders are SLOWER per call than bi-encoders. You still need a vector store for the first-stage retrieval. Reranking happens on top-50, not on millions.
  • Multilingual cross-encoders are bigger. bge-v2-m3 is 571 MB. For pure English apps, mini at 23 MB is the sweet spot.
  • Pure-edge runtimes (Cloudflare Workers, Vercel Edge) need the bundled WASM build of onnxruntime-web. v0.1 targets Node.js 20+ first; edge runtime story lands in v1.1.

Try it

npm install flashrank-js
Enter fullscreen mode Exit fullscreen mode

If you ship a Node.js RAG and your users have ever complained that "the bot quoted the wrong thing," try this. The 23 MB default model is around zero dollars to add and around 30 minutes to integrate. Citation accuracy goes up.

I would love your bug reports.

Top comments (0)