Reranker Fine-Tuning on Click Data: When Off-the-Shelf Stops Winning

#rag #ai #machinelearning #search

Book: RAG Pocket Guide: Retrieval, Chunking, and Reranking Patterns for Production
Also by me: Thinking in Go (2-book series) — Complete Guide to Go Programming + Hexagonal Architecture in Go
My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
Me: xgabriel.com | GitHub

You wired up a RAG pipeline, picked bge-reranker-v2-m3 off Hugging Face, plugged it in after the bi-encoder, and watched hit rate jump eight points. Then you stopped touching it. Two quarters later the domain has drifted. The corpus is twice the size, half the queries are jargon the reranker has never seen, and the support team is filing tickets faster than the model is winning them. Recall@5 still looks fine. Users don't.

This is the moment off-the-shelf rerankers stop being free wins. They were trained on MS MARCO and a handful of public IR datasets. Your queries are about your billing flow, your part numbers, your internal acronyms. The cross-encoder cannot tell which of two near-identical product specs is the one your engineers actually open. Your users can. They click.

Click data is the cheapest fine-tuning signal you will ever own. Every click is a labeled positive: this query, this doc, the user picked it. Every skip above the click is a labeled negative: this query, this doc, the user saw it and moved on. You already have it in your search logs. The question is what to do with it.

What "the reranker stopped winning" looks like

Before fine-tuning anything, prove the reranker is the bottleneck. The pattern that usually shows up in the trace:

The bi-encoder retrieves 50 candidates. The right one is in there 94% of the time.
The cross-encoder reranks them. The right one is in the top 5 about 78% of the time.
A year ago that number was 88%. Nobody noticed it drift because nobody put it on a dashboard.

The 94 → 78 gap is the reranker's slice. If your retriever is missing the doc entirely, no amount of fine-tuning fixes that. Run the four-metric harness from RAG Evaluation Beyond Recall@K first. If faithfulness and coverage look fine, recall@50 is high, and rerank-top-5 is sliding, fine-tune the reranker. Any other shape, fix the upstream stage first.

Data shape: what click data should look like before you train

Forget anything fancy. The minimum useful row is four columns:

# click_events.jsonl — one row per query impression
{
  "query": "refund window for digital downloads",
  "shown_doc_ids": ["d_8821", "d_3340", "d_2207", "d_91", "d_44"],
  "clicked_doc_ids": ["d_3340"],
  "session_id": "s_5b21...",
  "ts": 1736201044
}

Five rules for cleaning it before it touches a trainer:

Drop sessions with no click. No signal, only noise.
Drop bot traffic. A user-agent allowlist plus a "more than 50 queries in 60 seconds" rule kills 80% of it.
Drop queries with click-through-rate above 0.95 on the same doc-id repeatedly. That is either a bookmark or someone scripting your search bar.
Drop the top 1% of queries by frequency. They will dominate the training set and overfit you to your homepage.
Deduplicate by (query, clicked_doc_id). A user clicking the same result ten times is one positive, not ten.

After cleaning, what comes out the other end is a list of pairwise preferences:

# pairs.jsonl — one row per (positive, negative) pair
{
  "query": "refund window for digital downloads",
  "pos_doc_id": "d_3340",
  "neg_doc_id": "d_8821"
}
{
  "query": "refund window for digital downloads",
  "pos_doc_id": "d_3340",
  "neg_doc_id": "d_2207"
}

For each impression, every shown-but-not-clicked doc above the click position is a hard negative. Doc shown below the click is ambiguous — the user may have stopped scrolling — so drop it. With one click out of five shown, you get up to four negatives per impression. A million impressions a month gets you a few million pairs. That is enough.

Pairwise loss: the only objective that fits this data

You have pairs. The right loss is the one that takes pairs. MarginRankingLoss, conceptually similar to BPR in the recsys literature, asks the model to score the positive higher than the negative by a margin. The form:

loss = max(0, margin - (score_pos - score_neg))

Gradient flows when the negative is too close. When the model already ranks the positive far above the negative, the loss is zero and the example contributes nothing. Margin is usually 1.0 for cross-encoders that emit logits in a [-10, +10]-ish range.

A sentence-transformers training loop is forty lines:

from sentence_transformers import CrossEncoder, InputExample, losses
from torch.utils.data import DataLoader

model = CrossEncoder("BAAI/bge-reranker-v2-m3", num_labels=1)

train_examples = []
for row in load_pairs("pairs.jsonl"):
    q = row["query"]
    pos_text = doc_text(row["pos_doc_id"])
    neg_text = doc_text(row["neg_doc_id"])
    # Triplet input: (anchor, positive, negative)
    train_examples.append(
        InputExample(texts=[q, pos_text, neg_text])
    )

loader = DataLoader(train_examples, shuffle=True, batch_size=16)
loss_fn = losses.MultipleNegativesRankingLoss(model=model)

model.fit(
    train_dataloader=loader,
    epochs=1,
    warmup_steps=500,
    output_path="rerank-finetuned-v1",
)

One epoch over a few hundred thousand pairs on a single A10 takes about an hour. Two epochs is the limit. Past that, train loss keeps falling while eval loss climbs. Classic overfit.

Distillation: when click data is too sparse

If you have ten thousand pairs instead of a million, do not train from scratch. Distill. The shape: take a strong teacher reranker (Cohere rerank-3, bge-reranker-v2-gemma, or a closed-source one you can hit via API), score every (query, candidate) pair the teacher sees, and have your smaller student learn the teacher's continuous score, not the binary click label.

Why this works: the teacher's score for a doc is closer to the truth than a single click event. The teacher knows that doc A scores 0.91 and doc B scores 0.88; both are relevant, A is slightly more so. A click only tells you "the user picked A" — it has no view on how much more relevant A is. The distillation loss is regression on the teacher's score:

loss = mse(student_score(q, d), teacher_score(q, d))

MarginMSELoss in sentence-transformers does exactly this with a triplet (query, pos, neg): the student learns the difference between teacher's score for the positive and the negative. It stabilizes the loss when teachers are calibrated within a query but drift across queries.

Teacher reranks via a hosted API run on the order of $1 per 1000 calls at typical pricing (check current rates on the vendor's pricing page). A million (query, candidate) pairs is roughly four-figure spend on teacher labels, paid once per major distillation refresh. Cache the labels in S3 and never re-pay for them.

When click data is the bottleneck, distillation widens the dataset 50x. A student trained on clicks ∪ teacher-labels typically beats either source alone by a few NDCG points on a held-out eval.

The training budget question: one GPU-hour for 100k pairs

Numbers people ask for, with caveats. On a single A10G at roughly $1/hour on a typical cloud GPU provider, with bge-reranker-base (~280M params):

Pairs	Epochs	Approx wall time	Approx cost at ≈$1/hr
50k	1	30 min	$0.55
100k	1	60 min	$1.10
100k	2	2h	$2.20
500k	1	5h	$5.50
500k	2	10h	$11

These are rough. Your batch size, max-seq-len, and whether you're using LoRA all swing the number. The spend is in the noise. The cost of evaluating the fine-tuned model is higher than the cost of training it, because eval needs your full top-50 reranks scored across a held-out month of queries, and that is where most of the time goes.

Use LoRA if you want to ship multiple per-tenant rerankers. A LoRA adapter is a few MB, the base stays shared, and you can route at inference time. Without LoRA, every tenant model is the full 1.1 GB checkpoint, so each additional tenant adds 1.1 GB to your hot model footprint. Past 8–10 tenants you are paying for GPU memory you barely use.

Evaluation: the only number that matters is the lift over off-the-shelf

Train/test discipline first. Hold out the last N days of click logs. Never train on them. The fine-tuned model has to beat the off-the-shelf reranker on those exact same impressions, scored offline.

# eval_lift.py — the whole eval is this
def ndcg_at_5(scored, clicked):
    ranked = sorted(scored, key=lambda x: -x[1])[:5]
    dcg = sum(
        (1 if d in clicked else 0) / math.log2(i + 2)
        for i, (d, _) in enumerate(ranked)
    )
    idcg = sum(1 / math.log2(i + 2) for i in range(min(len(clicked), 5)))
    return dcg / idcg if idcg else 0

baseline_ndcg = mean(ndcg_at_5(score(base, q, docs), clicks)
                     for q, docs, clicks in held_out)
finetuned_ndcg = mean(ndcg_at_5(score(ft, q, docs), clicks)
                      for q, docs, clicks in held_out)

lift = (finetuned_ndcg - baseline_ndcg) / baseline_ndcg
print(f"NDCG@5 lift: {lift:.1%}")

Read the lift number with hostility. Three rules of thumb that have held up in practice:

Lift below 3% is a wash. The model is not statistically distinguishable from the baseline at typical eval-set sizes. Do not deploy.
Lift between 3 and 10% is real. Deploy behind a flag, A/B for two weeks, watch the actual click-through rate, not the offline number.
Lift above 15% is suspicious. You probably leaked the test set into the training set. Rebuild the split.

The trap: NDCG@5 lift on offline click data does not translate 1:1 to user-perceived quality. Offline lift is usually a fraction of the realized online CTR lift; expect erosion, and budget your decision threshold around the online number.

When NOT to bother

A short, opinionated list of moments to walk away from this entire project:

You have fewer than 10k clicks per month. The signal is too thin. Spend the cycles on better embeddings (Embeddings on the Edge: Local vs Hosted) instead.
Your top queries are 50% of traffic. Fine-tuning will memorize them and rot on the long tail. Fix the popularity skew first — query rewriting, intent classification — then revisit.
Your bi-encoder is the bottleneck. If recall@50 is below 90%, the doc is not making it to the reranker in the first place. No reranker fine-tune fixes that.
You ship a new product surface every month. The model will be stale before it's deployed. Use a teacher reranker via API and skip the train step entirely until the surface stabilizes.
You don't have a held-out test set discipline. The lift number will be a fantasy. Without it you cannot tell whether the fine-tune helped or hurt. Build the eval first; train second.

Fine-tune when click logs have caught something the public datasets miss and you can prove the gap with NDCG@5 on a clean held-out set. Without that proof, you are training a model to overfit your own search bar. Build the held-out eval this week, run it against the off-the-shelf reranker, and only then decide whether the fine-tune is worth the GPU hours.

If this was useful

This post is one slice of a larger problem: how retrieval, reranking, and chunking all interact in production RAG. The RAG Pocket Guide walks through the full pipeline — retriever choice, chunking strategies, reranker patterns, query rewriting, and the eval discipline that makes any of it ship-ready. If you're building search or RAG at work, that's the book.