Offline metrics look clean.
CTR is flat.
Conversion is down.
The problem isn't the model. It's what the model is ranking against.
This pattern shows up across recommendation engine audits:
The team ships a new model version. Offline retrieval score improves. A/B test shows neutral-to-positive CTR. Three months later, conversion is still flat.
In 20 audits, 18 teams answered with a model name. Two answered with a pipeline diagram.
The two that answered with a pipeline diagram had fixed the problem.
Why offline metrics lie
Offline retrieval metrics measure how well your model ranks items against a historical behavior sample. They cannot measure two things:
- Whether the behavioral signals feeding the model are fresh
- Whether those signals belong to the right user
Both failures are silent. The model scores look correct. The production output is fiction.
The staleness gap
signal_age = current_timestamp - last_event_timestamp
relevance_loss ≈ f(signal_age, behavior_drift_rate)
Customer behavior shifts daily — sometimes hourly around promotions, seasonal events, or price changes. If your embeddings refresh weekly, you are ranking users against a week-old snapshot of their preferences.
The model is ranking a ghost.
From a real rebuild:
| Metric | Value |
|---|---|
| Embedding refresh cadence | Weekly |
| Customer behavior shift | Daily (sometimes hourly) |
| Real retrieval accuracy | 38% (not the 0.91 offline score) |
| After pipeline rebuild (same model) | 87% |
| CAC reduction, next quarter | −34.7% |
No model changes. Same architecture. Zero new training data. The pipeline feeding the model was wrong.
The identity gap
The staleness problem is the first layer. The identity problem is the second.
How many device keys does your pipeline assign to one customer on a cross-device journey?
customer_A (mobile) → profile_1 → ranking_1
customer_A (desktop) → profile_2 → ranking_2
customer_A (app) → profile_3 → ranking_3
Three different recommendation strategies for one converting customer. The engine ranks each in isolation, because the identity layer never merged them.
The fix is upstream of the model: session stitching, device graph resolution, cross-channel event merging. None of it requires retraining.
The 4 audit checks
Run these before scheduling the next training run:
01 — Signal freshness
What is the median age of behavioral signals entering your model at serving time? If signal_age > behavior_drift_window, your ranking is stale by definition.
02 — Identity coverage
What percentage of your user base has more than one unmerged identity key? Run this query on your feature store. The number is almost always higher than expected.
03 — Offline-online metric gap
If offline accuracy is high but online CTR is flat, the distribution shift is coming from the pipeline, not the model. The serving distribution doesn't match the training distribution because the signals changed.
04 — Conversion window alignment
What is the gap between signal capture and recommendation serving? Is it shorter than the typical purchase decision window for your category? If not, you are recommending into the wrong moment.
The principle
A recommendation engine ranks signals. If the signals are stale or fragmented across unresolved identities, the ranking is precise noise.
Fix the pipeline. The model is the last thing to change.
How fresh are the behavioral signals entering your recommendation model at serving time? That single number explains most of the offline-online gap I've seen — curious what it looks like in your stack.

Top comments (0)