When Feature Importance Lies: Target Encoding at the Noise Floor

#machinelearning #datascience #lightgbm

We built a feature that dominated our model's importance rankings. Across four random seeds and three quantiles (q10, q50, q90), it ranked first by a wide margin. The gain scores were stable and reproducible. Every diagnostic said it was working.

It made our model measurably worse out of sample.

This is a post-mortem on why tree-based models eagerly prioritize high-variance target encodings that fail to generalize, and how to detect when your feature engineering has collided with an irreducible noise floor.

The Setup

We run a quantile regression model (LightGBM, predicting q10/q50/q90 price ranges) for Flyback.ai, a buy-side intelligence platform for the luxury watch market, where labels are inherently noisy. The same reference, listed twice by two different sellers, will sell for materially different prices. This is not measurement error. It is structural market variance driven by unobserved factors that no feature matrix can fully capture.

That noise sets a strict floor on achievable MAPE. You cannot model what you cannot observe. At some point, adding features stops reducing error and starts fitting noise.

We were at that floor. Our headline MAPE had plateaued across multiple retrains despite architectural changes. The model had extracted the accessible signal; what remained was irreducible variance. This is the context in which we built the encoder that broke our mental model.

The Feature

The core idea was sensible: the same watch reference sells for meaningfully different prices depending on variant attributes: dial color, bracelet type, bezel insert. A Rolex GMT-Master II on a Jubilee bracelet commands a different price than the same reference on an Oyster. A reference-level price prior flattens this. We wanted a variant-level prior.

We built a Bayesian target encoder keyed on a combination of reference number, dial color, and bracelet type. Standard smoothing: cells below a minimum count fall back to the reference-level prior. The fallback chain continues up through progressively broader groupings to a global mean. Nothing exotic (this is a standard pattern for high cardinality categorical encoding in tree models).

The scoping numbers looked promising. A meaningful fraction of our training data had enough variant-cell coverage to receive variant-level estimates rather than reference-level fallbacks. For our highest-volume references (Rolex Submariner, GMT-Master II, Datejust) coverage was substantial. The variant price spreads within those cells were real: actual market differentiation, not noise.

We shipped it.

The Importance Rankings

The feature immediately claimed the top position in gain-based importance at q90 across all four seeds. The gains were large; several times the gain of the next-highest feature. At the lower quantiles it ranked slightly lower but remained firmly in the top tier.

Gain-based importance in LightGBM measures how much each feature reduces the loss when used in a split. A feature that consistently ranks first by this metric, across multiple seeds, is doing substantial work in the training loop.

To verify, we ran a multi-seed ablation: four random seeds, three variants (baseline, an intermediate configuration, and the full encoder), evaluated on the same held out split. The multi-seed structure matters. A single retrain would have shown a result within the seed's variance band. Before formalizing the test, we had seen the encoder produce a slightly better MAPE on some seeds and slightly worse on others. Without the multi-seed average, we might have shipped on a lucky run.

Variant	Validation MAPE (4-seed mean)	Interval Coverage (4-seed mean)
Baseline (no encoder)	14.27% ± 0.18pp	77.8% ± 1.15pp
With variant encoder	14.55% ± 0.04pp	77.3% ± 0.58pp

The encoder consistently degraded the model on held out data. While 0.28 percentage points is small in absolute terms, the direction was unambiguous. The between-variant delta was 7x the within-variant standard deviation. This was a clean regression.

Coverage was also lower with the encoder; it lost on both axes simultaneously. The encoder produced more stable coverage across seeds (±0.58pp vs ±1.15pp baseline), which looks like a positive property but is actually a warning sign: stability without accuracy means the model is confidently wrong in a consistent direction rather than uncertain in an honest one. The general rule this enforces: if your measured improvement is smaller than your seed variance, you do not know if you have a real improvement. This is especially true for features that interact with the training distribution (target encoders, smoothed priors, anything computed from labels).

The Deceptive Elegance of High-Quantile Gains

LightGBM's gain metric measures training loss reduction. A Bayesian target encoder built on a composite key creates cells with genuine within-cell target variance reduction. By construction, the cells are internally more homogeneous than the population they came from. This is real information gain on the training set.

The problem is that the model was already extracting the same signal implicitly.

Tree models are surprisingly good at approximating target-encoded representations through deep splits on underlying categoricals. Given the raw features that define the composite key, LightGBM can route similar listings to similar leaves through successive splits, effectively learning watch variant price levels without an explicit encoder. It requires more splits and more parameters, but it gets there.

When you hand it the encoder, you give it a direct shortcut to signal it was already finding. The model takes the shortcut enthusiastically; the gain is immediate and large. But the shortcut also introduces mild target leakage: the encoder's smoothed cell estimates are trained on the same data the model trains on. The cells with the most coverage, where the encoder looks most confident, are also the cells where it has most thoroughly memorized the training distribution.

The net effect: the model anchors on the encoder, the encoder slightly overfits, and held out predictions get marginally worse. Feature importance is technically correct (the encoder is doing a lot of work), but doing a lot of work in training and improving held out predictions are not the same thing.

The Noise Floor Problem

There is a deeper issue the encoder could not solve regardless.

Every supervised learning problem has an irreducible error component: variance in the labels that cannot be explained by any observable feature because it is driven by factors you do not measure. In regression on tabular data, this shows up as a MAPE floor that refuses to move despite architectural improvements. You add features, tune hyperparameters, improve training data quality, and the headline metric stays put. This is not a modeling failure. It is the ceiling imposed by what your features can actually see.

The variance we were trying to capture with variant-level encoding exists in the training labels. But a substantial portion of the price variance within any variant cell is not explained by variant identity; it is driven by variables we do not observe. In a transaction marketplace, these unobserved variables compound: the condition details a photograph cannot capture, the negotiating posture of a particular seller, the timing of a listing relative to a news cycle or a currency move, the patience of a specific buyer. None of these appear in the feature matrix. All of them influence the transaction price.

When you build a variant encoder on noisy labels, you encode a mixture of real variant signal and label noise. The smoothing prior helps at the extremes: cells with very few observations get pulled toward a broader prior and are protected from pure memorization. But smoothing does not separate signal from noise within a cell. A cell with fifty observations still contains the irreducible variance of those fifty transactions. The encoder learns the cell mean, including whatever noise contributed to that mean.

This creates a specific failure mode: the encoder appears informative because it reduces within-cell variance relative to the full training distribution, but the variance it reduces is partly real signal and partly settled noise. On held out data, the noise component does not replicate. New transactions bring new noise, and the encoder's confident cell estimates are slightly wrong in unpredictable directions.

The problem is self-concealing at training time. The encoder's gain looks real because it is partly real. The only way to see the generalization failure is to evaluate on held out data across multiple seeds, which averages out seed-specific noise and surfaces the systematic direction of the effect.

Feature engineering can only extract signal that exists in the observable features. When labels have irreducible noise and the remaining unexplained variance is dominated by unobserved factors, more expressive features do not help. They give the model more surface area to fit noise. A simpler model that captures the same signal through coarser splits is more robust, because it maintains less surface area to overfit on the noise component.

The encoder was the most visible example, but the pattern generalizes. In the same period we tested binary flags derived from a computer vision condition-grading pipeline, interaction terms between existing high-importance categoricals, and a scalar feature encoding the Best Offer price premium that appeared in more than 40% of eBay training rows. Each looked promising on summary statistics. Each failed identically on held out evaluation across seeds. The consistent finding: features that appeared to explain residual variance were either (a) explaining variance the existing model structure already captured through slower implicit routes, or (b) fitting the noise component of the training labels. In both cases the held out metric did not improve. At the noise floor, the burden of proof inverts: a new feature needs to demonstrate held out improvement before you trust it, not just training-set gain.

The Outcome

We removed the encoder.

The 0.28pp regression was small, but when you are operating at the noise floor, every tenth of a point represents a true performance boundary. More importantly, the encoder was concentrating the model's predictions onto a single feature that was slightly overfitting. That is fragile architecture: a single high-importance feature that degrades on distribution shift degrades the whole model.

The underlying categorical features stayed in the model. The variant signal they encode is real; it gets extracted through the model's natural tree structure rather than an explicit encoder shortcut. Slower and more diffuse, but more robust.

The General Lesson

The default feature importance in tree models, gain (or split) importance, is a training metric, not a generalization metric. It is computed during tree construction from the loss reduction on training splits, so it cannot even be evaluated on a holdout. (Permutation importance and SHAP are a different tool: you compute them by perturbing a trained model on whatever data you choose, so run on a holdout they become a genuine out of sample signal. More on that below.)

A feature can dominate gain rankings by giving the model a fast path to reducing training loss while simultaneously making held out predictions worse. This happens most reliably with:

High cardinality Bayesian or target encoders computed from training labels
Composite key encoders over interacting categoricals
Any feature whose value is derived directly from the target distribution

The implicit mechanism is always the same: the encoder provides a training shortcut to signal the model was already finding through slower routes. The shortcut wins on importance. It loses on generalization.

The check requires discipline. The cheapest version: compute permutation importance (or SHAP) on a holdout and compare it to the gain ranking. A feature that ranks high by gain but barely moves holdout error when its values are shuffled is the exact pattern this post is about: high signal in training, none that survives out of sample. The stronger version, and what we now rely on, is to run an ablation across several seeds on your holdout before shipping any feature that looks suspiciously dominant. If your measured performance does not move in the right direction across seeds, the feature is not working, regardless of what the importance chart says.

In high noise domains, this check will save you from shipping a lot of confident looking regressions.

Flyback.ai is The Buy-Side Intelligence Platform for Luxury Watches. We publish technical writing on the ML and data engineering behind the platform at flyback.ai/engineering.