Ana Carolina Neumann Rodrigues

Posted on Jun 2

Your Regression Model Lies When X Doesn't Vary — And You Probably Don't Notice

#statistics #machinelearning

There's a classic trap in data science projects with linear regression that catches a lot of people: the model trains, the loss looks fine, the R² even seems reasonable — but the coefficient estimates are a mess.

The reason, almost always, is simple: X doesn't vary enough.

The Problem in 30 Seconds

In simple linear regression:

Y = β₀ + β₁X + ε

The variance of the estimated coefficient is:

Var(β̂₁) = σ² / Σ(xᵢ - x̄)²

Read it as:

Var(β̂₁) = model noise / variation in X

Two direct conclusions:

Too much noise in Y → unstable estimate
Too little variation in X → unstable estimate

The denominator is the part that usually gets ignored.

Concrete Example: Lead Time Forecasting

You work in supply chain and want to predict delivery lead time (in days) based on distance traveled (in km).

Scenario A: Data from a single regional route

Distance (km)	Lead Time (days)
480	3
490	4
500	3
510	4
505	3

Everyone is on the same route, covering practically the same distance.

The model looks at this and thinks:

"X barely changed. How am I supposed to know the effect of X on Y?"

Any variation in lead time could be a port delay, supplier issue, or holiday — not necessarily distance. The estimated slope will be unstable and unreliable.

Scenario B: Data from multiple routes

Distance (km)	Lead Time (days)
80	1
250	2
600	4
1,200	7
2,800	12
4,500	18

Now the model has real "horizontal evidence." It sees short, medium, and long shipments — and can actually separate the effect of distance from random noise.

Why Does Variation in X Matter So Much?

The formula for the estimated coefficient is:

β̂₁ = Σ(xᵢ - x̄)(yᵢ - ȳ) / Σ(xᵢ - x̄)²

The denominator is the same one that appears in the variance: how much X varies.

When X barely moves, the denominator stays small. Any noise in Y distorts the ratio significantly. The result is a coefficient that looks reasonable on training data but oscillates wildly across different samples.

Three Warnings Nobody Tells You About

1. Variation caused by an outlier doesn't count as good variation

Imagine your distance data looks like this:

480, 490, 500, 510, 4800

Mathematically, X has a lot of variation. In practice, it almost entirely comes from one single extreme point.

That point has high leverage — it pulls the entire line. The model becomes "confident" by the numbers, but that confidence is false.

2. Variation in X doesn't fix a non-linear relationship

If lead time grows exponentially with distance (regional warehouse → cross-border), a straight line may not capture the pattern.

Having plenty of variation in X helps, but it doesn't replace choosing the right model.

3. In multiple regression, X needs to vary independently

Added both distance and transit time to the same model? They move together — longer shipments tend to have more transit time.

That's multicollinearity. The model can't separate:

Does lead time increase because of distance or because of transit time?

In multiple regression the question becomes: is there variation in X₁ that isn't just a repeat of X₂?

Mental Summary to Keep

Var(β̂₁) = noise / variation in X

Situation	Effect
Little variation in X	Unstable estimate ⚠️
Variation only from outlier	False confidence ⚠️
X and X₂ are collinear	Multicollinearity ⚠️
Wide and useful variation	Reliable estimate ✅

The core idea is simple:

To estimate the effect of X, the model needs to observe X changing.

If your supply chain data comes from a short time window, a single region, or a very homogeneous supplier profile — revisit it before trusting the coefficients.

Enjoyed this? Follow me for more content on statistics applied to supply chain data.

DEV Community