There's a classic trap in data science projects with linear regression that catches a lot of people: the model trains, the loss looks fine, the R² even seems reasonable — but the coefficient estimates are a mess.
The reason, almost always, is simple: X doesn't vary enough.
The Problem in 30 Seconds
In simple linear regression:
Y = β₀ + β₁X + ε
The variance of the estimated coefficient is:
Var(β̂₁) = σ² / Σ(xᵢ - x̄)²
Read it as:
Var(β̂₁) = model noise / variation in X
Two direct conclusions:
- Too much noise in Y → unstable estimate
- Too little variation in X → unstable estimate
The denominator is the part that usually gets ignored.
Concrete Example: Lead Time Forecasting
You work in supply chain and want to predict delivery lead time (in days) based on distance traveled (in km).
Scenario A: Data from a single regional route
| Distance (km) | Lead Time (days) |
|---|---|
| 480 | 3 |
| 490 | 4 |
| 500 | 3 |
| 510 | 4 |
| 505 | 3 |
Everyone is on the same route, covering practically the same distance.
The model looks at this and thinks:
"X barely changed. How am I supposed to know the effect of X on Y?"
Any variation in lead time could be a port delay, supplier issue, or holiday — not necessarily distance. The estimated slope will be unstable and unreliable.
Scenario B: Data from multiple routes
| Distance (km) | Lead Time (days) |
|---|---|
| 80 | 1 |
| 250 | 2 |
| 600 | 4 |
| 1,200 | 7 |
| 2,800 | 12 |
| 4,500 | 18 |
Now the model has real "horizontal evidence." It sees short, medium, and long shipments — and can actually separate the effect of distance from random noise.
Why Does Variation in X Matter So Much?
The formula for the estimated coefficient is:
β̂₁ = Σ(xᵢ - x̄)(yᵢ - ȳ) / Σ(xᵢ - x̄)²
The denominator is the same one that appears in the variance: how much X varies.
When X barely moves, the denominator stays small. Any noise in Y distorts the ratio significantly. The result is a coefficient that looks reasonable on training data but oscillates wildly across different samples.
Three Warnings Nobody Tells You About
1. Variation caused by an outlier doesn't count as good variation
Imagine your distance data looks like this:
480, 490, 500, 510, 4800
Mathematically, X has a lot of variation. In practice, it almost entirely comes from one single extreme point.
That point has high leverage — it pulls the entire line. The model becomes "confident" by the numbers, but that confidence is false.
2. Variation in X doesn't fix a non-linear relationship
If lead time grows exponentially with distance (regional warehouse → cross-border), a straight line may not capture the pattern.
Having plenty of variation in X helps, but it doesn't replace choosing the right model.
3. In multiple regression, X needs to vary independently
Added both distance and transit time to the same model? They move together — longer shipments tend to have more transit time.
That's multicollinearity. The model can't separate:
Does lead time increase because of distance or because of transit time?
In multiple regression the question becomes: is there variation in X₁ that isn't just a repeat of X₂?
Mental Summary to Keep
Var(β̂₁) = noise / variation in X
| Situation | Effect |
|---|---|
| Little variation in X | Unstable estimate ⚠️ |
| Variation only from outlier | False confidence ⚠️ |
| X and X₂ are collinear | Multicollinearity ⚠️ |
| Wide and useful variation | Reliable estimate ✅ |
The core idea is simple:
To estimate the effect of X, the model needs to observe X changing.
If your supply chain data comes from a short time window, a single region, or a very homogeneous supplier profile — revisit it before trusting the coefficients.
Enjoyed this? Follow me for more content on statistics applied to supply chain data.


Top comments (0)