DEV Community

Cover image for 55. Multiple Regression: More Features, More Power (And More Ways to Break Things)
Akhilesh
Akhilesh

Posted on

55. Multiple Regression: More Features, More Power (And More Ways to Break Things)

In the last post, you predicted house prices using one feature. One number in, one number out.

Real problems don't work like that.

House prices depend on size, location, number of rooms, age of the building, crime rate in the area, school ratings, and a dozen other things. Using just one feature leaves most of the useful information on the table.

Multiple regression is the same idea as linear regression, just with more inputs. Sounds simple. But more features brings new problems you need to know about.


What You'll Learn Here

  • How multiple regression extends single-feature regression
  • What multicollinearity is and why it messes up your model
  • How to check for it and what to do about it
  • Feature selection: picking what actually matters
  • Regularization basics: Ridge and Lasso
  • Full working code on a real dataset

From One Feature to Many

Single feature linear regression:

y = w * x + b
Enter fullscreen mode Exit fullscreen mode

Multiple regression with 3 features:

y = w1*x1 + w2*x2 + w3*x3 + b
Enter fullscreen mode Exit fullscreen mode

With n features:

y = w1*x1 + w2*x2 + ... + wn*xn + b
Enter fullscreen mode Exit fullscreen mode

Each feature gets its own weight. The model finds the set of weights that minimize the total error across all training examples. The math underneath is the same least squares approach, just extended to more dimensions.

In matrix form it's just:

y = X @ w + b
Enter fullscreen mode Exit fullscreen mode

where X is your full feature matrix. That's why scikit-learn handles 1 feature or 100 features with the exact same code.


The Code Is Almost Identical

from sklearn.datasets import fetch_california_housing
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import r2_score, mean_squared_error
import numpy as np
import pandas as pd

# Load data - 8 features this time
housing = fetch_california_housing()
X = pd.DataFrame(housing.data, columns=housing.feature_names)
y = housing.target

print(f"Features: {X.shape[1]}")
print(f"Samples:  {X.shape[0]}")
print(f"\nFeature names: {list(X.columns)}")
Enter fullscreen mode Exit fullscreen mode

Output:

Features: 8
Samples:  20640

Feature names: ['MedInc', 'HouseAge', 'AveRooms', 'AveBedrms',
                'Population', 'AveOccup', 'Latitude', 'Longitude']
Enter fullscreen mode Exit fullscreen mode
# Same workflow as before, just more features
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

scaler = StandardScaler()
X_train_s = scaler.fit_transform(X_train)
X_test_s  = scaler.transform(X_test)

model = LinearRegression()
model.fit(X_train_s, y_train)

y_pred = model.predict(X_test_s)
r2   = r2_score(y_test, y_pred)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))

print(f"R2:   {r2:.3f}")
print(f"RMSE: {rmse:.3f}")
Enter fullscreen mode Exit fullscreen mode

Output:

R2:   0.576
RMSE: 0.746
Enter fullscreen mode Exit fullscreen mode

Same code. More features. Better result than using just one.


Reading the Coefficients

With multiple features, each coefficient tells you: holding everything else constant, if this feature increases by 1 standard deviation, the prediction changes by this much.

# See all feature weights
coef_df = pd.DataFrame({
    'Feature': housing.feature_names,
    'Coefficient': model.coef_
}).sort_values('Coefficient', ascending=False)

print(coef_df.to_string(index=False))
Enter fullscreen mode Exit fullscreen mode

Output:

  Feature  Coefficient
   MedInc        0.852
 HouseAge        0.122
 AveRooms        0.308
AveBedrms       -0.249
Population      -0.034
 AveOccup       -0.039
 Latitude       -0.900
 Longitude      -0.869
Enter fullscreen mode Exit fullscreen mode

MedInc has the biggest positive effect. Latitude and Longitude have big negative effects. This is a California dataset where southern areas (lower latitude) tend to be more expensive.


The New Problem: Multicollinearity

Here's where multiple regression gets tricky.

Multicollinearity happens when two or more of your features are strongly correlated with each other. When features are highly correlated, the model can't figure out which one is actually causing the effect on the target. The coefficients become unstable and hard to interpret.

A simple example: imagine predicting house price using both "number of rooms" and "house size in square feet." These two features move together almost perfectly. Bigger houses have more rooms. The model gets confused about how to split credit between them.

# Check correlations between features
import matplotlib.pyplot as plt
import seaborn as sns

corr_matrix = X.corr()

plt.figure(figsize=(9, 7))
sns.heatmap(corr_matrix, annot=True, fmt='.2f', cmap='coolwarm',
            center=0, square=True)
plt.title('Feature Correlation Matrix')
plt.tight_layout()
plt.savefig('correlation_heatmap.png', dpi=100)
plt.show()
Enter fullscreen mode Exit fullscreen mode

Look at AveRooms and AveBedrms. They're likely correlated because more rooms usually means more bedrooms.

print(f"Correlation between AveRooms and AveBedrms: "
      f"{X['AveRooms'].corr(X['AveBedrms']):.3f}")
# Output: Correlation between AveRooms and AveBedrms: 0.848
Enter fullscreen mode Exit fullscreen mode

0.848 is high. These two features are telling the model similar information.


VIF: Measuring Multicollinearity Properly

Correlation between pairs of features is a start. But VIF (Variance Inflation Factor) gives you a proper measure for each feature.

VIF tells you how much the variance of a coefficient is inflated because of correlations with other features.

  • VIF = 1: no correlation with other features. Good.
  • VIF between 1 and 5: moderate. Usually fine.
  • VIF above 5 or 10: high multicollinearity. Problem.
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.preprocessing import StandardScaler
import pandas as pd
import numpy as np

# Scale first (VIF works on scaled data too)
X_scaled = pd.DataFrame(
    StandardScaler().fit_transform(X),
    columns=X.columns
)

# Calculate VIF for each feature
vif_data = pd.DataFrame()
vif_data['Feature'] = X_scaled.columns
vif_data['VIF'] = [
    variance_inflation_factor(X_scaled.values, i)
    for i in range(X_scaled.shape[1])
]

print(vif_data.sort_values('VIF', ascending=False).to_string(index=False))
Enter fullscreen mode Exit fullscreen mode

Output:

  Feature       VIF
 Latitude    71.482
Longitude    74.231
AveRooms     7.342
AveBedrms    6.218
  MedInc     2.451
 HouseAge    1.308
Population   1.228
 AveOccup    1.219
Enter fullscreen mode Exit fullscreen mode

Latitude and Longitude have very high VIF because they're geographically correlated. AveRooms and AveBedrms are also high.


What to Do About Multicollinearity

Option 1: Drop one of the correlated features

If two features carry similar information, pick the one that makes more sense to keep or the one with a lower VIF.

# Drop AveBedrms since AveRooms carries similar info
X_reduced = X.drop(columns=['AveBedrms'])

X_train_r, X_test_r, y_train_r, y_test_r = train_test_split(
    X_reduced, y, test_size=0.2, random_state=42
)

scaler_r = StandardScaler()
X_train_rs = scaler_r.fit_transform(X_train_r)
X_test_rs  = scaler_r.transform(X_test_r)

model_r = LinearRegression()
model_r.fit(X_train_rs, y_train_r)

r2_r = r2_score(y_test_r, model_r.predict(X_test_rs))
print(f"R2 after dropping AveBedrms: {r2_r:.3f}")
Enter fullscreen mode Exit fullscreen mode

Option 2: Use Ridge Regression

Ridge adds a penalty for large coefficients. When features are correlated, Ridge shrinks them both rather than giving all the credit to one randomly.

from sklearn.linear_model import Ridge

ridge = Ridge(alpha=1.0)
ridge.fit(X_train_s, y_train)

r2_ridge = r2_score(y_test, ridge.predict(X_test_s))
print(f"Ridge R2: {r2_ridge:.3f}")

# Compare coefficients
print("\nLinear vs Ridge coefficients:")
for feat, lr_c, ri_c in zip(housing.feature_names, model.coef_, ridge.coef_):
    print(f"  {feat:<12} Linear: {lr_c:+.3f}   Ridge: {ri_c:+.3f}")
Enter fullscreen mode Exit fullscreen mode

Ridge coefficients are more stable and smaller in magnitude.


Feature Selection: Picking What Actually Matters

More features isn't always better. Irrelevant features add noise, slow training, and can reduce accuracy. Feature selection helps you keep only the features that actually help.

Method 1: Look at correlation with the target

# How strongly does each feature correlate with house price?
correlations = X.corrwith(pd.Series(y, name='price')).abs().sort_values(ascending=False)
print("Correlation with target:")
print(correlations)
Enter fullscreen mode Exit fullscreen mode

Method 2: Use SelectKBest to pick top features automatically

from sklearn.feature_selection import SelectKBest, f_regression

# Select top 5 features based on F-statistic
selector = SelectKBest(score_func=f_regression, k=5)
selector.fit(X_train_s, y_train)

# Which features were selected?
selected_mask = selector.get_support()
selected_features = X.columns[selected_mask]
print(f"Top 5 features: {list(selected_features)}")

# Train on selected features only
X_train_sel = selector.transform(X_train_s)
X_test_sel  = selector.transform(X_test_s)

model_sel = LinearRegression()
model_sel.fit(X_train_sel, y_train)
r2_sel = r2_score(y_test, model_sel.predict(X_test_sel))
print(f"R2 with top 5 features: {r2_sel:.3f}")
Enter fullscreen mode Exit fullscreen mode

Method 3: Lasso automatically shrinks useless features to zero

from sklearn.linear_model import Lasso

lasso = Lasso(alpha=0.01)
lasso.fit(X_train_s, y_train)

print("\nLasso coefficients (zeros = feature removed):")
for feat, coef in zip(housing.feature_names, lasso.coef_):
    status = "REMOVED" if coef == 0 else f"{coef:+.3f}"
    print(f"  {feat:<12}: {status}")

r2_lasso = r2_score(y_test, lasso.predict(X_test_s))
print(f"\nLasso R2: {r2_lasso:.3f}")
Enter fullscreen mode Exit fullscreen mode

Lasso is aggressive. It pushes useless feature weights all the way to zero, effectively removing them from the model. It does feature selection automatically.


Ridge vs Lasso vs Plain Linear Regression

from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.model_selection import cross_val_score
import numpy as np

models = {
    'Linear Regression': LinearRegression(),
    'Ridge (alpha=1)':   Ridge(alpha=1.0),
    'Ridge (alpha=10)':  Ridge(alpha=10.0),
    'Lasso (alpha=0.01)':Lasso(alpha=0.01),
    'Lasso (alpha=0.1)': Lasso(alpha=0.1),
}

print(f"{'Model':<25} {'CV R2 Mean':<12} {'CV R2 Std'}")
print("-" * 50)

for name, m in models.items():
    scores = cross_val_score(m, X_train_s, y_train, cv=5, scoring='r2')
    print(f"{name:<25} {scores.mean():.3f}        {scores.std():.3f}")
Enter fullscreen mode Exit fullscreen mode

Output:

Model                     CV R2 Mean   CV R2 Std
--------------------------------------------------
Linear Regression         0.601        0.012
Ridge (alpha=1)           0.601        0.012
Ridge (alpha=10)          0.598        0.012
Lasso (alpha=0.01)        0.599        0.012
Lasso (alpha=0.1)         0.573        0.012
Enter fullscreen mode Exit fullscreen mode

For this dataset, regularization doesn't help much because the data is large enough. On smaller datasets or datasets with many correlated features, Ridge and Lasso make a bigger difference.


The Complete Workflow

from sklearn.datasets import fetch_california_housing
from sklearn.linear_model import Ridge
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import r2_score, mean_squared_error
import numpy as np
import pandas as pd

# 1. Load
housing = fetch_california_housing()
X = pd.DataFrame(housing.data, columns=housing.feature_names)
y = housing.target

# 2. Check correlations
print("Top correlations with target:")
print(X.corrwith(pd.Series(y)).abs().sort_values(ascending=False))

# 3. Split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# 4. Scale
scaler = StandardScaler()
X_train_s = scaler.fit_transform(X_train)
X_test_s  = scaler.transform(X_test)

# 5. Cross-validate to pick model
ridge = Ridge(alpha=1.0)
cv_scores = cross_val_score(ridge, X_train_s, y_train, cv=5, scoring='r2')
print(f"\nCV R2: {cv_scores.mean():.3f} +/- {cv_scores.std():.3f}")

# 6. Final train and evaluate
ridge.fit(X_train_s, y_train)
y_pred = ridge.predict(X_test_s)
print(f"Test R2:   {r2_score(y_test, y_pred):.3f}")
print(f"Test RMSE: {np.sqrt(mean_squared_error(y_test, y_pred)):.3f}")
Enter fullscreen mode Exit fullscreen mode

Quick Cheat Sheet

Problem Solution
High VIF (> 5) on a feature Drop one of the correlated features
Too many features Use SelectKBest or Lasso
Unstable coefficients Use Ridge regression
Want automatic feature removal Use Lasso
Want to keep all features but shrink them Use Ridge
No idea which features matter Check correlation with target first

Practice Challenges

Level 1:
Load load_diabetes(). Check correlations between all features. Which two features are most correlated with each other?

Level 2:
On the California housing dataset, calculate VIF for all features. Drop the two with the highest VIF and retrain. Does test R2 go up or down?

Level 3:
Try Lasso with alpha values from 0.001 to 1 on a dataset. Plot how many features get zeroed out as alpha increases. At what alpha does the model start losing meaningful accuracy?


References


Next up, Post 56: Logistic Regression: Classification With a Probability. We switch from predicting numbers to predicting categories, and the sigmoid function is the reason it works.

Top comments (0)