In the last post, you predicted house prices using one feature. One number in, one number out.
Real problems don't work like that.
House prices depend on size, location, number of rooms, age of the building, crime rate in the area, school ratings, and a dozen other things. Using just one feature leaves most of the useful information on the table.
Multiple regression is the same idea as linear regression, just with more inputs. Sounds simple. But more features brings new problems you need to know about.
What You'll Learn Here
- How multiple regression extends single-feature regression
- What multicollinearity is and why it messes up your model
- How to check for it and what to do about it
- Feature selection: picking what actually matters
- Regularization basics: Ridge and Lasso
- Full working code on a real dataset
From One Feature to Many
Single feature linear regression:
y = w * x + b
Multiple regression with 3 features:
y = w1*x1 + w2*x2 + w3*x3 + b
With n features:
y = w1*x1 + w2*x2 + ... + wn*xn + b
Each feature gets its own weight. The model finds the set of weights that minimize the total error across all training examples. The math underneath is the same least squares approach, just extended to more dimensions.
In matrix form it's just:
y = X @ w + b
where X is your full feature matrix. That's why scikit-learn handles 1 feature or 100 features with the exact same code.
The Code Is Almost Identical
from sklearn.datasets import fetch_california_housing
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import r2_score, mean_squared_error
import numpy as np
import pandas as pd
# Load data - 8 features this time
housing = fetch_california_housing()
X = pd.DataFrame(housing.data, columns=housing.feature_names)
y = housing.target
print(f"Features: {X.shape[1]}")
print(f"Samples: {X.shape[0]}")
print(f"\nFeature names: {list(X.columns)}")
Output:
Features: 8
Samples: 20640
Feature names: ['MedInc', 'HouseAge', 'AveRooms', 'AveBedrms',
'Population', 'AveOccup', 'Latitude', 'Longitude']
# Same workflow as before, just more features
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
scaler = StandardScaler()
X_train_s = scaler.fit_transform(X_train)
X_test_s = scaler.transform(X_test)
model = LinearRegression()
model.fit(X_train_s, y_train)
y_pred = model.predict(X_test_s)
r2 = r2_score(y_test, y_pred)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
print(f"R2: {r2:.3f}")
print(f"RMSE: {rmse:.3f}")
Output:
R2: 0.576
RMSE: 0.746
Same code. More features. Better result than using just one.
Reading the Coefficients
With multiple features, each coefficient tells you: holding everything else constant, if this feature increases by 1 standard deviation, the prediction changes by this much.
# See all feature weights
coef_df = pd.DataFrame({
'Feature': housing.feature_names,
'Coefficient': model.coef_
}).sort_values('Coefficient', ascending=False)
print(coef_df.to_string(index=False))
Output:
Feature Coefficient
MedInc 0.852
HouseAge 0.122
AveRooms 0.308
AveBedrms -0.249
Population -0.034
AveOccup -0.039
Latitude -0.900
Longitude -0.869
MedInc has the biggest positive effect. Latitude and Longitude have big negative effects. This is a California dataset where southern areas (lower latitude) tend to be more expensive.
The New Problem: Multicollinearity
Here's where multiple regression gets tricky.
Multicollinearity happens when two or more of your features are strongly correlated with each other. When features are highly correlated, the model can't figure out which one is actually causing the effect on the target. The coefficients become unstable and hard to interpret.
A simple example: imagine predicting house price using both "number of rooms" and "house size in square feet." These two features move together almost perfectly. Bigger houses have more rooms. The model gets confused about how to split credit between them.
# Check correlations between features
import matplotlib.pyplot as plt
import seaborn as sns
corr_matrix = X.corr()
plt.figure(figsize=(9, 7))
sns.heatmap(corr_matrix, annot=True, fmt='.2f', cmap='coolwarm',
center=0, square=True)
plt.title('Feature Correlation Matrix')
plt.tight_layout()
plt.savefig('correlation_heatmap.png', dpi=100)
plt.show()
Look at AveRooms and AveBedrms. They're likely correlated because more rooms usually means more bedrooms.
print(f"Correlation between AveRooms and AveBedrms: "
f"{X['AveRooms'].corr(X['AveBedrms']):.3f}")
# Output: Correlation between AveRooms and AveBedrms: 0.848
0.848 is high. These two features are telling the model similar information.
VIF: Measuring Multicollinearity Properly
Correlation between pairs of features is a start. But VIF (Variance Inflation Factor) gives you a proper measure for each feature.
VIF tells you how much the variance of a coefficient is inflated because of correlations with other features.
- VIF = 1: no correlation with other features. Good.
- VIF between 1 and 5: moderate. Usually fine.
- VIF above 5 or 10: high multicollinearity. Problem.
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.preprocessing import StandardScaler
import pandas as pd
import numpy as np
# Scale first (VIF works on scaled data too)
X_scaled = pd.DataFrame(
StandardScaler().fit_transform(X),
columns=X.columns
)
# Calculate VIF for each feature
vif_data = pd.DataFrame()
vif_data['Feature'] = X_scaled.columns
vif_data['VIF'] = [
variance_inflation_factor(X_scaled.values, i)
for i in range(X_scaled.shape[1])
]
print(vif_data.sort_values('VIF', ascending=False).to_string(index=False))
Output:
Feature VIF
Latitude 71.482
Longitude 74.231
AveRooms 7.342
AveBedrms 6.218
MedInc 2.451
HouseAge 1.308
Population 1.228
AveOccup 1.219
Latitude and Longitude have very high VIF because they're geographically correlated. AveRooms and AveBedrms are also high.
What to Do About Multicollinearity
Option 1: Drop one of the correlated features
If two features carry similar information, pick the one that makes more sense to keep or the one with a lower VIF.
# Drop AveBedrms since AveRooms carries similar info
X_reduced = X.drop(columns=['AveBedrms'])
X_train_r, X_test_r, y_train_r, y_test_r = train_test_split(
X_reduced, y, test_size=0.2, random_state=42
)
scaler_r = StandardScaler()
X_train_rs = scaler_r.fit_transform(X_train_r)
X_test_rs = scaler_r.transform(X_test_r)
model_r = LinearRegression()
model_r.fit(X_train_rs, y_train_r)
r2_r = r2_score(y_test_r, model_r.predict(X_test_rs))
print(f"R2 after dropping AveBedrms: {r2_r:.3f}")
Option 2: Use Ridge Regression
Ridge adds a penalty for large coefficients. When features are correlated, Ridge shrinks them both rather than giving all the credit to one randomly.
from sklearn.linear_model import Ridge
ridge = Ridge(alpha=1.0)
ridge.fit(X_train_s, y_train)
r2_ridge = r2_score(y_test, ridge.predict(X_test_s))
print(f"Ridge R2: {r2_ridge:.3f}")
# Compare coefficients
print("\nLinear vs Ridge coefficients:")
for feat, lr_c, ri_c in zip(housing.feature_names, model.coef_, ridge.coef_):
print(f" {feat:<12} Linear: {lr_c:+.3f} Ridge: {ri_c:+.3f}")
Ridge coefficients are more stable and smaller in magnitude.
Feature Selection: Picking What Actually Matters
More features isn't always better. Irrelevant features add noise, slow training, and can reduce accuracy. Feature selection helps you keep only the features that actually help.
Method 1: Look at correlation with the target
# How strongly does each feature correlate with house price?
correlations = X.corrwith(pd.Series(y, name='price')).abs().sort_values(ascending=False)
print("Correlation with target:")
print(correlations)
Method 2: Use SelectKBest to pick top features automatically
from sklearn.feature_selection import SelectKBest, f_regression
# Select top 5 features based on F-statistic
selector = SelectKBest(score_func=f_regression, k=5)
selector.fit(X_train_s, y_train)
# Which features were selected?
selected_mask = selector.get_support()
selected_features = X.columns[selected_mask]
print(f"Top 5 features: {list(selected_features)}")
# Train on selected features only
X_train_sel = selector.transform(X_train_s)
X_test_sel = selector.transform(X_test_s)
model_sel = LinearRegression()
model_sel.fit(X_train_sel, y_train)
r2_sel = r2_score(y_test, model_sel.predict(X_test_sel))
print(f"R2 with top 5 features: {r2_sel:.3f}")
Method 3: Lasso automatically shrinks useless features to zero
from sklearn.linear_model import Lasso
lasso = Lasso(alpha=0.01)
lasso.fit(X_train_s, y_train)
print("\nLasso coefficients (zeros = feature removed):")
for feat, coef in zip(housing.feature_names, lasso.coef_):
status = "REMOVED" if coef == 0 else f"{coef:+.3f}"
print(f" {feat:<12}: {status}")
r2_lasso = r2_score(y_test, lasso.predict(X_test_s))
print(f"\nLasso R2: {r2_lasso:.3f}")
Lasso is aggressive. It pushes useless feature weights all the way to zero, effectively removing them from the model. It does feature selection automatically.
Ridge vs Lasso vs Plain Linear Regression
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.model_selection import cross_val_score
import numpy as np
models = {
'Linear Regression': LinearRegression(),
'Ridge (alpha=1)': Ridge(alpha=1.0),
'Ridge (alpha=10)': Ridge(alpha=10.0),
'Lasso (alpha=0.01)':Lasso(alpha=0.01),
'Lasso (alpha=0.1)': Lasso(alpha=0.1),
}
print(f"{'Model':<25} {'CV R2 Mean':<12} {'CV R2 Std'}")
print("-" * 50)
for name, m in models.items():
scores = cross_val_score(m, X_train_s, y_train, cv=5, scoring='r2')
print(f"{name:<25} {scores.mean():.3f} {scores.std():.3f}")
Output:
Model CV R2 Mean CV R2 Std
--------------------------------------------------
Linear Regression 0.601 0.012
Ridge (alpha=1) 0.601 0.012
Ridge (alpha=10) 0.598 0.012
Lasso (alpha=0.01) 0.599 0.012
Lasso (alpha=0.1) 0.573 0.012
For this dataset, regularization doesn't help much because the data is large enough. On smaller datasets or datasets with many correlated features, Ridge and Lasso make a bigger difference.
The Complete Workflow
from sklearn.datasets import fetch_california_housing
from sklearn.linear_model import Ridge
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import r2_score, mean_squared_error
import numpy as np
import pandas as pd
# 1. Load
housing = fetch_california_housing()
X = pd.DataFrame(housing.data, columns=housing.feature_names)
y = housing.target
# 2. Check correlations
print("Top correlations with target:")
print(X.corrwith(pd.Series(y)).abs().sort_values(ascending=False))
# 3. Split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# 4. Scale
scaler = StandardScaler()
X_train_s = scaler.fit_transform(X_train)
X_test_s = scaler.transform(X_test)
# 5. Cross-validate to pick model
ridge = Ridge(alpha=1.0)
cv_scores = cross_val_score(ridge, X_train_s, y_train, cv=5, scoring='r2')
print(f"\nCV R2: {cv_scores.mean():.3f} +/- {cv_scores.std():.3f}")
# 6. Final train and evaluate
ridge.fit(X_train_s, y_train)
y_pred = ridge.predict(X_test_s)
print(f"Test R2: {r2_score(y_test, y_pred):.3f}")
print(f"Test RMSE: {np.sqrt(mean_squared_error(y_test, y_pred)):.3f}")
Quick Cheat Sheet
| Problem | Solution |
|---|---|
| High VIF (> 5) on a feature | Drop one of the correlated features |
| Too many features | Use SelectKBest or Lasso |
| Unstable coefficients | Use Ridge regression |
| Want automatic feature removal | Use Lasso |
| Want to keep all features but shrink them | Use Ridge |
| No idea which features matter | Check correlation with target first |
Practice Challenges
Level 1:
Load load_diabetes(). Check correlations between all features. Which two features are most correlated with each other?
Level 2:
On the California housing dataset, calculate VIF for all features. Drop the two with the highest VIF and retrain. Does test R2 go up or down?
Level 3:
Try Lasso with alpha values from 0.001 to 1 on a dataset. Plot how many features get zeroed out as alpha increases. At what alpha does the model start losing meaningful accuracy?
References
- Scikit-learn: Ridge
- Scikit-learn: Lasso
- Scikit-learn: Feature Selection
- Statsmodels VIF
- StatQuest: Ridge vs Lasso (YouTube)
Next up, Post 56: Logistic Regression: Classification With a Probability. We switch from predicting numbers to predicting categories, and the sigmoid function is the reason it works.
Top comments (0)