You want to predict something. A number. How much a house will sell for. How many units you'll sell next month. What temperature it'll be tomorrow.
That's a regression problem. And linear regression is the first tool you reach for.
It's the simplest ML model that actually does something useful. Every more complex model builds on the ideas here. You can't skip this one.
What You'll Learn Here
- What linear regression actually does
- The equation y = mx + b and what each part means in ML
- What a cost function is and why we need one
- How least squares fitting works
- Building linear regression from scratch and with scikit-learn
- How to evaluate regression models (not accuracy, different metrics)
- Multiple features and what changes
The Simplest Idea in ML
You have two things that seem related. Hours studied and exam score. House size and house price. Temperature and ice cream sales.
Plot them on a graph. You get a scatter of dots. Linear regression draws the best possible straight line through those dots.
Once you have that line, you can plug in a new value on the X axis and read off a prediction on the Y axis.
That's it. That's linear regression.
The Equation: y = mx + b
You've seen this since school. In ML we write it slightly differently but it's the same thing.
y = mx + b
y = the thing you're predicting (output)
x = the input feature you know
m = slope (how much y changes when x increases by 1)
b = intercept (what y is when x is 0)
In ML notation:
y_hat = w * x + b
y_hat = predicted value
w = weight (same as slope m)
x = input feature
b = bias (same as intercept)
The model's job is to find the right values of w and b that make the line fit the data as well as possible.
What "Best Fit" Means: The Cost Function
For any line you draw, some predictions will be too high and some too low. The difference between your prediction and the real answer is called the residual or error.
error = actual - predicted
You want to minimize the total error across all your training examples. But you can't just add up raw errors because positive and negative errors cancel out.
So instead, you square each error and add them all up. This is called the Mean Squared Error (MSE).
MSE = (1/n) * sum((actual - predicted)^2)
Squaring does two things: makes all errors positive, and punishes big errors more than small ones. A prediction that's off by 10 is penalized 4x more than one off by 5.
The line that gives you the lowest MSE is your best fit line. That's what scikit-learn finds when you call .fit().
Building It From Scratch First
Before using scikit-learn, let's implement linear regression manually so you see what's actually happening.
import numpy as np
import matplotlib.pyplot as plt
# Simple dataset: study hours vs exam score
hours = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10], dtype=float)
scores = np.array([52, 55, 60, 65, 68, 72, 75, 81, 85, 90], dtype=float)
# Calculate slope (w) and intercept (b) using the least squares formula
n = len(hours)
mean_x = np.mean(hours)
mean_y = np.mean(scores)
# Slope formula: w = sum((x - mean_x) * (y - mean_y)) / sum((x - mean_x)^2)
numerator = np.sum((hours - mean_x) * (scores - mean_y))
denominator = np.sum((hours - mean_x) ** 2)
w = numerator / denominator
# Intercept formula: b = mean_y - w * mean_x
b = mean_y - w * mean_x
print(f"Slope (w): {w:.2f}")
print(f"Intercept (b): {b:.2f}")
print(f"Equation: score = {w:.2f} * hours + {b:.2f}")
# Make predictions
predictions = w * hours + b
# Plot
plt.figure(figsize=(8, 5))
plt.scatter(hours, scores, color='blue', label='Actual scores', zorder=5)
plt.plot(hours, predictions, color='red', linewidth=2, label=f'y = {w:.1f}x + {b:.1f}')
plt.xlabel('Hours Studied')
plt.ylabel('Exam Score')
plt.title('Linear Regression From Scratch')
plt.legend()
plt.grid(True, alpha=0.3)
plt.savefig('linear_regression_scratch.png', dpi=100)
plt.show()
Output:
Slope (w): 4.24
Intercept (b): 47.27
Equation: score = 4.24 * hours + 47.27
This tells you: for every extra hour of study, the score goes up by about 4.24 points. If someone studies 0 hours, the model predicts 47.27.
Now predict for a new student who studied 7.5 hours:
new_hours = 7.5
predicted_score = w * new_hours + b
print(f"Predicted score for {new_hours} hours: {predicted_score:.1f}")
# Output: Predicted score for 7.5 hours: 79.1
Now With Scikit-learn
You'll always use scikit-learn in practice. Let's do the same thing but properly.
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
import numpy as np
# Same data, shaped for sklearn (needs 2D input)
hours = np.array([1,2,3,4,5,6,7,8,9,10]).reshape(-1, 1)
scores = np.array([52,55,60,65,68,72,75,81,85,90])
# Split
X_train, X_test, y_train, y_test = train_test_split(
hours, scores, test_size=0.2, random_state=42
)
# Train
model = LinearRegression()
model.fit(X_train, y_train)
print(f"Slope (w): {model.coef_[0]:.2f}")
print(f"Intercept (b): {model.intercept_:.2f}")
# Predict
y_pred = model.predict(X_test)
# Evaluate
mse = mean_squared_error(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f"\nMSE: {mse:.2f}")
print(f"RMSE: {np.sqrt(mse):.2f}")
print(f"MAE: {mae:.2f}")
print(f"R2: {r2:.3f}")
Evaluating Regression: Different Metrics Than Classification
For classification you use accuracy. For regression, accuracy doesn't make sense. You use these instead.
MAE (Mean Absolute Error)
Average of absolute differences between predictions and actual values. Easy to understand. If MAE = 5.2, your predictions are off by 5.2 on average.
MAE = mean(|actual - predicted|)
MSE (Mean Squared Error)
Squares the errors before averaging. Punishes big mistakes more. Harder to interpret because the units are squared.
MSE = mean((actual - predicted)^2)
RMSE (Root Mean Squared Error)
Square root of MSE. Same units as your target variable. Most commonly used.
RMSE = sqrt(MSE)
R2 Score (R-squared)
Tells you how much of the variance in your target variable is explained by your model. Ranges from 0 to 1. Closer to 1 is better.
- R2 = 1.0: perfect predictions
- R2 = 0.8: model explains 80% of the variation in the data
- R2 = 0.0: model is no better than just predicting the mean every time
- R2 < 0: model is worse than predicting the mean (something is very wrong)
# Quick comparison of all metrics
print("Metric comparison:")
print(f" MAE: {mae:.2f} <- average error in original units")
print(f" RMSE: {np.sqrt(mse):.2f} <- also in original units, penalizes big errors")
print(f" R2: {r2:.3f} <- 0 to 1, higher is better")
Real Example: California Housing Dataset
Let's use a real dataset with actual features.
from sklearn.datasets import fetch_california_housing
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np
import pandas as pd
# Load data
housing = fetch_california_housing()
X = pd.DataFrame(housing.data, columns=housing.feature_names)
y = housing.target # median house value in $100,000s
print(X.head())
print(f"\nTarget range: ${y.min()*100:.0f}k to ${y.max()*100:.0f}k")
# Split and scale
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Train
model = LinearRegression()
model.fit(X_train_scaled, y_train)
# Evaluate
y_pred = model.predict(X_test_scaled)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
r2 = r2_score(y_test, y_pred)
print(f"RMSE: ${rmse*100:.0f}k")
print(f"R2: {r2:.3f}")
Output:
RMSE: $74k
R2: 0.576
The model is off by about $74k on average. R2 of 0.576 means it explains about 58% of the variation in house prices. Not bad for a simple linear model with no tuning.
Looking at Feature Weights
One great thing about linear regression: you can see exactly how much each feature matters.
# Feature importance from coefficients
coefficients = pd.Series(model.coef_, index=housing.feature_names)
coefficients_sorted = coefficients.sort_values()
print("Feature weights (bigger absolute value = more influence):")
print(coefficients_sorted)
Output:
Feature weights:
AveOccup -0.391
Latitude -0.900
Longitude -0.870
HouseAge 0.123
AveRooms 0.323
Population -0.003
AveBedrms -0.049
MedInc 0.827
MedInc (median income) has the biggest positive weight. Higher income neighborhoods have higher house prices. Makes sense.
Latitude has a big negative weight. Moving north in California (higher latitude) generally means lower prices. Also makes sense geographically.
This interpretability is one of the biggest advantages of linear regression. You can explain every prediction.
The Things Everyone Gets Wrong
Mistake 1: Not scaling features
Linear regression works with raw numbers. If one feature ranges from 0 to 1,000,000 and another from 0 to 1, the big-range feature dominates the coefficients. Always scale when features have very different ranges.
# Always do this before linear regression
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
Mistake 2: Using it for non-linear relationships
Linear regression assumes the relationship is, well, linear. If the real pattern curves, a straight line won't fit it. Check a scatter plot first.
Mistake 3: Reporting R2 without checking residuals
R2 can look decent even when your model is completely wrong in systematic ways. Always plot actual vs predicted and residuals vs predicted.
import matplotlib.pyplot as plt
# Residual plot
residuals = y_test - y_pred
plt.figure(figsize=(8, 4))
plt.scatter(y_pred, residuals, alpha=0.3, color='blue')
plt.axhline(y=0, color='red', linewidth=1)
plt.xlabel('Predicted Values')
plt.ylabel('Residuals')
plt.title('Residual Plot - should look like random scatter around 0')
plt.savefig('residual_plot.png', dpi=100)
plt.show()
If residuals show a pattern (a curve, a funnel shape), your linear model is missing something.
Quick Cheat Sheet
| Thing | Code |
|---|---|
| Train model | model = LinearRegression(); model.fit(X_train, y_train) |
| Get slope | model.coef_ |
| Get intercept | model.intercept_ |
| Predict | model.predict(X_test) |
| MAE | mean_absolute_error(y_test, y_pred) |
| RMSE | np.sqrt(mean_squared_error(y_test, y_pred)) |
| R2 | r2_score(y_test, y_pred) |
| Scale features | StandardScaler().fit_transform(X_train) |
Practice Challenges
Level 1:
Load load_diabetes() from sklearn. Train a linear regression model. Print the RMSE and R2. Which feature has the highest positive weight?
Level 2:
On the California housing dataset, remove the scaling step. Compare R2 with and without scaling. What changes? What does that tell you?
Level 3:
Plot actual vs predicted values as a scatter plot for the California housing data. Draw a diagonal line where predicted = actual. Points closer to that line are better predictions. Where does the model struggle the most?
References
- Scikit-learn: LinearRegression
- Scikit-learn: Regression metrics
- StatQuest: Linear Regression (YouTube)
- Khan Academy: Least Squares
Next up, Post 55: Multiple Regression: More Features, More Power. What changes when you go from 1 input to 10 inputs, how multicollinearity breaks your model, and how to pick the right features.
Top comments (0)