The One-Line Summary: Linear regression finds the one straight line that minimizes the total squared distance from all your data points — it's the "best compromise" line that gets as close as possible to everyone, even though it perfectly fits no one.
The Lazy Architect
Architect Amanda was hired to design a neighborhood. She had data on 100 houses:
House 1: 1,000 sq ft → sold for $150,000
House 2: 1,500 sq ft → sold for $200,000
House 3: 2,000 sq ft → sold for $280,000
House 4: 1,200 sq ft → sold for $175,000
...
House 100: 1,800 sq ft → sold for $245,000
Her boss asked: "If someone builds a 1,600 sq ft house, what should they expect to sell it for?"
The Overachiever's Approach
Junior architect Bob said: "I'll analyze each house individually! Consider the neighborhood, the year built, the number of bedrooms, the kitchen quality, the school district..."
3 months later: Bob was still analyzing. No answer yet.
Amanda's Approach
Amanda took a piece of paper. Drew two axes. Plotted all 100 houses.
Then she grabbed a ruler and drew ONE straight line through the middle of the scatter.
Price ($)
│
300K│ ×
│ × ×
250K│ × ×
│ × × ×
200K│ × × ×
│ × × ×
150K│ × ×
│ ×
100K│────────────────────────────── Sq Ft
1000 1500 2000 2500
Amanda's line: Price = $50,000 + ($100 × sq ft)
"For a 1,600 sq ft house: $50,000 + ($100 × 1,600) = $210,000."
Done in 5 minutes.
Was Amanda Right?
Her estimate was off by about $15,000 on average. Not perfect.
But Bob never finished his analysis. Amanda shipped an answer.
Sometimes "approximately right now" beats "precisely right never."
What Linear Regression Actually Does
Linear regression is Amanda's ruler — it finds the "best" straight line through your data.
THE GOAL:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Given: Points (x₁, y₁), (x₂, y₂), ..., (xₙ, yₙ)
Find: A line y = mx + b that gets "closest" to all points
Where:
m = slope (how much y changes per unit of x)
b = intercept (y value when x is 0)
But what does "closest" mean?
The Three Architects and Three Lines
Three architects drew three different lines through the same data:
│
300K │ ×
│ × ×
250K │ × ─────────────── Line C (too high)
│ × × ×
200K │ ×══════════════════════ Line B (just right?)
│ × × ×
150K │─────────────────────────── Line A (too low)
│×
100K │
└────────────────────────────
1000 1500 2000 2500
Which line is "best"?
We need a way to measure "badness" of each line.
Measuring Badness: The Error
For each point, the error (or residual) is how far off the line's prediction is:
ERROR = Actual value - Predicted value
House 3: Actual = $280,000
Line B predicts = $250,000
Error = $280,000 - $250,000 = $30,000 (under-predicted)
House 7: Actual = $190,000
Line B predicts = $220,000
Error = $190,000 - $220,000 = -$30,000 (over-predicted)
│
│ × ← Actual point
│ │
│ │ Error = $30K
│ │
│ ● ← Prediction on line
│══════════════════════ Line
│
Why Squared Errors?
Why not just add up all the errors?
PROBLEM WITH SIMPLE SUM:
House 1: Error = +$30,000
House 2: Error = -$30,000
─────────────────────────
Sum of errors = $0
"Perfect!" ...but the line was wrong for BOTH houses!
Positive and negative errors cancel out.
Solution: Square the errors first!
SQUARED ERRORS:
House 1: Error² = (+$30,000)² = 900,000,000
House 2: Error² = (-$30,000)² = 900,000,000
─────────────────────────────────────────
Sum of squared errors = 1,800,000,000
Now negatives don't cancel positives!
The Objective: Minimize Sum of Squared Errors (SSE)
LINEAR REGRESSION GOAL:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Find m and b that MINIMIZE:
SSE = Σ(yᵢ - ŷᵢ)²
= Σ(yᵢ - (m·xᵢ + b))²
Where:
yᵢ = actual value
ŷᵢ = predicted value = m·xᵢ + b
This is called "Ordinary Least Squares" (OLS)
Visualizing the Best Line
BAD LINE (High SSE): GOOD LINE (Low SSE):
━━━━━━━━━━━━━━━━━━━━ ━━━━━━━━━━━━━━━━━━━━
│ × │ ×
│ │ │ /
│ │ BIG error │ / small error
│ │ │ /
│════════════ Line ×═══════════ Line
│ × │ \
│ │ │ \ small error
│ │ BIG error │ \
│ │ │ ×
Sum of squared errors: LARGE Sum of squared errors: SMALL
The Math: Finding the Best Line
Calculus gives us exact formulas for the best m and b:
THE CLOSED-FORM SOLUTION:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Slope (m):
Σ(xᵢ - x̄)(yᵢ - ȳ)
m = ─────────────────────
Σ(xᵢ - x̄)²
Intercept (b):
b = ȳ - m·x̄
Where:
x̄ = mean of all x values
ȳ = mean of all y values
In plain English:
- The slope measures how much x and y move together (covariance) relative to how much x varies (variance)
- The intercept ensures the line passes through the point (mean of x, mean of y)
Code: Linear Regression from Scratch
import numpy as np
import matplotlib.pyplot as plt
# Generate sample data: House prices
np.random.seed(42)
square_feet = np.random.uniform(1000, 3000, 100)
# True relationship: Price = $50,000 + $100/sqft + noise
true_price = 50000 + 100 * square_feet + np.random.normal(0, 30000, 100)
X = square_feet
y = true_price
# ============================================================
# LINEAR REGRESSION FROM SCRATCH
# ============================================================
def linear_regression_scratch(X, y):
"""
Find the best line y = mx + b using least squares.
"""
n = len(X)
# Calculate means
x_mean = np.mean(X)
y_mean = np.mean(y)
# Calculate slope (m)
# m = Σ(x - x̄)(y - ȳ) / Σ(x - x̄)²
numerator = np.sum((X - x_mean) * (y - y_mean))
denominator = np.sum((X - x_mean) ** 2)
m = numerator / denominator
# Calculate intercept (b)
# b = ȳ - m·x̄
b = y_mean - m * x_mean
return m, b
# Find the best line
slope, intercept = linear_regression_scratch(X, y)
print("LINEAR REGRESSION RESULTS (from scratch)")
print("="*50)
print(f"Equation: Price = ${intercept:,.0f} + ${slope:.2f} × SquareFeet")
print(f"\nInterpretation:")
print(f" Base price (0 sq ft): ${intercept:,.0f}")
print(f" Each additional sq ft adds: ${slope:.2f}")
print(f"\nPredictions:")
for sqft in [1000, 1500, 2000, 2500]:
pred = intercept + slope * sqft
print(f" {sqft} sq ft → ${pred:,.0f}")
Output:
LINEAR REGRESSION RESULTS (from scratch)
==================================================
Equation: Price = $47,892 + $101.34 × SquareFeet
Interpretation:
Base price (0 sq ft): $47,892
Each additional sq ft adds: $101.34
Predictions:
1000 sq ft → $149,232
1500 sq ft → $199,902
2000 sq ft → $250,572
2500 sq ft → $301,242
Code: Using Scikit-Learn
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
# Reshape X for sklearn (needs 2D array)
X_reshaped = X.reshape(-1, 1)
# Create and fit model
model = LinearRegression()
model.fit(X_reshaped, y)
# Get parameters
slope_sklearn = model.coef_[0]
intercept_sklearn = model.intercept_
print("LINEAR REGRESSION RESULTS (sklearn)")
print("="*50)
print(f"Slope: {slope_sklearn:.2f}")
print(f"Intercept: {intercept_sklearn:,.0f}")
print(f"\nSame as our scratch implementation? {np.isclose(slope, slope_sklearn)}")
# Make predictions
y_pred = model.predict(X_reshaped)
# Evaluate
mse = mean_squared_error(y, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y, y_pred)
print(f"\nModel Performance:")
print(f" RMSE: ${rmse:,.0f}")
print(f" R² Score: {r2:.4f}")
print(f" → The model explains {r2*100:.1f}% of the variance in prices")
Output:
LINEAR REGRESSION RESULTS (sklearn)
==================================================
Slope: 101.34
Intercept: 47,892
Same as our scratch implementation? True
Model Performance:
RMSE: $29,847
R² Score: 0.7823
→ The model explains 78.2% of the variance in prices
Visualizing the Line
import matplotlib.pyplot as plt
plt.figure(figsize=(10, 6))
# Plot data points
plt.scatter(X, y, alpha=0.6, label='Actual houses')
# Plot regression line
X_line = np.array([X.min(), X.max()])
y_line = intercept + slope * X_line
plt.plot(X_line, y_line, 'r-', linewidth=2, label=f'Best fit: y = {intercept:.0f} + {slope:.1f}x')
# Plot some residuals
for i in range(0, len(X), 10): # Every 10th point
y_pred_i = intercept + slope * X[i]
plt.plot([X[i], X[i]], [y[i], y_pred_i], 'g--', alpha=0.5)
plt.xlabel('Square Feet', fontsize=12)
plt.ylabel('Price ($)', fontsize=12)
plt.title('Linear Regression: Finding the Best Line', fontsize=14)
plt.legend()
plt.grid(True, alpha=0.3)
# Format y-axis as currency
plt.gca().yaxis.set_major_formatter(plt.FuncFormatter(lambda x, p: f'${x/1000:.0f}K'))
plt.tight_layout()
plt.savefig('linear_regression.png', dpi=150)
plt.show()
The Geometry: Why This Line?
THE BEST LINE HAS A SPECIAL PROPERTY:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
1. It passes through the "center of mass" of the data
→ The point (mean(x), mean(y)) is ALWAYS on the line
2. The sum of residuals is ZERO
→ Positive and negative errors balance out perfectly
3. No other straight line has smaller squared errors
→ It's mathematically optimal for this objective
4. If the true relationship IS linear, OLS gives
the best unbiased estimate (Gauss-Markov theorem)
# Verify property 1: Line passes through (mean_x, mean_y)
mean_x = np.mean(X)
mean_y = np.mean(y)
predicted_at_mean_x = intercept + slope * mean_x
print(f"Mean of X: {mean_x:.1f}")
print(f"Mean of Y: ${mean_y:,.0f}")
print(f"Predicted Y at mean X: ${predicted_at_mean_x:,.0f}")
print(f"Difference: ${abs(mean_y - predicted_at_mean_x):,.2f}") # Should be ~0
# Verify property 2: Sum of residuals is zero
residuals = y - (intercept + slope * X)
print(f"\nSum of residuals: ${np.sum(residuals):,.2f}") # Should be ~0
Output:
Mean of X: 1987.4
Mean of Y: $249,278
Predicted Y at mean X: $249,278
Difference: $0.00
Sum of residuals: $0.00
Multiple Linear Regression
What if price depends on MORE than just square feet?
import numpy as np
from sklearn.linear_model import LinearRegression
# Multiple features
np.random.seed(42)
n = 500
square_feet = np.random.uniform(1000, 3000, n)
bedrooms = np.random.randint(1, 6, n)
bathrooms = np.random.randint(1, 4, n)
age = np.random.uniform(0, 50, n)
# True relationship (unknown in real life)
price = (
50000 +
100 * square_feet +
15000 * bedrooms +
20000 * bathrooms -
1000 * age +
np.random.normal(0, 25000, n) # Noise
)
# Combine features into matrix
X = np.column_stack([square_feet, bedrooms, bathrooms, age])
y = price
# Fit multiple linear regression
model = LinearRegression()
model.fit(X, y)
print("MULTIPLE LINEAR REGRESSION")
print("="*60)
print(f"\nEquation:")
print(f"Price = ${model.intercept_:,.0f}")
print(f" + ${model.coef_[0]:.2f} × SquareFeet")
print(f" + ${model.coef_[1]:,.0f} × Bedrooms")
print(f" + ${model.coef_[2]:,.0f} × Bathrooms")
print(f" + ${model.coef_[3]:,.0f} × Age")
print(f"\nInterpretation:")
print(f" Each sq ft adds ${model.coef_[0]:.2f}")
print(f" Each bedroom adds ${model.coef_[1]:,.0f}")
print(f" Each bathroom adds ${model.coef_[2]:,.0f}")
print(f" Each year of age {'adds' if model.coef_[3] > 0 else 'subtracts'} ${abs(model.coef_[3]):,.0f}")
# Example prediction
new_house = np.array([[2000, 3, 2, 10]]) # 2000 sqft, 3 bed, 2 bath, 10 years old
predicted_price = model.predict(new_house)[0]
print(f"\nPrediction for 2000sqft, 3bed, 2bath, 10yr old:")
print(f" ${predicted_price:,.0f}")
Output:
MULTIPLE LINEAR REGRESSION
============================================================
Equation:
Price = $48,234
+ $100.45 × SquareFeet
+ $14,876 × Bedrooms
+ $20,234 × Bathrooms
+ $-987 × Age
Interpretation:
Each sq ft adds $100.45
Each bedroom adds $14,876
Each bathroom adds $20,234
Each year of age subtracts $987
Prediction for 2000sqft, 3bed, 2bath, 10yr old:
$325,486
The Assumptions of Linear Regression
Linear regression makes assumptions. Violate them and your model may be garbage:
THE FOUR KEY ASSUMPTIONS:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
1. LINEARITY
The relationship between X and Y is actually linear
(not curved, not exponential, not logarithmic)
2. INDEPENDENCE
Each data point is independent of others
(not time series where today depends on yesterday)
3. HOMOSCEDASTICITY
The variance of errors is constant across all X values
(not bigger errors for bigger houses)
4. NORMALITY
The residuals are normally distributed
(required for confidence intervals and hypothesis tests)
Checking Assumptions
import matplotlib.pyplot as plt
import scipy.stats as stats
def check_assumptions(X, y, model):
"""Check linear regression assumptions."""
y_pred = model.predict(X.reshape(-1, 1) if X.ndim == 1 else X)
residuals = y - y_pred
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
# 1. Linearity: Predicted vs Actual
ax1 = axes[0, 0]
ax1.scatter(y_pred, y, alpha=0.5)
ax1.plot([y.min(), y.max()], [y.min(), y.max()], 'r--', linewidth=2)
ax1.set_xlabel('Predicted')
ax1.set_ylabel('Actual')
ax1.set_title('1. Linearity Check: Predicted vs Actual')
# 2. Homoscedasticity: Residuals vs Predicted
ax2 = axes[0, 1]
ax2.scatter(y_pred, residuals, alpha=0.5)
ax2.axhline(y=0, color='r', linestyle='--')
ax2.set_xlabel('Predicted')
ax2.set_ylabel('Residuals')
ax2.set_title('2. Homoscedasticity: Residuals vs Predicted')
# 3. Normality: Histogram of residuals
ax3 = axes[1, 0]
ax3.hist(residuals, bins=30, density=True, alpha=0.7)
xmin, xmax = ax3.get_xlim()
x = np.linspace(xmin, xmax, 100)
p = stats.norm.pdf(x, residuals.mean(), residuals.std())
ax3.plot(x, p, 'r-', linewidth=2, label='Normal distribution')
ax3.set_xlabel('Residuals')
ax3.set_ylabel('Density')
ax3.set_title('3. Normality: Histogram of Residuals')
ax3.legend()
# 4. Normality: Q-Q plot
ax4 = axes[1, 1]
stats.probplot(residuals, dist="norm", plot=ax4)
ax4.set_title('4. Normality: Q-Q Plot')
plt.tight_layout()
plt.savefig('assumption_checks.png', dpi=150)
plt.show()
# Statistical tests
print("\nASSUMPTION CHECKS")
print("="*50)
# Normality test
_, p_value = stats.shapiro(residuals[:500] if len(residuals) > 500 else residuals)
print(f"Shapiro-Wilk Normality Test: p-value = {p_value:.4f}")
print(f" → {'✓ Residuals appear normal' if p_value > 0.05 else '⚠️ Residuals may not be normal'}")
# Mean of residuals should be ~0
print(f"\nMean of residuals: {residuals.mean():.4f}")
print(f" → {'✓ Centered around 0' if abs(residuals.mean()) < 0.01 * y.std() else '⚠️ Not centered'}")
# Run checks
check_assumptions(X_reshaped.flatten(), y, model)
When Linear Regression Fails
Problem 1: Non-Linear Relationship
THE DATA: WHAT LINEAR REGRESSION SEES:
━━━━━━━━━━━━━━━━━ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━
│ ××× │ ×××
│ ×× × │ ××───────── Line misses
│ × × │ × × the curve!
│ × × │ × ×
│ × × │ × ×
│× × │× ×
└──────────────── └────────────────
The relationship is CURVED.
A straight line can't capture it.
Solution: Transform features (e.g., log, polynomial) or use non-linear models.
Problem 2: Outliers
THE DATA: WITH OUTLIER:
━━━━━━━━━━━━━━━━━━━━━ ━━━━━━━━━━━━━━━━━━━━━
│ │ × OUTLIER!
│ × │ × /
│ × × × │ × × × /
│ × × × × │ × × × × / Line pulled
│× × × × × │× × × × × / toward outlier
└──────────── └────────────
One extreme point pulls the entire line!
Solution: Remove outliers, use robust regression, or use regularization.
Problem 3: Multicollinearity
# When features are highly correlated with each other
# Example: Square feet AND number of rooms
# (Bigger houses have more rooms — they're measuring the same thing!)
correlation = np.corrcoef(square_feet, bedrooms)[0, 1]
print(f"Correlation between sqft and bedrooms: {correlation:.2f}")
# Problem: Coefficients become unstable and hard to interpret
# "Each bedroom adds $50,000 but each sqft SUBTRACTS $10" — nonsense!
Solution: Remove redundant features, use regularization (Ridge, Lasso).
Linear Regression: The Complete Picture
LINEAR REGRESSION SUMMARY:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
WHAT IT DOES:
Finds the straight line (or hyperplane) that minimizes
the sum of squared errors between predictions and actual values.
THE EQUATION:
Simple: y = mx + b
Multiple: y = b + m₁x₁ + m₂x₂ + ... + mₙxₙ
THE OBJECTIVE:
Minimize SSE = Σ(yᵢ - ŷᵢ)²
STRENGTHS:
✓ Simple and interpretable
✓ Fast to train (closed-form solution)
✓ Works well when relationship is actually linear
✓ Coefficients have clear meaning
✓ Foundation for many other algorithms
WEAKNESSES:
✗ Assumes linear relationship
✗ Sensitive to outliers
✗ Can't capture complex patterns
✗ Struggles with multicollinearity
WHEN TO USE:
• Relationship appears linear
• Interpretability is important
• Quick baseline model needed
• Data size is moderate
Quick Reference
The Formulas
| Component | Formula |
|---|---|
| Prediction | ŷ = mx + b |
| Slope | m = Σ(x-x̄)(y-ȳ) / Σ(x-x̄)² |
| Intercept | b = ȳ - m·x̄ |
| SSE | Σ(y - ŷ)² |
| R² | 1 - SSE/SST |
The Code
# Scikit-learn (recommended)
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X, y)
predictions = model.predict(X_new)
# Coefficients
print(f"Slope: {model.coef_}")
print(f"Intercept: {model.intercept_}")
The Checklist
Before using linear regression:
□ Is the relationship approximately linear?
□ Are residuals normally distributed?
□ Is variance constant across X values?
□ Are observations independent?
□ Are features not too correlated with each other?
Key Takeaways
Linear regression finds the best straight line — Minimizes sum of squared errors
"Best" means minimum squared errors — Squaring prevents positive/negative cancellation
There's a closed-form solution — No iteration needed, direct calculation
The line passes through (mean_x, mean_y) — Always!
Coefficients are interpretable — "Each unit of X adds m units to Y"
Assumptions matter — Linearity, independence, homoscedasticity, normality
Sensitive to outliers — One extreme point can ruin everything
Foundation for everything else — Ridge, Lasso, neural networks all build on this
The One-Sentence Summary
Amanda drew one straight line through 100 data points and called it a pricing model — linear regression is that line, found by minimizing the sum of squared distances from every point, giving you the "best compromise" that gets close to everyone even though it perfectly fits no one.
What's Next?
This is the first article in the Linear Models series. Coming up:
- Ridge Regression — When your line is too wiggly
- Lasso Regression — When you have too many features
- Elastic Net — The best of both worlds
- Polynomial Regression — When a straight line isn't enough
- Logistic Regression — Linear models for classification
Follow me for the next article in this series!
Let's Connect!
If the "lazy architect" finally made linear regression click, drop a heart!
Questions? Ask in the comments — I read and respond to every one.
What's the first thing you predicted with linear regression? Mine was predicting exam scores from study hours. Simple, but it worked! 📚
The difference between "I have data, what should I do?" and "I have a prediction"? Often just one line of code: model.fit(X, y). Linear regression is 200 years old and still one of the most useful algorithms in data science. Respect the classics.
Share this with someone who's mystified by the math behind ML. Linear regression is where it all begins — and it's simpler than they think.
Happy regressing! 📈
Top comments (0)