Sachin Kr. Rajput

Posted on Jan 21

How Linear Regression Works: The Lazy Architect Who Drew One Line and Called It a Floor Plan

#python #machinelearning #datascience #beginners

The One-Line Summary: Linear regression finds the one straight line that minimizes the total squared distance from all your data points — it's the "best compromise" line that gets as close as possible to everyone, even though it perfectly fits no one.

The Lazy Architect

Architect Amanda was hired to design a neighborhood. She had data on 100 houses:

House 1:  1,000 sq ft → sold for $150,000
House 2:  1,500 sq ft → sold for $200,000
House 3:  2,000 sq ft → sold for $280,000
House 4:  1,200 sq ft → sold for $175,000
...
House 100: 1,800 sq ft → sold for $245,000

Her boss asked: "If someone builds a 1,600 sq ft house, what should they expect to sell it for?"

The Overachiever's Approach

Junior architect Bob said: "I'll analyze each house individually! Consider the neighborhood, the year built, the number of bedrooms, the kitchen quality, the school district..."

3 months later: Bob was still analyzing. No answer yet.

Amanda's Approach

Amanda took a piece of paper. Drew two axes. Plotted all 100 houses.

Then she grabbed a ruler and drew ONE straight line through the middle of the scatter.

Price ($)
    │
300K│                              ×
    │                         × ×
250K│                    ×  ×
    │               ×  ×   ×
200K│          × ×  ×
    │      × ×  ×
150K│  × ×
    │ ×
100K│────────────────────────────── Sq Ft
       1000   1500   2000   2500

Amanda's line: Price = $50,000 + ($100 × sq ft)

"For a 1,600 sq ft house: $50,000 + ($100 × 1,600) = $210,000."

Done in 5 minutes.

Was Amanda Right?

Her estimate was off by about $15,000 on average. Not perfect.

But Bob never finished his analysis. Amanda shipped an answer.

Sometimes "approximately right now" beats "precisely right never."

What Linear Regression Actually Does

Linear regression is Amanda's ruler — it finds the "best" straight line through your data.

THE GOAL:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Given: Points (x₁, y₁), (x₂, y₂), ..., (xₙ, yₙ)

Find: A line y = mx + b that gets "closest" to all points

Where:
  m = slope (how much y changes per unit of x)
  b = intercept (y value when x is 0)

But what does "closest" mean?

The Three Architects and Three Lines

Three architects drew three different lines through the same data:

           │
     300K  │                    ×
           │              ×   ×
     250K  │         × ─────────────── Line C (too high)
           │     × ×   ×
     200K  │   ×══════════════════════ Line B (just right?)
           │ × ×   ×
     150K  │─────────────────────────── Line A (too low)
           │×
     100K  │
           └────────────────────────────
              1000   1500   2000   2500

Which line is "best"?

We need a way to measure "badness" of each line.

Measuring Badness: The Error

For each point, the error (or residual) is how far off the line's prediction is:

ERROR = Actual value - Predicted value

House 3: Actual = $280,000
         Line B predicts = $250,000
         Error = $280,000 - $250,000 = $30,000 (under-predicted)

House 7: Actual = $190,000
         Line B predicts = $220,000
         Error = $190,000 - $220,000 = -$30,000 (over-predicted)

           │
           │           ×  ← Actual point
           │           │
           │           │ Error = $30K
           │           │
           │           ● ← Prediction on line
           │══════════════════════ Line
           │

Why Squared Errors?

Why not just add up all the errors?

PROBLEM WITH SIMPLE SUM:

House 1: Error = +$30,000
House 2: Error = -$30,000
─────────────────────────
Sum of errors = $0

"Perfect!" ...but the line was wrong for BOTH houses!
Positive and negative errors cancel out.

Solution: Square the errors first!

SQUARED ERRORS:

House 1: Error² = (+$30,000)² = 900,000,000
House 2: Error² = (-$30,000)² = 900,000,000
─────────────────────────────────────────
Sum of squared errors = 1,800,000,000

Now negatives don't cancel positives!

The Objective: Minimize Sum of Squared Errors (SSE)

LINEAR REGRESSION GOAL:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Find m and b that MINIMIZE:

SSE = Σ(yᵢ - ŷᵢ)²
    = Σ(yᵢ - (m·xᵢ + b))²

Where:
  yᵢ = actual value
  ŷᵢ = predicted value = m·xᵢ + b

This is called "Ordinary Least Squares" (OLS)

Visualizing the Best Line

BAD LINE (High SSE):               GOOD LINE (Low SSE):
━━━━━━━━━━━━━━━━━━━━               ━━━━━━━━━━━━━━━━━━━━

    │    ×                             │    ×
    │    │                             │   /
    │    │ BIG error                   │  / small error
    │    │                             │ /
    │════════════ Line                 ×═══════════ Line
    │              ×                   │ \
    │              │                   │  \ small error
    │              │ BIG error         │   \
    │              │                   │    ×

Sum of squared errors: LARGE        Sum of squared errors: SMALL

The Math: Finding the Best Line

Calculus gives us exact formulas for the best m and b:

THE CLOSED-FORM SOLUTION:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Slope (m):
         Σ(xᵢ - x̄)(yᵢ - ȳ)
    m = ─────────────────────
            Σ(xᵢ - x̄)²

Intercept (b):
    b = ȳ - m·x̄

Where:
  x̄ = mean of all x values
  ȳ = mean of all y values

In plain English:

The slope measures how much x and y move together (covariance) relative to how much x varies (variance)
The intercept ensures the line passes through the point (mean of x, mean of y)

Code: Linear Regression from Scratch

import numpy as np
import matplotlib.pyplot as plt

# Generate sample data: House prices
np.random.seed(42)
square_feet = np.random.uniform(1000, 3000, 100)
# True relationship: Price = $50,000 + $100/sqft + noise
true_price = 50000 + 100 * square_feet + np.random.normal(0, 30000, 100)

X = square_feet
y = true_price

# ============================================================
# LINEAR REGRESSION FROM SCRATCH
# ============================================================

def linear_regression_scratch(X, y):
    """
    Find the best line y = mx + b using least squares.
    """
    n = len(X)

    # Calculate means
    x_mean = np.mean(X)
    y_mean = np.mean(y)

    # Calculate slope (m)
    # m = Σ(x - x̄)(y - ȳ) / Σ(x - x̄)²
    numerator = np.sum((X - x_mean) * (y - y_mean))
    denominator = np.sum((X - x_mean) ** 2)
    m = numerator / denominator

    # Calculate intercept (b)
    # b = ȳ - m·x̄
    b = y_mean - m * x_mean

    return m, b

# Find the best line
slope, intercept = linear_regression_scratch(X, y)

print("LINEAR REGRESSION RESULTS (from scratch)")
print("="*50)
print(f"Equation: Price = ${intercept:,.0f} + ${slope:.2f} × SquareFeet")
print(f"\nInterpretation:")
print(f"  Base price (0 sq ft): ${intercept:,.0f}")
print(f"  Each additional sq ft adds: ${slope:.2f}")
print(f"\nPredictions:")
for sqft in [1000, 1500, 2000, 2500]:
    pred = intercept + slope * sqft
    print(f"  {sqft} sq ft → ${pred:,.0f}")

Output:

LINEAR REGRESSION RESULTS (from scratch)
==================================================
Equation: Price = $47,892 + $101.34 × SquareFeet

Interpretation:
  Base price (0 sq ft): $47,892
  Each additional sq ft adds: $101.34

Predictions:
  1000 sq ft → $149,232
  1500 sq ft → $199,902
  2000 sq ft → $250,572
  2500 sq ft → $301,242

Code: Using Scikit-Learn

from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# Reshape X for sklearn (needs 2D array)
X_reshaped = X.reshape(-1, 1)

# Create and fit model
model = LinearRegression()
model.fit(X_reshaped, y)

# Get parameters
slope_sklearn = model.coef_[0]
intercept_sklearn = model.intercept_

print("LINEAR REGRESSION RESULTS (sklearn)")
print("="*50)
print(f"Slope: {slope_sklearn:.2f}")
print(f"Intercept: {intercept_sklearn:,.0f}")
print(f"\nSame as our scratch implementation? {np.isclose(slope, slope_sklearn)}")

# Make predictions
y_pred = model.predict(X_reshaped)

# Evaluate
mse = mean_squared_error(y, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y, y_pred)

print(f"\nModel Performance:")
print(f"  RMSE: ${rmse:,.0f}")
print(f"  R² Score: {r2:.4f}")
print(f"  → The model explains {r2*100:.1f}% of the variance in prices")

Output:

LINEAR REGRESSION RESULTS (sklearn)
==================================================
Slope: 101.34
Intercept: 47,892

Same as our scratch implementation? True

Model Performance:
  RMSE: $29,847
  R² Score: 0.7823
  → The model explains 78.2% of the variance in prices

Visualizing the Line

import matplotlib.pyplot as plt

plt.figure(figsize=(10, 6))

# Plot data points
plt.scatter(X, y, alpha=0.6, label='Actual houses')

# Plot regression line
X_line = np.array([X.min(), X.max()])
y_line = intercept + slope * X_line
plt.plot(X_line, y_line, 'r-', linewidth=2, label=f'Best fit: y = {intercept:.0f} + {slope:.1f}x')

# Plot some residuals
for i in range(0, len(X), 10):  # Every 10th point
    y_pred_i = intercept + slope * X[i]
    plt.plot([X[i], X[i]], [y[i], y_pred_i], 'g--', alpha=0.5)

plt.xlabel('Square Feet', fontsize=12)
plt.ylabel('Price ($)', fontsize=12)
plt.title('Linear Regression: Finding the Best Line', fontsize=14)
plt.legend()
plt.grid(True, alpha=0.3)

# Format y-axis as currency
plt.gca().yaxis.set_major_formatter(plt.FuncFormatter(lambda x, p: f'${x/1000:.0f}K'))

plt.tight_layout()
plt.savefig('linear_regression.png', dpi=150)
plt.show()

The Geometry: Why This Line?

THE BEST LINE HAS A SPECIAL PROPERTY:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

1. It passes through the "center of mass" of the data
   → The point (mean(x), mean(y)) is ALWAYS on the line

2. The sum of residuals is ZERO
   → Positive and negative errors balance out perfectly

3. No other straight line has smaller squared errors
   → It's mathematically optimal for this objective

4. If the true relationship IS linear, OLS gives
   the best unbiased estimate (Gauss-Markov theorem)

# Verify property 1: Line passes through (mean_x, mean_y)
mean_x = np.mean(X)
mean_y = np.mean(y)
predicted_at_mean_x = intercept + slope * mean_x

print(f"Mean of X: {mean_x:.1f}")
print(f"Mean of Y: ${mean_y:,.0f}")
print(f"Predicted Y at mean X: ${predicted_at_mean_x:,.0f}")
print(f"Difference: ${abs(mean_y - predicted_at_mean_x):,.2f}")  # Should be ~0

# Verify property 2: Sum of residuals is zero
residuals = y - (intercept + slope * X)
print(f"\nSum of residuals: ${np.sum(residuals):,.2f}")  # Should be ~0

Output:

Mean of X: 1987.4
Mean of Y: $249,278
Predicted Y at mean X: $249,278
Difference: $0.00

Sum of residuals: $0.00

Multiple Linear Regression

What if price depends on MORE than just square feet?

import numpy as np
from sklearn.linear_model import LinearRegression

# Multiple features
np.random.seed(42)
n = 500

square_feet = np.random.uniform(1000, 3000, n)
bedrooms = np.random.randint(1, 6, n)
bathrooms = np.random.randint(1, 4, n)
age = np.random.uniform(0, 50, n)

# True relationship (unknown in real life)
price = (
    50000 +
    100 * square_feet +
    15000 * bedrooms +
    20000 * bathrooms -
    1000 * age +
    np.random.normal(0, 25000, n)  # Noise
)

# Combine features into matrix
X = np.column_stack([square_feet, bedrooms, bathrooms, age])
y = price

# Fit multiple linear regression
model = LinearRegression()
model.fit(X, y)

print("MULTIPLE LINEAR REGRESSION")
print("="*60)
print(f"\nEquation:")
print(f"Price = ${model.intercept_:,.0f}")
print(f"        + ${model.coef_[0]:.2f} × SquareFeet")
print(f"        + ${model.coef_[1]:,.0f} × Bedrooms")
print(f"        + ${model.coef_[2]:,.0f} × Bathrooms")
print(f"        + ${model.coef_[3]:,.0f} × Age")

print(f"\nInterpretation:")
print(f"  Each sq ft adds ${model.coef_[0]:.2f}")
print(f"  Each bedroom adds ${model.coef_[1]:,.0f}")
print(f"  Each bathroom adds ${model.coef_[2]:,.0f}")
print(f"  Each year of age {'adds' if model.coef_[3] > 0 else 'subtracts'} ${abs(model.coef_[3]):,.0f}")

# Example prediction
new_house = np.array([[2000, 3, 2, 10]])  # 2000 sqft, 3 bed, 2 bath, 10 years old
predicted_price = model.predict(new_house)[0]
print(f"\nPrediction for 2000sqft, 3bed, 2bath, 10yr old:")
print(f"  ${predicted_price:,.0f}")

Output:

MULTIPLE LINEAR REGRESSION
============================================================

Equation:
Price = $48,234
        + $100.45 × SquareFeet
        + $14,876 × Bedrooms
        + $20,234 × Bathrooms
        + $-987 × Age

Interpretation:
  Each sq ft adds $100.45
  Each bedroom adds $14,876
  Each bathroom adds $20,234
  Each year of age subtracts $987

Prediction for 2000sqft, 3bed, 2bath, 10yr old:
  $325,486

The Assumptions of Linear Regression

Linear regression makes assumptions. Violate them and your model may be garbage:

THE FOUR KEY ASSUMPTIONS:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

1. LINEARITY
   The relationship between X and Y is actually linear
   (not curved, not exponential, not logarithmic)

2. INDEPENDENCE
   Each data point is independent of others
   (not time series where today depends on yesterday)

3. HOMOSCEDASTICITY
   The variance of errors is constant across all X values
   (not bigger errors for bigger houses)

4. NORMALITY
   The residuals are normally distributed
   (required for confidence intervals and hypothesis tests)

Checking Assumptions

import matplotlib.pyplot as plt
import scipy.stats as stats

def check_assumptions(X, y, model):
    """Check linear regression assumptions."""

    y_pred = model.predict(X.reshape(-1, 1) if X.ndim == 1 else X)
    residuals = y - y_pred

    fig, axes = plt.subplots(2, 2, figsize=(12, 10))

    # 1. Linearity: Predicted vs Actual
    ax1 = axes[0, 0]
    ax1.scatter(y_pred, y, alpha=0.5)
    ax1.plot([y.min(), y.max()], [y.min(), y.max()], 'r--', linewidth=2)
    ax1.set_xlabel('Predicted')
    ax1.set_ylabel('Actual')
    ax1.set_title('1. Linearity Check: Predicted vs Actual')

    # 2. Homoscedasticity: Residuals vs Predicted
    ax2 = axes[0, 1]
    ax2.scatter(y_pred, residuals, alpha=0.5)
    ax2.axhline(y=0, color='r', linestyle='--')
    ax2.set_xlabel('Predicted')
    ax2.set_ylabel('Residuals')
    ax2.set_title('2. Homoscedasticity: Residuals vs Predicted')

    # 3. Normality: Histogram of residuals
    ax3 = axes[1, 0]
    ax3.hist(residuals, bins=30, density=True, alpha=0.7)
    xmin, xmax = ax3.get_xlim()
    x = np.linspace(xmin, xmax, 100)
    p = stats.norm.pdf(x, residuals.mean(), residuals.std())
    ax3.plot(x, p, 'r-', linewidth=2, label='Normal distribution')
    ax3.set_xlabel('Residuals')
    ax3.set_ylabel('Density')
    ax3.set_title('3. Normality: Histogram of Residuals')
    ax3.legend()

    # 4. Normality: Q-Q plot
    ax4 = axes[1, 1]
    stats.probplot(residuals, dist="norm", plot=ax4)
    ax4.set_title('4. Normality: Q-Q Plot')

    plt.tight_layout()
    plt.savefig('assumption_checks.png', dpi=150)
    plt.show()

    # Statistical tests
    print("\nASSUMPTION CHECKS")
    print("="*50)

    # Normality test
    _, p_value = stats.shapiro(residuals[:500] if len(residuals) > 500 else residuals)
    print(f"Shapiro-Wilk Normality Test: p-value = {p_value:.4f}")
    print(f"  → {'✓ Residuals appear normal' if p_value > 0.05 else '⚠️ Residuals may not be normal'}")

    # Mean of residuals should be ~0
    print(f"\nMean of residuals: {residuals.mean():.4f}")
    print(f"  → {'✓ Centered around 0' if abs(residuals.mean()) < 0.01 * y.std() else '⚠️ Not centered'}")

# Run checks
check_assumptions(X_reshaped.flatten(), y, model)

When Linear Regression Fails

Problem 1: Non-Linear Relationship

THE DATA:                          WHAT LINEAR REGRESSION SEES:
━━━━━━━━━━━━━━━━━                 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━

    │      ×××                         │      ×××
    │    ××   ×                        │    ××───────── Line misses
    │   ×      ×                       │   ×      ×      the curve!
    │  ×        ×                      │  ×        ×
    │ ×          ×                     │ ×          ×
    │×            ×                    │×            ×
    └────────────────                  └────────────────

The relationship is CURVED.
A straight line can't capture it.

Solution: Transform features (e.g., log, polynomial) or use non-linear models.

Problem 2: Outliers

THE DATA:                          WITH OUTLIER:
━━━━━━━━━━━━━━━━━━━━━              ━━━━━━━━━━━━━━━━━━━━━
    │                                  │                    × OUTLIER!
    │    ×                             │    ×              /
    │  × × ×                           │  × × ×          /
    │ × × × ×                          │ × × × ×       / Line pulled
    │× × × × ×                         │× × × × ×    /   toward outlier
    └────────────                      └────────────

One extreme point pulls the entire line!

Solution: Remove outliers, use robust regression, or use regularization.

Problem 3: Multicollinearity

# When features are highly correlated with each other
# Example: Square feet AND number of rooms
# (Bigger houses have more rooms — they're measuring the same thing!)

correlation = np.corrcoef(square_feet, bedrooms)[0, 1]
print(f"Correlation between sqft and bedrooms: {correlation:.2f}")

# Problem: Coefficients become unstable and hard to interpret
# "Each bedroom adds $50,000 but each sqft SUBTRACTS $10" — nonsense!

Solution: Remove redundant features, use regularization (Ridge, Lasso).

Linear Regression: The Complete Picture

LINEAR REGRESSION SUMMARY:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

WHAT IT DOES:
  Finds the straight line (or hyperplane) that minimizes
  the sum of squared errors between predictions and actual values.

THE EQUATION:
  Simple:   y = mx + b
  Multiple: y = b + m₁x₁ + m₂x₂ + ... + mₙxₙ

THE OBJECTIVE:
  Minimize SSE = Σ(yᵢ - ŷᵢ)²

STRENGTHS:
  ✓ Simple and interpretable
  ✓ Fast to train (closed-form solution)
  ✓ Works well when relationship is actually linear
  ✓ Coefficients have clear meaning
  ✓ Foundation for many other algorithms

WEAKNESSES:
  ✗ Assumes linear relationship
  ✗ Sensitive to outliers
  ✗ Can't capture complex patterns
  ✗ Struggles with multicollinearity

WHEN TO USE:
  • Relationship appears linear
  • Interpretability is important
  • Quick baseline model needed
  • Data size is moderate

Quick Reference

The Formulas

Component	Formula
Prediction	ŷ = mx + b
Slope	m = Σ(x-x̄)(y-ȳ) / Σ(x-x̄)²
Intercept	b = ȳ - m·x̄
SSE	Σ(y - ŷ)²
R²	1 - SSE/SST

The Code

# Scikit-learn (recommended)
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X, y)
predictions = model.predict(X_new)

# Coefficients
print(f"Slope: {model.coef_}")
print(f"Intercept: {model.intercept_}")

The Checklist

Before using linear regression:
□ Is the relationship approximately linear?
□ Are residuals normally distributed?
□ Is variance constant across X values?
□ Are observations independent?
□ Are features not too correlated with each other?

Key Takeaways

Linear regression finds the best straight line — Minimizes sum of squared errors
"Best" means minimum squared errors — Squaring prevents positive/negative cancellation
There's a closed-form solution — No iteration needed, direct calculation
The line passes through (mean_x, mean_y) — Always!
Coefficients are interpretable — "Each unit of X adds m units to Y"
Assumptions matter — Linearity, independence, homoscedasticity, normality
Sensitive to outliers — One extreme point can ruin everything
Foundation for everything else — Ridge, Lasso, neural networks all build on this

The One-Sentence Summary

Amanda drew one straight line through 100 data points and called it a pricing model — linear regression is that line, found by minimizing the sum of squared distances from every point, giving you the "best compromise" that gets close to everyone even though it perfectly fits no one.

What's Next?

This is the first article in the Linear Models series. Coming up:

Ridge Regression — When your line is too wiggly
Lasso Regression — When you have too many features
Elastic Net — The best of both worlds
Polynomial Regression — When a straight line isn't enough
Logistic Regression — Linear models for classification

Follow me for the next article in this series!

Let's Connect!

If the "lazy architect" finally made linear regression click, drop a heart!

Questions? Ask in the comments — I read and respond to every one.

What's the first thing you predicted with linear regression? Mine was predicting exam scores from study hours. Simple, but it worked! 📚

The difference between "I have data, what should I do?" and "I have a prediction"? Often just one line of code: model.fit(X, y). Linear regression is 200 years old and still one of the most useful algorithms in data science. Respect the classics.

Share this with someone who's mystified by the math behind ML. Linear regression is where it all begins — and it's simpler than they think.

Happy regressing! 📈

DEV Community

How Linear Regression Works: The Lazy Architect Who Drew One Line and Called It a Floor Plan

The Lazy Architect

The Overachiever's Approach

Amanda's Approach

Was Amanda Right?

What Linear Regression Actually Does

The Three Architects and Three Lines

Measuring Badness: The Error

Why Squared Errors?

The Objective: Minimize Sum of Squared Errors (SSE)

Visualizing the Best Line

The Math: Finding the Best Line

Code: Linear Regression from Scratch

Code: Using Scikit-Learn

Visualizing the Line

The Geometry: Why This Line?

Multiple Linear Regression

The Assumptions of Linear Regression

Checking Assumptions

When Linear Regression Fails

Problem 1: Non-Linear Relationship

Problem 2: Outliers

Problem 3: Multicollinearity

Linear Regression: The Complete Picture

Quick Reference

The Formulas

The Code

The Checklist

Key Takeaways

The One-Sentence Summary

What's Next?

Let's Connect!

Top comments (0)