DEV Community

Dhanvina N
Dhanvina N

Posted on

Linear Regression

Linear Regression Explained Simply (Using Only 3 Houses)

Step 1: Imagine you have only 3 houses

House size (x) Real price (y)
1 1.5
2 2.2
3 2.9
  • x = size in thousands of square feet
  • y = price in hundreds of thousands of dollars

Your goal: predict the price from the size using a straight line.


Step 2: What does a straight line look like?

Every straight-line prediction model follows:

predicted price = (some number) × size + (another number)
Enter fullscreen mode Exit fullscreen mode

We name these numbers:

  • w → weight/slope
  • b → bias/intercept

So the model is:

ŷ = w × x + b
Enter fullscreen mode Exit fullscreen mode

Where ŷ means predicted y.


Step 3: Pick a random line to start

Let's guess:

w = 0.5
b = 1.0
Enter fullscreen mode Exit fullscreen mode

Now compute predictions:

x (size) Prediction ŷ = 0.5x + 1.0 Real y Error (ŷ − y)
1 1.5 1.5 0
2 2.0 2.2 -0.2
3 3.0 2.9 +0.1

Sometimes we predict low, sometimes high.


Step 4: Convert “a bit wrong” into ONE number

We need a single value describing how bad the line is.
But errors cancel out (e.g., -0.2 + 0.1 ≠ helpful).

So we use two tricks:

1. Square the errors

  • 0² = 0
  • (-0.2)² = 0.04
  • (0.1)² = 0.01

2. Take the average

MSE = (0 + 0.04 + 0.01) / 3
    = 0.05 / 3
    ≈ 0.0167
Enter fullscreen mode Exit fullscreen mode

This is the Mean Squared Error (MSE).

Smaller MSE = better line.


Step 5: Try another line and compare

New guess:

w = 0.8
b = 0.7
Enter fullscreen mode Exit fullscreen mode
x Prediction ŷ = 0.8x + 0.7 Real y Error Squared Error
1 1.5 1.5 0 0
2 2.3 2.2 +0.1 0.01
3 3.1 2.9 +0.2 0.04
MSE = (0 + 0.01 + 0.04) / 3
    ≈ 0.0167
Enter fullscreen mode Exit fullscreen mode

Same as before — not better yet.


Step 6: The goal

Try many combinations of w and b until you find the ones that give the smallest possible MSE.

That best pair:

(best w, best b)
Enter fullscreen mode Exit fullscreen mode

is the optimal straight line for your data.


Final takeaway

Linear regression is:

“Find the straight line that makes the average squared error as small as possible.”


Why Not Brute Force Linear Regression? Introducing Gradient Descent

When we try millions of combinations of w and b to find the best line, we are doing brute force search.

Why brute force is a bad idea?

  1. It takes too long.
    Trying millions of pairs of parameters becomes extremely slow.

  2. Small datasets → maybe okay.
    Real datasets → impossible.

    With 3 houses, brute force is fine.
    With 100,000 houses, a computer would struggle.

So instead of guessing randomly, we use a far smarter method.


The Clever Trick: Ask the Loss Function for Directions

We treat the MSE (Mean Squared Error) like a landscape:

  • Every pair (w, b) is a point on the surface.
  • The height of that point is the MSE at those parameter values.
  • The lowest height = the best line.

Think of it as a bowl-shaped valley.
Your job is to walk to the bottom.

But here’s the key idea:

Mathematics can tell you exactly which direction is downhill from where you stand.

That direction is called the gradient.


What the Gradient Tells Us

At your current values:

w = 0.5  
b = 1.0  
MSE ≈ 0.0167
Enter fullscreen mode Exit fullscreen mode

We ask:

1. “If I increase w a tiny bit (+0.01), does MSE go up or down?”

  • MSE goes up → the slope is positive → move w downward (decrease w).

2. “If I increase b a tiny bit (+0.01), does MSE go up or down?”

  • MSE goes down → the slope is negative → move b upward (increase b).

So the gradient tells us:

  • Move w slightly down.
  • Move b slightly up.

The Update Rule (Gradient Descent)

We update both parameters:

new w = current w − (learning rate × slope_w)
new b = current b − (learning rate × slope_b)
Enter fullscreen mode Exit fullscreen mode

Then:

  1. Recalculate the new MSE
  2. Recalculate the slopes
  3. Take another step downhill
  4. Repeat 20–100 times

Instead of testing millions of combinations, we follow the downhill slope directly to the minimum.


Why Linear Regression Still Feels Like a Mystery (And What Is Actually Happening)

Now you know the basic idea of linear regression, but the internal mechanics can still feel mysterious.
This walkthrough removes the mystery by showing exactly what is happening inside gradient descent, step by step, using the same 3-house example.


Our Dataset

x (size) y (real price)
1 1.5
2 2.2
3 2.9

We start with a random guess:

w = 0.5
b = 1.0
Enter fullscreen mode Exit fullscreen mode

Step 1: Make Predictions

Using ŷ = w·x + b:

  • House 1 → 0.5×1 + 1.0 = 1.5
  • House 2 → 0.5×2 + 1.0 = 2.0
  • House 3 → 0.5×3 + 1.0 = 3.0

Step 2: Compute Errors

error = y − ŷ
Enter fullscreen mode Exit fullscreen mode
  • House 1: 1.5 − 1.5 = 0
  • House 2: 2.2 − 2.0 = +0.2
  • House 3: 2.9 − 3.0 = −0.1

These errors determine how we must adjust w and b.


Step 3: What Happens if w Changes a Little?

Increase w slightly:

new w = 0.51
b stays = 1.0
Enter fullscreen mode Exit fullscreen mode

New predictions:

  • House 1 → 1.51
  • House 2 → 2.02
  • House 3 → 3.06

New errors:

  • House 1: −0.01
  • House 2: +0.18
  • House 3: −0.16

The overall squared error becomes slightly larger.
Conclusion: increasing w makes the model worse → w should be decreased.

That “how much worse” is exactly the gradient with respect to w.


Step 4: The Gradient Formula (No Mystery Anymore)

For linear regression, the exact slope (gradient) of MSE tells us how to update w:

gradient_w = −2 × average(x × error)
Enter fullscreen mode Exit fullscreen mode

Compute it:

  • House 1 → 1 × 0 = 0
  • House 2 → 2 × 0.2 = 0.4
  • House 3 → 3 × (−0.1) = −0.3

Sum = 0 + 0.4 − 0.3 = +0.1
Average = 0.1 / 3 = 0.033
Apply −2:

gradient_w ≈ −0.066
Enter fullscreen mode Exit fullscreen mode

This negative gradient means: decreasing w reduces the error.


Step 5: Gradient for b

gradient_b = −2 × average(error)
Enter fullscreen mode Exit fullscreen mode

Average error = (0 + 0.2 − 0.1) / 3 = 0.033
Apply −2:

gradient_b ≈ −0.066
Enter fullscreen mode Exit fullscreen mode

Same direction: decreasing b reduces error.


Step 6: Update w and b

General update rule:

new_value = old_value − learning_rate × gradient
Enter fullscreen mode Exit fullscreen mode

With a moderate learning rate (for demonstration):

w_new ≈ 0.73
b_new ≈ 0.93
Enter fullscreen mode Exit fullscreen mode

After just one update step, the MSE drops from about 0.0167 to 0.008.
The model is already noticeably better.


What Gradient Descent Is Really Doing

Gradient descent repeatedly performs these simple steps:

  1. Compute each prediction ŷ
  2. Compute each error (y − ŷ)
  3. Multiply errors by x to understand how each house influences w
  4. Average those influence values
  5. Adjust w toward lower error
  6. Adjust b using the average error
  7. Repeat 50–200 times

This is the entire mechanism behind linear regression training.


Linear Regression Explained in Complete Beginner Mode


Part 1: What Linear Regression Is Trying to Do

We have houses.
For each house we know:

  • x = size of the house
  • y = real selling price

We want a straight line that predicts price from size:

predicted price = w × x + b
Enter fullscreen mode Exit fullscreen mode
  • w = how much price increases when size increases by 1
  • b = base price when size is zero

Our goal is simple:
Find the best possible w and b.


Part 2: How We Measure “Best”

We measure how wrong our line is using MSE (Mean Squared Error).

For each house:

  1. Predict the price: ŷ = w×x + b
  2. Compute error: y − ŷ
  3. Square the error: (y − ŷ)²
  4. Add squared errors for all houses
  5. Divide by number of houses

Formula:

MSE = (1/N) × Σ (y − ŷ)²
Enter fullscreen mode Exit fullscreen mode

Smaller MSE = a better line.

This is the only quantity we try to minimize.


Part 3: The Key Idea — Nudge w and b in the Right Direction

We want to adjust w and b so that MSE gets smaller.

Imagine nudging w slightly upward (by something tiny like +0.0001).
Two possibilities:

  • If MSE increases → wrong direction; w should move down
  • If MSE decreases → correct direction; w should move up

The amount MSE changes when w changes slightly is the slope or gradient.

Same idea applies to b.


Part 4: Deriving the Gradient in Simple Arithmetic

Start with one house.
Its squared error is:

(y − (w×x + b))²
Enter fullscreen mode Exit fullscreen mode

Let:

e = y − (w×x + b)
Enter fullscreen mode Exit fullscreen mode

Then squared error = e².

How does e² change when w changes slightly?

A basic math rule:

change in (e²) = 2 × e × (change in e)
Enter fullscreen mode Exit fullscreen mode

What is the change in e when w increases?

e = y − w×x − b
Enter fullscreen mode Exit fullscreen mode

If w increases, w×x increases, so e decreases:

change in e = −x
Enter fullscreen mode Exit fullscreen mode

Thus:

change in (e²) = 2 × e × (−x) = −2 × e × x
Enter fullscreen mode Exit fullscreen mode

This is for one house.

For all houses, we sum and average:

gradient_w = (1/N) × Σ [ −2 × (y − ŷ) × x ]
Enter fullscreen mode Exit fullscreen mode

Gradient for b is easier:

change in e when b changes = −1
gradient_b = (1/N) × Σ [ −2 × (y − ŷ) ]
Enter fullscreen mode Exit fullscreen mode

These are the exact formulas every linear regression library uses.


Part 5: Final Gradient Formulas

gradient_w = −(2/N) × Σ [ (y − ŷ) × x ]
gradient_b = −(2/N) × Σ (y − ŷ)
Enter fullscreen mode Exit fullscreen mode

To reduce MSE, we move opposite the gradient:

w ← w − learning_rate × gradient_w
b ← b − learning_rate × gradient_b
Enter fullscreen mode Exit fullscreen mode

Or, expanding the negatives:

w ← w + learning_rate × (2/N) × Σ [ (y − ŷ) × x ]
b ← b + learning_rate × (2/N) × Σ (y − ŷ)
Enter fullscreen mode Exit fullscreen mode

This is the complete update rule used in gradient descent.


Part 6: Full Example Done Completely by Hand

Our dataset:

x y
1 1.5
2 2.2
3 2.9

Start with a very poor guess:

w = 0
b = 0
Enter fullscreen mode Exit fullscreen mode

Step 1: Predictions

All predictions are zero:

ŷ1 = 0
ŷ2 = 0
ŷ3 = 0
Enter fullscreen mode Exit fullscreen mode

Errors:

1.5, 2.2, 2.9
Enter fullscreen mode Exit fullscreen mode

Step 2: Update w

Compute average of (error × x):

(1.5×1 + 2.2×2 + 2.9×3) / 3
= (1.5 + 4.4 + 8.7) / 3
= 14.6 / 3
≈ 4.867
Enter fullscreen mode Exit fullscreen mode

With learning rate = 0.1 and factor 2:

new w ≈ 0 + 0.1 × 2 × 4.867 ≈ 0.973
Enter fullscreen mode Exit fullscreen mode

Step 3: Update b

Average error:

(1.5 + 2.2 + 2.9) / 3 = 7.6 / 3 ≈ 2.533
Enter fullscreen mode Exit fullscreen mode

Update:

new b ≈ 0 + 0.1 × 2 × 2.533 ≈ 0.507
Enter fullscreen mode Exit fullscreen mode

After just one update, parameters jump from (0, 0) → approximately (0.97, 0.51).
This is already much closer to the optimal line.

Repeat 20–50 steps and the updates stabilize.
Those final w and b are the best-fitting straight line for the data.

This is exactly what happens inside any machine learning library when you call .fit().


Summary in Plain Language

  1. Start with random w and b.
  2. Compute predictions for all houses.
  3. Compute errors (y − ŷ).
  4. To update w: multiply each error by its x, average them, and nudge w in that direction.
  5. To update b: average all errors and nudge b in that direction.
  6. Repeat until nothing changes.

There is no hidden machinery.
Only simple arithmetic repeated many times.


Two Ways to Solve Linear Regression: Gradient Descent vs the Closed-Form Formula

Now that gradient descent makes sense, it is important to know that there is actually another method to compute the best line. In fact, for simple linear regression, there is a formula that gives the perfect answer in one step with no looping at all.

There are two approaches:

  1. Gradient Descent → takes many small steps, works for any model
  2. Closed-Form Solution (Ordinary Least Squares) → gives exact w and b instantly

For basic linear regression, the closed-form method is faster, simpler, and exact.


The Closed-Form Formula (One-Step Solution)

For simple linear regression with one feature x, the optimal slope and intercept are:

w = Σ[(x − x_mean)(y − y_mean)] / Σ[(x − x_mean)²]
b = y_mean − w × x_mean
Enter fullscreen mode Exit fullscreen mode

This computes the best-fit line in one calculation.


Applying the Formula to Our Example

Dataset:

x y
1 1.5
2 2.2
3 2.9

Step 1: Compute Means

x_mean = (1 + 2 + 3) / 3 = 2
y_mean = (1.5 + 2.2 + 2.9) / 3 = 2.2
Enter fullscreen mode Exit fullscreen mode

Step 2: Build the Deviation Table

x y x−2 y−2.2 (x−2)(y−2.2) (x−2)²
1 1.5 -1 -0.7 0.7 1
2 2.2 0 0 0 0
3 2.9 1 0.7 0.7 1

Step 3: Sum the Required Columns

Σ(x−mean)(y−mean) = 1.4
Σ(x−mean)² = 2
Enter fullscreen mode Exit fullscreen mode

Step 4: Apply the Formula

w = 1.4 / 2 = 0.7
b = 2.2 − 0.7 × 2 = 0.8
Enter fullscreen mode Exit fullscreen mode

Final model:

price = 0.7 × size + 0.8
Enter fullscreen mode Exit fullscreen mode

Check Against Data

  • x = 1 → 0.7 + 0.8 = 1.5
  • x = 2 → 1.4 + 0.8 = 2.2
  • x = 3 → 2.1 + 0.8 = 2.9

The line fits all three points exactly.


Why This Formula Works (Intuition)

The numerator:

Σ[(x − x_mean)(y − y_mean)]
Enter fullscreen mode Exit fullscreen mode

measures how much x and y move together.

  • If x is above average and y is also above average → positive contribution
  • If they move in opposite directions → negative contribution

The denominator:

Σ[(x − x_mean)²]
Enter fullscreen mode Exit fullscreen mode

measures how much x varies on its own.

Thus:

w = (movement together) / (movement of x alone)
Enter fullscreen mode Exit fullscreen mode

Once the slope is fixed, the intercept b simply shifts the line so it passes through the point:

(x_mean, y_mean)
Enter fullscreen mode Exit fullscreen mode

The Matrix Version (for multiple features)

In general linear algebra form:

w = (XᵀX)⁻¹ Xᵀ y
Enter fullscreen mode Exit fullscreen mode

This is the Ordinary Least Squares (OLS) solution.
For one feature, it reduces exactly to the two formulas we computed.


Summary: Gradient Descent vs Closed-Form

Method Steps Required Loop Needed Exact? Works for Huge Data?
Gradient Descent Many small updates Yes Approximate Yes
Closed-Form OLS One computation No Exact Only if data fits RAM

For simple linear regression, the closed-form method is ideal.
For complex models (neural networks, large datasets, many parameters), gradient descent is required.


Why We Still Use Gradient Descent When a Perfect Closed-Form Formula Exists

After learning the closed-form solution for linear regression, it is natural to wonder:

“If we can compute w and b instantly, why do we ever bother with gradient descent?”

The short answer:
The closed-form formula is excellent for small problems, but it breaks down completely once the model or dataset becomes large.
Gradient descent, in contrast, scales to extremely large modern machine-learning problems.


Comparison: Closed-Form vs Gradient Descent

Situation Closed-Form (OLS) Gradient Descent Winner
1 feature, 100 data points Instant, exact Works but slower Closed-form
10 features, 1M data points Works Works Both fine
1,000+ features Must compute a large XᵀX matrix → high memory Computes updates step-by-step → efficient Gradient descent
100,000+ features (e.g., text embeddings) XᵀX is enormous → cannot fit in RAM Still works with manageable memory Gradient descent
Neural networks (millions/billions of parameters) No closed-form solution exists Designed to optimize such models Gradient descent
Streaming/online data Must recompute from scratch Updates incrementally Gradient descent
Add regularization (L1/L2) Closed-form becomes more complex Gradient descent only needs a small modification Gradient descent (usually simpler)

Why Closed-Form Breaks in Real Life

Example: Large tabular dataset

A housing dataset with:

  • 1,000,000 houses
  • 500 features

XᵀX becomes a 500 × 500 matrix → manageable.

But modern machine learning rarely has 500 features.
Instead, consider:

Example: Image or text models

A feature vector might have:

  • 100,000 dimensions (e.g., bag-of-words, embeddings)
  • or millions of parameters (neural networks)

The closed-form formula requires:

(XᵀX)⁻¹
Enter fullscreen mode Exit fullscreen mode

But XᵀX becomes:

  • 100,000 × 100,000 matrix (10 billion entries)
  • completely impossible to store or invert

Gradient descent does not require any matrix inversion.
It only needs to compute simple operations on the dataset in batches.

This is why every modern machine learning framework—TensorFlow, PyTorch, JAX—uses gradient-based optimization.


A Simple Way to Remember It

  • Closed-form (OLS):
    Works perfectly, but only for small, simple linear models.

  • Gradient descent:
    Works for linear models, logistic regression, deep learning, transformers, large-scale systems—essentially everything.

scikit-learn’s LinearRegression() uses the closed-form solution because typical tabular datasets are small enough.
TensorFlow and PyTorch use gradient-based methods exclusively because they target large, complex models.


Top comments (0)