DEV Community: the_undefined_architect

Linear Regression: Code (a) Line

the_undefined_architect — Sat, 02 May 2026 18:19:27 +0000

It's time to write your first ML model and predict house prices.
To follow along, go ahead and take a look at the complete product:
https://github.com/yotambelgoroski/ml_unchained-house_pricing

Step 1: It's all about data

ML is all about data - you can't create a model without training it, and you can't train it without data.

Our dataset is typically split into two parts:

Training data - Data used to train a model
Test - Once a model is trained, we can take input (x) from the test data, predict the output (ŷ), and compare that prediction to the real value (y). This tells us how well our model performs.

In more advanced setups, you might also see a validation set, which is used to tune the model before testing it.

Where does data come from?

The answer depends on your business and use case. For learning purposes, Kaggle is a great source for datasets and ML resources. To keep things simple, I use a script that generates synthetic data.

How much data do I need for training?

There is no fixed number — as model complexity increases, more data is required.

A common rule of thumb is:
Have 10×–20× more data points than features (independent variables)

We currently have one feature (sqm), so I used 10 records to train the model — the bare minimum to keep things simple.

How much data do I need for testing?

There are several approaches, but a simple one is to split your dataset using an 80:20 ratio:

80% for training
20% for testing

Step 2: Training the model

Now that we have our dataset, it's time to train a model.

Training involves three steps:

Load the training data
Train the model in memory based on that data
Serialization — save the trained model to disk so it can be reused without retraining

Here is how it looks in code:

import joblib
import pandas as pd
from pathlib import Path
from sklearn.linear_model import LinearRegression

FEATURE_COLS = ["sqm"]
TARGET_COL = "price"
MODEL_FILENAME = "house_price_model.joblib"


def load_training_data(train_path: Path) -> pd.DataFrame:
    return pd.read_csv(train_path)


def train_model(df: pd.DataFrame) -> LinearRegression:
    model = LinearRegression()
    model.fit(df[FEATURE_COLS], df[TARGET_COL])
    return model


def save_model(model: LinearRegression, dest_path: Path) -> None:
    dest_path.parent.mkdir(parents=True, exist_ok=True)
    joblib.dump(model, dest_path)
    print(f"Model saved → {dest_path}")


def train(train_path: Path, model_dir: Path) -> LinearRegression:
    df = load_training_data(train_path)
    model = train_model(df)
    save_model(model, model_dir / MODEL_FILENAME)
    print(f"Model trained on {len(df)} samples.")
    return model

This is it - our first model!

Our Dependencies

Pandas — A data handling library for working with tabular data. Its core structure, the DataFrame, allows us to easily access and manipulate data.
scikit-learn — A machine learning library for Python. LinearRegression is one of its models, used to learn the best linear relationship between input features and a target value.
Joblib — A utility library used here for serialization. It allows us to save a trained model to disk and load it later for inference.

Congratulations — you've created your first model!

However, it's not production-ready yet. Next, we’ll use the test data to evaluate how good our model really is.

Linear Regression: behind the lines

the_undefined_architect — Sat, 18 Apr 2026 11:10:19 +0000

As I mentioned in my introductory post to this series, I don’t want math to be a hurdle for developers who want to get into AI and machine learning. You can always start coding and come back to the math when you find yourself needing it.

However, there are some concepts in linear regression that you should definitely know, since they are core to many other algorithms as well.

That said, I don’t believe you should memorize equations. When developing, we rely on libraries to handle that for us. But I do recommend going through this post and building a mental model of the core concepts that are rooted in math — very simple math that you can absolutely handle.

On The Line

In the previous post, we predicted the price of a 250 m² house by drawing a line based on a dataset of 10 houses.

But why that line? And how did we get a predicted value (ŷ) from an input (x)?

The answer is this formula:
ŷ = bx + a

a - The intercept, where the line crosses the Y axis (when x = 0)
b - The slope - how steep the line is

To find a and b, we use:

a = ((Σy)(Σx^2) - (Σx)(Σxy)) / (n(Σx^2) - (Σx)^2)
b = (n(Σxy) - (Σx)(Σy)) / (n(Σx^2) - (Σx)^2)

Don't worry! Its simple than it seems:

Σ = Total sum

Σx = Total sum of all house sizes (50 + 65 + 80 + 95 + 110 + 130 + 150 + 170 + 190 + 210 = 1250)
Σy = Total sum of all house prices (140 + 210 + 180 + 260 + 240 + 330 + 310 + 420 + 390 + 470 = 2950)
Σxy = Total sum of (x · y) for each row (7000 + 13650 + 14400 + 24700 + 26400 + 42900 + 46500 + 71400 + 74100 + 98700 = 419,750)
Σx² = Total sum of (x · x) for each row (2500 + 4225 + 6400 + 9025 + 12100 + 16900 + 22500 + 28900 + 36100 + 44100 = 182,750)
n = Total number of rows (10 houses)

So:

a = ((2950)(182750) - (1250)(419750)) / (10(182750) - (1250)^2) ≈ 54.43
b = (10(419750) - (1250)(2950)) / (10(182750) - (1250)^2) ≈ 1.92

Now we can predict the price of a 250 m² house:

ŷ = bx + a
ŷ = 1.92 * 250 + 54.43
ŷ = 534.43 ≈ 535

Or like we've seen before:

How Good Is Our Line?

We have a model — but is it a good one?

For each data point, we compare the actual value (y) with the predicted value (ŷ). The difference is called a residual:
residual = y - ŷ

Let's calculate a few residuals using our formula
ŷ = 1.92x + 54.43

Some residuals are positive (we under-predicted), some are negative (we over-predicted). If we simply summed them, they would cancel out — so we square them:
MSE = (1/n) Σ(yᵢ - ŷᵢ)²

This is the Mean Squared Error (MSE) — a single number that tells us how wrong the model is, on average.

Our line should reflect the minimal MSE possible.

How to do that? We already did.

The formulas for a and b are derived by minimizing the MSE.

This means the line we found is the best possible line for this data — there is no better combination of a and b.

In the next post, we'll code our first model!

Linear Regression: Putting things in line

the_undefined_architect — Tue, 14 Apr 2026 14:27:24 +0000

Like every developer knows, Hello World is famously the first application you build when you start learning how to code. In the same spirit, house price prediction is one of the best beginner examples for understanding how machine learning models are trained and used.

Say you want to build an app that predicts house prices. How would you do it? There is no simple if-else statement that can accurately predict the price of a house in your neighborhood.

So let’s simplify the problem.

Imagine you collected data for 10 houses and wanted to explore whether house size can help us predict house price. Your dataset might look like this:

If we plot these points on an X and Y axis, this is what we get:

As we can see, a correlation emerges between the size of the house and its price (duh).
But how can we define this relationship so that, given a real value of X (the house size), we can predict the price?

This is where Linear Regression comes in.

Instead of trying to match every point perfectly, linear regression finds a line, called the regression line, that best represents the overall trend in the data.

Here’s what that looks like:

Now that we have our regression line, we can actually use it to make predictions.

Let’s say we want to estimate the price of a house that is 250 m².

We simply take that value (X = 250), project it onto our regression line, and get the predicted price:

And that’s it — we’ve trained our first model for predicting house prices.

I know, I know… we didn’t go into how the model actually finds this line, or how to implement it in code. We’ll get there in the next post.

For now, the goal was to give you an intuition for how Machine Learning works:

how models learn from data
how training shapes their behavior
and most importantly — that there is no certainty, only probability and prediction

ML Unchained: Machine Learning for Developers

the_undefined_architect — Sun, 12 Apr 2026 19:01:53 +0000

If you’re an experienced developer, you’ve probably felt it already.

Machine Learning is everywhere.
Recommendations, pricing, search, fraud detection, and copilots.

And yet — it still feels… separate from what you do.

Like a different world.

Why You Should Care

Not because it’s hype.
Because it changes how you build.

As an experienced developer, you’re used to writing deterministic systems.

Given X → return Y

You structure it with:

Functions
Classes
Tests

Machine Learning doesn’t replace that.
But it introduces a component that doesn’t follow explicit rules — it predicts.

Instead of writing logic, you train it.
Instead of exact outputs, you get probabilities.
Instead of debugging code, you analyze behavior.

Why This Transition Is So Hard

Traditional developers struggle to get started with ML because the AI world feels intimidating. There’s no simple “hello world” to ease you in.

It’s not that simple — but it’s not that hard either.

I’ll say it clearly: you don’t need to be a math wizard to become an ML engineer. I’m not. Yes, math is involved, but as a developer you’ve already dealt with harder problems.

The real issue is how people approach learning it.

They think they need to:

Learn the math first
Then the theory
And only then start coding

That’s backwards.

We’re engineers.

We don’t learn by reading first —
we learn by building cools shit.

Final Thoughts

That’s exactly what we’re going to do together.

We’re going to dive into Machine Learning — but not through theory-first learning.

We’ll go project first.

We’ll build things.
We’ll break them.
We’ll understand how they behave.

And along the way, the concepts will start to make sense — naturally.

No unnecessary complexity.
No waiting until you’re “ready.”

We'll do it the engineering way, and we'll do that 500 words at a time