John Ojo

Posted on Jul 19 • Edited on Aug 11

Understanding Linear Regression Inside-Out: Gradients, Loss, and Learning from Scratch

#machinelearning #linearregression #python #ai

This article is beginner-friendly for those looking to break into Artificial Intelligence/Machine Learning, as well as for those with experience who want to understand what goes on behind the scenes of most ML algorithms.
This will be a two-part series. In this part, I will explain the fundamentals of linear regression models and how they work behind the scenes to train and make predictions. In the next section, I will demonstrate how to utilize scikit-learn and TensorFlow to achieve this.
Understanding the fundamentals is always valuable—when things aren’t working as expected, a solid foundation helps us troubleshoot more effectively.

The source code on GitHub : Basic Linear Regression

Prerequisites

jupyter notebook setup
A local Python env setup with packages: numpy, matplotlib, sklearn, and tensorflow

Introduction

Supervised machine learning trains models using labeled data, where each example consists of both input features and the corresponding target variable. After training, the model can make predictions on new data points.

Linear regression is one of the simplest yet most foundational algorithms in machine learning. It models the relationship between input features and a continuous target variable, forming the basis for more complex models used in modern predictive analytics. Linear regression is particularly useful for predicting continuous values like weather forecasts, sales figures, or stock prices.

Libraries like scikit-learn and TensorFlow make training a linear regression model straightforward using just a few lines of code. However, this convenience can become a black box that obscures how the model learns, which is why, in this article, we’ll demystify linear regression by building it completely from scratch — using only NumPy for mathematical operations with just a touch of TensorFlow for data normalization.

We’ll work with the classic advertising dataset, where the goal is to predict sales based on advertising spend across TV, radio, and newspapers. Along the way, we’ll cover:

Loading and preprocessing real-world data;
Implementing the prediction function, loss calculation, and gradient descent manually;
Evaluating the model on a separate test set; and
Visualizing model performance with clear plots.

At the end, you’ll not only understand what linear regression does, but also why it works and how to implement it step by step. Whether you're new to machine learning or looking to strengthen your foundations, this hands-on project will sharpen your skills.

Stick with me—this will be a bit long, as we’re covering everything from loading and processing the data, to calculating loss, implementing gradient descent, making predictions, and doing it all manually.

Let’s dive in.

Data Processing

In this project, we’ll use the advertising dataset, a well-known dataset in the machine learning community. It’s often used for regression problems, particularly for demonstrating how different types of advertising affect product sales.

The dataset contains 200 rows, and includes the following columns:

tv Advertising budget spent on TV
radio Advertising budget spent on radio
newspaper Advertising budget spent on newspapers
Sales Product sales

Here’s a quick look at the first few rows:

id   tv     radio  newspaper  sales
1   230.1   37.8    69.2      22.1
2    44.5   39.3    45.1      10.4
3    17.2   45.9    69.3       9.3
4   151.5   41.3    58.5      18.5

In our model:
The input features (X) will be: tv, radio, and newspaper
The target variable (y) will be: sales

This setup is ideal for linear regression because we want to predict a continuous value (sales) based on numerical inputs.

Before training, we’ll split the data into training, validation, and test sets, and normalize the features to ensure they are all on a similar scale. That way, our gradient descent optimizer can converge more efficiently.

Let’s get started by loading the needed packages.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
import tensorflow as tf
import utils as utils

NumPy: a popular library for scientific computing
Matplotlib: a popular library for plotting data
scikit-learn: just for splitting the dataset
TensorFlow: only for normalizing the input features
utils.py: contains helper functions needed to load the CSV data, extract the features and target from a dataset, and plot relevant graphs (loss, predictions).

Next is loading and preprocessing the data

raw_data = utils.load_csv_data('<local-path>/data/Advertising.csv')

feature_keys = ['tv', 'radio', 'newspaper']
target_key = 'sales'

X, y = utils.prepare_data(raw_data, feature_keys, target_key)

print("X Shape: ", X.shape)
print("X length: ", len(X))
print("X first 5 features: ", X[:5])
print("X type: ", type(X))

print("y Shape: ", y.shape)
print("y length: ", len(y))
print("y first 5 features: ", y[:5])
print("y type: ", type(y))

Make sure to update the local-path. The utils.load_csv_data() function loads the dataset from the CSV file. After loading the raw data, we’ll convert the relevant columns into NumPy arrays for processing. utils.prepare_data() takes the raw data, the feature keys, and the target key, then returns a tuple of the features and target NumPy arrays. We print the shape, length, type, and the first 5 items in the features and target arrays. This enables us to get a sense of the data we are working with.

Next, we can plot our features vs the target graph to visualize our data

# Plot the first 5 features vs target
print("Plot first 5 X vs y")
utils.plot_features_vs_target(X[:5], y[:5], feature_keys, target_key)

# Plot the entire features vs target
print("Plot entire X vs y")
utils.plot_features_vs_target(X, y, feature_keys, target_key)

Next, we process the data and get it ready for training.
To simulate a realistic training workflow, we split the data into three parts:

75% for training
12.5% for validation (model tuning to ensure the model is not overfitting on the training data)
12.5% for final testing (unseen data)

To help gradient descent converge efficiently, we normalize the features using TensorFlow’s Normalization layer, trained on the training set only to prevent data leakage.

# Step 1: Split the dataset into training, validation, and test sets
# First split: 75% training, 25% temporary set
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.25, random_state=55)
# Second split: Divide the temporary set into validation and test sets (50% each, which is 12.5% of the original data each)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.50, random_state=55)

# Step 2: Normalize using TensorFlow (adapt on train only)
# Create a normalization layer that will standardize the features
normalizer = tf.keras.layers.Normalization(axis=-1)
# Fit the normalizer only on training data to avoid data leakage
normalizer.adapt(X_train)  # Only fit on training data

# Step 3: Transform all datasets using the fitted normalizer
# Apply normalization to training data and convert to numpy array
X_train_norm = normalizer(X_train).numpy()
# Apply same normalization to validation data
X_val_norm = normalizer(X_val).numpy() 
# Apply same normalization to test data
X_test_norm = normalizer(X_test).numpy()

# Print shapes of all datasets to verify the splitting worked correctly
print("X_train_norm Shape: ", X_train_norm.shape)
print("y_train Shape: ", y_train.shape)
print("X_val_norm Shape: ", X_val_norm.shape)
print("y_val Shape: ", y_val.shape)
print("X_test_norm Shape: ", X_test_norm.shape)
print("y_test Shape: ", y_test.shape)

At this point, our data is clean, split, and ready to be used for training a linear regression model from scratch. Next up, we’ll implement the linear regression components one by one: prediction, loss function, gradient calculation, weight and bias update using gradient descent, and training. Apologies in advance as I will not go deep into the mathematical equations used, but there are resources available online if you need to explore this further. Let’s get into it.

Prediction Definition

Predictions are how you predict an output f(x) given input features, weights, and a bias. In linear regression, the predicted output is computed as:

f_{w,b}(x^{(i)}) = wx^{(i)} + b / 𝑊⋅𝑋 + 𝑏

where:

W is the weight vector (one weight per feature, since we have three features, then the weight would be a vector of three values)
X is the input feature vector
b is the bias (intercept term)

The training process aims to find the appropriate weights and bias that, when applied to new input values, produce accurate predictions. Essentially, training is about learning these optimal parameters (weights and bias) for future use on new data points.

The process of finding optimal weights and bias involves:

Initializing the parameters (weights and bias) with random values
Making predictions using the current parameters
Calculating the prediction error using a loss function and storing the result
Updating the weights and bias to new values using gradient descent
Repeating steps 2-4 for multiple training epochs
Monitoring the loss over time—decreasing values indicate the model is learning and converging.

This process yields trained parameters (weights and bias) that can make accurate predictions on new data.

def predict(
    X: np.ndarray, 
    W: np.ndarray, 
    b: float
) -> np.ndarray:
    """
    Predict target values using linear regression.

    Args:
        X (np.ndarray): Feature matrix of shape (n_samples, n_features)
        W (np.ndarray): Weight vector of shape (n_features,)
        b (float): Bias term

    Returns:
        np.ndarray: Predicted values of shape (n_samples,)
    """

    # Calculate predictions using the linear regression formula: f(x) = X·W + b
    # - np.dot(X, W) computes the matrix multiplication between features and weights
    # - Adding b applies the bias term to each prediction
    f_x = np.dot(X, W) + b

    return f_x

f_x = np.dot(X, W) + b uses Vectorization to compute f(x).

Loss

Next, we would compute the loss. Loss refers to the difference between the predicted values and the actual (true) values of the target variable. It quantifies how well or poorly the model is performing.
The loss function quantifies how far our predictions are from the actual values. Low loss values indicate good model performance, while high loss values suggest the model needs improvement.
We’ll use mean squared error (MSE) as the loss function. It’s defined as:

J(w,b) = \frac{1}{2m} \sum\limits_{i = 0}^{m-1} (f_{w,b}(x^{(i)}) - y^{(i)})^2

def mean_squared_loss(
    X: np.ndarray, 
    y: np.ndarray, 
    W: np.ndarray, 
    b: float
) -> float:
    """
    Compute the mean squared error loss.

    Args:
        X (np.ndarray): Feature matrix of shape (m, n)
        y (np.ndarray): Target vector of shape (m,)
        W (np.ndarray): Weight vector of shape (n,)
        b (float): Bias term

    Returns:
        float: Mean squared error loss
    """

    # Get the number of training examples
    m = X.shape[0]

    # Calculate predictions using the predict function
    predictions = predict(X, W, b) 

    # Compute the squared differences between predictions and actual values
    squared_errors = (predictions - y) ** 2

    # Calculate the mean squared error loss with the 1/2m factor
    loss = np.sum(squared_errors) / (2 * m)

    return loss

Gradient and Gradient Descent

Next, we will compute a gradient. This calculates how much we need to change the weights and bias to reduce prediction error, helping the model learn during training.
The gradient is defined as:

\frac{\partial J(w,b)}{\partial w} = \frac{1}{m} \sum\limits_{i = 0}^{m-1} (f_{w,b}(x^{(i)}) - y^{(i)})x^{(i)}

\frac{\partial J(w,b)}{\partial b} = \frac{1}{m} \sum\limits_{i = 0}^{m-1} (f_{w,b}(x^{(i)}) - y^{(i)})

def compute_gradient(
    X: np.ndarray, 
    y: np.ndarray, 
    W: np.ndarray, 
    b: float
) -> tuple[np.ndarray, float]:
    """
    Compute the gradient of the cost function with respect to parameters W and b.

    Args:
        X: Input features, shape (m, n) where m is number of examples and n is number of features
        y: Target values, shape (m,)
        W: Weight parameters, shape (n,)
        b: Bias parameter

    Returns:
        tuple: Gradients with respect to W and b
            - d_dw: Gradient with respect to W, shape (n,)
            - d_db: Gradient with respect to b, scalar
    """

    # Get the number of examples (m) and features (n)
    m, n = X.shape

    # Calculate model predictions using current parameters
    predictions = predict(X, W, b) 
    # Compute the error (difference between predictions and actual values)
    error = predictions - y

    # Calculate gradient for weights by taking dot product of X transpose and error, then normalize by m
    # This is the partial derivative of the cost function with respect to W
    d_dw = np.dot(X.T, error) / m

    # Calculate gradient for bias by summing all errors and normalizing by m
    # This is the partial derivative of the cost function with respect to b
    d_db = np.sum(error) / m

    return d_dw, d_db

After this, we create the gradient descent function that enables us to perform one step of gradient descent to update model parameters.
The formula used to update the model's parameters is as follows:

w = w - \alpha \frac{\partial J(w,b)}{\partial w}

b = b - \alpha \frac{\partial J(w,b)}{\partial b}

def gradient_descent(
    X: np.ndarray, 
    y: np.ndarray, 
    W: np.ndarray, 
    b: float, 
    learning_rate: float
)-> tuple[np.ndarray, float]:
    """
    Perform one step of gradient descent to update model parameters.

    Args:
        X: Input features, shape (m, n) where m is number of examples and n is number of features
        y: Target values, shape (m,) or (m, 1)
        W: Current weight parameters, shape (n,) or (n, 1)
        b: Current bias parameter
        learning_rate: Step size for the gradient descent update

    Returns:
        tuple: Updated weights W and bias b after one step of gradient descent
    """

    # Calculate gradients for weights and bias using current parameters
    dW, db = compute_gradient(X, y, W, b)

    # Update weights by subtracting the learning rate multiplied by the gradient
    W =  W - (learning_rate * dW)
    # Update bias by subtracting the learning rate multiplied by the gradient
    b = b - (learning_rate * db)

    return W, b

Model Training

Now that we have implemented the core functionalities needed, i.e., prediction, loss, gradient computation, and training parameter updates, we can train our linear regression model using gradient descent.
The training process involves:

Calculating the loss using the current weights and bias (this predicts the output and then computes the MSE)
Updating the model parameters (weights and bias) using gradient descent
Repeating the process for the specified number of epochs (training steps)

We store the loss over time so that we can plot it. If the loss decreases over the epochs, we can say the model is converging and learning. If it keeps increasing, then the model is performing poorly, and we need to investigate the cause, since the loss should decrease during proper training.

def train(
    X_train: np.ndarray, 
    y_train: np.ndarray, 
    X_val: np.ndarray,
    y_val: np.ndarray,
    W: np.ndarray, 
    b: float, 
    learning_rate: float, 
    epochs: int
) -> tuple[np.ndarray, float, list[float], list[float]]:
    """
    Train a linear model using gradient descent optimization.

    Args:
        X_train: Training features, shape (n_samples, n_features)
        y_train: Training target values, shape (n_samples,)
        X_val: Validation features, shape (n_samples, n_features)
        y_val: Validation target values, shape (n_samples,)
        W: Initial weight matrix, shape (n_features,)
        b: Initial bias term
        learning_rate: Step size for gradient descent updates
        epochs: Number of training iterations

    Returns:
        tuple: Updated weights W, bias b, train loss history, and validation loss history after training
    """

    # Initialize empty lists to store loss values during training
    train_loss_history = []
    val_loss_history = []

    # Iterate through the specified number of training epochs
    for epoch in range(epochs):
        # Calculate and store the mean squared loss on training data
        train_loss = mean_squared_loss(X_train, y_train, W, b)
        train_loss_history.append(train_loss)

        # Calculate and store the mean squared loss on validation data
        val_loss = mean_squared_loss(X_val, y_val, W, b)
        val_loss_history.append(val_loss)      

        # Print progress every 100 epochs
        if epoch % 100 == 0:
            print(f"Epoch {epoch}: Train Loss = {train_loss:.4f}, Val Loss = {val_loss:.4f}")

        # Update model parameters (weights and bias) using gradient descent
        W, b = gradient_descent(X_train, y_train, W, b, learning_rate)

    # Return the trained model parameters and loss histories
    return W, b, train_loss_history, val_loss_history

Now that we have our train function in place, our next step is to run it. We will initialize the weights and bias to zero, use a learning rate of 0.01, and run this process 1000 times.

# Initialize weights and bias
W = np.zeros(X_train.shape[1])  # shape (n_features,)
b = 0.0
learning_rate = 0.01  # Step size for gradient descent
epochs = 1000  # Number of training iterations

# Train the linear regression model
W_trained, b_trained, train_losses, val_losses = train(X_train_norm, y_train, X_val_norm, y_val, W, b, learning_rate, epochs)

# Print the final trained parameters
print(f"Training parameters Weight: {W_trained}, bias {b_trained}")

# Visualize the training and validation loss over epochs
# This helps to monitor model convergence and potential overfitting
utils.plot_loss_curve(train_losses, val_losses)

As you can see, the final weights are [ 3.78713104 2.88855073 -0.14126015], and the bias is 13.92606568567291.
Note: In our case, the model converged after 200 steps, with no further significant change in the loss. We could modify the training process to implement early stopping based on a patience threshold—I'll leave that as an exercise for you.

Testing The Trained Model

We can now use these learned parameters to make predictions on our test dataset.

# Make predictions on the test set using trained weights and bias
y_predict = predict(X_test_norm, W_trained, b_trained)

# Calculate the mean squared loss on the test set
y_predict_loss = mean_squared_loss(X_test_norm, y_test, W_trained, b_trained)   

# Print the test loss
print(f"Test Loss: {y_predict_loss}")

# Create a plot comparing actual vs predicted sales
utils.plot_predictions(y_test, y_predict, 'Predicted vs Actual Sales', 'Actual Sales', 'Predicted Sales')

# Print the first 25 actual and predicted values for comparison
for i in range(25):
    print("Print actual va predicted values")
    print(f"Actual: {y_test[i]}, Predicted: {y_predict[i]:.1f}")

The test loss is 1.129492604788525, which shows that our model did very well in predicting the outcome.

Conclusion

In this tutorial, we built a complete linear regression model from scratch without relying on high-level machine learning libraries. We walked through each fundamental step from loading and preprocessing a real-world dataset to implementing the prediction and loss functions, computing gradients, and training with gradient descent.

Along the way, we used the advertising dataset to model the relationship between advertising spend and product sales. By manually implementing each component, we gained a clearer understanding of how linear regression works under the hood.

This hands-on approach reinforces not just the math behind machine learning, but also the practical workflow required to evaluate and test a simple model.

You typically won't need to implement this from scratch as we did, since packages like scikit-learn and TensorFlow handle the heavy lifting, unless you're developing entirely new ML algorithms. However, as I mentioned earlier, understanding the fundamentals helps you troubleshoot when models are not performing as expected or failing to converge.

In part two of this series, I'll show you how to accomplish the same task using scikit-learn and TensorFlow with just a few lines of code. This approach abstracts away the underlying mathematical complexity of linear regression, however the understanding you’ve gained from this article will give you a real appreciation for the work these libraries do behind the scenes. There's still plenty more to explore! Topics like regularization, feature engineering, and others offer fascinating extensions to what we've covered.

Thanks for following along! If you found this helpful or have questions, feel free to reach out. You can also buy me a coffee — happy learning!

DEV Community