Pascal Thormeier

Posted on Dec 11, 2020 • Edited on Jan 18, 2023

Algorithm explained: Linear regression using gradient descent with PHP

#machinelearning #php #computerscience #tutorial

Part 4 of Algorithms explained! Every few weeks I write about an algorithm and explain and implement it!
Is there an algorithm you always wanted to know about? Leave a comment!

I once took an online class on machine learning by the amazing Andrew Ng from Stanford for my bachelors studies. One of the first ML algorithms I ever had the chance to look at, was linear regression.

This algorithm really struck me. It allows machines to do what humans are incredibly good at: Pattern recognition. It allows machines to create a linear function that more or less matches seemingly unrelated data by applying mathematics to it!

In this post I'll explain how linear regression works and implement it in PHP!

So what does linear regression actually do?

In one sentence: It creates a linear function that represents the trend of something. The go-to example is usually house prices compared to their size. The larger a house, the more expensive it is. Sounds reasonable, right? You would put a known house size, for example in square meters, into the function and you would get an estimated price out of it.

Let's have a look at this data set:

On first glance, the data seems to correlate. The larger the house, the more expensive it is. The data is a bit scattered, but there's an overall trend visible:

(This is an approximation)

Of course there's usually more factors to the price of a house in real life: Is there a grocery store nearby, is it modern, was it once considered haunted, what's the size of its backyard? These factors make up the variation in the data - it's not linear.

The algorithm itself works like this:

Define any odd linear function as a trend line (usually random)
Measure how far off it is by calculating the average distance between predicted Y and actual Y of every data point (the so called "error")
Adjust the trend line based on the measurement (the "gradient descent")
Repeat Step 2 and 3 until the average distance has reached a minimum (not necessarily 0)

Let me illustrate with the above example:

(Again, this is an approximation)

So, to implement the algorithm, we need three different pieces:

The definition of a linear function
The measurement
The adjustment
Repeating until it's at a minimum

Let's start with...

The definition of a linear function

A linear function $f(x)$ usually comes in the form of:

f(x) = ax + b

Whereas $a$ basically defines the slope of the function and $b$ moves it along the Y axis.

We can make a more general definition of a linear function by introducing a variable called $x_0$ into the function and renaming a and b to $c_0$ and $c_0$ ("c" stands for "coefficient"):

f(x_0, x_1) = c_0x_0 + c_1x_1

In PHP, this is pretty straight forward:

<?php

declare(strict_types=1);

function linearFunction (float $c0, float $x0, float $c1, float $x1) : float {
    return $c0 * $x0 + $c1 * $x1;
}

We can randomly define $c0 and $c1 to get a first starting point:

$c0 = mt_rand(-100, 100); // -100 and 100 are almost parallel to the X and Y axis
$c1 = mt_rand(-100, 100);

So far, so good. Let's generate some data next!

Getting data

I'm going to generate the data randomly with a correlation. To get the linear function $f(x_0, x_1) = c_0x_0 + c_1x_1$ back into it's original form $f(x) = c_0 + c_1x$ , we set $x_0$ to always be 1:

$data = [
    [1, mt_rand(0, 100), mt_rand(0, 100)],
    [1, mt_rand(100, 200), mt_rand(100, 200)],
    [1, mt_rand(200, 300), mt_rand(200, 300)],
    [1, mt_rand(300, 400), mt_rand(300, 400)],
    [1, mt_rand(400, 500), mt_rand(400, 500)],
    [1, mt_rand(500, 600), mt_rand(500, 600)],
    [1, mt_rand(600, 700), mt_rand(600, 700)],
    [1, mt_rand(700, 800), mt_rand(700, 800)],
    [1, mt_rand(800, 900), mt_rand(800, 900)],
    [1, mt_rand(900, 1000), mt_rand(900, 1000)],
];

For illustration purposes, I'll be going to work with this data set and predefined values for $c0 and $c1:

$data = [
  [1, 48, 36],
  [1, 121, 104],
  [1, 285, 208],
  [1, 345, 373],
  [1, 402, 442],
  [1, 576, 582],
  [1, 617, 629],
  [1, 723, 787],
  [1, 852, 857],
  [1, 965, 950],
];
$c0 = 16;
$c1 = 15;

To graph a few things, I'll use Desmos, which is an amazing tool to fiddle around with maths. On a graph, the data looks like this:

We can see a clear trend there. The trend line will likely be going from bottom left to top right.

With the data at hand, we can start to...

Measure the error

So, to get a good estimate of how far off the linear function is, we calculate the average squared distance of the function's output Y and the actual Y. Squared, because otherwise negative numbers might emerge, which is something we don't want.

For this, we'll define yet another function that looks like this:

error(x_0^{(1..n)}, x_1^{(1..n)}, y^{(1..n)}) = \frac{1}{n}\sum_{i=1}^{n}{(y^{(i)} - f(x_0^{(i)}, x_1^{(i)}))^2}

(Notice: $x^{(i)}$ is not "x to the power of (i)", but rather "the x at position (i) in the list)

Or, in PHP:

function squaredError(float $c0, float $c1, array $data): float {
  return array_sum(
    array_map(
      function ($point) use ($c0, $c1) {
        return ($point[2] - linearFunction($c0, $point[0], $c1, $point[1])) ** 2;
      },
      $data
    )
  ) / count($data);
}

var_dump(squaredError($c0, $c1, $data)); // 73679488.2

This formula sums up the squares (because otherwise there's negative numbers) of the differences between predicted Y of the trend line and actual Y for every data point. It then divides it by the total number of data points, resulting in an average.

Ok, but 73679488.2 is quite the number. Let's see what this would look like by adding the linear function to the graph:

Yup, that's pretty far off. The Y-value for each point is far off to the top, accounting for this very large error. But that was the random definition. When can start improving from here.

Calculating the adjustment

To improve the trend line's fit to the data, we'll use "gradient descent". According to Wikipedia, gradient descent is "a first-order iterative optimization algorithm for finding a local minimum of a differentiable function." And that pretty much sums it up.

The "differentiable function" is, in the case of linear regression, the error function. So gradient descent basically tries to find the local minimum of the error function (thus maximizing the fit of the original linear function to the data). To achieve this, it can adjust two things.

Since X is always given (it's the data, we can't change that), gradient descent can adjust $c_{0}$ and $c_{1}$ .

Gradient descent decides on a level of adjustment by looking at the error function's steepness. Again, Wikipedia has a good analogy on this: Imagine being stuck in the mountains. It's very foggy and you need to get to a valley to escape. All you can rely on is local information (stuff that is visible around you). Using gradient descent, you would go in the steepest direction downwards, eventually reaching a minimum - a valley. The larger the steep, the larger the step you would take.

Let me illustrate in 2D:

The red dots on the graph show the value for C for each iteration of gradient descent. The arrows indicate in which direction the algorithm would alter C in order to find a smaller error.

(Since we have two C's ( $c_{0}$ and $c_{1}$ ) and one extra dimension for the error, the graph would be 3D, though.)

To get to the amount of change for each value of $c_{0}$ and $c_{1}$ , we can do partial derivatives with respect to $c_{0}$ and $c_{1}$ (i.e. what's the steepness of the function for one variable, while the other is held constant?). I won't go too deep into the calculus here. We would end up with this formula:

D(c_{m}) = \frac{-2}{n} \sum_{i=1}^{n}{((y^{(i)} - f(x_0^{(i)}, x_1^{(i)})) \times x_m^{(i)})}

$D(c_{m})$ meaning the descent of $c_{m}$ and $c_{m}$ being $c_{0}$ or $c_{1}$ . $x^{(i)}_m$ basically means "the x at position (i) in the data for the coefficient m". So for example, $x^{(i)}_0$ would always be 1, whereas $x^{(i)}_1$ is the data.

We can implement this more or less the same way we implemented the error function:

function descent(int $m, float $c0, float $c1, array $data): float {
  return (-2 / count($data)) * array_sum(
    array_map(
      function ($point) use ($c0, $c1, $m) {
        return ($point[2] - linearFunction($c0, $point[0], $c1, $point[1])) * $point[$m];
      },
      $data
    )
  );
}

This gives us the descent that needs to be applied to $c_m$ in order to bring the linear function closer to the data. Now we can...

Apply the adjustment

This is rather straight forward: We calculate the descents for $c_0$ and $c_1$ , apply some learning rate (to adjust the step size) and set the new values.

Wait, now what's a learning rate?

The gradient descent might try to take steps that are way too large to yield good results. Remember the analogy of being stuck on a mountain? Now imagine being 50 meters tall. The smallest step you could take would still be way too large, especially if the valley you want to get in is rather narrow. You wouldn't go into the valley, but end up on the opposite side of it. The next step you would take would be too large, too, so you end up, yet again, on the other side of the valley, eventually going even more uphill than downhill:

The learning rate acts as a counter measurement to this: It adjusts your size so your steps are small enough for you to step into the valley.

The learning rate depends on the kind of data you have. A learning rate that's too small will drastically increase the number of steps to take to reach a good fit, a learning rate that's too large will worsen the problem of taking steps that are too large.

There's a sweet spot to every data set, but there's no "one fits all" learning rate. One could apply other ML algorithms to find the perfect learning rate, though, but for small examples, it's best to try a few values and settle for the one that gives the best results.

With this information, we can define two functions to calculate the adjusted values of $c_0$ and $c_1$ :

function adaptC0(float $c0, float $c1, array $data, float $learningRate): float {
    return $c0 - $learningRate * descent(0, $c0, $c1, $data);
}

function adaptC1(float $c0, float $c1, array $data, float $learningRate): float {
    return $c1 - $learningRate * descent(1, $c0, $c1, $data);
}

Almost there. Now we need a loop to adapt the values until a certain goal is reached. Let's just do 50 iterations and see where the values end up:

// Rather small learning rate
$learningRate = 0.000001;

$errors = [];
for ($i = 0; $i < 50; $i++) {
    // Keep the errors so we can graph them later.
    $errors[] = squaredError($c0, $c1, $data);

    // Do not assign immediately, because otherwise 
    // the result of $c0 would influence the descent 
    // of $c1!
    $newC0 = adaptC0($c0, $c1, $data, $learningRate);
    $newC1 = adaptC1($c0, $c1, $data, $learningRate);

    $c0 = $newC0;
    $c1 = $newC1;
}

echo $c0 . ', ' . $c1; // 14.976594533192, 0.99210801880553

Ok, that seems to have worked. Let's see how good those values fit our data:

That's as good as it gets and actually pretty close! This function can now do rather accurate predictions for unknown values of x of the same data set. How did the error do, though?

We see that the error is large at first, but gets smaller really fast: Within around 5 iterations the error stabilizes at around 1429.28 - our implementation works!

Takeaway thoughts

Linear regression is applied mathematics. Just like most ML algorithms. Machine learning and AI may seem like utter magic at first, but once I started reading into the topic, I realized that the people applying it are basically working with the same tools everyone else does.

It's never the less amazing, that a mathematical model is able to recognize patterns. The more complex algorithms even find patterns a human being probably wouldn't, which serves as a great tool.

I hope you enjoyed reading this article as much as I enjoyed writing it! If so, leave a ❤️ or a 🦄! I write tech articles in my free time and like to drink coffee every once in a while.

If you want to support my efforts, please consider buying me a coffee ☕ or follow me on Twitter 🐦!

Top comments (7)

Dan Silcox • Dec 11 '20

Nice job and well explained examples! As a PHP-by-day dev it’s nice to see it be used for something a bit different too :) I would imagine in PHP 8, even for more complex analysis it would be a great contender to try this sort of thing.

Pascal Thormeier • Dec 11 '20

Thank you very much for your positive feedback! I always try to explain things as understandable as I can 😃

I could very well imagine PHP catching up in data science topis, especially with PHP8. Natural language processing and machine learning are things PHP could absolutely do, it does have all the necessary tools.

Dan Silcox • Dec 11 '20

I’m thinking I might have to give it a go some time :)

Pascal Thormeier • Dec 11 '20

Is there a specific algorithm or method you're thinking about? Might cover that one in a post next 🙂

Dan Silcox • Dec 11 '20

To be honest I don’t really know! Not done much with machine learning or similar really, too maths heavy for me generally 😂 but willing to give it a go, was just going to basically replicate what you’ve done here initially for learning purposes, maybe with a different data source.