DEV Community

Cover image for Algorithm explained: Linear regression using gradient descent with PHP
Pascal Thormeier
Pascal Thormeier

Posted on • Updated on

Algorithm explained: Linear regression using gradient descent with PHP

Part 4 of Algorithms explained! Every few weeks I write about an algorithm and explain and implement it!
Is there an algorithm you always wanted to know about? Leave a comment!

I once took an online class on machine learning by the amazing Andrew Ng from Stanford for my bachelors studies. One of the first ML algorithms I ever had the chance to look at, was linear regression.

This algorithm really struck me. It allows machines to do what humans are incredibly good at: Pattern recognition. It allows machines to create a linear function that more or less matches seemingly unrelated data by applying mathematics to it!

In this post I'll explain how linear regression works and implement it in PHP!

So what does linear regression actually do?

In one sentence: It creates a linear function that represents the trend of something. The go-to example is usually house prices compared to their size. The larger a house, the more expensive it is. Sounds reasonable, right? You would put a known house size, for example in square meters, into the function and you would get an estimated price out of it.

Let's have a look at this data set:

House prices and sizes compared to each other in a two dimensional graph. There's some correlation visible.

On first glance, the data seems to correlate. The larger the house, the more expensive it is. The data is a bit scattered, but there's an overall trend visible:

The same graph as above, now with an added trend line

(This is an approximation)

Of course there's usually more factors to the price of a house in real life: Is there a grocery store nearby, is it modern, was it once considered haunted, what's the size of its backyard? These factors make up the variation in the data - it's not linear.

The algorithm itself works like this:

  1. Define any odd linear function as a trend line (usually random)
  2. Measure how far off it is by calculating the average distance between predicted Y and actual Y of every data point (the so called "error")
  3. Adjust the trend line based on the measurement (the "gradient descent")
  4. Repeat Step 2 and 3 until the average distance has reached a minimum (not necessarily 0)

Let me illustrate with the above example:

The different steps illustrated

(Again, this is an approximation)

So, to implement the algorithm, we need three different pieces:

  • The definition of a linear function
  • The measurement
  • The adjustment
  • Repeating until it's at a minimum

Let's start with...

The definition of a linear function

A linear function f(x)f(x) usually comes in the form of:

f(x)=ax+b f(x) = ax + b

Whereas aa basically defines the slope of the function and bb moves it along the Y axis.

We can make a more general definition of a linear function by introducing a variable called x0x_0 into the function and renaming a and b to c0c_0 and c0c_0 ("c" stands for "coefficient"):

f(x0,x1)=c0x0+c1x1 f(x_0, x_1) = c_0x_0 + c_1x_1

In PHP, this is pretty straight forward:



function linearFunction (float $c0, float $x0, float $c1, float $x1) : float {
    return $c0 * $x0 + $c1 * $x1;
Enter fullscreen mode Exit fullscreen mode

We can randomly define $c0 and $c1 to get a first starting point:

$c0 = mt_rand(-100, 100); // -100 and 100 are almost parallel to the X and Y axis
$c1 = mt_rand(-100, 100);
Enter fullscreen mode Exit fullscreen mode

So far, so good. Let's generate some data next!

Getting data

I'm going to generate the data randomly with a correlation. To get the linear function f(x0,x1)=c0x0+c1x1f(x_0, x_1) = c_0x_0 + c_1x_1 back into it's original form f(x)=c0+c1xf(x) = c_0 + c_1x , we set x0x_0 to always be 1:

$data = [
    [1, mt_rand(0, 100), mt_rand(0, 100)],
    [1, mt_rand(100, 200), mt_rand(100, 200)],
    [1, mt_rand(200, 300), mt_rand(200, 300)],
    [1, mt_rand(300, 400), mt_rand(300, 400)],
    [1, mt_rand(400, 500), mt_rand(400, 500)],
    [1, mt_rand(500, 600), mt_rand(500, 600)],
    [1, mt_rand(600, 700), mt_rand(600, 700)],
    [1, mt_rand(700, 800), mt_rand(700, 800)],
    [1, mt_rand(800, 900), mt_rand(800, 900)],
    [1, mt_rand(900, 1000), mt_rand(900, 1000)],
Enter fullscreen mode Exit fullscreen mode

For illustration purposes, I'll be going to work with this data set and predefined values for $c0 and $c1:

$data = [
  [1, 48, 36],
  [1, 121, 104],
  [1, 285, 208],
  [1, 345, 373],
  [1, 402, 442],
  [1, 576, 582],
  [1, 617, 629],
  [1, 723, 787],
  [1, 852, 857],
  [1, 965, 950],
$c0 = 16;
$c1 = 15;
Enter fullscreen mode Exit fullscreen mode

To graph a few things, I'll use Desmos, which is an amazing tool to fiddle around with maths. On a graph, the data looks like this:

The above data, graphed as a scatter plot

We can see a clear trend there. The trend line will likely be going from bottom left to top right.

With the data at hand, we can start to...

Measure the error

So, to get a good estimate of how far off the linear function is, we calculate the average squared distance of the function's output Y and the actual Y. Squared, because otherwise negative numbers might emerge, which is something we don't want.

For this, we'll define yet another function that looks like this:

error(x0(1..n),x1(1..n),y(1..n))=1ni=1n(y(i)f(x0(i),x1(i)))2 error(x_0^{(1..n)}, x_1^{(1..n)}, y^{(1..n)}) = \frac{1}{n}\sum_{i=1}^{n}{(y^{(i)} - f(x_0^{(i)}, x_1^{(i)}))^2}

(Notice: x(i)x^{(i)} is not "x to the power of (i)", but rather "the x at position (i) in the list)

Or, in PHP:

function squaredError(float $c0, float $c1, array $data): float {
  return array_sum(
      function ($point) use ($c0, $c1) {
        return ($point[2] - linearFunction($c0, $point[0], $c1, $point[1])) ** 2;
  ) / count($data);

var_dump(squaredError($c0, $c1, $data)); // 73679488.2
Enter fullscreen mode Exit fullscreen mode

This formula sums up the squares (because otherwise there's negative numbers) of the differences between predicted Y of the trend line and actual Y for every data point. It then divides it by the total number of data points, resulting in an average.

Ok, but 73679488.2 is quite the number. Let's see what this would look like by adding the linear function to the graph:

Desmos showing the data and the graph

Yup, that's pretty far off. The Y-value for each point is far off to the top, accounting for this very large error. But that was the random definition. When can start improving from here.

Calculating the adjustment

To improve the trend line's fit to the data, we'll use "gradient descent". According to Wikipedia, gradient descent is "a first-order iterative optimization algorithm for finding a local minimum of a differentiable function." And that pretty much sums it up.

The "differentiable function" is, in the case of linear regression, the error function. So gradient descent basically tries to find the local minimum of the error function (thus maximizing the fit of the original linear function to the data). To achieve this, it can adjust two things.

Since X is always given (it's the data, we can't change that), gradient descent can adjust c0c_{0} and c1c_{1} .

Gradient descent decides on a level of adjustment by looking at the error function's steepness. Again, Wikipedia has a good analogy on this: Imagine being stuck in the mountains. It's very foggy and you need to get to a valley to escape. All you can rely on is local information (stuff that is visible around you). Using gradient descent, you would go in the steepest direction downwards, eventually reaching a minimum - a valley. The larger the steep, the larger the step you would take.

Let me illustrate in 2D:

Gradient descent in two dimensions. A parabola showing the relation between a single coefficient C and the error. Red dots indicate which C is chosen and in which direction it is altered.

The red dots on the graph show the value for C for each iteration of gradient descent. The arrows indicate in which direction the algorithm would alter C in order to find a smaller error.

(Since we have two C's ( c0c_{0} and c1c_{1} ) and one extra dimension for the error, the graph would be 3D, though.)

To get to the amount of change for each value of c0c_{0} and c1c_{1} , we can do partial derivatives with respect to c0c_{0} and c1c_{1} (i.e. what's the steepness of the function for one variable, while the other is held constant?). I won't go too deep into the calculus here. We would end up with this formula:

D(cm)=2ni=1n((y(i)f(x0(i),x1(i)))×xm(i)) D(c_{m}) = \frac{-2}{n} \sum_{i=1}^{n}{((y^{(i)} - f(x_0^{(i)}, x_1^{(i)})) \times x_m^{(i)})}

D(cm)D(c_{m}) meaning the descent of cmc_{m} and cmc_{m} being c0c_{0} or c1c_{1} . xm(i)x^{(i)}_m basically means "the x at position (i) in the data for the coefficient m". So for example, x0(i)x^{(i)}_0 would always be 1, whereas x1(i)x^{(i)}_1 is the data.

We can implement this more or less the same way we implemented the error function:

function descent(int $m, float $c0, float $c1, array $data): float {
  return (-2 / count($data)) * array_sum(
      function ($point) use ($c0, $c1, $m) {
        return ($point[2] - linearFunction($c0, $point[0], $c1, $point[1])) * $point[$m];
Enter fullscreen mode Exit fullscreen mode

This gives us the descent that needs to be applied to cmc_m in order to bring the linear function closer to the data. Now we can...

Apply the adjustment

This is rather straight forward: We calculate the descents for c0c_0 and c1c_1 , apply some learning rate (to adjust the step size) and set the new values.

Wait, now what's a learning rate?

The gradient descent might try to take steps that are way too large to yield good results. Remember the analogy of being stuck on a mountain? Now imagine being 50 meters tall. The smallest step you could take would still be way too large, especially if the valley you want to get in is rather narrow. You wouldn't go into the valley, but end up on the opposite side of it. The next step you would take would be too large, too, so you end up, yet again, on the other side of the valley, eventually going even more uphill than downhill:

Gradient descent going in the wrong direction, an approximation.

The learning rate acts as a counter measurement to this: It adjusts your size so your steps are small enough for you to step into the valley.

The learning rate depends on the kind of data you have. A learning rate that's too small will drastically increase the number of steps to take to reach a good fit, a learning rate that's too large will worsen the problem of taking steps that are too large.

There's a sweet spot to every data set, but there's no "one fits all" learning rate. One could apply other ML algorithms to find the perfect learning rate, though, but for small examples, it's best to try a few values and settle for the one that gives the best results.

With this information, we can define two functions to calculate the adjusted values of c0c_0 and c1c_1 :

function adaptC0(float $c0, float $c1, array $data, float $learningRate): float {
    return $c0 - $learningRate * descent(0, $c0, $c1, $data);

function adaptC1(float $c0, float $c1, array $data, float $learningRate): float {
    return $c1 - $learningRate * descent(1, $c0, $c1, $data);
Enter fullscreen mode Exit fullscreen mode

Almost there. Now we need a loop to adapt the values until a certain goal is reached. Let's just do 50 iterations and see where the values end up:

// Rather small learning rate
$learningRate = 0.000001;

$errors = [];
for ($i = 0; $i < 50; $i++) {
    // Keep the errors so we can graph them later.
    $errors[] = squaredError($c0, $c1, $data);

    // Do not assign immediately, because otherwise 
    // the result of $c0 would influence the descent 
    // of $c1!
    $newC0 = adaptC0($c0, $c1, $data, $learningRate);
    $newC1 = adaptC1($c0, $c1, $data, $learningRate);

    $c0 = $newC0;
    $c1 = $newC1;

echo $c0 . ', ' . $c1; // 14.976594533192, 0.99210801880553
Enter fullscreen mode Exit fullscreen mode

Ok, that seems to have worked. Let's see how good those values fit our data:

Desmos plotting the linear function above the data: It fits well.

That's as good as it gets and actually pretty close! This function can now do rather accurate predictions for unknown values of x of the same data set. How did the error do, though?

Demos showing the errors for each iteration: They get small really fast and then settle near the X axis.

We see that the error is large at first, but gets smaller really fast: Within around 5 iterations the error stabilizes at around 1429.28 - our implementation works!

Takeaway thoughts

Linear regression is applied mathematics. Just like most ML algorithms. Machine learning and AI may seem like utter magic at first, but once I started reading into the topic, I realized that the people applying it are basically working with the same tools everyone else does.

It's never the less amazing, that a mathematical model is able to recognize patterns. The more complex algorithms even find patterns a human being probably wouldn't, which serves as a great tool.

I hope you enjoyed reading this article as much as I enjoyed writing it! If so, leave a ❤️ or a 🦄! I write tech articles in my free time and like to drink coffee every once in a while.

If you want to support my efforts, please consider buying me a coffee or follow me on Twitter 🐦!

Buy me a coffee button

Top comments (7)

dansilcox profile image
Dan Silcox

Nice job and well explained examples! As a PHP-by-day dev it’s nice to see it be used for something a bit different too :) I would imagine in PHP 8, even for more complex analysis it would be a great contender to try this sort of thing.

thormeier profile image
Pascal Thormeier

Thank you very much for your positive feedback! I always try to explain things as understandable as I can 😃

I could very well imagine PHP catching up in data science topis, especially with PHP8. Natural language processing and machine learning are things PHP could absolutely do, it does have all the necessary tools.

dansilcox profile image
Dan Silcox

I’m thinking I might have to give it a go some time :)

Thread Thread
thormeier profile image
Pascal Thormeier

Is there a specific algorithm or method you're thinking about? Might cover that one in a post next 🙂

Thread Thread
dansilcox profile image
Dan Silcox

To be honest I don’t really know! Not done much with machine learning or similar really, too maths heavy for me generally 😂 but willing to give it a go, was just going to basically replicate what you’ve done here initially for learning purposes, maybe with a different data source.

isitar profile image
Isitar (Pascal Lüscher)

Typo in "The definition of a linear function"

f(x) = ax + b, not f(x) = a + bx. since you describe a as the slope variable & b as the intersection with the y axis at x = 0.

thormeier profile image
Pascal Thormeier

Fixed! Thank you for pointing it out, might have been confusing for someone. 🙂