bdrgn

Posted on Jan 9, 2023 • Updated on Jan 18, 2023

Gradient Boosting explained (with pictures)

Tree-based models are a go-to solution for Machine Learning on tabular data. Google, Amazon and other big tech companies use them for fraud prediction, ad and search optimization as well as recommendations and pricing. In this article you’ll explore the most basic building block in various tree models — the Decision Tree. You’ll learn what it is, why you need it and most importantly how it works. Moreover, building on top of this knowledge you’ll grasp Gradient Boosting — an ensemble Machine Learning algorithm utilizing the power of Decision Trees. What is unusual about this particular article — almost no math is required. In fact there is only one formula. You don’t need to have any specialized knowledge to understand the explanation.

How to use a Decision Tree?

Decision Tree is a Machine Learning algorithm used to predict various numerical measures such as price, temperature or the amount of sales. It can also be used to forecast discrete properties such as loan defaults, plant species or even medical diagnoses.

To make predictions, Decision Trees need to learn first. How do they learn? They use past observations to make predictions about the future.

For example, imagine we want to train a Decision Tree to predict the price of a house. Firstly, we must have a table with house prices along with other information about each one of them. It can be their size in square feet, the number of bedrooms, the number of rooms total, etc.

How do we suggest a fair price for a new house using all this information? For example, let’s say we have a house with a size of 230 square feet, 2 bedrooms and 2 rooms in total. We can predict its price with a Decision Tree.

To make a prediction we do the following:

Give the algorithm a set of examples to learn from — a table of houses complete with prices and other features.
Provide a dataset of houses which need algorithm’s predictions — a table of houses with all the same features except prices.
In exchange the Decision Tree gives us price predictions for our house.

How do Decision Trees make predictions?

Trained Decision Trees store their understanding of the house price in the format of a tree. They ask questions about the data and choose the next one based on the answer. In the end they give some numerical estimates for our set of answers. In this part of the article let’s focus on understanding how a trained Decision Tree makes predictions. We will cover how it learns from data later.

For example, let’s see how the process looks for our house from the previous example. Firstly, the Decision Tree asks ‘Is the size of the house less than 200 square feet?’. In our case the answer is ‘No’ so we proceed to the corresponding branch.

The next question: ‘Are there less than 5 rooms?’ For our particular house the answer is ‘yes’. Therefore the algorithm chooses the prediction of 300 k$ for this house.

The process of making predictions is very straightforward for a Decision Tree but we still have lots of unanswered questions:

How does it know which questions to ask? Why the first question was ‘Is the size of the house less than 200 feet?’ and not ‘Are there less than 5 rooms?’ ?
How does it know what value to predict given the set of answered questions? For example, what price should it predict when we answered ‘No’ and ‘Yes’ on two previous questions? It predicted 300 but why not 400 or 200?

Our questions boil down to how does it learn? It can be phrased differently as ‘How does the Decision Tree algorithm build a tree?’

How does the Decision Tree algorithm build a tree?

Let’s approach the problem step by step. The first problem to solve is to find an optimal first question which splits the data in two parts. To find the best split we must test all the possible ones for the data. Otherwise, how do we know that we chose the best?

To test the split we apply it to the data and see how good it is. For example, if we decide to split the data using the first size value of 100 it will look like this. Houses with size less than 100 square feet go to the left and others to the right.

But how do we know if the splitting condition is any good? More importantly, how do we compare two splits to each other to know which one of them is better? For example, what question should we ask first ‘Is the size less than 100?’ or ‘Is the size less than 200?’?

To compare two splits between each other we calculate the metric known as Mean Squared Error. Here’s the formula:

Let’s calculate the MSE step by step using the formula for house price in our dataset. First of all, we want to calculate MSE for house prices therefore Y is a house price. Secondly, n is just the number of observations in a dataset.

Our dataset has 4 observations so n = 4.

Secondly, Y with a funny dash above is just our average price. To calculate the average price we should sum all the prices up and divide by their number. (which is 4) Average Y is 250.

This weird symbol just means to sum from 1 to n. We also know that n = 4 so we just need to substitute Yi for the corresponding value from the dataset. Remember that 250 is our average Y or average price. We subtract our average price from every other price and square the result each time.

In the end we are getting (100-250)2 + (200-250)2 + (400-250)2 + (300-250)2 = 50 000. The only thing left to do is to divide by the number of observations. In the result we are getting the MSE of 12 500 for our dataset.

The meaning behind MSE is measuring the error for a metric when trying to predict with its average value. The MSE we calculated answers the question ‘What would be the average squared error if we predict with an average price?’ We calculated the average price — 250 k$. If we try to predict with this value the squared error will be 12 500 k$. Seems that it is a bad idea to predict using the average price at this point. How can we make our prediction better?

By splitting the data! Let's split the data by the first size condition — size <= 100. But how do we calculate MSE for a split? We know how to calculate it for one table but the splitting results in two tables. For the split we can calculate a weighted MSE — get MSE for each table and multiply by its weight. Weights of the branches are calculated as a number of observations in a branch divided by the number of observations in total.

But let’s get back to the data. We were splitting the data by the first size condition of size <= 100. As we can see it only made our initial MSE worse!

Let’s try another split of size <= 200. As you can see this split actually made our MSE lower. It means that we can predict the house price better if we split the data into two groups using this condition and use average price in groups as a prediction.

We test all possible splits in the dataset for all fields: size, bedrooms, rooms, etc. In the end we choose the best split of them. The best meaning the one which gives us the greatest MSE drop. But what next? How do we split the data in both groups?

We do the same procedure for both groups. We treat the splitting as if it was the beginning of our tree. At each step we choose the split with the least MSE.

How long should we continue? The Decision Tree algorithm can be controlled using a manually selected stopping condition, eg. do not go beyond 3 in depth. If the stopping condition is not set by the user the building of the tree continues until there is something to split. If there are at least 2 observations in a split the building continues. In the end a tree without a stopping condition essentially splits the data to individual samples.

How do we make predictions with a fully built tree? We take the average value from all the observations matching the conditions (questions). The prediction we get is an average value for all the observations matching these conditions.

In our case the tree had no stopping condition and it was built until there were only single observations left. If we stopped the tree building after the depth of one, it would look like this. We ask ‘Is the size of the house less than 200 square feet?’. If yes — we take all the houses with a size of less than 200 square feet and average their price. The resulting value is our prediction. The same for other houses.

Let’s have a recap because it might be getting hazy:

We train a Decision Tree by giving it observations of houses with prices.
The algorithm builds a tree by splitting the data using questions. Each time it chooses the question with the least mean squared error.
We give observations for which we want a prediction. Each observation should have all the same features as the dataset we trained the tree on except the price.

Gradient Boosting

Gradient Boosting serves the same purpose as Decision Trees — it learns by past observations to predict numerical measures. The only difference — it does so strictly better than an individual tree. Why? Because this algorithm trains many trees differently and incorporates each tree’s prediction into the final estimate.

Now that we have understanding of Decision Trees we can learn how Gradient Boosting works. It is actually the easiest part. Gradient Boosting works like this:

Trains a Decision Tree with a maximum depth of 2 or some other value which restricts tree growth to some extent. The restriction is needed in order for the algorithm to work. We will see later why.
The algorithm uses the first tree to make predictions. It then subtracts these predictions from our target variable.
It then uses the new column’s values to train another Decision Tree. The intuitive explanation is as follows: each consecutive tree learns on the mistakes of all the previous.
Then it repeats the process over and over again for a number of times set by the user. Usually it builds one hundred trees or more. Each time error gets smaller as a result of each tree catching more nuance. If the first tree has a huge error on one observation the next tree will try to catch this error.

Generalization vs Accuracy

There is one unanswered question left. Why do we have to limit the tree depth in Gradient Boosting?

Previously we’ve seen how to grow a tree without a stopping condition. In practice people rarely use unbounded Decision Trees. The reason for that? Overfitting. It means that the tree gives predictions too close to the original dataset. In the real world we rarely see the need to predict for observations which are exactly like the training dataset. The need to extrapolate or generalize emerges. By limiting the maximum depth of the tree we fight overfitting and improve generalization.

Imagine we have two trees: the first of depth one and the second of depth two. We use both trees to make a prediction for a sample which is not that close to a training dataset. What tree performs better?

The first tree predicts using an average value of two houses with size less than 200. The resulting error is 10k because the true price is 160k while the prediction is 150k.

The second one is deeper so it gets down to individual samples for predictions. The closest house to the one we want a prediction for costs 200k so it predicts 200k. The resulting error is 40k.

The one with the least depth wins because it is better at generalization. Too much nuance can be harmful, too. Of course, tree depth should be fine-tuned according to the data. There is a tradeoff between generalization and predictive accuracy.

Concept review Let’s review what we learned today:

Decision Tree is an algorithm which learns from past observations to predict numerical values.
It learns by splitting the data using questions. Each question is selected to minimize the mean squared error. As a result error decreases with each consecutive split.
Decision Trees predict using average value of all observations matching the questions, e.g. we predict the price for a house using average price for similar houses. Similar = with the same answers for our questions.
Decision Trees are used to train a more complex algorithm — Gradient Boosting.
In Gradient Boosting we train lots of trees with limited depth.
Each consecutive tree learns from the errors of previous ones.
As a result, with each new tree error decreases.
It is important to not use deep trees in gradient boosting because that way they generalize better.

If you find this article interesting and want more content like this, feel free to connect on LinkedIn:
https://www.linkedin.com/in/bdrgn

Top comments (13)

Md Raihan • Jul 14 '23

Exceptionally valuable article I preferred the method for utilizing the choice tree it helps a ton in expectations and navigation.
There is plainly a particular work that requires a great deal of meticulous work and time.****