DEV Community

Cover image for Gradient Boosting explained (with pictures)
bdrgn
bdrgn

Posted on • Updated on

Gradient Boosting explained (with pictures)

Tree-based models are a go-to solution for Machine Learning on tabular data. Google, Amazon and other big tech companies use them for fraud prediction, ad and search optimization as well as recommendations and pricing. In this article you’ll explore the most basic building block in various tree models — the Decision Tree. You’ll learn what it is, why you need it and most importantly how it works. Moreover, building on top of this knowledge you’ll grasp Gradient Boosting — an ensemble Machine Learning algorithm utilizing the power of Decision Trees. What is unusual about this particular article — almost no math is required. In fact there is only one formula. You don’t need to have any specialized knowledge to understand the explanation.

How to use a Decision Tree?

Decision Tree is a Machine Learning algorithm used to predict various numerical measures such as price, temperature or the amount of sales. It can also be used to forecast discrete properties such as loan defaults, plant species or even medical diagnoses.

To make predictions, Decision Trees need to learn first. How do they learn? They use past observations to make predictions about the future.

For example, imagine we want to train a Decision Tree to predict the price of a house. Firstly, we must have a table with house prices along with other information about each one of them. It can be their size in square feet, the number of bedrooms, the number of rooms total, etc.

Image description

How do we suggest a fair price for a new house using all this information? For example, let’s say we have a house with a size of 230 square feet, 2 bedrooms and 2 rooms in total. We can predict its price with a Decision Tree.

Image description

To make a prediction we do the following: 

  1. Give the algorithm a set of examples to learn from — a table of houses complete with prices and other features.
  2. Provide a dataset of houses which need algorithm’s predictions — a table of houses with all the same features except prices.
  3. In exchange the Decision Tree gives us price predictions for our house.

Image description

How do Decision Trees make predictions?

Trained Decision Trees store their understanding of the house price in the format of a tree. They ask questions about the data and choose the next one based on the answer. In the end they give some numerical estimates for our set of answers. In this part of the article let’s focus on understanding how a trained Decision Tree makes predictions. We will cover how it learns from data later. 

For example, let’s see how the process looks for our house from the previous example. Firstly, the Decision Tree asks ‘Is the size of the house less than 200 square feet?’. In our case the answer is ‘No’ so we proceed to the corresponding branch.

Image description

The next question: ‘Are there less than 5 rooms?’ For our particular house the answer is ‘yes’. Therefore the algorithm chooses the prediction of 300 k$ for this house.

Image description

The process of making predictions is very straightforward for a Decision Tree but we still have lots of unanswered questions:

  1. How does it know which questions to ask? Why the first question was ‘Is the size of the house less than 200 feet?’ and not ‘Are there less than 5 rooms?’ ?
  2. How does it know what value to predict given the set of answered questions? For example, what price should it predict when we answered ‘No’ and ‘Yes’ on two previous questions? It predicted 300 but why not 400 or 200?

Our questions boil down to how does it learn? It can be phrased differently as ‘How does the Decision Tree algorithm build a tree?’

How does the Decision Tree algorithm build a tree?

Let’s approach the problem step by step. The first problem to solve is to find an optimal first question which splits the data in two parts. To find the best split we must test all the possible ones for the data. Otherwise, how do we know that we chose the best? 

To test the split we apply it to the data and see how good it is. For example, if we decide to split the data using the first size value of 100 it will look like this. Houses with size less than 100 square feet go to the left and others to the right.

Image description

But how do we know if the splitting condition is any good? More importantly, how do we compare two splits to each other to know which one of them is better? For example, what question should we ask first ‘Is the size less than 100?’ or ‘Is the size less than 200?’?

To compare two splits between each other we calculate the metric known as Mean Squared Error. Here’s the formula:

Image description

Let’s calculate the MSE step by step using the formula for house price in our dataset. First of all, we want to calculate MSE for house prices therefore Y is a house price. Secondly, n is just the number of observations in a dataset.

Image description

Our dataset has 4 observations so n = 4.

Image description

Secondly, Y with a funny dash above is just our average price. To calculate the average price we should sum all the prices up and divide by their number. (which is 4) Average Y is 250.

Image description

This weird symbol just means to sum from 1 to n. We also know that n = 4 so we just need to substitute Yi for the corresponding value from the dataset. Remember that 250 is our average Y or average price. We subtract our average price from every other price and square the result each time.

Image description

In the end we are getting (100-250)2 + (200-250)2 + (400-250)2 + (300-250)2 = 50 000. The only thing left to do is to divide by the number of observations. In the result we are getting the MSE of 12 500 for our dataset.

Image description

The meaning behind MSE is measuring the error for a metric when trying to predict with its average value. The MSE we calculated answers the question ‘What would be the average squared error if we predict with an average price?’ We calculated the average price — 250 k$. If we try to predict with this value the squared error will be 12 500 k$. Seems that it is a bad idea to predict using the average price at this point. How can we make our prediction better?

By splitting the data! Let's split the data by the first size condition — size <= 100. But how do we calculate MSE for a split? We know how to calculate it for one table but the splitting results in two tables. For the split we can calculate a weighted MSE — get MSE for each table and multiply by its weight. Weights of the branches are calculated as a number of observations in a branch divided by the number of observations in total.

But let’s get back to the data. We were splitting the data by the first size condition of size <= 100. As we can see it only made our initial MSE worse!

Image description

Let’s try another split of size <= 200. As you can see this split actually made our MSE lower. It means that we can predict the house price better if we split the data into two groups using this condition and use average price in groups as a prediction.

Image description

We test all possible splits in the dataset for all fields: size, bedrooms, rooms, etc. In the end we choose the best split of them. The best meaning the one which gives us the greatest MSE drop. But what next? How do we split the data in both groups?

Image description

We do the same procedure for both groups. We treat the splitting as if it was the beginning of our tree. At each step we choose the split with the least MSE.

How long should we continue? The Decision Tree algorithm can be controlled using a manually selected stopping condition, eg. do not go beyond 3 in depth. If the stopping condition is not set by the user the building of the tree continues until there is something to split. If there are at least 2 observations in a split the building continues. In the end a tree without a stopping condition essentially splits the data to individual samples.

How do we make predictions with a fully built tree? We take the average value from all the observations matching the conditions (questions). The prediction we get is an average value for all the observations matching these conditions.

Image description

In our case the tree had no stopping condition and it was built until there were only single observations left. If we stopped the tree building after the depth of one, it would look like this. We ask ‘Is the size of the house less than 200 square feet?’. If yes — we take all the houses with a size of less than 200 square feet and average their price. The resulting value is our prediction. The same for other houses.

Image description

Let’s have a recap because it might be getting hazy:

  1. We train a Decision Tree by giving it observations of houses with prices.
  2. The algorithm builds a tree by splitting the data using questions. Each time it chooses the question with the least mean squared error.
  3. We give observations for which we want a prediction. Each observation should have all the same features as the dataset we trained the tree on except the price.

Image description

Gradient Boosting

Gradient Boosting serves the same purpose as Decision Trees — it learns by past observations to predict numerical measures. The only difference — it does so strictly better than an individual tree. Why? Because this algorithm trains many trees differently and incorporates each tree’s prediction into the final estimate.

Now that we have understanding of Decision Trees we can learn how Gradient Boosting works. It is actually the easiest part. Gradient Boosting works like this:

  • Trains a Decision Tree with a maximum depth of 2 or some other value which restricts tree growth to some extent. The restriction is needed in order for the algorithm to work. We will see later why.
    Image description

  • The algorithm uses the first tree to make predictions. It then subtracts these predictions from our target variable.
    Image description

  • It then uses the new column’s values to train another Decision Tree. The intuitive explanation is as follows: each consecutive tree learns on the mistakes of all the previous.
    Image description

  • Then it repeats the process over and over again for a number of times set by the user. Usually it builds one hundred trees or more. Each time error gets smaller as a result of each tree catching more nuance. If the first tree has a huge error on one observation the next tree will try to catch this error.
    Image description

Generalization vs Accuracy

There is one unanswered question left. Why do we have to limit the tree depth in Gradient Boosting?

Previously we’ve seen how to grow a tree without a stopping condition. In practice people rarely use unbounded Decision Trees. The reason for that? Overfitting. It means that the tree gives predictions too close to the original dataset. In the real world we rarely see the need to predict for observations which are exactly like the training dataset. The need to extrapolate or generalize emerges. By limiting the maximum depth of the tree we fight overfitting and improve generalization.

Imagine we have two trees: the first of depth one and the second of depth two. We use both trees to make a prediction for a sample which is not that close to a training dataset. What tree performs better?

Image description

The first tree predicts using an average value of two houses with size less than 200. The resulting error is 10k because the true price is 160k while the prediction is 150k.

Image description

The second one is deeper so it gets down to individual samples for predictions. The closest house to the one we want a prediction for costs 200k so it predicts 200k. The resulting error is 40k.

Image description

The one with the least depth wins because it is better at generalization. Too much nuance can be harmful, too. Of course, tree depth should be fine-tuned according to the data. There is a tradeoff between generalization and predictive accuracy.

Concept review Let’s review what we learned today:

  • Decision Tree is an algorithm which learns from past observations to predict numerical values.
  • It learns by splitting the data using questions. Each question is selected to minimize the mean squared error. As a result error decreases with each consecutive split.
  • Decision Trees predict using average value of all observations matching the questions, e.g. we predict the price for a house using average price for similar houses. Similar = with the same answers for our questions.
  • Decision Trees are used to train a more complex algorithm — Gradient Boosting.
  • In Gradient Boosting we train lots of trees with limited depth.
  • Each consecutive tree learns from the errors of previous ones.
  • As a result, with each new tree error decreases.
  • It is important to not use deep trees in gradient boosting because that way they generalize better.

If you find this article interesting and want more content like this, feel free to connect on LinkedIn:
https://www.linkedin.com/in/bdrgn

Top comments (13)

Collapse
 
raihanbabu00 profile image
Md Raihan

Exceptionally valuable article I preferred the method for utilizing the choice tree it helps a ton in expectations and navigation.
There is plainly a particular work that requires a great deal of meticulous work and time.****

Collapse
 
brain profile image
Netali

I really liked the article. Actual.

Collapse
 
nusaalym13 profile image
ТаНюша

You are a very smart person and reading your posts I admire you.

Collapse
 
imtiazk57197805 profile image
Imtiaz Khan

Your are very nice look 😍😍😍

Collapse
 
alexrebkov profile image
Sergykznov

Very useful article I liked the way to use the decision tree it helps a lot in predictions and decision-making Thanks for sharing this article

Collapse
 
sarkarshihab1 profile image
Sarkar Shihab

Nice articles

Collapse
 
benurio profile image
benurio

there is clearly a specific work that requires a lot of painstaking work and time. The author did the right thing.

Collapse
 
nahidrox profile image
Nahid Rox

You express your ideas so well!
This was a delight to read!

Collapse
 
ayankha554 profile image
Ayan Khan

This article was very helpful I took alot of information from this article thanks a lot.

Collapse
 
ruxsorb57235 profile image
Рухсора Бахадировна

Very useful article I liked the way to use the decision tree it helps a lot in predictions and decision-making Thanks for sharing this article