Akhilesh

Posted on May 9

59. XGBoost: The Algorithm That Wins Competitions

#ai #programming #python #productivity

If you've spent any time on Kaggle, you've seen XGBoost win. Over and over. Structured data competition? XGBoost. Tabular data problem? XGBoost. Real-world ML pipeline? XGBoost.

It's not hype. It genuinely is that good on most problems with structured data.

But a lot of people use it without understanding why it works. They just copy the code, tune a few numbers, and hope for the best. This post fixes that.

What You'll Learn Here

The difference between bagging and boosting
How gradient boosting works step by step
What makes XGBoost faster and better than basic gradient boosting
How to train XGBoost for classification and regression
The most important hyperparameters and what they actually do
Early stopping so you never have to guess the right number of trees

Bagging vs Boosting: The Core Difference

Random Forest uses bagging. Trees are built independently, in parallel, on random subsets of data. Final answer = average of all trees.

XGBoost uses boosting. Trees are built one at a time, in sequence. Each new tree focuses specifically on the examples the previous trees got wrong. Final answer = weighted sum of all trees.

Bagging (Random Forest):
  Tree 1 ──┐
  Tree 2 ──┤──> Average ──> Prediction
  Tree 3 ──┘

Boosting (XGBoost):
  Tree 1 ──> finds errors ──> Tree 2 fixes them ──> finds errors ──> Tree 3 fixes those ──> ...

Boosting is more precise because every tree is learning from the specific failures of the previous ones. But it's also more prone to overfitting if you're not careful.

How Gradient Boosting Works Step by Step

Let's say you're predicting house prices. Here's what happens inside a gradient boosting model:

Step 1: Start with a simple prediction. Usually the mean of all target values.

Initial prediction for everyone: $300,000 (the mean)

Step 2: Calculate the residuals. How wrong was that prediction for each house?

House A: actual $350k, predicted $300k → residual = +$50k
House B: actual $250k, predicted $300k → residual = -$50k
House C: actual $420k, predicted $300k → residual = +$120k

Step 3: Train a small tree to predict those residuals.

Tree 1 learns: "when bedrooms > 3, predict residual = +$60k"

Step 4: Update predictions by adding a fraction of tree 1's output.

learning_rate = 0.1
New prediction = $300k + 0.1 * $60k = $306k

Step 5: Calculate new residuals based on updated predictions. Train tree 2 on those.

Step 6: Repeat for as many trees as you specify.

Each tree is small and weak on its own. But 100 or 500 of them stacked together get very precise. That's why boosting is called an ensemble of weak learners.

What Makes XGBoost Special

Plain gradient boosting existed before XGBoost. So why did XGBoost take over?

A few reasons:

Speed: XGBoost uses parallelism within each tree (not between trees). It also uses approximate split finding instead of checking every possible split point exactly. Much faster than vanilla gradient boosting.

Regularization built in: It adds L1 and L2 regularization directly into the tree building process. This controls overfitting better than basic gradient boosting.

Handling missing values: XGBoost learns the best direction to go when a value is missing. You don't need to impute first.

Pruning: It builds trees fully then prunes backwards, removing branches that don't help. Smarter than stopping early.

Installing XGBoost

pip install xgboost

Your First XGBoost Classifier

import xgboost as xgb
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
import pandas as pd

data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Train XGBoost classifier
model = xgb.XGBClassifier(
    n_estimators=100,
    learning_rate=0.1,
    max_depth=4,
    random_state=42,
    eval_metric='logloss',
    verbosity=0
)

model.fit(X_train, y_train)

y_pred = model.predict(X_test)
print(f"XGBoost Accuracy: {accuracy_score(y_test, y_pred):.3f}")
print()
print(classification_report(y_test, y_pred, target_names=data.target_names))

Output:

XGBoost Accuracy: 0.974

              precision    recall  f1-score   support

   malignant       0.98      0.95      0.96        42
      benign       0.97      0.99      0.98        72

    accuracy                           0.97       114

Comparing XGBoost to Random Forest

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
import numpy as np

rf = RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1)
xgb_model = xgb.XGBClassifier(
    n_estimators=100, learning_rate=0.1, max_depth=4,
    random_state=42, eval_metric='logloss', verbosity=0
)

rf_scores  = cross_val_score(rf, X, y, cv=5)
xgb_scores = cross_val_score(xgb_model, X, y, cv=5)

print(f"Random Forest: {rf_scores.mean():.3f} +/- {rf_scores.std():.3f}")
print(f"XGBoost:       {xgb_scores.mean():.3f} +/- {xgb_scores.std():.3f}")

Output:

Random Forest: 0.962 +/- 0.014
XGBoost:       0.967 +/- 0.016

Very close on this dataset. XGBoost tends to win more clearly on larger, messier datasets with many features.

Early Stopping: Never Guess the Right Number of Trees

One of XGBoost's best features. Instead of guessing how many trees to use, you set a high number and let the model stop automatically when validation performance stops improving.

from sklearn.model_selection import train_test_split

# Need a validation set for early stopping
X_train_es, X_val, y_train_es, y_val = train_test_split(
    X_train, y_train, test_size=0.2, random_state=42
)

model_es = xgb.XGBClassifier(
    n_estimators=1000,      # set high, early stopping will find the right number
    learning_rate=0.05,
    max_depth=4,
    random_state=42,
    eval_metric='logloss',
    verbosity=0,
    early_stopping_rounds=20  # stop if no improvement for 20 rounds
)

model_es.fit(
    X_train_es, y_train_es,
    eval_set=[(X_val, y_val)],
    verbose=False
)

print(f"Best number of trees: {model_es.best_iteration}")
print(f"Test accuracy: {accuracy_score(y_test, model_es.predict(X_test)):.3f}")

Output:

Best number of trees: 47
Test accuracy: 0.982

The model stopped at 47 trees even though you told it to try up to 1000. It found the sweet spot automatically. This is one of the most practical features in XGBoost.

XGBoost for Regression

Works exactly the same way, just change the objective.

from sklearn.datasets import fetch_california_housing
from sklearn.metrics import r2_score, mean_squared_error
import numpy as np

housing = fetch_california_housing()
X_h = pd.DataFrame(housing.data, columns=housing.feature_names)
y_h = housing.target

X_train_h, X_test_h, y_train_h, y_test_h = train_test_split(
    X_h, y_h, test_size=0.2, random_state=42
)

X_train_h2, X_val_h, y_train_h2, y_val_h = train_test_split(
    X_train_h, y_train_h, test_size=0.2, random_state=42
)

reg = xgb.XGBRegressor(
    n_estimators=1000,
    learning_rate=0.05,
    max_depth=5,
    random_state=42,
    eval_metric='rmse',
    verbosity=0,
    early_stopping_rounds=20
)

reg.fit(
    X_train_h2, y_train_h2,
    eval_set=[(X_val_h, y_val_h)],
    verbose=False
)

y_pred_h = reg.predict(X_test_h)
print(f"Best trees: {reg.best_iteration}")
print(f"R2:   {r2_score(y_test_h, y_pred_h):.3f}")
print(f"RMSE: {np.sqrt(mean_squared_error(y_test_h, y_pred_h)):.3f}")

Output:

Best trees: 284
R2:   0.836
RMSE: 0.462

Compare that to:

Linear Regression: R2 = 0.576
Random Forest: R2 = 0.805
XGBoost: R2 = 0.836

XGBoost wins on this dataset without much tuning at all.

The Key Hyperparameters

These are the ones that actually matter. You don't need to tune all of them.

model = xgb.XGBClassifier(
    # Tree structure
    n_estimators=500,       # max trees (use early stopping with this)
    max_depth=4,            # depth of each tree. 3 to 6 is typical. Lower = less overfit.
    min_child_weight=1,     # minimum sum of instance weights in a leaf. Higher = less overfit.

    # Learning
    learning_rate=0.05,     # how much each tree contributes. Lower = need more trees but better.
    subsample=0.8,          # fraction of training data used per tree. Adds randomness.
    colsample_bytree=0.8,   # fraction of features used per tree. Like max_features in RF.

    # Regularization
    reg_alpha=0,            # L1 regularization on weights. Makes some weights exactly 0.
    reg_lambda=1,           # L2 regularization on weights. Shrinks all weights.
    gamma=0,                # minimum loss reduction to make a split. Higher = more conservative.

    random_state=42,
    eval_metric='logloss',
    verbosity=0
)

Where to start when tuning:

Set learning_rate=0.05 and n_estimators=1000 with early stopping
Tune max_depth between 3 and 7
Tune subsample and colsample_bytree between 0.6 and 1.0
If still overfitting, increase reg_alpha or reg_lambda

Feature Importance in XGBoost

import matplotlib.pyplot as plt

# Train on breast cancer data
model_fi = xgb.XGBClassifier(
    n_estimators=100, learning_rate=0.1, max_depth=4,
    random_state=42, eval_metric='logloss', verbosity=0
)
model_fi.fit(X_train, y_train)

# Plot feature importance
xgb.plot_importance(model_fi, max_num_features=15, figsize=(9, 7))
plt.title('XGBoost Feature Importance')
plt.tight_layout()
plt.savefig('xgb_feature_importance.png', dpi=100)
plt.show()

# Or get as a dict
importance = model_fi.get_booster().get_score(importance_type='gain')
importance_df = pd.DataFrame(
    list(importance.items()), columns=['Feature', 'Gain']
).sort_values('Gain', ascending=False)

print("Top 10 features by gain:")
print(importance_df.head(10).to_string(index=False))

XGBoost has three types of feature importance:

weight: how many times a feature was used to split
gain: average improvement in loss from splits using this feature
cover: average number of samples affected by splits using this feature

gain is usually the most meaningful. It tells you how much each feature actually helped reduce error.

Handling Missing Values Automatically

This is a real advantage over most other algorithms.

import numpy as np

# Introduce some missing values
X_missing = X_train.copy()
mask = np.random.rand(*X_missing.shape) < 0.1  # 10% of values missing
X_missing[mask] = np.nan

# XGBoost handles this directly, no imputation needed
model_nan = xgb.XGBClassifier(
    n_estimators=100, learning_rate=0.1,
    random_state=42, eval_metric='logloss', verbosity=0
)
model_nan.fit(X_missing, y_train)

X_test_missing = X_test.copy()
mask_test = np.random.rand(*X_test_missing.shape) < 0.1
X_test_missing[mask_test] = np.nan

print(f"Accuracy with 10% missing values: {accuracy_score(y_test, model_nan.predict(X_test_missing)):.3f}")

XGBoost learns which direction to send missing values at each split. It doesn't just impute with the mean. It makes an informed decision based on which direction reduces error more.

The Things Everyone Gets Wrong

Mistake 1: Using a high learning rate with few trees

learning_rate=0.3 with 50 trees is worse than learning_rate=0.05 with 500 trees and early stopping. Lower learning rate almost always gives better results. It just needs more trees.

Mistake 2: Ignoring early stopping

Setting n_estimators=100 and guessing is a beginner move. Use early stopping and let the data tell you the right number.

Mistake 3: Over-tuning on small datasets

XGBoost has many hyperparameters. On small datasets, the random variation in a 5-fold CV is larger than the improvement you get from tuning. Don't over-engineer it. Tune max_depth, learning_rate, and subsample. That's usually enough.

Mistake 4: Thinking XGBoost works well on everything

It dominates on tabular/structured data. For images, audio, and text, deep learning is usually better. XGBoost is not a universal answer.

Quick Cheat Sheet

Task	Code
Classification	`xgb.XGBClassifier(n_estimators=500, learning_rate=0.05, max_depth=4)`
Regression	`xgb.XGBRegressor(n_estimators=500, learning_rate=0.05, max_depth=4)`
Early stopping	`early_stopping_rounds=20` + `eval_set=[(X_val, y_val)]`
Best iteration	`model.best_iteration`
Feature importance	`xgb.plot_importance(model)`
Reduce overfitting	lower `max_depth`, increase `reg_lambda`, lower `subsample`
Speed up	`tree_method='hist'` for large datasets
Missing values	handled automatically, no code needed

Practice Challenges

Level 1:
Train XGBoost on load_wine(). Use early stopping with a validation set. Print how many trees were actually used. Compare accuracy to Random Forest.

Level 2:
On the California housing dataset, try learning_rate values of 0.3, 0.1, 0.05, 0.01 with early stopping each time. See how the best iteration count changes. Plot final R2 for each learning rate.

Level 3:
Intentionally introduce 20% missing values into the breast cancer dataset. Compare accuracy of XGBoost (no imputation), XGBoost (with SimpleImputer), and Random Forest (with SimpleImputer). Which handles missing values best?

References

Next up, Post 60: Support Vector Machines: Drawing the Perfect Boundary. We learn about hyperplanes, margins, and the kernel trick that lets SVMs handle non-linear problems without changing your features.

DEV Community

59. XGBoost: The Algorithm That Wins Competitions

What You'll Learn Here

Bagging vs Boosting: The Core Difference

How Gradient Boosting Works Step by Step

What Makes XGBoost Special

Installing XGBoost

Your First XGBoost Classifier

Comparing XGBoost to Random Forest

Early Stopping: Never Guess the Right Number of Trees

XGBoost for Regression

The Key Hyperparameters

Feature Importance in XGBoost

Handling Missing Values Automatically

The Things Everyone Gets Wrong

Quick Cheat Sheet

Practice Challenges

References

Top comments (0)