If you've spent any time on Kaggle, you've seen XGBoost win. Over and over. Structured data competition? XGBoost. Tabular data problem? XGBoost. Real-world ML pipeline? XGBoost.
It's not hype. It genuinely is that good on most problems with structured data.
But a lot of people use it without understanding why it works. They just copy the code, tune a few numbers, and hope for the best. This post fixes that.
What You'll Learn Here
- The difference between bagging and boosting
- How gradient boosting works step by step
- What makes XGBoost faster and better than basic gradient boosting
- How to train XGBoost for classification and regression
- The most important hyperparameters and what they actually do
- Early stopping so you never have to guess the right number of trees
Bagging vs Boosting: The Core Difference
Random Forest uses bagging. Trees are built independently, in parallel, on random subsets of data. Final answer = average of all trees.
XGBoost uses boosting. Trees are built one at a time, in sequence. Each new tree focuses specifically on the examples the previous trees got wrong. Final answer = weighted sum of all trees.
Bagging (Random Forest):
Tree 1 ──┐
Tree 2 ──┤──> Average ──> Prediction
Tree 3 ──┘
Boosting (XGBoost):
Tree 1 ──> finds errors ──> Tree 2 fixes them ──> finds errors ──> Tree 3 fixes those ──> ...
Boosting is more precise because every tree is learning from the specific failures of the previous ones. But it's also more prone to overfitting if you're not careful.
How Gradient Boosting Works Step by Step
Let's say you're predicting house prices. Here's what happens inside a gradient boosting model:
Step 1: Start with a simple prediction. Usually the mean of all target values.
Initial prediction for everyone: $300,000 (the mean)
Step 2: Calculate the residuals. How wrong was that prediction for each house?
House A: actual $350k, predicted $300k → residual = +$50k
House B: actual $250k, predicted $300k → residual = -$50k
House C: actual $420k, predicted $300k → residual = +$120k
Step 3: Train a small tree to predict those residuals.
Tree 1 learns: "when bedrooms > 3, predict residual = +$60k"
Step 4: Update predictions by adding a fraction of tree 1's output.
learning_rate = 0.1
New prediction = $300k + 0.1 * $60k = $306k
Step 5: Calculate new residuals based on updated predictions. Train tree 2 on those.
Step 6: Repeat for as many trees as you specify.
Each tree is small and weak on its own. But 100 or 500 of them stacked together get very precise. That's why boosting is called an ensemble of weak learners.
What Makes XGBoost Special
Plain gradient boosting existed before XGBoost. So why did XGBoost take over?
A few reasons:
Speed: XGBoost uses parallelism within each tree (not between trees). It also uses approximate split finding instead of checking every possible split point exactly. Much faster than vanilla gradient boosting.
Regularization built in: It adds L1 and L2 regularization directly into the tree building process. This controls overfitting better than basic gradient boosting.
Handling missing values: XGBoost learns the best direction to go when a value is missing. You don't need to impute first.
Pruning: It builds trees fully then prunes backwards, removing branches that don't help. Smarter than stopping early.
Installing XGBoost
pip install xgboost
Your First XGBoost Classifier
import xgboost as xgb
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
import pandas as pd
data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
# Train XGBoost classifier
model = xgb.XGBClassifier(
n_estimators=100,
learning_rate=0.1,
max_depth=4,
random_state=42,
eval_metric='logloss',
verbosity=0
)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print(f"XGBoost Accuracy: {accuracy_score(y_test, y_pred):.3f}")
print()
print(classification_report(y_test, y_pred, target_names=data.target_names))
Output:
XGBoost Accuracy: 0.974
precision recall f1-score support
malignant 0.98 0.95 0.96 42
benign 0.97 0.99 0.98 72
accuracy 0.97 114
Comparing XGBoost to Random Forest
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
import numpy as np
rf = RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1)
xgb_model = xgb.XGBClassifier(
n_estimators=100, learning_rate=0.1, max_depth=4,
random_state=42, eval_metric='logloss', verbosity=0
)
rf_scores = cross_val_score(rf, X, y, cv=5)
xgb_scores = cross_val_score(xgb_model, X, y, cv=5)
print(f"Random Forest: {rf_scores.mean():.3f} +/- {rf_scores.std():.3f}")
print(f"XGBoost: {xgb_scores.mean():.3f} +/- {xgb_scores.std():.3f}")
Output:
Random Forest: 0.962 +/- 0.014
XGBoost: 0.967 +/- 0.016
Very close on this dataset. XGBoost tends to win more clearly on larger, messier datasets with many features.
Early Stopping: Never Guess the Right Number of Trees
One of XGBoost's best features. Instead of guessing how many trees to use, you set a high number and let the model stop automatically when validation performance stops improving.
from sklearn.model_selection import train_test_split
# Need a validation set for early stopping
X_train_es, X_val, y_train_es, y_val = train_test_split(
X_train, y_train, test_size=0.2, random_state=42
)
model_es = xgb.XGBClassifier(
n_estimators=1000, # set high, early stopping will find the right number
learning_rate=0.05,
max_depth=4,
random_state=42,
eval_metric='logloss',
verbosity=0,
early_stopping_rounds=20 # stop if no improvement for 20 rounds
)
model_es.fit(
X_train_es, y_train_es,
eval_set=[(X_val, y_val)],
verbose=False
)
print(f"Best number of trees: {model_es.best_iteration}")
print(f"Test accuracy: {accuracy_score(y_test, model_es.predict(X_test)):.3f}")
Output:
Best number of trees: 47
Test accuracy: 0.982
The model stopped at 47 trees even though you told it to try up to 1000. It found the sweet spot automatically. This is one of the most practical features in XGBoost.
XGBoost for Regression
Works exactly the same way, just change the objective.
from sklearn.datasets import fetch_california_housing
from sklearn.metrics import r2_score, mean_squared_error
import numpy as np
housing = fetch_california_housing()
X_h = pd.DataFrame(housing.data, columns=housing.feature_names)
y_h = housing.target
X_train_h, X_test_h, y_train_h, y_test_h = train_test_split(
X_h, y_h, test_size=0.2, random_state=42
)
X_train_h2, X_val_h, y_train_h2, y_val_h = train_test_split(
X_train_h, y_train_h, test_size=0.2, random_state=42
)
reg = xgb.XGBRegressor(
n_estimators=1000,
learning_rate=0.05,
max_depth=5,
random_state=42,
eval_metric='rmse',
verbosity=0,
early_stopping_rounds=20
)
reg.fit(
X_train_h2, y_train_h2,
eval_set=[(X_val_h, y_val_h)],
verbose=False
)
y_pred_h = reg.predict(X_test_h)
print(f"Best trees: {reg.best_iteration}")
print(f"R2: {r2_score(y_test_h, y_pred_h):.3f}")
print(f"RMSE: {np.sqrt(mean_squared_error(y_test_h, y_pred_h)):.3f}")
Output:
Best trees: 284
R2: 0.836
RMSE: 0.462
Compare that to:
- Linear Regression: R2 = 0.576
- Random Forest: R2 = 0.805
- XGBoost: R2 = 0.836
XGBoost wins on this dataset without much tuning at all.
The Key Hyperparameters
These are the ones that actually matter. You don't need to tune all of them.
model = xgb.XGBClassifier(
# Tree structure
n_estimators=500, # max trees (use early stopping with this)
max_depth=4, # depth of each tree. 3 to 6 is typical. Lower = less overfit.
min_child_weight=1, # minimum sum of instance weights in a leaf. Higher = less overfit.
# Learning
learning_rate=0.05, # how much each tree contributes. Lower = need more trees but better.
subsample=0.8, # fraction of training data used per tree. Adds randomness.
colsample_bytree=0.8, # fraction of features used per tree. Like max_features in RF.
# Regularization
reg_alpha=0, # L1 regularization on weights. Makes some weights exactly 0.
reg_lambda=1, # L2 regularization on weights. Shrinks all weights.
gamma=0, # minimum loss reduction to make a split. Higher = more conservative.
random_state=42,
eval_metric='logloss',
verbosity=0
)
Where to start when tuning:
- Set
learning_rate=0.05andn_estimators=1000with early stopping - Tune
max_depthbetween 3 and 7 - Tune
subsampleandcolsample_bytreebetween 0.6 and 1.0 - If still overfitting, increase
reg_alphaorreg_lambda
Feature Importance in XGBoost
import matplotlib.pyplot as plt
# Train on breast cancer data
model_fi = xgb.XGBClassifier(
n_estimators=100, learning_rate=0.1, max_depth=4,
random_state=42, eval_metric='logloss', verbosity=0
)
model_fi.fit(X_train, y_train)
# Plot feature importance
xgb.plot_importance(model_fi, max_num_features=15, figsize=(9, 7))
plt.title('XGBoost Feature Importance')
plt.tight_layout()
plt.savefig('xgb_feature_importance.png', dpi=100)
plt.show()
# Or get as a dict
importance = model_fi.get_booster().get_score(importance_type='gain')
importance_df = pd.DataFrame(
list(importance.items()), columns=['Feature', 'Gain']
).sort_values('Gain', ascending=False)
print("Top 10 features by gain:")
print(importance_df.head(10).to_string(index=False))
XGBoost has three types of feature importance:
-
weight: how many times a feature was used to split -
gain: average improvement in loss from splits using this feature -
cover: average number of samples affected by splits using this feature
gain is usually the most meaningful. It tells you how much each feature actually helped reduce error.
Handling Missing Values Automatically
This is a real advantage over most other algorithms.
import numpy as np
# Introduce some missing values
X_missing = X_train.copy()
mask = np.random.rand(*X_missing.shape) < 0.1 # 10% of values missing
X_missing[mask] = np.nan
# XGBoost handles this directly, no imputation needed
model_nan = xgb.XGBClassifier(
n_estimators=100, learning_rate=0.1,
random_state=42, eval_metric='logloss', verbosity=0
)
model_nan.fit(X_missing, y_train)
X_test_missing = X_test.copy()
mask_test = np.random.rand(*X_test_missing.shape) < 0.1
X_test_missing[mask_test] = np.nan
print(f"Accuracy with 10% missing values: {accuracy_score(y_test, model_nan.predict(X_test_missing)):.3f}")
XGBoost learns which direction to send missing values at each split. It doesn't just impute with the mean. It makes an informed decision based on which direction reduces error more.
The Things Everyone Gets Wrong
Mistake 1: Using a high learning rate with few trees
learning_rate=0.3 with 50 trees is worse than learning_rate=0.05 with 500 trees and early stopping. Lower learning rate almost always gives better results. It just needs more trees.
Mistake 2: Ignoring early stopping
Setting n_estimators=100 and guessing is a beginner move. Use early stopping and let the data tell you the right number.
Mistake 3: Over-tuning on small datasets
XGBoost has many hyperparameters. On small datasets, the random variation in a 5-fold CV is larger than the improvement you get from tuning. Don't over-engineer it. Tune max_depth, learning_rate, and subsample. That's usually enough.
Mistake 4: Thinking XGBoost works well on everything
It dominates on tabular/structured data. For images, audio, and text, deep learning is usually better. XGBoost is not a universal answer.
Quick Cheat Sheet
| Task | Code |
|---|---|
| Classification | xgb.XGBClassifier(n_estimators=500, learning_rate=0.05, max_depth=4) |
| Regression | xgb.XGBRegressor(n_estimators=500, learning_rate=0.05, max_depth=4) |
| Early stopping |
early_stopping_rounds=20 + eval_set=[(X_val, y_val)]
|
| Best iteration | model.best_iteration |
| Feature importance | xgb.plot_importance(model) |
| Reduce overfitting | lower max_depth, increase reg_lambda, lower subsample
|
| Speed up |
tree_method='hist' for large datasets |
| Missing values | handled automatically, no code needed |
Practice Challenges
Level 1:
Train XGBoost on load_wine(). Use early stopping with a validation set. Print how many trees were actually used. Compare accuracy to Random Forest.
Level 2:
On the California housing dataset, try learning_rate values of 0.3, 0.1, 0.05, 0.01 with early stopping each time. See how the best iteration count changes. Plot final R2 for each learning rate.
Level 3:
Intentionally introduce 20% missing values into the breast cancer dataset. Compare accuracy of XGBoost (no imputation), XGBoost (with SimpleImputer), and Random Forest (with SimpleImputer). Which handles missing values best?
References
- XGBoost official docs
- XGBoost paper (original)
- StatQuest: XGBoost (YouTube)
- Kaggle: XGBoost tutorials
- Scikit-learn: GradientBoostingClassifier
Next up, Post 60: Support Vector Machines: Drawing the Perfect Boundary. We learn about hyperplanes, margins, and the kernel trick that lets SVMs handle non-linear problems without changing your features.
Top comments (0)