Kenechukwu Anoliefo

Posted on Nov 9

Predicting Fuel Efficiency with Tree-Based Models: A Hands-On Machine Learning Walkthrough

#mlzoomcamp #datatalksclub #machinelearning #python

Understanding how vehicle characteristics affect fuel efficiency is a classic regression problem — and an excellent way to explore tree-based models like Decision Trees, Random Forests, and XGBoost. In this project, I analyzed a dataset of cars and built models to predict fuel efficiency (MPG) with different configurations.

🧩 Step 1 — Data Preparation

The dataset contained various vehicle features, including:

vehicle_weight
engine_displacement
horsepower
acceleration
model_year
origin
fuel_type

To ensure data consistency, all missing values were filled with zeros.
Then I performed a train/validation/test split (60%/20%/20%), using a random_state=1 for reproducibility.

Next, I used DictVectorizer(sparse=True) to convert categorical and numerical features into a format suitable for scikit-learn models.

🌳 Step 2 — Decision Tree Regressor

I began with a Decision Tree Regressor with max_depth=1.
This simple tree helps visualize which feature the model uses first to split the data — effectively revealing the most influential variable in predicting MPG.

Result:
The feature used for splitting was model_year, showing that newer vehicles tend to have different fuel efficiencies compared to older models.

🌲 Step 3 — Random Forest Model

Next, I trained a Random Forest Regressor with the parameters:

n_estimators=10  
random_state=1  
n_jobs=-1

Random forests aggregate multiple decision trees to reduce overfitting and improve accuracy.

Validation RMSE: ≈ 4.5

This confirmed the model could capture relationships between engine specs and fuel efficiency quite effectively.

⚙️ Step 4 — Tuning n_estimators

To see how the number of trees affects performance, I trained models with n_estimators ranging from 10 to 200 (step = 10).
After monitoring RMSE, I observed the improvement plateaued after around 80 estimators, indicating that adding more trees didn’t significantly enhance accuracy.

🌾 Step 5 — Tuning max_depth

I then compared four values of max_depth — [10, 15, 20, 25] — each with increasing n_estimators from 10 to 200.
The best mean RMSE occurred at max_depth = 20, which struck the right balance between bias and variance.

🔍 Step 6 — Feature Importance

Random Forests provide an excellent built-in mechanism for feature importance.
Training the model with:

n_estimators=10, max_depth=20, random_state=1

I found the most influential feature for predicting fuel efficiency to be engine_displacement, followed by vehicle_weight and horsepower.

This aligns well with domain knowledge — larger engines and heavier vehicles typically consume more fuel.

⚡ Step 7 — XGBoost Experiments

Finally, I trained an XGBoost regressor, tuning the eta (learning rate) parameter between 0.3 and 0.1.

xgb_params = {
    'eta': [0.3 or 0.1],
    'max_depth': 6,
    'objective': 'reg:squarederror',
    'nthread': 8,
    'seed': 1
}

After 100 training rounds, the model with eta = 0.1 delivered slightly better RMSE on the validation set — confirming that a smaller learning rate can yield smoother, more generalized models.

🎯 Key Takeaways

model_year strongly influences fuel efficiency in modern cars.
Random Forests with n_estimators ≈ 80 and max_depth=20 gave the most balanced performance.
Engine displacement emerged as the most important predictor of MPG.
XGBoost with a lower learning rate (eta=0.1) achieved the best validation score.

💡 Final Thoughts

This project demonstrates how iterative experimentation with tree-based models reveals both predictive strength and interpretability.
From simple decision trees to tuned XGBoost models, each step provided insight into how vehicle characteristics drive fuel efficiency — and how model parameters affect performance.

If you’re learning machine learning, projects like this are perfect for mastering feature engineering, evaluation metrics, and model tuning.

DEV Community