Understanding how vehicle characteristics affect fuel efficiency is a classic regression problem — and an excellent way to explore tree-based models like Decision Trees, Random Forests, and XGBoost. In this project, I analyzed a dataset of cars and built models to predict fuel efficiency (MPG) with different configurations.
🧩 Step 1 — Data Preparation
The dataset contained various vehicle features, including:
vehicle_weightengine_displacementhorsepoweraccelerationmodel_yearoriginfuel_type
To ensure data consistency, all missing values were filled with zeros.
Then I performed a train/validation/test split (60%/20%/20%), using a random_state=1 for reproducibility.
Next, I used DictVectorizer(sparse=True) to convert categorical and numerical features into a format suitable for scikit-learn models.
🌳 Step 2 — Decision Tree Regressor
I began with a Decision Tree Regressor with max_depth=1.
This simple tree helps visualize which feature the model uses first to split the data — effectively revealing the most influential variable in predicting MPG.
Result:
The feature used for splitting was model_year, showing that newer vehicles tend to have different fuel efficiencies compared to older models.
🌲 Step 3 — Random Forest Model
Next, I trained a Random Forest Regressor with the parameters:
n_estimators=10
random_state=1
n_jobs=-1
Random forests aggregate multiple decision trees to reduce overfitting and improve accuracy.
Validation RMSE: ≈ 4.5
This confirmed the model could capture relationships between engine specs and fuel efficiency quite effectively.
⚙️ Step 4 — Tuning n_estimators
To see how the number of trees affects performance, I trained models with n_estimators ranging from 10 to 200 (step = 10).
After monitoring RMSE, I observed the improvement plateaued after around 80 estimators, indicating that adding more trees didn’t significantly enhance accuracy.
🌾 Step 5 — Tuning max_depth
I then compared four values of max_depth — [10, 15, 20, 25] — each with increasing n_estimators from 10 to 200.
The best mean RMSE occurred at max_depth = 20, which struck the right balance between bias and variance.
🔍 Step 6 — Feature Importance
Random Forests provide an excellent built-in mechanism for feature importance.
Training the model with:
n_estimators=10, max_depth=20, random_state=1
I found the most influential feature for predicting fuel efficiency to be engine_displacement, followed by vehicle_weight and horsepower.
This aligns well with domain knowledge — larger engines and heavier vehicles typically consume more fuel.
⚡ Step 7 — XGBoost Experiments
Finally, I trained an XGBoost regressor, tuning the eta (learning rate) parameter between 0.3 and 0.1.
xgb_params = {
'eta': [0.3 or 0.1],
'max_depth': 6,
'objective': 'reg:squarederror',
'nthread': 8,
'seed': 1
}
After 100 training rounds, the model with eta = 0.1 delivered slightly better RMSE on the validation set — confirming that a smaller learning rate can yield smoother, more generalized models.
🎯 Key Takeaways
-
model_yearstrongly influences fuel efficiency in modern cars. -
Random Forests with
n_estimators ≈ 80andmax_depth=20gave the most balanced performance. - Engine displacement emerged as the most important predictor of MPG.
- XGBoost with a lower learning rate (eta=0.1) achieved the best validation score.
💡 Final Thoughts
This project demonstrates how iterative experimentation with tree-based models reveals both predictive strength and interpretability.
From simple decision trees to tuned XGBoost models, each step provided insight into how vehicle characteristics drive fuel efficiency — and how model parameters affect performance.
If you’re learning machine learning, projects like this are perfect for mastering feature engineering, evaluation metrics, and model tuning.
Top comments (0)