DEV Community

Cover image for Why My Baseline Random Forest Model Beat XGBoost: A Deep Dive into the Titanic Survival Prediction Dataset
Gatusso
Gatusso

Posted on

Why My Baseline Random Forest Model Beat XGBoost: A Deep Dive into the Titanic Survival Prediction Dataset

A practical look at feature engineering, model optimization, and why simpler models sometimes win on smaller datasets.

When you start out in data science, you are often led to believe that there is a strict hierarchy of algorithms. You start with Linear Regression, move up to Random Forests, and eventually reach the holy grail: Gradient Boosting models like XGBoost. The assumption is usually that more complex equals better results.

But data science in the real world rarely follows a perfect script.

I recently built a survival classification model using the classic Titanic dataset for my portfolio. I set up an end-to-end pipeline, built a solid baseline, ran a rigorous hyperparameter grid search, and threw an XGBoost classifier at the problem.

The results threw me a curveball, and they taught me a massive lesson about data scale and model variance. Here is how I built the pipeline and what the results actually mean.

Before feeding any data into a machine learning model, it’s critical to understand that algorithms are essentially giant math equations. They don't understand context, and they don't handle missing data well. My workflow followed six key stages:

  1. Exploratory Data Analysis (EDA): Finding the historical patterns.
  2. Missing Data Imputation: Smart strategies to fill the blanks.
  3. Feature Engineering: Creating high-signal columns from raw text.
  4. Categorical Encoding: Transforming strings to numbers safely.
  5. Model Evaluation: Setting up an 80/20 train-validation split.
  6. Hyperparameter Tuning & Comparison: Pit baseline RF vs. GridSearch RF vs. XGBoost.

The Power of Feature Engineering

Most beginners simply drop text columns or fill missing values with a global average. To build a production-grade portfolio project, I implemented domain-specific feature engineering choices using pandas:

  • Smart Age Imputation via Titles: Instead of filling the 177 missing age values with the ship's average age (29), I extracted social titles (Mr., Mrs., Miss, Master) from the names. Because a "Master" is historically a young boy, filling his missing age with the median of the Master group is significantly more accurate than giving him an adult's age.
  • The Family Size Matrix: I combined SibSp (siblings/spouses) and Parch (parents/children) into a single FamilySize feature. Interestingly, data analysis showed that individuals traveling entirely alone or families larger than 5 had poor survival rates, whereas small families (2-4 people) fared much better.
  • Handling the Cabin Sparsity: Over 70% of the Cabin column was missing. Rather than dropping it, I turned it into a binary feature: Has_Cabin (1 or 0). This captured a massive socioeconomic signal, as 1st-class passengers were far more likely to have assigned, recorded cabins closer to the deck.

The Showdown: Comparing 3 Architectures

After splitting the data and encoding text variables into numerical binaries using pd.get_dummies(drop_first=True), I trained and evaluated three distinct setups on my validation data.

Here is how they performed:

Strategy Validation Accuracy Notes / Settings
1. Baseline Random Forest 82.68% Simple setup, max_depth=5
2. XGBoost Classifier 82.12% learning_rate=0.05, max_depth=4
3. GridSearchCV Tuned RF 81.56% Optimized via 5-Fold Cross-Validation

The GridSearchCV block methodically checked variations of estimators, depths, and split criteria, ultimately landing on these optimal parameters:


python
{'criterion': 'entropy', 'max_depth': 6, 'min_samples_split': 10, 'n_estimators': 50}
Enter fullscreen mode Exit fullscreen mode

Top comments (0)