Meftahul Jannat Mila

Posted on Jul 26

Building a Machine Learning Pipeline with a Decision Tree Classifier

#dataanalytics #machinelearningprojects #programming #datascience

In this article, we will explain how to prepare and train a machine learning model using a pipeline. We'll focus on using a Decision Tree to predict survival based on the Titanic dataset. This process involves data cleaning, preprocessing, training, and tuning, all structured within a neat and reusable pipeline.

Introduction:

A Machine Learning Pipeline is a systematic workflow designed to automate the process of building, training, and deploying of ML models. It includes several steps, such as data collection, preprocessing, feature engineering, model training, evaluation and deployment.

Pipelines simplify and standardize workflows, accelerating machine learning development. They enhance data management by enabling the extraction, transformation, and loading of data from diverse sources.

Step 1: Importing Libraries

First, we import the essential libraries for data handling, preprocessing, model training, and evaluation.

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, MinMaxScaler
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.tree import DecisionTreeClassifier
import pickle

Step 2: Load and Clean the Data

We load the Titanic dataset and drop columns that aren’t useful for our model.

df = pd.read_csv('tested.csv')
df.drop(columns=['PassengerId', 'Name', 'Ticket', 'Cabin'], inplace=True)

These columns are either IDs, text, or not relevant — so we remove them.

Step 3: Split the Data

We separate our dataset into features (x) and target (y), then split them into training and testing sets.

x_train, x_test, y_train, y_test = train_test_split(
    df.drop(columns=['Survived']),
    df['Survived'],
    test_size=0.2,
    random_state=42
)

This helps us train the model on one part of the data and test its performance on the unseen part.

Step 4: Build the Preprocessing Pipeline

4.1 Impute Missing Values

We fill in missing values — using the mean for age and the most frequent value for the "Embarked" column.

columntransformer1 = ColumnTransformer([
    ('impute_age', SimpleImputer(), [2]),  # Age
    ('impute_embarked', SimpleImputer(strategy='most_frequent'), [6])  # Embarked
], remainder='passthrough')

4.2 One-Hot Encode Categorical Features

We convert text columns like "Sex" and "Embarked" into numbers using one-hot encoding.

columntransformer2 = ColumnTransformer([
    ('ohe_sex_embarked', OneHotEncoder(sparse_output=False, handle_unknown='ignore'), [1, 6])
], remainder='passthrough')

4.3 Scale Numerical Features

We scale all numerical values to a range between 0 and 1 to make the model training more stable.

columntransformer3 = ColumnTransformer([
    ('scale', MinMaxScaler(), slice(0, 10))
])

Step 5: Feature Selection and Model

We select the 5 best features and use a Decision Tree for classification.

selectkbest = SelectKBest(score_func=chi2, k=5)
decisiontreeclassifier = DecisionTreeClassifier()

Step 6: Create the Pipeline

We combine all steps — preprocessing, feature selection, and modeling — into one reusable pipeline.

pipe = make_pipeline(
    columntransformer1,
    columntransformer2,
    columntransformer3,
    selectkbest,
    decisiontreeclassifier
)

Now we can treat this entire setup as a single model object.

Step 7: Train the Model

We train the pipeline using our training data.

pipe.fit(x_train, y_train)

Step 8: Evaluate the Model

We make predictions on the test data and calculate the accuracy.

from sklearn.metrics import accuracy_score

y_pred = pipe.predict(x_test)
accuracy = accuracy_score(y_test, y_pred)
print("Test Accuracy:", accuracy)

We can also evaluate performance using cross-validation:

cross_val_score(pipe, x_train, y_train, cv=5, scoring='accuracy').mean()

Step 9: Hyperparameter Tuning with GridSearchCV

We use GridSearchCV to find the best values for some parameters, like how many features to select and the maximum depth of the tree.

params = {
    'selectkbest__k': [5, 10],
    'decisiontreeclassifier__max_depth': [3, 5, 10]
}

grid = GridSearchCV(pipe, param_grid=params, cv=5, scoring='accuracy')
grid.fit(x_train, y_train)

print("Best Score:", grid.best_score_)
print("Best Parameters:", grid.best_params_)

Step 10: Save the Trained Pipeline

Finally, we save the trained pipeline so we can reuse it later without retraining.

pickle.dump(pipe, open('pipe.pkl', 'wb'))

Production Prediction

Step 1: Import Required Libraries and Load the Model

We start by importing necessary libraries and loading the saved pipeline.

import pickle
import numpy as np

pipe = pickle.load(open('pipe.pkl', 'rb'))

Step 2: Create a New Input as a NumPy Array

Assume this is the data provided by the user. It must follow the same format (columns and order) as used during training.

# Assume user input
test_input2 = np.array([2, 'female', 16.0, 0, 0, 10.5, 'S'], dtype=object).reshape(1, 7)

Step 3: Make the Prediction Using the Pipeline

We call .predict() on the pipeline just like any other Scikit-learn model.

pipe.predict(test_input2)

This will output either 0 (Did Not Survive) or 1 (Survived), which you can format as needed.

Final Summary of the Article

In this article, we built an end-to-end machine learning pipeline using the Titanic dataset. Here’s a brief overview:

Cleaned the dataset by removing irrelevant columns.
Used Scikit-learn’s Pipeline and ColumnTransformer for data preprocessing, including:
- Imputing missing values
- Encoding categorical variables
- Scaling numerical features
- Selecting important features
Trained a Decision Tree Classifier and validated it with the accuracy score and cross-validation.
Optimized hyperparameters using GridSearchCV.
Saved the trained pipeline with pickle.

Top comments (1)

M. Oly Mahmud • Jul 26

Very informative