DEV Community

Cover image for Building a Machine Learning Pipeline with a Decision Tree Classifier
Meftahul Jannat Mila
Meftahul Jannat Mila

Posted on

Building a Machine Learning Pipeline with a Decision Tree Classifier

In this article, we will explain how to prepare and train a machine learning model using a pipeline. We'll focus on using a Decision Tree to predict survival based on the Titanic dataset. This process involves data cleaning, preprocessing, training, and tuning, all structured within a neat and reusable pipeline.


Introduction:

A Machine Learning Pipeline is a systematic workflow designed to automate the process of building, training, and deploying of ML models. It includes several steps, such as data collection, preprocessing, feature engineering, model training, evaluation and deployment.

Pipelines simplify and standardize workflows, accelerating machine learning development. They enhance data management by enabling the extraction, transformation, and loading of data from diverse sources.

Step 1: Importing Libraries

First, we import the essential libraries for data handling, preprocessing, model training, and evaluation.

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, MinMaxScaler
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.tree import DecisionTreeClassifier
import pickle
Enter fullscreen mode Exit fullscreen mode

Step 2: Load and Clean the Data

We load the Titanic dataset and drop columns that aren’t useful for our model.

df = pd.read_csv('tested.csv')
df.drop(columns=['PassengerId', 'Name', 'Ticket', 'Cabin'], inplace=True)
Enter fullscreen mode Exit fullscreen mode

These columns are either IDs, text, or not relevant — so we remove them.


Step 3: Split the Data

We separate our dataset into features (x) and target (y), then split them into training and testing sets.

x_train, x_test, y_train, y_test = train_test_split(
    df.drop(columns=['Survived']),
    df['Survived'],
    test_size=0.2,
    random_state=42
)
Enter fullscreen mode Exit fullscreen mode

This helps us train the model on one part of the data and test its performance on the unseen part.


Step 4: Build the Preprocessing Pipeline

4.1 Impute Missing Values

We fill in missing values — using the mean for age and the most frequent value for the "Embarked" column.

columntransformer1 = ColumnTransformer([
    ('impute_age', SimpleImputer(), [2]),  # Age
    ('impute_embarked', SimpleImputer(strategy='most_frequent'), [6])  # Embarked
], remainder='passthrough')
Enter fullscreen mode Exit fullscreen mode

4.2 One-Hot Encode Categorical Features

We convert text columns like "Sex" and "Embarked" into numbers using one-hot encoding.

columntransformer2 = ColumnTransformer([
    ('ohe_sex_embarked', OneHotEncoder(sparse_output=False, handle_unknown='ignore'), [1, 6])
], remainder='passthrough')
Enter fullscreen mode Exit fullscreen mode

4.3 Scale Numerical Features

We scale all numerical values to a range between 0 and 1 to make the model training more stable.

columntransformer3 = ColumnTransformer([
    ('scale', MinMaxScaler(), slice(0, 10))
])
Enter fullscreen mode Exit fullscreen mode

Step 5: Feature Selection and Model

We select the 5 best features and use a Decision Tree for classification.

selectkbest = SelectKBest(score_func=chi2, k=5)
decisiontreeclassifier = DecisionTreeClassifier()
Enter fullscreen mode Exit fullscreen mode

Step 6: Create the Pipeline

We combine all steps — preprocessing, feature selection, and modeling — into one reusable pipeline.

pipe = make_pipeline(
    columntransformer1,
    columntransformer2,
    columntransformer3,
    selectkbest,
    decisiontreeclassifier
)
Enter fullscreen mode Exit fullscreen mode

Now we can treat this entire setup as a single model object.


Step 7: Train the Model

We train the pipeline using our training data.

pipe.fit(x_train, y_train)
Enter fullscreen mode Exit fullscreen mode

Step 8: Evaluate the Model

We make predictions on the test data and calculate the accuracy.

from sklearn.metrics import accuracy_score

y_pred = pipe.predict(x_test)
accuracy = accuracy_score(y_test, y_pred)
print("Test Accuracy:", accuracy)
Enter fullscreen mode Exit fullscreen mode

We can also evaluate performance using cross-validation:

cross_val_score(pipe, x_train, y_train, cv=5, scoring='accuracy').mean()
Enter fullscreen mode Exit fullscreen mode

Step 9: Hyperparameter Tuning with GridSearchCV

We use GridSearchCV to find the best values for some parameters, like how many features to select and the maximum depth of the tree.

params = {
    'selectkbest__k': [5, 10],
    'decisiontreeclassifier__max_depth': [3, 5, 10]
}

grid = GridSearchCV(pipe, param_grid=params, cv=5, scoring='accuracy')
grid.fit(x_train, y_train)

print("Best Score:", grid.best_score_)
print("Best Parameters:", grid.best_params_)
Enter fullscreen mode Exit fullscreen mode

Step 10: Save the Trained Pipeline

Finally, we save the trained pipeline so we can reuse it later without retraining.

pickle.dump(pipe, open('pipe.pkl', 'wb'))
Enter fullscreen mode Exit fullscreen mode

Production Prediction


Step 1: Import Required Libraries and Load the Model

We start by importing necessary libraries and loading the saved pipeline.

import pickle
import numpy as np

pipe = pickle.load(open('pipe.pkl', 'rb'))
Enter fullscreen mode Exit fullscreen mode

Step 2: Create a New Input as a NumPy Array

Assume this is the data provided by the user. It must follow the same format (columns and order) as used during training.

# Assume user input
test_input2 = np.array([2, 'female', 16.0, 0, 0, 10.5, 'S'], dtype=object).reshape(1, 7)
Enter fullscreen mode Exit fullscreen mode

Step 3: Make the Prediction Using the Pipeline

We call .predict() on the pipeline just like any other Scikit-learn model.

pipe.predict(test_input2)
Enter fullscreen mode Exit fullscreen mode

This will output either 0 (Did Not Survive) or 1 (Survived), which you can format as needed.


Final Summary of the Article

In this article, we built an end-to-end machine learning pipeline using the Titanic dataset. Here’s a brief overview:

  • Cleaned the dataset by removing irrelevant columns.
  • Used Scikit-learn’s Pipeline and ColumnTransformer for data preprocessing, including:
    • Imputing missing values
    • Encoding categorical variables
    • Scaling numerical features
    • Selecting important features
  • Trained a Decision Tree Classifier and validated it with the accuracy score and cross-validation.
  • Optimized hyperparameters using GridSearchCV.
  • Saved the trained pipeline with pickle.

Top comments (1)

Collapse
 
olymahmud profile image
M. Oly Mahmud

Very informative