DEV Community

yuval mehta
yuval mehta

Posted on

Evolve Your Machine Learning: Automate the Process of Model Selection through TPOT.

One day, I google for optimizing my machine learning projects and I came across the TPOT library. Based on the genetic algorithms, TPOT stands for Tree-based Pipeline Optimization Tool, is an automatic way to select the model and tune hyperparameters. More information regarding TPOT, its features and a step by step guide on how to use TPOT to automate your machine learning process shall be discussed in this blog.

Tpot

What is TPOT?
TPOT is a Python library that utilizes genetic programming to optimize the pipeline of machine learning. It deals with two problems that are otherwise time-consuming, that is; model selection and hyperparameter tuning so that the data scientists can find better solutions to ostensibly problematic tasks. TPOT has several models to choose from and the hyperparameters of the models are dynamically optimised with new best pipelines being incorporated.

Key Features of TPOT
Automation: TPOT optimizes the selection of models and the tuning of the hyperparameters of the chosen models themselves.
Genetic Programming: Uses genetic algorithms to solve the problem of the evolution of machine learning pipelines.
Scikit-Learn Compatibility: TPOT is designed to be highly flexible, is implemented in Python, and leverages scikit-learn which should integrate well into most pipelines.
Customizability: Users can can also set their personalized operators and pipeline settings.

example Machine Learning pipeline

How TPOT Works
In the case of TPOT, it applies genetic programming so as to evolve the machine learning pipelines. Starting with a set of random pipelines and then using selection, crossover, and mutation, it improves the pipelines. The fitness function is a key element of the process since it provides assessment of the pipelines’ performance.

workflow

Getting Started with TPOT
Now let us consider the steps of setting up TPOT as the tool to automate the most of the ML processes.

Step 1: Installing TPOT
You can install TPOT using pip:



pip install tpot


Enter fullscreen mode Exit fullscreen mode

Step 2: Importing Necessary Libraries
Once installed, you can import TPOT and other necessary libraries.



import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from tpot import TPOTClassifier


Enter fullscreen mode Exit fullscreen mode

Step 3: Loading and Preparing Data
I am using gamma-telescope data which I found in Kaggle datasets



telescope=pd.read_csv('/kaggle/input/magic-gamma-telescope-dataset/telescope_data.csv')
telescope.drop(telescope.columns[0],axis=1,inplace=True)
telescope.head()


Enter fullscreen mode Exit fullscreen mode


telescope_shuffle=telescope.iloc[np.random.permutation(len(telescope))]
telescope=telescope_shuffle.reset_index(drop=True)
telescope['class']=telescope['class'].map({'g':0,'h':1})


Enter fullscreen mode Exit fullscreen mode

Data

Step 4: Configuring and Running TPOT
Configure the TPOT classifier and fit it.



tele_class = telescope['class'].values
tele_features = telescope.drop('class',axis=1).values
training_data, testing_data, training_classes, testing_classes = train_test_split(tele_features, tele_class, test_size=0.25, random_state=42, stratify=tele_class)


Enter fullscreen mode Exit fullscreen mode


tpot = TPOTClassifier(generations=5,verbosity=2)
tpot.fit(training_data, training_classes)


Enter fullscreen mode Exit fullscreen mode

Output of tpot

Step 5: Evaluating the Best Pipeline



tpot.score(testing_data, testing_classes)


Enter fullscreen mode Exit fullscreen mode

score

Step 6: Understanding the Output
The export function saves the best pipeline as a Python script



import os
os.makedirs('Output',exist_ok=True)
tpot.export('Output/tpot_pipeline.py')


Enter fullscreen mode Exit fullscreen mode

Output file (tpot_pipeline.py):



import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPClassifier
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import RobustScaler
from tpot.builtins import ZeroCount

# NOTE: Make sure that the outcome column is labeled 'target' in the data file
tpot_data = pd.read_csv('PATH/TO/DATA/FILE', sep='COLUMN_SEPARATOR', dtype=np.float64)
features = tpot_data.drop('target', axis=1)
training_features, testing_features, training_target, testing_target = \
            train_test_split(features, tpot_data['target'], random_state=None)

# Average CV score on the training set was: 0.8779530318962496
exported_pipeline = make_pipeline(
    ZeroCount(),
    RobustScaler(),
    MLPClassifier(alpha=0.001, learning_rate_init=0.01)
)

exported_pipeline.fit(training_features, training_target)
results = exported_pipeline.predict(testing_features)


Enter fullscreen mode Exit fullscreen mode

Conclusion
TPOT is a significant tool based on the application of genetic algorithms, which helps to solve the task of automatic finding of the optimal structure of the machine learning pipeline. Therefore, by automating the process by bringing TPOT into your development environment, you can cut out the time devoted to model selection or fine-tuning of hyperparameters in favor of more intricate operations of your tasks.

Resources:
TPOT Documentation
Genetic Programming

Top comments (0)