One day, I google for optimizing my machine learning projects and I came across the TPOT library. Based on the genetic algorithms, TPOT stands for Tree-based Pipeline Optimization Tool, is an automatic way to select the model and tune hyperparameters. More information regarding TPOT, its features and a step by step guide on how to use TPOT to automate your machine learning process shall be discussed in this blog.
What is TPOT?
TPOT is a Python library that utilizes genetic programming to optimize the pipeline of machine learning. It deals with two problems that are otherwise time-consuming, that is; model selection and hyperparameter tuning so that the data scientists can find better solutions to ostensibly problematic tasks. TPOT has several models to choose from and the hyperparameters of the models are dynamically optimised with new best pipelines being incorporated.
Key Features of TPOT
Automation: TPOT optimizes the selection of models and the tuning of the hyperparameters of the chosen models themselves.
Genetic Programming: Uses genetic algorithms to solve the problem of the evolution of machine learning pipelines.
Scikit-Learn Compatibility: TPOT is designed to be highly flexible, is implemented in Python, and leverages scikit-learn which should integrate well into most pipelines.
Customizability: Users can can also set their personalized operators and pipeline settings.
How TPOT Works
In the case of TPOT, it applies genetic programming so as to evolve the machine learning pipelines. Starting with a set of random pipelines and then using selection, crossover, and mutation, it improves the pipelines. The fitness function is a key element of the process since it provides assessment of the pipelines’ performance.
Getting Started with TPOT
Now let us consider the steps of setting up TPOT as the tool to automate the most of the ML processes.
Step 1: Installing TPOT
You can install TPOT using pip:
pip install tpot
Step 2: Importing Necessary Libraries
Once installed, you can import TPOT and other necessary libraries.
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from tpot import TPOTClassifier
Step 3: Loading and Preparing Data
I am using gamma-telescope data which I found in Kaggle datasets
telescope=pd.read_csv('/kaggle/input/magic-gamma-telescope-dataset/telescope_data.csv')
telescope.drop(telescope.columns[0],axis=1,inplace=True)
telescope.head()
telescope_shuffle=telescope.iloc[np.random.permutation(len(telescope))]
telescope=telescope_shuffle.reset_index(drop=True)
telescope['class']=telescope['class'].map({'g':0,'h':1})
Step 4: Configuring and Running TPOT
Configure the TPOT classifier and fit it.
tele_class = telescope['class'].values
tele_features = telescope.drop('class',axis=1).values
training_data, testing_data, training_classes, testing_classes = train_test_split(tele_features, tele_class, test_size=0.25, random_state=42, stratify=tele_class)
tpot = TPOTClassifier(generations=5,verbosity=2)
tpot.fit(training_data, training_classes)
Step 5: Evaluating the Best Pipeline
tpot.score(testing_data, testing_classes)
Step 6: Understanding the Output
The export function saves the best pipeline as a Python script
import os
os.makedirs('Output',exist_ok=True)
tpot.export('Output/tpot_pipeline.py')
Output file (tpot_pipeline.py):
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPClassifier
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import RobustScaler
from tpot.builtins import ZeroCount
# NOTE: Make sure that the outcome column is labeled 'target' in the data file
tpot_data = pd.read_csv('PATH/TO/DATA/FILE', sep='COLUMN_SEPARATOR', dtype=np.float64)
features = tpot_data.drop('target', axis=1)
training_features, testing_features, training_target, testing_target = \
train_test_split(features, tpot_data['target'], random_state=None)
# Average CV score on the training set was: 0.8779530318962496
exported_pipeline = make_pipeline(
ZeroCount(),
RobustScaler(),
MLPClassifier(alpha=0.001, learning_rate_init=0.01)
)
exported_pipeline.fit(training_features, training_target)
results = exported_pipeline.predict(testing_features)
Conclusion
TPOT is a significant tool based on the application of genetic algorithms, which helps to solve the task of automatic finding of the optimal structure of the machine learning pipeline. Therefore, by automating the process by bringing TPOT into your development environment, you can cut out the time devoted to model selection or fine-tuning of hyperparameters in favor of more intricate operations of your tasks.
Resources:
TPOT Documentation
Genetic Programming
Top comments (0)