DEV Community

Innocent Mambo
Innocent Mambo

Posted on

Machine Learning Pipelines

Ever wondered of what a machine learning pipeline is? In this article I discuss in detail what a machine learning pipeline is.

Image description
A machine learning pipeline is a series of steps that solve a certain business problem.
You begin with identifying a business problem that machine learning can solve. An example of a business problem is to identify fraudulent credit card transactions so that you can stop the transactions before it processes.

You then formulate a research problem articulating your business problem and converting it into a machine learning problem. In this phase you evaluate whether really machine learning can solve your problem. Make the problem measurable, have an intended outcome. In this phase you also determine the algorithm you'll use for your problem, is it supervised learning? Unsupervised? Then devise a solution to the problem. From our business problem, our goal can be to reduce the number of customers who end their membership because of fraud.

Thereafter you convert the business problem to a machine learning problem. What output do you want to see from your model? An example would be whether the credit card is fraudulent or not. It makes it even simpler to choose a model type, in this case it's supervised learning.

You then move to the data collection, preparation and preprocessing phase. In this phase you think about the data you want to use to train your model and how accessible is it to you. You can decide to use private data, which is existing data in the company, can include logs to customer invoices. You can use commercial data, that is data that a commercial entity collected and made available, such as Reuters. You can opt for open source datasets which are usually available for use in research or teaching purposes such as Kaggle. You must however be careful and understand your data and evaluate it's quality.

Since your data might be collected in raw different forms. You need to convert it into a format that a machine learning model can understand. The format might include a csv file format for example. In this phase you also clean the model, that is dropping null values, or filling them, ensuring data has been entered in correctly. You do this to maximize your models accuracy and efficiency.
Feature Engineering is the science and art of extracting more information from existing data to improve your models prediction power so that it learns faster. You select or create the features you will use to train the model. The features are the columns of data within your dataset. The goal of the model is to try and correctly estimate the target value for new data. You perform two things, feature extraction and feature selection. Feature extraction involves building up valuable information from raw data by reformatting, combining and transforming primary features into new ones. Some activities include:

  • Encoding ordinal and non-ordinal data

  • Finding missing data

  • Handling outliers of data

Feature selection is all about selecting the features that are most relevant and discarding the rest. Feature selection is applied to prevent either redundancy or irrelevance in the existing features or reducing number of features to prevent overfitting. The three methods used here are:

  • Filter methods which use statistical methods to measure the relevance of features by their correlation with the target variable. It is the cheaper and faster than wrapper because it does not involve training the models repeatedly.

  • Wrapper methods measure the usefulness of a subset of features by training a model on it and measuring the success of that model.

  • Embedded methods are algorithm-specific and might use a combination of both.

Thereafter you train and tune your model. Remember, here we do not use all the data to train our model, you split it and leave the rest to use in testing. Here you have the training dataset that's used to train the model, the validation dataset where you tweak and tune some features. The test dataset includes only features because you want the labels to be predicted. The performance you get from the test dataset is what you can reasonably expect from production.

This process is iterative so that you can be able to evaluate which model works better than the rest. If the model doesn't meet your business goals you go back and re-evaluate a couple of things. You evaluate it with regards to your success metric.
If the model meets your business needs, you can move forward and deploy it to production to perform predictions.

The Deployment stage. This is the final stage. Remember, deploying for testing and production are different, although the mechanics are the same. Here you can deploy your model online for example to AWS SageMaker.

Top comments (0)