DEV Community

Phylis Jepchumba, MSc
Phylis Jepchumba, MSc

Posted on

Machine Learning Steps

1.Collecting Data.

  • Since machines learn from data that you give them; it is important to collect reliable data so that your machine learning model can find the correct patterns.
  • Good data is;

    • Relevant
    • Contains very few missing and repeated values
    • Has a good representation of the various subcategories/classes present.

2.Preparing the Data.

  • Data Preparation is the process of transforming raw data so that data scientists and analysts can run it through machine learning algorithms to uncover insights or make predictions.

  • It can be done by:

Putting together all the data you have and randomizing it. This helps make sure that data is evenly distributed, and the ordering does not affect the learning process.

Cleaning the data to remove unwanted data, missing values, rows, and columns, duplicate values, data type conversion, etc. You might even have to restructure the dataset and change the rows and columns or index of rows and columns.

Visualize the data to understand how it is structured and understand the relationship between various variables and classes present.

Splitting the cleaned data into two sets - a training set and a testing set. The training set is the set your model learns from. A testing set is used to check the accuracy of your model after training.

3.Choosing a Model.

A machine learning model determines the output you get after running a machine learning algorithm on the collected data therefore it is important to know how to match a machine learning algorithm to a particular problem.

Here are some important considerations while choosing an algorithm.

Size of the training data

  • If the training data is smaller or if the dataset has a fewer number of observations and a higher number of features like genetics or textual data, choose algorithms with high bias/low variance like Linear regression, Naïve Bayes, or Linear SVM.

    • If the training data is sufficiently large and the number of observations is higher as compared to the number of features, one can go for low bias/high variance algorithms like KNN, Decision trees, or kernel SVM.

Speed or Training time

  • Higher accuracy typically means higher training time.
  • Algorithms like Naïve Bayes and Linear and Logistic regression are easy to implement and quick to run.

  • Algorithms like SVM, which involve tuning of parameters, Neural networks with high convergence time, and random forests, need a lot of time to train the data.

Number of features

  • The dataset may have a large number of features that may not all be relevant and significant. A large number of features can bog down some learning algorithms, making training time unfeasibly long.

  • SVM(Support Vector Machine) is better suited in case of data with large feature space and lesser observations.

  • Other considerations include Linearity and Accuracy and/or Interpretability of the output

4.Training the Model.

In training, you pass the prepared data to your machine learning model to find patterns and make predictions.

5.Evaluating the Model.

  • Model evaluation aims to estimate the generalization accuracy of a model on future (unseen/out-of-sample) data.

  • Uses some metric or combination of metrics to "measure" objective performance of model.

  • Model is tested against previously unseen data

6.Parameter Tuning.

  • Also known as hyperparameter tuning.

  • Once you have created and evaluated your model, see if it's accuracy can be improved in any way. This is done by tuning the parameters present in your model.

  • Tune model parameters for improved performance

7.Making Prediction

  • Now you can use your model on unseen data to make predictions accurately.

Top comments (0)