Silvester

Posted on Apr 10

Top machine learning algorithms for a beginner

#beginners #linear #machinelearning #datascience

With a lot of data available today, machines are learning at a fast pace. These machines learn through the use of machine learning algorithms which are the building blocks for artificial intelligence. Today, you can analyze large volumes of data and make predictions on what will happen tomorrow, next week, next month, or even in a year.

Platforms like e-commerce sites use machine learning to suggest what you can pair with what you have already put in your cart. Businesses are also using machine learning to optimize their marketing campaigns and to understand their customers. These examples show that a future filled with machine learning is inevitable and hence why you should have some knowledge of some of the commonly used machine learning algorithms. Whether you are an aspiring data scientist or a curious person, this beginner-friendly article will give you a good foundation to learn more about machine learning algorithms.

Models

Linear Regression

This is probably the first machine learning algorithm that you will create in your data science/machine learning journey because of its simplicity. This algorithm is usually for establishing relationships between input and output variables.
Linear regression is represented by the linear equation; y=mx +c where y is the dependent variable, m is the gradient (slope), x is the independent variable and c is the intercept (where the line cuts the y-axis). With linear regression, the target is finding the best-fit line that shows how variables x and y are related.

Let us look at this example of a logistic regression task by Javatpoint which is a simple linear regression task to determine the relationship between salary and years of experience.

The plot above represents a plot of salary against years of experience. The red line represents the regression line while the blue dots represent the observations. In this prediction, you can see that the observations (blue dots) are close to the regression line, an indication that indeed, salary increases with an increase in years of experience hence a strong linear relationship.
Linear regression is used for prediction tasks and it works best with continuous data that have a linear relationship. As a beginner, linear regression will give you the necessary foundation to learn other machine learning algorithms.

Decision Tree

The Decision Tree Algorithm is based on the binary tree. With this model, a decision is reached after following a tree-like structure that has nodes, branches, and leaves. Each node in the tree represents an input variable (x) while the leaf node is the output variable (y). When using this model, you traverse the nodes starting with the root node and then passing through the splits until you arrive at the leaf node which is the output of the model.

This image by IBM gives a good illustration of how the decision tree algorithm works.

The image above represents an example of how a person can use a decision tree to decide whether to surf or not. From the decision tree, you can see that the surfer only goes to surf when there is offshore wind direction or when there is low to no wind.
Decision Tree makes classification using a series of questions which give it a flowchart-like structure. A Decision Tree Algorithm learns quickly and it is mostly used in banking to classify loan applicants based on their probabilities of defaulting.

Logistic Regression

Logistic Regression, just like linear regression, operates to find values for coefficients weighing the input variables. When using logistic regression, a binary variable is the dependent. For instance, you can use this variable when predicting whether an event will occur or not. If I wanted to predict whether a customer will default on their loan or not, this algorithm would make perfect sense to use. Another example is that an e-commerce company can use this model to predict whether a customer will make a purchase or abandon their cart based on their browsing behaviors.

This illustration by Analytics Vidhya gives a simplified approach to how logistic regression works.

In the illustration, the logistic regression model takes in various features and predicts whether the bird is happy or sad. As discussed earlier, the logistic regression works only where the target feature is a binary variable and in the image above, the binary values are sad and happy.
The key thing to know about this model is that it is ideal for binary classification and that it is often used for tasks like filtering spam or predicting customer churn.

Naïve Bayes algorithm

This machine learning algorithm is based on the Bayesian probability model. Naïve Bayes is a supervised machine learning algorithm used for classification problems. The algorithm takes its name from the assumption that no variable is related to the other, which is naïve given that variables are related in the real world.
The Bayes equation; P(X|Y) = (P(Y|X) P(X)) / P(Y)
This model is efficient when used to classify data using independent features. Examples of areas where you can use this algorithm include email spam filtering, sentiment analysis, recommendation systems, and weather prediction among others.

K-means

K-Means is an unsupervised machine learning model used for clustering problems. The model uses the K number of clusters to break down data into closely related groups and outputs them as clusters. When starting with this algorithm, you randomly choose the value for k. The next step is categorizing the data points into their closest points. The process is repeated until the clusters k become stable in that the centroids no longer change. The clusters are usually differentiated with colors to reduce confusion
K-means is probably the first unsupervised machine-learning algorithm that you will use. K-means is commonly used for problems that require clustering such as determining the shopping habits of customers, anomaly detection, or market segmentation.
This demonstration by Stanford offers a good demonstration of how k-means clustering works.

The first step (a) represents the initial dataset. In (b), the initial cluster centroids of 2 were chosen and they are represented with the blue and red crosses. From (c-f), the cluster centroids were recalculated and the process was repeated several times until the perfect cluster in (f) was arrived at. If you look closely, you will see that the red and blue clusters are distinct, an indication that the clustering process is complete.

Random forest

Random Forest algorithm is a supervised machine learning algorithm that is an upgrade from the challenges associated with decision trees. This algorithm combines several decision trees to create a better-performing model. This approach reduces overfitting, a problem that decision trees face. Some of the uses of this algorithm include image recognition and customer churn predictions.
Note that this algorithm is used for both regression and classification problems. Also, Random Forests have better performance compared to decision tree models.
This demonstration from Analytics Vidhya gives a better view of the working of Random Forest algorithms.

In the image above, the majority decision was reached after comparing the performance of the various trees. You can see that most of the trees classified the fruit as an apple while a few classified the fruit as a banana. The class that got the most voting was the one for apples and that is how the decision was reached at.

K-nearest neighbor algorithm (KNN)

KNN is a supervised machine learning model that handles both regression and classification problems. The algorithm is based on the assumption that observations close to a data point have similar observations in the dataset. As a result, the algorithm assumes that we can classify unseen points based on their closeness to existing points. That is, with a data point k, we can predict nearby observations.

The image below by Towards Data Science gives a good visual of how KNN works.

A closer look at the image shows that similar data points are close to each other. The image captures the model’s assumption that similar observations are near each other.
Some of the uses of the algorithm include loan approvals, credit scoring, and optical character recognition.

Support Vector Machines (SVM)

SVM is an ML algorithm used to handle both regression and classification problems. When using SVM, the objective is to find an optimal hyperplane to separate data points in different classes. The optimal hyperplane has the largest margin.
This Datacamp visual shows how the SVM hyperplane works.

In the visual, 3 hyperplanes were initially selected to separate the two classes. However, two hyperplanes (the blue and orange) did not separate the classes effectively. The only hyperplane that separated the classes correctly is the black one and thus became the chosen hyperplane as shown in the second image.
SVM is used for various problems like image classification, handwriting identification, anomaly detection, face detection, spam detection, gene expression analysis, and text classification.

Apriori (frequent pattern mining)

This unsupervised algorithm uses prior knowledge to generate associations between events. Apriori creates association rules by observing events that followed each other in the past when a person did something. The association rules like ‘if A bought item B, then they will buy C’ is represented as B-> C.
This visual by Intellipaat summarizes what the apriori algorithm in market basket analysis is all about.

This is the logic in the image above: If a transaction containing items {wine, chips, bread} is frequent, then {wine, bread} must also be a frequent transaction since for every transaction containing {wine, chips, bread}, then {wine, bread} are automatically on the list.
Apriori algorithm is used for problems like Google autocomplete and market basket analysis.

Conclusion

As you dive into the data science world, you will encounter most of these machine learning algorithms listed above. All these algorithms will not work for all your data problems but rather specific problems. In this article, we have looked at the top 10 most popular algorithms with the majority being supervised and unsupervised algorithms.
Remember that]-

While these algorithms are powerful, understanding their capabilities, strengths, and limitations is important to critically leverage their power.

Through active experimentation with the various models, you will gain valuable technical skills and also get a better understanding of how to use and solve data science problems. With this knowledge, you can easily leverage the power of machine learning to solve any real-world problem.

DEV Community

Top machine learning algorithms for a beginner

Models

Linear Regression

Decision Tree

Logistic Regression

Naïve Bayes algorithm

K-means

Random forest

K-nearest neighbor algorithm (KNN)

Support Vector Machines (SVM)

Apriori (frequent pattern mining)

Conclusion

Top comments (0)

Read next

Understanding Kubernetes: part 53 – Kubernetes 1.32 Changelog

Understanding Neural Networks: A Simple Interactive Visualization ⚙️

Move Data from Oracle to ClickHouse in Minutes

Shape Your Career with Our Advanced Online Data Science Course