DEV Community: Jhalyl Mason

Machine Learning Models: Linear Regression

Jhalyl Mason — Fri, 03 May 2024 17:13:25 +0000

Linear Regression

Linear Regression is the simplest of machine learning algorithms and is usually the first one you will learn in any course or class on the subject. However, it’s simplicity does not deprive it of any power. On the contrary, despite their simple nature they are still renowned for their ability in prediction; according to the International Business Machines Corporation (IBM), “Linear regression is the most commonly used method of predictive analysis”. It is a type of supervised learning model, meaning it learns from a labeled training set of data, sometimes known as targets, in order to make a prediction. Specifically, linear regression tries to fit a straight line through the given data points in order to best model the relationship between the inputs and the given targets. To do this, it calculates a weighted sum of the inputs plus a bias term.

The bias term is usually denoted with a θ₀, with the weights for each input denoted as θ₁ through θₙ for n input values. Thus, linear regression can be represented as a linear function ŷ = θ₁x₁+ … + θₙxₙ + θ₀, with ŷ being the prediction. Training the model means finding the parameters (θ) that best fit the training data. In order to best train the model, we first must find the difference between the predicted value (the model’s output) and the expected value (the actual target value) for a data point of the training set (yᵢ — ŷᵢ). This is called finding the error for that particular prediction. Finding the error for one specific point isn’t too helpful on it’s own. What’s most important is the total error the model has made, which is known as the cost. The equation used to find the cost of a given model is known as the cost function. The cost function associated with a linear regression model is the Mean Squared Error. In simpler terms, this just means take the average (mean) of all of the errors squared; hence the name.

So this gives the cost of the model, or how far your model is away from the target values. What’s next is to find out how to minimize the result of this equation. Training a linear regression model consists of finding the best values for the weights and biases that give the smallest possible MSE for the training set.

Normal Equation

One way to find the ideal set of parameters for a linear regression model is by using the normal equation. This is an equation that gives the direct result automatically after computation, in contrast to another method that will be discussed later. The way to compute the normal equation usually starts with arranging all of the features (x) for every data point in the training data into a matrix (X), with each row representing an instance of recorded data. Then create a vector (y) containing all of the target values of the training set. Afterwards, complete the equation:

To explain exactly what that means, multiply the transpose of your matrix X by the matrix itself, take the inverse, and multiply that by the product of the transpose of matrix X and vector y. This will give you the optimal value/values for θ to minimize the cost function. This is an effective way for computing the optimal θ values when the number of inputs isn’t that large, however, as the amount of features or instances of data grow this computation becomes slower and less efficient. This brings up the next common way to train this model, and many others as well.

Gradient Descent

Gradient descent is a common optimization algorithm that is widely used for a vast amount of different machine learning models. The idea behind gradient descent is to iteratively change the parameters of the model in order to minimize the overall cost. To do this, the gradient descent algorithm calculates how much the cost function changes if you change a parameter slightly. It does this by computing the partial derivative of the cost with respect to the parameter. The best analogy I’ve seen to represent gradient descent was by Luis Serrano in the Math for Machine Learning & Data Science Specialization hosted by DeepLearning.Ai on Coursera. To summarize: imagine you are in a really hot room and want to get to the coldest spot possible. The way you might go about doing this, is to take a step in any direction and see if it is hotter or colder than where you were before. You would keep doing this until every spot you could go to next is hotter than the spot you are currently at; this is when you have found the coldest spot in the room. This is essentially how gradient descent works, slowly taking steps, the size of which are dictated by the learning rate, to find the minimum cost function for the model. The amount of steps taken, or iterations of the training algorithm, are known as epochs. In order to implement gradient descent, you would compute the partial derivative of the cost function with regards to each parameter, using the equation:

Note that θᵀx is another way of representing the prediction ŷ that is commonly used. It is just expressing the prediction as the product of the transpose of the parameter matrix θ and the feature vector x. Now instead of computing these partial derivatives individually, a common method is to use batch gradient descent. This method of gradient descent calculates the derivatives over the whole training set at each step. This involves creating a vector of gradients:

Once you have the gradient vector, you begin the steps to use the gradient vector to step in the correct direction. This is when the learning rate comes into play. The step algorithm involves subtracting your parameter vector θ by the product of the gradient vector and the learning rate. This is how the learning rate influences the size of the steps you take away from the gradient. The equation looks like this:

The η represents the learning rate. The size of the learning rate is important; too large and you may continuously jump over the lowest point, but too small and it may take forever to converge on it. However, since the MSE is a convex function, the algorithm is guaranteed to get close to the global minima (lowest point) with a small learning rate as long as you wait long enough (and run through enough epochs). Thus it is usually safer to go with a smaller learning rate to start with and experiment from there on what works best.

Code Implementation

In order to fully demonstrate how/when to implement and train a linear regression model, I will go through the steps of a regression/prediction task. The dataset and project itself come from the dataset published in the book Hands On Machine Learning with SciKit-Learn and TensorFlow by Aurelien Geron, however all code presented here is typed and published solely by me. The author has expressed his intent to keep the code for the datasets and projects available open-source through his github, which I will have linked in the reference list at the bottom of this text.

This project is assuming you want to find out if money correlates to happiness. In order to find out, you collect data on the life satisfaction of certain countries along with their GDP (gross domestic product). Your goal is to find if their is some correlation, and if so, to create a model that can predict someone’s expected happiness based off of their country’s GDP.

We will start by importing the necessary libraries:

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression

Next, we go through the steps of downloading and preprocessing the data. This includes opening the data and determining our features/targets. Since we want to be able to determine happiness given money, it stands to reason that our features would be the GDP of the country and our targets will be the life satisfaction of that country.

data = "https://github.com/ageron/data/raw/main/"
life_satisfaction = pd.read_csv(data + "lifesat/lifesat.csv")
X = life_satisfaction[["GDP per capita (USD)"]].values
y = life_satisfaction[["Life satisfaction"]].values

We will then go through the steps of visualizing our data; an important step before model selection.

life_satisfaction.plot(kind='scatter', grid=True, 
                       x="GDP per capita (USD)", y="Life satisfaction")
plt.axis([23_500, 62_500, 4, 9])
plt.show()

While it doesn’t look like a complete straight line, their definitely looks like a linear correlation between the GDP and life satisfaction of the given countries. Therefore a linear regression model will due well to make predictions. So now that we know which model we want to use, let’s train (or fit) the model to the training set. We’ll use the normal equation first:

Normal Equation Implementation:

from sklearn.preprocessing import add_dummy_feature

X_b = add_dummy_feature(X)
best_theta = np.linalg.inv(X_b.T @ X_b) @ X_b.T @ y

This is a code implementation of the same normal equation explained above, with the output being mapped to a variable called “best_theta”. Although knowing the actual numbers themselves isn’t all too helpful, let’s look at what the normal equation says anyways:

best_theta

So now that we have the optimal parameter values, we are able to start making predictions. So given Cyprus’ GDP of 37,655.2; what would the model predict their life satisfaction to be?

X_new = [[37_655.2]]
X_new_b = add_dummy_feature(X_new)
y_prediction = X_new_b @ best_theta
y_prediction

So it’s looking like our model expects Cyprus to have a life satisfaction rating of about 6.3. Now let’s look at a visualization of the prediction line that the normal equation has fit to our data:

life_satisfaction.plot(kind='scatter', grid=True, 
                       x="GDP per capita (USD)", y="Life satisfaction")
plt.plot(X, best_theta[1]*X + best_theta[0], "r-")
plt.axis([23_500, 62_500, 4, 9])
plt.show()

And that is pretty much all there is to linear regression with the normal equation. Now let’s look solving the same problem with gradient descent.

Gradient Descent Implementation:

alpha = .1 
epochs = 1000
m = len(X_b)

np.random.seed(42)
theta = np.random.randn(2, 1) 

for epoch in range(epochs):
    gradients = 2 / m * X_b.T @ (X_b @ theta - y)
    theta = theta - alpha * gradients

This randomizes the initial values of our parameter vector, implements the cost function, and goes through the steps for the amount of epochs we had set. Now let’s see what parameters the algorithm returns this time.

We got the same thing we did as with the normal equation. Remember, gradient descent is useful for many models, not just linear regression. Now there’s actually one more way to implement and train this model, and it’s the easiest.

SciKit Learn Implementation:

model = LinearRegression()
model.fit(X, y)

model.intercept_, model.coef_

The SciKit Learn library already has a built in linear regression model. Using the “.fit” function, we can train the model in one line of code. Using this way, you can simplify the previous other methods down to just three lines of code. Easier is not always better however, and although it is useful to have most of the details abstracted away, it is always important to understand what is going on beneath the simplified code. With that being said, this is usually the most common way you will see a linear regression model implemented and trained, as most people won’t have the need to code a model from the ground up from scratch. However, if you ever did need to, now you know how.

Reference List

IBM — https://www.ibm.com/docs/en/db2oc?topic=procedures-linear-regression

Hands On Machine Learning 3rd Edition Github — https://github.com/ageron/handson-ml3/tree/main

Architecture of Neural Networks

Jhalyl Mason — Fri, 03 May 2024 16:57:44 +0000

Introduction to Neural Networks

What is a Neural Network?

Neural networks are the fundamental machine learning algorithm responsible for spawning the field of deep learning. According to the International Business Machines Corporation (IBM), “A neural network is a machine learning program, or model, that makes decisions in a manner similar to the human brain, by using processes that mimic the way biological neurons work”. Sometimes referred to as Artificial Neural Networks (ANNs) to differentiate them from their biological influence, neural networks have become extremely popular for machine learning due to their versatility and ability to handle large and especially complex tasks.

While other algorithms are very useful for simple tasks, such as linear regression for price/cost prediction and support vector machines for binary classification, ANNs have paved the way for some of the largest and most impressive accomplishments in machine learning and Ai as a whole. These include: image classification like for Google Images, speech recognition like Apple’s Siri, and recommendation tasks like for videos on YouTube. The creation and widespread acceptance of neural networks has truly changed the field, and the world as a whole, and has helped shape what things we deem computationally plausible.

Biological Neurons

As can be extrapolated by their namesake, artificial neural networks are modeled after the neurons in the brains of animals, such as humans. Neurons are nerve cells that, according to the National Institute of Neurological Disorders and Stroke (NINDS) “allow you to do everything from breathing to talking, eating, walking, and thinking”. Each neuron has a long extension called an axon which branches off into tips that have what are known as synaptic terminals, or synapses.

These synapses are what connects to other neurons and allows them to exchange information. Neurons produce electrical impulses which travel down their axons and to the synapses, causing them to release chemicals called neurotransmitters to the other neurons. When a neuron receives enough neurotransmitters within a short span; it will either fire it’s own, or stop firing, depending on the type of neurotransmitter. This small action is the essential basis behind brain activity and the process that artificial neural networks intend to mimic.

From Biological to Artificial

The Artificial Neuron

The idea behind ANNs has been around for a multitude of years. They were first introduced by neuropsychiatrist Warren McCulloch and mathematician Walter Pitts in their landmark paper “A Logical Calculus Of The Ideas Immanent In Nervous Activity”, published in 1943. In the paper, they introduce the idea of a simple computational model that can mimic the function of neurons using propositional logic (true or false). The model of the neuron they created was comprised of one or more binary (on/off) inputs and one binary output. This paper was instrumental in demonstrating that, even with these relatively simple neurons, it was possible to create a network capable of computing any logical proposition.

The TLU

Building off of the early artificial neuron, the threshold logic unit, or TLU, was the next big step for ANNs. The TLU differs from McCulloch and Pitts’ original model in that it’s inputs and output are numbers instead of just binary on/off signals. This model associates values, known as weights, to each of it’s input values. It then calculates a linear function of it’s inputs and their weights, along with a bias term, and applies what’s known as a step function to the result. This step function introduces a threshold to the output of the function, making it positive if above the threshold and negative if below. A single TLU can perform simple binary classification tasks, however they become more useful when stacked together.

The Perceptron

Created by psychologist Frank Rosenblatt in 1957; the perceptron is comprised of one or more TLUs stacked in a layer, with each input connected to each unit. These layers are known as fully connected (or dense) layers with the layer of inputs taking the name input layer. A perceptron with just two inputs and three units can simultaneously classify instances of data into three different binary classes, making it useful for multilabel classification. It also became useful for multiclass classification for the same reason.

Another benefit the perceptron had was the ability to adjust the weights, or train, the model. In order to train it, the perceptron would be fed multiple training samples with each output being recorded. After each sample, the weights are adjusted to minimize the value between the output and the desired output. This allowed the model to get better, or learn, from each instance it was trained on.

The Multilayer Perceptron

One step up from the perceptron is the multilayer perceptron, or MLP. An MLP is comprised of an input layer, multiple TLU layers in the center (called hidden layers), and one more layer of units called the output layer. Neural networks with two or more hidden layers are known as deep neural networks, and the study of deep neural networks became known as d*eep learning.* These MLPs were found to do increasingly well at complex tasks. They could still do things such as binary classification and regression, but they also showed promise in more difficult jobs such as image classification. Over time, researchers were able to modify and adapt these deep neural networks for a plethora of different functions, including: speech recognition, sentiment analysis, and image recognition.

Common Types of Neural Networks

Feedforward Neural Networks

Feedforward Neural Networks are some of the most simple types of ANNs. They get their name from the fact that the data that is input into the model goes only one direction: forward. That is to say that the data comes from the input layer, is transferred through it’s hidden layers, and is then fed through the output layer. Every neuron in one layer is connected to every neuron in the next, and none of the perceptron are connected to any others in the same layer. These networks are the foundation for more complex and specialized networks.

Convolutional Neural Networks

Modeled after the visual cortex region of the brain, Convolutional Neural Networks, or CNNs, are networks specialized for image and audio inputs. They work by using a layer, known as the convolutional layer, to detect important features within image or audio files. The data is then fed through a pooling layer, which reduces the dimensions of the data, helping reduce complexity and increase efficiency. The data is then pushed through a fully connected layer, similar to a normal feedforward network. Convolutional neural networks are the backbone of Computer Vision, the field of Ai dedicated to enabling computers to derive information from digital images and videos. Computer Vision is used in many industries such as: radiology- allowing doctors to better and more efficiently identify cancerous tumors; security- allowing cameras to identify and mark possibly threats; and the automotive industry- aiding the detection in systems such as lane detection and even self driving capabilities.

Recurrent Neural Networks

Recurrent Neural Networks, or RNNs, are networks that use sequential or time series data. They are most popular for their use in speech recognition and natural language processing (NLP). They differ from other neural networks in that they have “memory”, they take information from prior inputs to influence the next output. This step is necessary for tasks like natural language processing, as the position of each character in a sentence is important in determining the purpose or sentiment of the sentence. Some of the most popular uses of RNNs are things like Siri for the iPhone, voice search, and Google translate.

References

IBM- https://www.ibm.com/topics/neural-networks

https://www.ibm.com/topics/recurrent-neural-networks

https://www.ibm.com/topics/convolutional-neural-networks

NINDS- https://www.ninds.nih.gov/health-information/public-education/brain-basics/brain-basics-life-and-death-neuron#:~:text=Neurons%20are%20nerve%20cells%20that,were%20ever%20going%20to%20have.

McCulloch & Pitts — https://www.cs.cmu.edu/~./epxing/Class/10715/reading/McCulloch.and.Pitts.pdf

History of Math & Machine Learning

Jhalyl Mason — Fri, 03 May 2024 16:48:40 +0000

With the increase in popularity of Ai and machine learning, more and more people are looking to break into the field.

A quick look at the Google Trends stats for the keyword “machine learning” shows the interest in the term has quadrupled since March of 2016 (from a popularity score of 20 to a peak of 93 in March of 2024, at the time of writing).

With this, there has been a recent discourse amongst people wanting to get into Ai/ML. With the tools necessary to build your own neural networks being so easily accessible and the multitude of premade models already out for use, do you even need to still learn the math to get into Machine Learning?

Background

Before we dive into the arguments on both sides and I give my opinion on the topic, it’s important that we first get a background on what led us here.

The creation of Artificial Intelligence is usually dated back to 1950. This was the year Alan Turing published his paper “Computing Machinery and Intelligence”, all about the theory of building intelligent machines and how we could measure or test said intelligence. However at the time theory was far ahead of technology, so although Turing helped popularize the thought process, his impact was stunted there.

A few years later in 1956, a man named John McCarthy would be the next to propel the field forward. McCarthy, who is also credited with actually coining the term “Artificial Intelligence”, hosted a conference with Marvin Minsky at Dartmouth College in 1956 called “Dartmouth Summer Research Project on Artificial Intelligence (DSRPAI)”. McCarthy brought together researchers from multiple different fields to discuss the possibility and techniques necessary to make Artificial Intelligence happen. Although the conference didn’t exactly go to plan (it was reported that only six people there, including McCarthy and Minsky, stayed consistently present), it was integral in starting the field as we know it today. However, like with Turing, the technology still wasn’t fast, accessible, or capable enough of turning theory into practice at that time. That would not stay the case for long.

Nine years later in 1965, Gordon Moore, co-founder of Intel, had noticed that the number of transistors on an integrated circuit they were manufacturing had increased by a factor of two in the previous 5 years. He later posited that at the rate they were at, the number of components on a single chip would double every two years. This came to be known as Moore’s Law and would have massive effects on computing and society as a whole; but Ai especially. The exponential growth in computing power helped push the ML boom that occurred around the time. By 1970, Marvin Minsky was quoted as telling TIME Magazine “from three to eight years we will have a machine with the general intelligence of an average human being.”

There were several setbacks and rollercoasters the field has went on since which helped prove Minsky wrong: Ai winters, defunding of projects, lack of access to data, etc. However all of those events laid the foundation for the modern space of Ai/ML today. Skipping over multiple years and achievements later, we come to the surge in 2016 mentioned earlier. Google’s DeepMind team had created an Ai that successfully beat some of the best human players at Go, a feat so difficult people still questioned if it would be possible even with the release of Siri in iPhones 5 years earlier. This helped push the idea that not only were intelligent Ai models possible, they were here.

What’s Math Got to Do With It?

So we know what led us to where we are now, and it sounds like a lot to do with computers. So where does math come in? It had been there from the beginning. Computer Science itself wasn’t it’s own discipline until the early 1960s. Even when it was created, it was considered the intersection between math and electrical engineering. Many of the important names we mentioned earlier were all mathematicians by degree. McCarthy was a mathematics professor at Dartmouth when he organized the summer conference with Minsky, who had received his PhD in mathematics from Princeton 2 years prior. Alan Turing graduated from King’s College with a mathematics degree in 1934 before publishing “Computing Machinery and Intelligence” 2 years later. Math has always been an integral part in the study of computing and ML as an extension. In fact, the majority of what happens in Machine learning is all about a computer being fed data and using math to optimize that data around a certain metric. Just like with the rest of computer science, when you peel back the curtain of programming it’s all math.

The Current Discourse

So if it all started with math, and still consists of math, where’s the discourse come in? Well Ai has come a long way since it’s inception. Back in the early days of Ai, and even up until recently as Ai isn’t that old, it was assumed you needed a PhD to do anything regarding the field. The complex nature of the science and the fact that many of the biggest names in it were all PhDs helped push that narrative. Due to this, many stayed away from studying or pursuing ML unless they were already in a highly quantitative degree like computer science, math, or physics. Due to the nature of having a higher level of math requirements for these degrees, math was just accepted as a necessity to get into ML. All that changed with this most recent boom in Ai.

With chatGPT helping bring Ai to the mainstream and Tesla showing it’s applications on self-driving cars, Ai has become the biggest buzzword again and everybody wants in. This, along with the advancements being made helping make ML more accessible for the common man, caused people to see Ai as a hugely lucrative opportunity. However, with this influx of people from different and less quantitative fields, those who may not have had to take Calculus II or Statistics in college, or perhaps just don’t remember or don’t care to. And these people had a good question: If there’s multiple different types of models already made, polished, and available on the internet for whatever I may need, why learn the math behind them?

To their credit, there is a case to be made there. The math behind Ai is used to make the algorithms, right? So if the algorithm is already premade then hasn’t the necessity for math been abstracted away? And these aren’t just people from outside the field saying this either. Many data analysts and even ML engineers are saying that tough math is no longer needed to break into the field, citing the fact that the majority of the models used in production are premade models and that focusing on data cleaning and prepping would be a better use of time.

The Case for Math

So I just solved the problem right there, most the models are premade so the math is a thing of the past. Right?

So here’s where I state my stance as well as make the case for learning the math first. As previously stated, math has always been a huge part in all of computer science, *especially Ai*. The math is necessary in understanding what the algorithm for the model you’re using is, what it does, and how it does it. Now, it is 100% true that nowadays you can pickup a model online or load a GitHub repository and have an algorithm up and running without having to so much as know how it works. And oftentimes you can get by with most tasks doing just that. However, what happens when you want to change the algorithm to better fit your use? Or if it seems to break under certain use cases?

Understanding the math is important to knowing what is going on behind the scenes in an algorithm. Taking the time out to get at least a good foundation in the math behind ML can make a huge difference in your ability to use Ai, when to use it, and how to use it. It will give you a better intuition on which models work best in which conditions, and how to tune and track your model accurately. Overall, the math is a crucial part in actually knowing what’s going on in Machine Learning.

So Where Do I Start?

So now that we’ve established why math is crucial for Machine Learning, how do you learn the math necessary? Most of the math necessary you can pick up in an average college degree. It’s pretty much all just linear algebra, calculus, and statistics. What if you didn’t go to college, or don’t remember? Luckily, the internet is full of resources, both free and paid, to help you catch up and fill the gaps in your knowledge.

Abhishek (Adam) Divekar

wrote an article with recommendations that I advocate you take a look at here.

I also recommend the Mathematics for Machine Learning and Data Science specialization by Andrew Ng hosted on Coursera. The course breaks the math up into 3 categories: Linear Algebra, Calculus, and Probability & Statistics. I will also be doing a summary/review of each section in the course separately here on Medium for if you want to see what each section is about or just want a reference guide to freshen up or come back to.

Now that you know all about the history of math and Machine Learning, you know about the discourse, and have the resources to fill in the blanks, the only thing left is to get started studying. Enjoy the process.

Reference List

https://trends.google.com/trends/explore?date=all&geo=US&q=%2Fm%2F01hyh_&hl=en

https://www.britannica.com/science/computer-science

https://www.alumni.cam.ac.uk/news/cambridge-alumnus-alan-turing-to-be-the-face-of-new-%C2%A350-note-1#:~:text=Turing%20studied%20mathematics%20at%20King's,a%20first%2Dclass%20honours%20degree.

https://home.dartmouth.edu/about/artificial-intelligence-ai-coined-dartmouth?source=post_page-----935f3a3929ee--------------------------------

https://ai100.stanford.edu/2016-report

https://www.investopedia.com/terms/m/mooreslaw.asp

https://bjc.edc.org/bjc-r/cur/programming/6-computers/2-history-impact/2-moore.html?topic=nyc_bjc%2F6-how-computers-work.topic&course=bjc4nyc.html&source=post_page-----935f3a3929ee--------------------------------

https://www.theguardian.com/technology/2016/dec/28/2016-the-year-ai-came-of-age?source=post_page-----935f3a3929ee--------------------------------

https://en.wikipedia.org/wiki/AlphaGo?source=post_page-----935f3a3929ee--------------------------------