What is Machine Learning?
Machine Learning: The development of algorithms that enable computers to learn and make predictions based on data (in essence, learning from experience) instead of explicitly coding each individual step (as is done in regular programming).
Motivations
To understand what this means, let's consider a simple, typical program: a list-sorting algorithm. Writing this algorithm would involve specifying each step needed to carry out the task of sorting a list.
This approach works well on a lot of problems, but there are some for which this is impractical. Suppose we want to write an algorithm that, given an image (of either a cat or dog), will tell us whether that image contains a cat or a dog.
How would we even specify the steps to take in order to do such a thing? For all we know, when we look at a dog or a cat, we are somehow able to tell them apart. This is where machine learning comes into play.
At a high level, we are feeding data (called the training set) to the algorithm. This data consists of example inputs and their corresponding outputs (to put it simply, we are giving it example questions and the answers to those questions). In the case of the cat and dog classifier, this data would consist of cat and dog images, along with their appropriate labels ("cat" or "dog"). The algorithm, based on this data, learns how to differentiate between dogs and cats (this process is called training), such that, when eventually exposed to some new image it has not seen before, can accurately determine whether that image contains a dog or a cat.
Machine Learning models as functions
A machine learning model is essentially a mathematical function. Given an input , the value of the function is the output, or prediction of the model.
Machine learning models consist of two components:
Parameters (or weights): Variables within the mathematical function that influence how the input is transformed into the output.
Architecture: The template of the model, without any specific choices of parameter values.
In other words, a model is a specific instance of an architecture with particular parameter values.
For example, if we had the function , then:
- would be the architecture.
- , would be the parameters.
Then, say that, for a particular application, we find that the 'best values' for and (those that yield most accurate predictions) are 5 and 20 respectively. In this case, the machine learning model would be the function .
In this example, we used a linear function. This is one out of many choices of functions we could potentially use.
Machine learning algorithms that use the linear function as their model are called linear regression algorithms.
What is a Neural Network?
A neural network is a particular type of machine learning model, just like the linear function is one particular option for a machine learning model.
Therefore, the neural network is also a mathematical function. However, the special thing about the neural network is its flexibility. While the linear model ( ) can only represent straight lines, the neural network, in theory, can represent just about any function given the right parameter values. This fact is demonstrated by the universal approximation theorem.
The term "neural network" comes from its initial inspiration, which are the neurons in the brain.
Loss (and Cost)
In the linear regression example, the idea of 'best values' for the parameters of a model was briefly mentioned. The best values for parameters are those that make for the most accurate predictions.
Loss is a measure how "off" a single prediction is from the real value. The sum of loss values over the training set is called the cost.
The values we want our parameters to take are those that minimize this cost value.
Essentially, the process of training an algorithm consists of trying a set of parameter values on the training set, evaluating how much loss those parameter values incur, then adjusting those values until the best values are found. This is how the algorithm "learns".
The algorithm used for training neural networks is called Stochastic Gradient Descent (SGD).
Overfitting & Underfitting
When we build (and train) a machine learning model, our ultimate objective is to have it make accurate predictions on new data (data it hasn't seen before). In other words, we want our model to generalize well to new examples.
A crucial thing to realize here is that having a low cost value does not imply that our model generalizes well.
Overfitting is a phenomenon that occurs when a model, instead of learning the underlying patterns in its training data, memorizes it instead. This leads to a model with a very low cost value, but a poor ability to generalize.
However, this is not to say that the cost value does not matter. Ultimately, we want our model to learn the underlying trend/pattern present within its training data (that will hopefully also be the same in the data it will see in practice), and a model with an extremely high cost probably does not do a good job at this. It could then run into the opposite problem: underfitting.
Underfitting is a phenomenon that occurs when a model fails to capture the underlying patterns in its training data due to being too simplistic.
Notice that both overfitting and underfitting are bad for the same reason: both cause the model to fail at adequately capturing the underlying patterns in the data. However, the reasons for failing to do so differ.
In essence, an overfitted model "reads too much into the data", taking the noise (random variations) in the data as if it is part of the underlying pattern.
An underfitted model suffers from the opposite problem: it can be said to "read too little". It oversimplifies the data, making false assumptions on the underlying patterns in the data.
Thus, if we are to build a model that generalizes well, we want to strike a balance between the two. This is where the concepts of validation and test sets come into play.
Example
Say we are building a model that aims to predict the price of a house, given its area.
We have a data set, gathered from existing houses. It consists of multiple points , where:
- : The area of the given house.
- : The price of the house.
We want to use our model to predict how much a house with an area of 8 units would cost.
As shown in the first plot, the points in our training set roughly follow a quadratic curve, a polynomial of order 2. This is the underlying pattern we want our model to learn.
The second plot shows an overfit. One of the reasons this could happen is if we were to try very high order polynomial function for our model . As you can see, even though this model perfectly fits all points in the training set (Its cost is very close to or is equal to 0), it does not produce an accurate prediction. Looking at the trend, it would seem like an -value of 8 would produce a higher -value than 4.3.
The third plot shows an underfit. In this case, we are using the linear function, a polynomial of order 1. It does not adequately capture the trend in our training data. As a result, it also fails to provide an accurate prediction.
Finally, our desired model would look something like this:
Training, Validation, & Test sets
An indispensable part of building a machine learning model is the dataset. As mentioned previously, the dataset is essentially a collection of example inputs and their associated outputs.
Typically, this dataset will be divided into 3 portions: the training set, validation set, and test set.
Why the need for this? Why not just use a training set? As mentioned before, the cost value of a model is not an adequate indicator of its ability to generalize.
More specifically, the cost value will probably be an underestimate of the error our model's predictions will incur on new data. This is because, during training, our choice of parameters have been directly influenced by the data in the training set itself. This potentially leads to overfitting.
Therefore, a logical next step would be to test our model's performance on a set of data that was not seen by the model during training. This set of data is called the validation set.
Hyperparameters
But it turns out that parameters are not the only values that have to be adjusted in order to build a good model.
Hyperparameters are a separate category of parameters whose values are determined before training. (Thus, they are not learned from data, rather, they are set by the person developing the model).
Some examples of hyperparameters include the learning rate, a value which determines the rate at which the model updates its parameters during training and the degree of the polynomial of .
It is common for multiple variants of a model, with different hyperparameter value choices, to be trained, then compared on the validation set. Then, certain metrics, such as accuracy or loss, are used to determine which set of hyperparameter values is the best.
Wait, but that means we can't use the validation set as an unbiased test of our model's performance anymore, since the data in the validation set has directly influenced our model design decisions via the choice of hyperparameter values.
Therefore, overfitting can not only occur during the training process with the training set, but also during the process of hyperparameter tuning with the validation set.
And thus, in the same way that we need a set of data to hold back from the training process, we need another set of data to hold back from the ones designing the model.
The key here is to understand that, for a set of data to be an unbiased test of a model's performance, that set of data should not have played any role in the design choices of the model.
And this is where the test set comes in. It is a portion of the data set that is completely independent and unseen by both the training process and the model design decisions. It serves as an unbiased evaluation of the model's performance on new, unseen data.
To summarize:
- The training set is used for finding the best parameter values.
- The validation set is used for finding the best hyperparameter values.
- The test set is used as an unbiased evaluation of the model's performance.
Top comments (0)