PRACTICAL MACHINE LEARNING
WHO PREDICT THINGS
1.Google -> They use machine learning to show you ads you’re most likely to click based on ads you’ve clicked on in the past to increase revenue
2.Netflix -> They’d show you movies you might be interested in based on movies you’ve watched in the past to increase revenue
3.Insurance companies -> They use machine learning to predict your risk of death so they can increase revenue.
WHY PEOPLE PREDICT
1.For glory – sense of pride
2.For money- win prediction contests eg Kaggle
3.To save lives – better medical decisions.
WHAT ARE THE COMPONENTS OF A PREDICTOR
Question (a very specific and well-defined question..eg what are you trying to predict? And what are you trying to predict it with?)
Input data(you collect the best input data you can to use to predict)
Features (you either use measured characteristics that you have or you might use computation to build features that you think you can use to predict the outcomes that you want)
Algorithm (you can use the machine learning algorithms that you’ve learnt about such as random forest and decision trees)
Parameters (estimate the parameters of those algorithms and use those parameters to apply the algorithm on a new data set
Evaluate (evaluate the algorithm on that new data)
IN SAMPLE VS OUT OF SAMPLE
In sample error is the error you get on the same data you used in training your predictor and it is sometimes called resubstituition error.
Out sample error is the error you get when using your predictor on a new data sometimes called generalization error.
In every data set we have the noise and the signal. The goal of the predictor is to locate the signal and ignore the noise. The signal is the part we use to predict and the noise is the random variation we get in the data set during measurement.
In-sample error occurs because sometimes your predictor algorithm would tune itself to the noise you collected in that particular data set but with a new data set there would be a different noise hence the accuracy goes down a little bit. Hence its always best to test your predictor on new data set to get a realistic idea of how the algorithm works.
PREDICTION STUDY DESIGN
Define your error rate
Split your data set into (training , testing and validation (optional))
Training set must be created to build a model
Testing set is to evaluate the model
Validation set is to validate the model (optional)
On the training set you pick features using cross validation method (the idea is to use the training set to pick features that are most important in your model)
Use cross-validation to pick the prediction functions on the training set to estimate all the parameters you might be interested in to build a model.
If there is no validation model then we apply the best model that we have on the test set just one time. If we apply multiple models on the test set before picking the best one then in some sense we are using the test set to train the model which shouldn’t be the case.
If there is a validation set and a test set then we apply the best prediction model on the test set and refine it a little bit and then apply the result on the validation set just one time.
Cross-validation is one of the most widely used tools for detecting relevant features for building models and estimating their parameters
Accuracy on the training set will always be optimistic (resubstituition accuracy). We are always trying many different models and picking the best one on the training set and as a result the predictive tool could be in tune with the data set and it might not give the best accuracy when a new data set is introduced to it.
The better accuracy comes from an independent set (test set accuracy). However, if we keep using the test set to evaluate the out of sample accuracy, then in a sense the test set has become part of the training set and we still don’t have an outside independent evaluation of the predictive model.
Hence the goal is to build our model entirely on the training set and only evaluate the once on the test set. The only way to do that is by doing cross-evaluation
- Use training set
- Split training set into training set and test set
- Build a model on the training set
- Evaluate it on the test set
- Repeat and average the estimated errors
DIFFERENT WAYS PEOPLE CAN DO TRAINING AND TEST SET
1. Random sub-sampling
Imagine that every observation you are trying to predict arrayed along this axis. The first row represents only the training samples. However we take subset of the training samples and call them the test samples. In the case of the picture above it’s the light grey bars. And then we build our predictor on the dark grey samples and then we apply it to predict the light grey samples and evaluate their accuracy. We do that for several random samples. The first three rows represent 3 different random sampling and then we average the errors to know what the accuracy would be.
2. K-fold cross-evaluation
The idea here is that we break the data set into K equal sized data set. For a 3 fold data set this is what it would look like ( you take the first 1/3 of the first fold, second 1/3 of the second fold and third 1/3 on the third fold). On the first fold we would build the model on the dark grey training data and apply it on the light grey test data. We do this for all three folds. Again we would average the error for all three folds and get the estimate of our error for the predictive model we have built.
3. Leave-one out cross validation
We leave one sample, we build the predictive model on the other sample and then we predict the sample that we left out find the mean errors.
WHAT DATA SHOULD YOU USE?
After identifying the question being asked you should use the best possible data in answering the exact question you’ve been asked.
When you are shown images of different kinds of fishes, how are you able to tell that these are fishes? You are able to tell because the images possess all the features of a fish. If the same image is fed to a machine, how would the machine identify it to be a fish. This is where machine learning comes in, we keep on feeding the machine images of fishes with the label fish until the machine learns all of the features associated with a fish. When it has learned all the features we feed the machine new data to see how much it has learnt. In other words, raw data called training data is given to a machine so it learns all the associated features associated with that data and once the learning is done the machine is given new data or the test data to determine how well it has learned and that is the underlying concept of machine learning.
STEPS IN MACHINE LEARNING
- Acquiring data from various sources
- Cleaning data. You usually gain insight on the features of data when cleaning it.
- Split data into training and testing set
- Build your model with the training set. Teach the model all the features of the training set until it has learnt everything.
- Testing the accuracy of the model on the test set.
- Evaluate the model using confusion matrix or root mean square.
DIFFERENT TYPES OF MACHINE LEARNING ALGORITHMS
Supervised learning: The machine is given training data set to learn all the features and when it’s done we test its accuracy using the test set. Eg. A student being taught by a teacher and writing a test afterwards
Unsupervised learning: Unsupervised learning algorithm draws inference from data which does not have labels. Eg. A student who does not require external teaching and hence learning on their own before a test.
Reinforcement learning: This is where a machine learning algorithm placed in an environment learns the ideal behaviour in order to maximize its performance. A simple feedback is required for the machine to learn best. Eg A self-driving car that obeys the laws of the traffic light to ensure safety of the passengers.
It stands for Classification and Regression Training. This package was specifically designed to perform machine learning functionalities better and faster.
FUNCTIONS OF THE CARET PACKAGE
There are so many people in the R community creating multiple packages on a daily basis. Caret makes them easier for use by wrapping 100s of these packages together while providing a common interface.
It provides a lot of functionalities for data splitting and sampling
It has a simple functionality for feature selection. That is, it can automatically run through your data set and algorithm and bring out the best permutation that works for the particular problem.
Some of the best models are very difficult to operate because they have a lot of functionalities. Caret can help you simplify various models while maintaining their peak efficiency.
HOW TO CHOOSE FEATURES FOR A MODEL (FEATURE CREATION)
Features are variables that you include to a model that you combine to help you predict the outcome that you want. The best features are those that capture only the relevant information.The levels of feature creation include;
Taking the raw data that you have and turning it into a predictor that you can use. It often takes the form of an image, txt file or a website. It’s very difficult building a model with those forms of raw data until it is summarized into a qualitative or quantitative variable. These new variables are called features or covariates, they are variables that describe the data as much as possible while giving it some compression while making it easier to fit standard machine learning algorithms.
Example; A typical email data would be very difficult to use in your prediction models unless they are summarized into features such as (the number of times a word occurs, the number of capitalized words, the number of currency signs included and the likes)
Transforming tidy features. These are taking features that you’ve already created and making new features out of them. These could be functions or transformation of a feature that might be useful when building a prediction model. Example including the average number of words that occur, the average number of capitalized words and so on.