Beginning with Machine Learning (6 Part Series)
In the last few posts, we covered the basics of machine learning and regression technique. Here we will discuss about classification which is also a supervised problem similar to regression. But instead of predicting continuous values, we predict discrete values. It is used to categorize the different objects. The output variable here is a "category" such as red or blue, yes or no. Take an example where we need to classify emails as spam or not spam. There will be a set of features which will be used to classify emails. Each email is categorized into one of the categories depending upon the values of these features. Once the model is trained on seen data/labeled data, the model predicts the labels for the unknown data.
Let's get started with the types of classification algorithm 😃
In SVM or Support Vector Machines, we differentiate between the categories by separating the classes with an optimal hyperplane. Optimal hyperplane is the plane which will have the maximum margin.
In the figure, there could have been multiple hyperplanes separating the classes but the optimal plane is the one with maximum margin as shown above. The points that are closest to the hyperplane which define the margin of hyperplane are called Support vectors. So even if all the other points in the dataset are removed (except the support vectors), the position of hyperplane doesn’t change. This is because the distance of the plane from support vectors remains the same. To summarize, in SVM we want an optimal hyperplane with the maximum margin that separates the classes.
Many times the data is not linearly separable and thus defining a linear hyperplane is not possible. In this figure, the data points belong to two classes inner ring and outer ring. As we can see, these classes are not linearly separable i.e. we cannot define a straight line to separate them. However elliptical or circular “hyperplane” can easily separate the two classes. The features x and y can be used to create a new feature z defined as z = sqrt(x²+y²).
The kernel function is applied to each data instance to map the original non-linear observations into a higher-dimensional space in which they become separable. These functions can be of different types. For example linear, nonlinear, polynomial, radial basis function (RBF), and sigmoid.
Decision trees have a flowchart-like structure. These are those classifiers which work by identifying a splitting node at each step.
The split is decided by Information Gain. The attribute with maximum information gain is identified as the splitting node. More is the information gain, less is the entropy. Entropy represents homogeneity in data. A set’s entropy is zero when it contains instances of only one class.
The steps to build a decision tree classifier are briefly described below:
- Calculate the entropy of the target variable.
- The dataset is then split on the different attributes. The entropy for each branch is calculated. Then it is added proportionally, to get total entropy for the split. The resulting entropy is subtracted from the entropy before the split. The result is the Information Gain or decrease in entropy.
- The attribute with maximum information gain is selected as splitting attribute and the process is repeated on every branch.
- A branch with entropy 0 represents the leaf node. Branch with entropy more than 0 requires further splitting.
Decision trees are very much prone to overfitting. To fit training data perfectly, splitting is sometimes done to a huge extent. This causes the classifier to lose its generalization capability. And the model performs poorly on test dataset (unseen data).
To handle this overfitting, one of the approaches used is Pruning. It can start either at roots or leaves and involves cutting off some branches of decision trees, thus reducing complexity.
Reduced error pruning: Starting at the leaves, each node is replaced with its most popular class. This change is kept if it doesn’t deteriorate accuracy.
Cost complexity pruning: In this, the overall goal is to minimize the cost-complexity function. A sequence of subtrees is created where Tn is the tree consisting only of the root node and T0 the whole tree. At step i, the tree is created by removing a subtree from tree i-1 and replacing it with a leaf node. In each step, that subtree is selected which minimizes the decrease in the cost-complexity function and hence is the weakest link of the tree.
Handling of missing labels is also incorporated in the decision tree algorithm itself. While finding the candidate for the split, missing labels will not produce any gain of information. So it can be safely ignored. Now consider an example, where gender is the splitting node with 2 possible values here of male and female. Now if in case an instance is having a missing value on the gender column how should we decide to which branch the instance belongs (interesting right? :p). This situation is handled by sending the instance with missing value to all the child nodes but with a diminished weight. If there are 10 instances having ‘male’, 30 instances having ‘female’ and 5 instances having a missing value, then all 5 instances with missing will be sent to both male and female child nodes, having weights multiplied by 10/40 for ‘male’ and 30/40 for ‘female’ nodes.
When at prediction time we encounter a node in the decision tree which tests a variable A, and for that variable, we have a missing value than all the possibilities are explored.
Logistic Regression is one of the most used classification algorithms. Don’t get confused by the term regression. It is a classification method where the probability of default class is predicted. Considering an example where we need to predict an email as spam or not spam (default class being spam). It is a binary classification problem. When we apply logistic regression, probability, whether the email is spam, is predicted (ranging between 0 to 1). If the value is close to 0, it implies the probability for the email to be spam is close to 0 and hence the class is then given as not-spam. A threshold value is taken for determining the class. If probability value is less than the threshold, email is identified as not-spam else spam (as the primary class is spam).
The value predicted by logistic regression can be anywhere between -infinity to +infinity and so a sigmoid/logistic function is applied to the predicted value to squash it between [0,1].
The above equation is a representation of logistic regression where the RHS is a linear equation (similar to linear regression) with b0 and b1 as the coefficients and X as the input feature. It is these coefficients which are learned during the training process. The output variable (y) on the left is odds of the default class. Odds are calculated as the ratio of the probability of event by the probability of not the event. p(X) is the probability of default class.
The coefficients are calculated using the cost function Maximum Likelihood estimation. This minimization algorithm is used to optimize the best values for the coefficients for our training data. The best coefficients would result in a model that would predict a value very close to 1 (e.g. spam) for the default class and value very close to 0 (e.g. not-spam) for the other class.
Points to remember while applying logistic regression:
- Outliers and misclassified data should be removed from training set.
- Remove features that are highly correlated as it would result in overfitting of model.
Naive Bayes is a classification algorithm based on Bayes’ Theorem. It makes 2 main assumptions i) each feature is independent of the other feature ii) each feature is given same weight/importance.
By P(A|B) we mean, the probability of event A given the event B is true. Some popular naive Bayes’ classification algorithms are:
- Gaussian Naive Bayes
- Multinomial Naive Bayes
- Bernoulli Naive Bayes
In this classification algorithm, it builds an ensemble of decision trees which helps in giving a more accurate prediction. Each decision tree has a vote for classifying the input variable and the majority class is assigned to the input. As multiple decision trees with different sets of features are involved, the overfitting problem can be avoided.
The steps for building random forest classifier are:
- Select K random features where k<m (m total number of features)
- Identify n where n is the number of decision tree classifiers to be created by finding the best split node. Repeat step 1 and 2 to create several classification trees.
- To predict an input variable, take votes from each decision tree and assign the class with maximum votes.
- Random forests usually perform well in all kinds of classification problems. It is able to handle missing features and categorical features.
KNN stands for K nearest neighbor. Here ‘K’ is the number of nearest neighbors which are to be considered while classifying input variable. The k nearest neighbors are identified and the majority class is assigned to the input variable.
While trying to identify class of small circle and when K=3 is taken, 3 nearest variables are taken into consideration. As red cross marks are in majority the class of small circle is set as the red cross.
Phew! If you managed to read the entire article and come till this point, you are doing fantastic 😄. This post was a brief explanation of classification algorithms. You will have to for sure dig into the math behind these algorithms to get a better understanding. In case you have any feedback and points to add, feel free to comment. The next post would be a classification project similar to what we did for regression. Stay tuned for more!😸