Hey, Josiah here. Today I’ll be discussing the concept of SVMs (Support Vector Machine), a subset of Machine Learning used commonly for classification and regression analysis. First off, we must discuss how SVM works from a theoretical standpoint.
The three lines seen (H1, H2, and H3) are known as “Hyperplanes”, which divide our data into classes. To define a hyperplane, one must follow a simple procedure:
Draw a line between the two classes such that the closest data point from each class (known as the support vector) are equidistant from the line. Thus, we have defined a Hyperplane.
However, it soon becomes apparent that an infinite number of hyperplanes can be derived. The obvious question arises: “How do we determine which hyperplane to use?” The answer is, we use the hyperplane which maximizes the distance to the support vectors. This maximized gap between the two classes is known as the maximum-margin hyperplane, and is the ideal hyperplane to predict the class of future data points.
Now that we have discussed the basic theoretical concepts of SVM classification, let us get into a simple example involving the breast-cancer dataset from Scikit-learn. (Relatively trivial code snippets will be captioned)
from sklearn import metrics
from sklearn import svm
cn = sklearn.datasets.load_breast_cancer() # load data set
x = cn.data # set x to the entries
y = cn.target # set y to the labels
Here, we set two variables x and y equal to our data’s entries and labels, respectively. We will need this in the next snippet to separate the data further.
x_train, x_test, y_train, y_test = sklearn.model_selection.train_test_split(x, y, test_size=0.2)
We use Scikit-learn’s model_selection.train_test_split method to take our entries and labels and split them into training and testing subcomponents. By setting test_size to 0.2, we allocate 20% of our data for testing purposes, and the other 80% for training purposes. Note that it’s not recommended to set test_size to be greater than 0.3.
# Create an SVC
clf = svm.SVC()
# train the training set
Here, we create the Support Vector Machine along with the SVC (Support Vector Classifier) and assign it to the variable clf. Next, we train the model on the training entries and labels (20% of the dataset as defined by test_size previously).
# Predict the labels for the training data
y_predict = clf.predict(x_test)
# compare the predicted labels to the actual labels and conver to percentage to output
acc = metrics.accuracy_score(y_test, y_predict)*100
We set a variable y_predict equal to the results of the classifier predicting the labels of the testing entries. We then use the sklearn.metrics.accuracy_score() method to compare the actual testing labels to that of y_predict, which was the machine’s attempt at classifying them. We then convert this to a percentage by multiplying by a factor of 100, and printing this result. The accuracy you should get is ~96%, which is a very good accuracy!
If you enjoy articles like this and find it useful or have any feedback, message me on Discord @Necloremius#6412