DEV Community


Posted on

A foray into SVM w/ Scikit-Learn

Hey, Josiah here. Today I’ll be discussing the concept of SVMs (Support Vector Machine), a subset of Machine Learning used commonly for classification and regression analysis. First off, we must discuss how SVM works from a theoretical standpoint.

Image description
A simple example of an SVM classifying data
Suppose some given data points each belong to one of two classes (”Black” & “White”), and the goal is to decide which class a new data point will be in.

The three lines seen (H1, H2, and H3) are known as “Hyperplanes”, which divide our data into classes. To define a hyperplane, one must follow a simple procedure:

Draw a line between the two classes such that the closest data point from each class (known as the support vector) are equidistant from the line. Thus, we have defined a Hyperplane.

However, it soon becomes apparent that an infinite number of hyperplanes can be derived. The obvious question arises: “How do we determine which hyperplane to use?” The answer is, we use the hyperplane which maximizes the distance to the support vectors. This maximized gap between the two classes is known as the maximum-margin hyperplane, and is the ideal hyperplane to predict the class of future data points.

Now that we have discussed the basic theoretical concepts of SVM classification, let us get into a simple example involving the breast-cancer dataset from Scikit-learn. (Relatively trivial code snippets will be captioned)

import sklearn
from sklearn import metrics
from sklearn import svm
cn = sklearn.datasets.load_breast_cancer() # load data set
x = # set x to the entries
y = # set y to the labels

Here, we set two variables x and y equal to our data’s entries and labels, respectively. We will need this in the next snippet to separate the data further.

x_train, x_test, y_train, y_test = sklearn.model_selection.train_test_split(x, y, test_size=0.2)
We use Scikit-learn’s model_selection.train_test_split method to take our entries and labels and split them into training and testing subcomponents. By setting test_size to 0.2, we allocate 20% of our data for testing purposes, and the other 80% for training purposes. Note that it’s not recommended to set test_size to be greater than 0.3.

# Create an SVC
clf = svm.SVC()

# train the training set, y_train)

Here, we create the Support Vector Machine along with the SVC (Support Vector Classifier) and assign it to the variable clf. Next, we train the model on the training entries and labels (20% of the dataset as defined by test_size previously).

# Predict the labels for the training data
y_predict = clf.predict(x_test)

# compare the predicted labels to the actual labels and conver to percentage to output
acc = metrics.accuracy_score(y_test, y_predict)*100
print(f"{acc}% accuracy")

We set a variable y_predict equal to the results of the classifier predicting the labels of the testing entries. We then use the sklearn.metrics.accuracy_score() method to compare the actual testing labels to that of y_predict, which was the machine’s attempt at classifying them. We then convert this to a percentage by multiplying by a factor of 100, and printing this result. The accuracy you should get is ~96%, which is a very good accuracy!

If you enjoy articles like this and find it useful or have any feedback, message me on Discord @Necloremius#6412

Top comments (1)

nikhartz profile image

first thank you for the good explanation of this topic!

I would be very grateful if you could answer my the following question.

How i create a Dataset out of images? In my special case a camera, watching in the sky, should recognize clouds, but not detect them in boundary boxes but should detect the shape of it.

Is there any tool to "mark" the clouds and create out of it a dataset which can be used by classifiers?

Best Regards