petercour

Posted on

Party time! Classify data with Support Vector Machine

Classifying data is a common task in machine learning. The support vector machine (SVM) is one of the models that can do that.

What is classification?

Given new data, make a decision which class it belongs to.
Like the garfield below:

That would be a classifier that doesn't use data. Programmers try to be smart and use historical data in their algorithm

Data?

The SVM trains on data (ML algorithms train on data). This data is labeled data, for every input there is an output.

With respect to terminology. Scientists call this 'supervised learning'. Programmers call this (labeled) data.

All your bytes are belong to us

So how do you load this data? where can you find it?
The Machine Learning module sklearn comes with some data sets, like the iris dataset. This data set has several columns, so take only the first two.

``````#!/usr/bin/python3
X = iris.data[:, :2]

# we only take the first two features. We could
# avoid this ugly slicing by using a two-dim dataset

Y = iris.target
``````

Then apply the SVM and plot the output. Now without going into to much details, the ML algorithm is created and trained with the fit() function. The variables X and Y are the data we loaded before.

Remember we said "for every input there is an output". So each Y is the output for the input vector X.

From the data set loaded above.

``````#!/usr/bin/python3
clf = svm.SVC(kernel=my_kernel)
clf.fit(X, Y)
``````

So the algorithm is trained with X and Y. This algorithms only goal is to make predictions: which group does new data fit into.

``````clf.fit(X, Y)
``````

Run the code

That's all. Now the rest are details. Full code:

``````#!/usr/bin/python3
import numpy as np
import matplotlib.pyplot as plt
from sklearn import svm, datasets

# import some data to play with
X = iris.data[:, :2]

# we only take the first two features. We could
# avoid this ugly slicing by using a two-dim dataset

Y = iris.target

def my_kernel(X, Y):
"""
We create a custom kernel:

(2  0)
k(X, Y) = X  (    ) Y.T
(0  1)
"""
M = np.array([[2, 0], [0, 1.0]])
return np.dot(np.dot(X, M), Y.T)

h = .02

# step size in the mesh
# we create an instance of SVM and fit out data.
clf = svm.SVC(kernel=my_kernel)
clf.fit(X, Y)

# Plot the decision boundary. For that, we will assign a color to each
# point in the mesh [x_min, x_max]x[y_min, y_max].
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
# Put the result into a color plot
Z = Z.reshape(xx.shape)
plt.pcolormesh(xx, yy, Z, cmap=plt.cm.Paired)
# Plot also the training points
plt.scatter(X[:, 0], X[:, 1], c=Y, cmap=plt.cm.Paired, edgecolors='k')
plt.title('3-Class classification using Support Vector= Machine with custom' ' kernel')
plt.axis('tight')
plt.show()
``````