DEV Community

Cover image for Support Vector Machines From Scratch
Lance Galletti
Lance Galletti

Posted on

Support Vector Machines From Scratch

In this article you will learn how to implement a simple algorithm for solving SVM from scratch.

perceptron algorithm

Tldr; Support Vector Machines

The goal is to find the widest street that separates classes. The street is defined by 3 lines:

In this example we have two classes (blue = +1 and green = -1). The line in red is the decision boundary — to classify an unknown point u using the above SVM means:

  • wTu+b0w^T u + b ≥ 0 THEN blue
  • wTu+b<0w^T u + b < 0 THEN green

The width of the street is

width=2wwidth = \frac{2}{\lVert w \rVert}

That means the width of the street is inversely proportional to the magnitude of w.

Dividing w and b by 4:

So finding the widest street means we want to find a w that is as small as possible that forms a street that separates the classes.

Expanding / Retracting Rate

Notice that multiplying w and b by the same constant c doesn’t change the decision boundary but does change the width of the street. If:

  • 0<c<10 < c < 1 the width will expand
  • c>1c > 1 the width will retract

Assuming we have a linearly separable dataset, we would impose the constraint that no data points lie in the street. This means that if we find a line that separates our classes but some points lie in the street, we should make the street more narrow. And if we find a street that separates our classes and none of them are in the street, we should expand the street.

Ideally, some special points (called support vectors) would exactly lie on the edge of the street.

Perceptron Algorithm

If the current w and b result in misclassification of a random point x in our dataset, we can move the line toward x by some amount lr as follows:

wnew=wold+yilrxiw_{new} = w_{old} + y_i * lr * x_i and bnew=bold+yilrb_{new} = b_{old} + y_i * lr

If that point x is in the street we might want to retract the street. If that point is not in the street or was classified correctly we might want to expand it.

Putting it all together

Starting with a random w and b (note: the initial values of w and b have a large impact on what line is learned through this process).

epochs = 100
lr = .05
expanding_rate = .99
retracting_rate = 1.01
for _ in range(epochs):
    # pick a point from X at random
    i = np.random.randint(0, len(X))
    x, y = X[i], Y[i]
    ypred = w[0] * x[0] + w[1] * x[1] + b
    if (ypred > 0 and y > 0) or (ypred < 0 and y < 0):
        # classified correctly
        if ypred < 1 and ypred > -1:
            # in the street / street is too wide
            w = w + x * y * lr * retracting_rate
            b = b + y * lr * retracting_rate
        else:
            # street is too narrow
            w = w * expanding_rate
            b = b * expanding_rate
    else:
        # misclassified
        w = w + x * y * lr * expanding_rate
        b = b + y * lr * expanding_rate
Enter fullscreen mode Exit fullscreen mode

Circled in red are the points that were mislassified. Circled in yellow are the points that were correctly classified.

perceptron algorithm

What if the data is not linearly separable

Looking at the problem statement again — we’re looking to maximize the width of the street subject to the constraint that points in our dataset cannot lie in the street. Mathematically we want to find:

max(2w)=min(w)=min(12w2)\max(\frac{2}{\lVert w \rVert}) = \min(\lVert w \rVert) = \min(\frac{1}{2} \lVert w \rVert^2)

subject to:

yi(wxi+b)1=0y_i (w \cdot x_i + b ) - 1 = 0

Which equates to minimizing:

L=12w2iαi[yi(wxi+b)1]L = \frac{1}{2} \lVert w \rVert^2 - \sum_i \alpha_i \left[ y_i (w \cdot x_i + b ) - 1 \right]

Taking the derivative wrt w:

Lw=wiαiyixi=0\frac{\partial L}{\partial w} = w - \sum_i \alpha_i y_i x_i = 0

Which means:

w=iαiyixiw = \sum_i \alpha_i y_i x_i

So the decision boundary can be re-written as:

iαixi,x+b=0\sum_i \alpha_i \langle x_i, x \rangle + b = 0

Updating w means updating α.

αi,new=αi,old+yilr\alpha_{i, new} = \alpha_{i, old} + y_i * lr and bnew=bold+yilrb_{new} = b_{old} + y_i * lr

So, we can re-write the above algorithm as such:

epochs = 100
lr = .05
expanding_rate = .99
retracting_rate = 1.01
def predict(alpha_i, b, x):
    wx = 0
    for j in range(len(X)):
        wx += alpha_i[j] * np.dot(X[j], x)
    return wx + b
for _ in range(epochs):
    # pick a point from X at random
    i = np.random.randint(0, len(X))
    x, y = X[i], Y[i]
    ypred = predict(alpha_i, b, x)
    if (ypred > 0 and y > 0) or (ypred < 0 and y < 0):
        # classified correctly
        if ypred < 1 and ypred > -1:
            # in the street / street is too wide
            alpha_i[i] += y * lr
            alpha_i = alpha_i * retracting_rate           
            b += y * lr * retracting_rate
        else:
            # street is too narrow
            alpha_i = alpha_i * expanding_rate
            b *= expanding_rate
    else:
        # misclassified
        alpha_i[i] += y * lr
        alpha_i = alpha_i * expanding_rate
        b += y * lr * expanding_rate
Enter fullscreen mode Exit fullscreen mode

Kernel Trick

To find a non-linear decision boundary, we can use the “kernel trick”. That is, instead of defining an explicit mapping to a new feature space where the dataset is linearly separable we need only define what the inner product in that transformed space is. This kernel function is what defines this inner product and can replace the dot product in the predict function:

iαiK(xi,x)+b=0\sum_i \alpha_i K(x_i, x) + b = 0

For example we can use the polynomial kernel

def polynomial(x_i, x_j, c, n):
    return (np.dot(x_i, x_j) + c) ** n

def predict(alpha_i, b, x):
    wx = 0
    for j in range(len(X)):
        wx += alpha_i[j] * polynomial(X[j], x, C, N)
    return wx + b
Enter fullscreen mode Exit fullscreen mode

polynomial kernel to solve xor

Or using the RBF kernel

rbf kernel

Conclusion

While other methods can solve this problem more effectively, I hope this simple algorithm helped you better understand SVMs!

Top comments (0)