DEV Community

Cover image for 60. Support Vector Machines: Drawing the Perfect Boundary
Akhilesh
Akhilesh

Posted on

60. Support Vector Machines: Drawing the Perfect Boundary

Most classification algorithms find a boundary that separates classes. SVM finds the boundary that is as far away from both classes as possible.

That extra distance is called the margin. And maximizing it is what makes SVM so good, especially when you don't have a lot of data.

It sounds simple. The math underneath is not. But you don't need the math to use it well. You need to understand the ideas.


What You'll Learn Here

  • What a hyperplane and margin actually are
  • What support vectors are and why only a few points matter
  • The C parameter: how to control the margin vs mistakes tradeoff
  • The kernel trick: handling non-linear data without changing features
  • When SVM works well and when to skip it
  • Full working code for classification and when to scale

The Problem With Just Any Boundary

Imagine two groups of points on a 2D graph. Many different lines could separate them.

Logistic regression finds one that separates them. Decision trees find one. But there are infinite valid lines.

SVM asks a different question: which line is the most confident separator?

The answer is the line that sits exactly in the middle of the gap between the two classes, as far from both sides as possible. That gap is the margin.

Class A (circles):   o  o  o
                              |  <- margin boundary
                     - - - - + - - - - <- decision boundary (hyperplane)
                              |  <- margin boundary
Class B (crosses):            x  x  x
Enter fullscreen mode Exit fullscreen mode

The points closest to the decision boundary are called support vectors. They're the ones that define where the boundary sits. If you removed all other points and kept only the support vectors, the boundary wouldn't change.

That's a powerful property. SVM only cares about the hard cases at the border, not the easy ones far away.


Hyperplanes in Higher Dimensions

In 2D, the decision boundary is a line. In 3D, it's a plane. In 100 dimensions, it's called a hyperplane.

The math works the same way regardless of dimensions:

hyperplane equation: w · x + b = 0

w = weight vector (perpendicular to the hyperplane)
x = input features
b = bias (shifts the hyperplane)
Enter fullscreen mode Exit fullscreen mode

Points where w · x + b > 0 go to one class. Points where w · x + b < 0 go to the other. The margin is the distance between the two parallel hyperplanes w · x + b = +1 and w · x + b = -1.


Your First SVM

from sklearn.datasets import load_breast_cancer
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, classification_report
import pandas as pd

data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# SVM needs scaling - this is non-negotiable
scaler = StandardScaler()
X_train_s = scaler.fit_transform(X_train)
X_test_s  = scaler.transform(X_test)

# Linear SVM
svm = SVC(kernel='linear', C=1.0, random_state=42)
svm.fit(X_train_s, y_train)

y_pred = svm.predict(X_test_s)
print(f"SVM (linear) Accuracy: {accuracy_score(y_test, y_pred):.3f}")
print()
print(classification_report(y_test, y_pred, target_names=data.target_names))
Enter fullscreen mode Exit fullscreen mode

Output:

SVM (linear) Accuracy: 0.982

              precision    recall  f1-score   support

   malignant       0.98      0.98      0.98        42
      benign       0.99      0.99      0.99        72

    accuracy                           0.98       114
Enter fullscreen mode Exit fullscreen mode

Notice we always scale before SVM. Always. Without scaling, features with large ranges dominate the margin calculation completely and the model breaks.


The C Parameter: Margin vs Mistakes

Real data is messy. Classes overlap. A perfect margin might not exist.

The C parameter controls how much you penalize the model for making mistakes on training data.

  • Small C: allow more mistakes, prefer a wider margin. Simpler boundary. May underfit.
  • Large C: allow fewer mistakes, accept a narrower margin. Complex boundary. May overfit.
from sklearn.model_selection import cross_val_score
import numpy as np

print(f"{'C value':<10} {'CV Mean':<10} {'CV Std'}")
print("-" * 32)

for C in [0.001, 0.01, 0.1, 1, 10, 100, 1000]:
    svm_c = SVC(kernel='linear', C=C, random_state=42)
    scores = cross_val_score(svm_c, X_train_s, y_train, cv=5)
    print(f"{C:<10} {scores.mean():.3f}      {scores.std():.3f}")
Enter fullscreen mode Exit fullscreen mode

Output:

C value    CV Mean    CV Std
--------------------------------
0.001      0.934      0.021
0.01       0.956      0.018
0.1        0.967      0.016
1          0.974      0.013
10         0.974      0.015
100        0.974      0.015
1000       0.967      0.017
Enter fullscreen mode Exit fullscreen mode

CV accuracy peaks around C=1 to C=10. Very small C underfits. Very large C starts to overfit. C=1 is a safe default to start with.


The Kernel Trick: Handling Non-Linear Data

Here's the problem. SVM draws a straight hyperplane. But a lot of real data isn't linearly separable.

Look at this case: you have two concentric circles of points. No straight line separates the inner circle from the outer one.

The kernel trick says: instead of drawing a curved boundary in 2D, map the data to a higher dimension where a straight hyperplane does work.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_circles
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler

# Non-linearly separable data
X_circles, y_circles = make_circles(n_samples=300, noise=0.1, factor=0.3, random_state=42)

# Plot the raw data
plt.figure(figsize=(12, 4))

plt.subplot(1, 3, 1)
plt.scatter(X_circles[:, 0], X_circles[:, 1], c=y_circles, cmap='bwr', alpha=0.7)
plt.title('Raw Data - Not Linearly Separable')
plt.axis('equal')

# Linear SVM - will fail
scaler = StandardScaler()
X_c_s = scaler.fit_transform(X_circles)

svm_linear = SVC(kernel='linear', C=1.0)
svm_linear.fit(X_c_s, y_circles)
print(f"Linear kernel accuracy: {svm_linear.score(X_c_s, y_circles):.3f}")

# RBF kernel - works
svm_rbf = SVC(kernel='rbf', C=1.0, gamma='scale')
svm_rbf.fit(X_c_s, y_circles)
print(f"RBF kernel accuracy:    {svm_rbf.score(X_c_s, y_circles):.3f}")
Enter fullscreen mode Exit fullscreen mode

Output:

Linear kernel accuracy: 0.503
RBF kernel accuracy:    0.990
Enter fullscreen mode Exit fullscreen mode

Linear SVM is basically guessing on circular data. RBF SVM gets 99%.

The RBF (Radial Basis Function) kernel maps the data into a higher dimensional space where the classes become linearly separable. You don't manually transform features. The kernel does it internally during training.


The Four Main Kernels

from sklearn.datasets import make_moons
from sklearn.model_selection import cross_val_score

X_m, y_m = make_moons(n_samples=300, noise=0.2, random_state=42)
X_m_s = StandardScaler().fit_transform(X_m)

kernels = ['linear', 'poly', 'rbf', 'sigmoid']

print(f"{'Kernel':<10} {'CV Accuracy'}")
print("-" * 25)

for k in kernels:
    svm_k = SVC(kernel=k, C=1.0, random_state=42)
    score = cross_val_score(svm_k, X_m_s, y_m, cv=5).mean()
    print(f"{k:<10} {score:.3f}")
Enter fullscreen mode Exit fullscreen mode

Output:

Kernel     CV Accuracy
-------------------------
linear     0.873
poly       0.947
rbf        0.977
sigmoid    0.873
Enter fullscreen mode Exit fullscreen mode

linear: works when data is linearly separable. Fast. Interpretable.
poly: fits polynomial boundaries. degree parameter controls complexity.
rbf: most flexible. Works for most non-linear problems. Best default choice.
sigmoid: less commonly used. Similar to neural network activation.

When in doubt, start with rbf.


The Gamma Parameter for RBF

When using the RBF kernel, gamma controls how far the influence of a single training example reaches.

  • Small gamma: far reach. Smooth boundary. May underfit.
  • Large gamma: close reach. Complex boundary. May overfit.
print(f"{'Gamma':<12} {'CV Mean':<10} {'CV Std'}")
print("-" * 35)

for gamma in [0.001, 0.01, 0.1, 1, 10, 100]:
    svm_g = SVC(kernel='rbf', C=1.0, gamma=gamma)
    scores = cross_val_score(svm_g, X_train_s, y_train, cv=5)
    print(f"{gamma:<12} {scores.mean():.3f}      {scores.std():.3f}")
Enter fullscreen mode Exit fullscreen mode

In practice, use gamma='scale' as your default. It automatically sets gamma based on the number of features and the variance of the data. Much better than guessing.

svm = SVC(kernel='rbf', C=1.0, gamma='scale', random_state=42)
Enter fullscreen mode Exit fullscreen mode

Getting Probabilities From SVM

By default, SVM only gives you class labels. If you need probabilities, set probability=True. It uses Platt scaling internally, which adds a bit of training time.

svm_prob = SVC(kernel='rbf', C=1.0, gamma='scale', probability=True, random_state=42)
svm_prob.fit(X_train_s, y_train)

# Now you can get probabilities
proba = svm_prob.predict_proba(X_test_s)
print("Sample probabilities (malignant, benign):")
for i in range(5):
    print(f"  P(malignant)={proba[i][0]:.3f}  P(benign)={proba[i][1]:.3f}  "
          f"Predicted: {data.target_names[svm_prob.predict(X_test_s)[i]]}")
Enter fullscreen mode Exit fullscreen mode

SVM for Regression: SVR

SVM also works for regression with a slightly different setup. Instead of finding a margin between classes, it finds a tube around the predictions and minimizes errors outside the tube.

from sklearn.svm import SVR
from sklearn.datasets import fetch_california_housing
from sklearn.metrics import r2_score
import numpy as np

housing = fetch_california_housing()
X_h = housing.data[:2000]  # use subset - SVR is slow on large datasets
y_h = housing.target[:2000]

X_train_h, X_test_h, y_train_h, y_test_h = train_test_split(
    X_h, y_h, test_size=0.2, random_state=42
)

scaler_h = StandardScaler()
X_train_hs = scaler_h.fit_transform(X_train_h)
X_test_hs  = scaler_h.transform(X_test_h)

svr = SVR(kernel='rbf', C=10, gamma='scale', epsilon=0.1)
svr.fit(X_train_hs, y_train_h)

y_pred_h = svr.predict(X_test_hs)
print(f"SVR R2: {r2_score(y_test_h, y_pred_h):.3f}")
Enter fullscreen mode Exit fullscreen mode

When SVM Shines and When to Skip It

SVM works well when:

  • You have a small to medium dataset (under 100k samples)
  • You have more features than samples (text classification, genomics)
  • The data is nearly linearly separable with some noise
  • You need a strong baseline before trying complex models

Skip SVM when:

  • Dataset is very large (100k+ samples). SVM scales poorly, training becomes very slow.
  • You need fast predictions on millions of examples. SVM prediction is slower than trees.
  • You need easy feature importance. SVM with RBF kernel is a black box.
  • You need to retrain frequently. SVM training doesn't scale to streaming data.
# Quick benchmark: compare SVM to XGBoost and Random Forest
from sklearn.ensemble import RandomForestClassifier
import xgboost as xgb
import time

models = {
    'SVM (rbf)':      SVC(kernel='rbf', C=1, gamma='scale', random_state=42),
    'Random Forest':  RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1),
    'XGBoost':        xgb.XGBClassifier(n_estimators=100, random_state=42,
                                         eval_metric='logloss', verbosity=0),
}

print(f"{'Model':<18} {'Accuracy':<12} {'Train Time'}")
print("-" * 45)

for name, m in models.items():
    start = time.time()
    m.fit(X_train_s, y_train)
    elapsed = time.time() - start

    acc = accuracy_score(y_test, m.predict(X_test_s))
    print(f"{name:<18} {acc:.3f}        {elapsed:.3f}s")
Enter fullscreen mode Exit fullscreen mode

Output:

Model              Accuracy     Train Time
---------------------------------------------
SVM (rbf)          0.982        0.021s
Random Forest      0.974        0.312s
XGBoost            0.974        0.183s
Enter fullscreen mode Exit fullscreen mode

On this small dataset, SVM is actually fastest and most accurate. On larger datasets, that flips completely.


The Complete Workflow

from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import classification_report
from sklearn.datasets import load_breast_cancer

data = load_breast_cancer()
X, y = data.data, data.target

# 1. Split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# 2. Scale (always)
scaler = StandardScaler()
X_train_s = scaler.fit_transform(X_train)
X_test_s  = scaler.transform(X_test)

# 3. Find best C and gamma with grid search
param_grid = {
    'C':     [0.1, 1, 10, 100],
    'gamma': ['scale', 0.01, 0.1],
}

grid = GridSearchCV(
    SVC(kernel='rbf', random_state=42),
    param_grid,
    cv=5,
    scoring='accuracy',
    n_jobs=-1
)
grid.fit(X_train_s, y_train)

print(f"Best params: {grid.best_params_}")
print(f"Best CV score: {grid.best_score_:.3f}")

# 4. Evaluate final model
best_svm = grid.best_estimator_
print(f"\nTest accuracy: {accuracy_score(y_test, best_svm.predict(X_test_s)):.3f}")
print()
print(classification_report(y_test, best_svm.predict(X_test_s),
                              target_names=data.target_names))
Enter fullscreen mode Exit fullscreen mode

Quick Cheat Sheet

Task Code
Linear SVM SVC(kernel='linear', C=1.0)
RBF SVM (default choice) SVC(kernel='rbf', C=1.0, gamma='scale')
Get probabilities SVC(probability=True) then .predict_proba()
Regression SVR(kernel='rbf', C=1.0, gamma='scale')
Scale features always use StandardScaler before SVM
Tune C and gamma GridSearchCV with param_grid
Speed up grid search n_jobs=-1 in GridSearchCV
Large datasets use LinearSVC instead of SVC(kernel='linear')

Practice Challenges

Level 1:
Train an SVM with kernel='rbf' on load_digits() (handwritten digits, 10 classes). Scale the data. Print accuracy. SVM handles this surprisingly well.

Level 2:
On make_moons(noise=0.3), compare linear, poly, and rbf kernels. Plot the decision boundaries for each. Which one fits the moon shape correctly?

Level 3:
Use GridSearchCV to tune both C and gamma on the breast cancer dataset. Plot a heatmap of CV accuracy for different C and gamma combinations. Which region of the grid gives the best results?


References


Next up, Post 61: K-Nearest Neighbors: Judge by Your Company. The laziest algorithm in ML stores all training data and classifies by similarity. Simple, no training phase, surprisingly effective.

Top comments (0)