Most classification algorithms find a boundary that separates classes. SVM finds the boundary that is as far away from both classes as possible.
That extra distance is called the margin. And maximizing it is what makes SVM so good, especially when you don't have a lot of data.
It sounds simple. The math underneath is not. But you don't need the math to use it well. You need to understand the ideas.
What You'll Learn Here
- What a hyperplane and margin actually are
- What support vectors are and why only a few points matter
- The C parameter: how to control the margin vs mistakes tradeoff
- The kernel trick: handling non-linear data without changing features
- When SVM works well and when to skip it
- Full working code for classification and when to scale
The Problem With Just Any Boundary
Imagine two groups of points on a 2D graph. Many different lines could separate them.
Logistic regression finds one that separates them. Decision trees find one. But there are infinite valid lines.
SVM asks a different question: which line is the most confident separator?
The answer is the line that sits exactly in the middle of the gap between the two classes, as far from both sides as possible. That gap is the margin.
Class A (circles): o o o
| <- margin boundary
- - - - + - - - - <- decision boundary (hyperplane)
| <- margin boundary
Class B (crosses): x x x
The points closest to the decision boundary are called support vectors. They're the ones that define where the boundary sits. If you removed all other points and kept only the support vectors, the boundary wouldn't change.
That's a powerful property. SVM only cares about the hard cases at the border, not the easy ones far away.
Hyperplanes in Higher Dimensions
In 2D, the decision boundary is a line. In 3D, it's a plane. In 100 dimensions, it's called a hyperplane.
The math works the same way regardless of dimensions:
hyperplane equation: w · x + b = 0
w = weight vector (perpendicular to the hyperplane)
x = input features
b = bias (shifts the hyperplane)
Points where w · x + b > 0 go to one class. Points where w · x + b < 0 go to the other. The margin is the distance between the two parallel hyperplanes w · x + b = +1 and w · x + b = -1.
Your First SVM
from sklearn.datasets import load_breast_cancer
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, classification_report
import pandas as pd
data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
# SVM needs scaling - this is non-negotiable
scaler = StandardScaler()
X_train_s = scaler.fit_transform(X_train)
X_test_s = scaler.transform(X_test)
# Linear SVM
svm = SVC(kernel='linear', C=1.0, random_state=42)
svm.fit(X_train_s, y_train)
y_pred = svm.predict(X_test_s)
print(f"SVM (linear) Accuracy: {accuracy_score(y_test, y_pred):.3f}")
print()
print(classification_report(y_test, y_pred, target_names=data.target_names))
Output:
SVM (linear) Accuracy: 0.982
precision recall f1-score support
malignant 0.98 0.98 0.98 42
benign 0.99 0.99 0.99 72
accuracy 0.98 114
Notice we always scale before SVM. Always. Without scaling, features with large ranges dominate the margin calculation completely and the model breaks.
The C Parameter: Margin vs Mistakes
Real data is messy. Classes overlap. A perfect margin might not exist.
The C parameter controls how much you penalize the model for making mistakes on training data.
- Small C: allow more mistakes, prefer a wider margin. Simpler boundary. May underfit.
- Large C: allow fewer mistakes, accept a narrower margin. Complex boundary. May overfit.
from sklearn.model_selection import cross_val_score
import numpy as np
print(f"{'C value':<10} {'CV Mean':<10} {'CV Std'}")
print("-" * 32)
for C in [0.001, 0.01, 0.1, 1, 10, 100, 1000]:
svm_c = SVC(kernel='linear', C=C, random_state=42)
scores = cross_val_score(svm_c, X_train_s, y_train, cv=5)
print(f"{C:<10} {scores.mean():.3f} {scores.std():.3f}")
Output:
C value CV Mean CV Std
--------------------------------
0.001 0.934 0.021
0.01 0.956 0.018
0.1 0.967 0.016
1 0.974 0.013
10 0.974 0.015
100 0.974 0.015
1000 0.967 0.017
CV accuracy peaks around C=1 to C=10. Very small C underfits. Very large C starts to overfit. C=1 is a safe default to start with.
The Kernel Trick: Handling Non-Linear Data
Here's the problem. SVM draws a straight hyperplane. But a lot of real data isn't linearly separable.
Look at this case: you have two concentric circles of points. No straight line separates the inner circle from the outer one.
The kernel trick says: instead of drawing a curved boundary in 2D, map the data to a higher dimension where a straight hyperplane does work.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_circles
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
# Non-linearly separable data
X_circles, y_circles = make_circles(n_samples=300, noise=0.1, factor=0.3, random_state=42)
# Plot the raw data
plt.figure(figsize=(12, 4))
plt.subplot(1, 3, 1)
plt.scatter(X_circles[:, 0], X_circles[:, 1], c=y_circles, cmap='bwr', alpha=0.7)
plt.title('Raw Data - Not Linearly Separable')
plt.axis('equal')
# Linear SVM - will fail
scaler = StandardScaler()
X_c_s = scaler.fit_transform(X_circles)
svm_linear = SVC(kernel='linear', C=1.0)
svm_linear.fit(X_c_s, y_circles)
print(f"Linear kernel accuracy: {svm_linear.score(X_c_s, y_circles):.3f}")
# RBF kernel - works
svm_rbf = SVC(kernel='rbf', C=1.0, gamma='scale')
svm_rbf.fit(X_c_s, y_circles)
print(f"RBF kernel accuracy: {svm_rbf.score(X_c_s, y_circles):.3f}")
Output:
Linear kernel accuracy: 0.503
RBF kernel accuracy: 0.990
Linear SVM is basically guessing on circular data. RBF SVM gets 99%.
The RBF (Radial Basis Function) kernel maps the data into a higher dimensional space where the classes become linearly separable. You don't manually transform features. The kernel does it internally during training.
The Four Main Kernels
from sklearn.datasets import make_moons
from sklearn.model_selection import cross_val_score
X_m, y_m = make_moons(n_samples=300, noise=0.2, random_state=42)
X_m_s = StandardScaler().fit_transform(X_m)
kernels = ['linear', 'poly', 'rbf', 'sigmoid']
print(f"{'Kernel':<10} {'CV Accuracy'}")
print("-" * 25)
for k in kernels:
svm_k = SVC(kernel=k, C=1.0, random_state=42)
score = cross_val_score(svm_k, X_m_s, y_m, cv=5).mean()
print(f"{k:<10} {score:.3f}")
Output:
Kernel CV Accuracy
-------------------------
linear 0.873
poly 0.947
rbf 0.977
sigmoid 0.873
linear: works when data is linearly separable. Fast. Interpretable.
poly: fits polynomial boundaries. degree parameter controls complexity.
rbf: most flexible. Works for most non-linear problems. Best default choice.
sigmoid: less commonly used. Similar to neural network activation.
When in doubt, start with rbf.
The Gamma Parameter for RBF
When using the RBF kernel, gamma controls how far the influence of a single training example reaches.
- Small gamma: far reach. Smooth boundary. May underfit.
- Large gamma: close reach. Complex boundary. May overfit.
print(f"{'Gamma':<12} {'CV Mean':<10} {'CV Std'}")
print("-" * 35)
for gamma in [0.001, 0.01, 0.1, 1, 10, 100]:
svm_g = SVC(kernel='rbf', C=1.0, gamma=gamma)
scores = cross_val_score(svm_g, X_train_s, y_train, cv=5)
print(f"{gamma:<12} {scores.mean():.3f} {scores.std():.3f}")
In practice, use gamma='scale' as your default. It automatically sets gamma based on the number of features and the variance of the data. Much better than guessing.
svm = SVC(kernel='rbf', C=1.0, gamma='scale', random_state=42)
Getting Probabilities From SVM
By default, SVM only gives you class labels. If you need probabilities, set probability=True. It uses Platt scaling internally, which adds a bit of training time.
svm_prob = SVC(kernel='rbf', C=1.0, gamma='scale', probability=True, random_state=42)
svm_prob.fit(X_train_s, y_train)
# Now you can get probabilities
proba = svm_prob.predict_proba(X_test_s)
print("Sample probabilities (malignant, benign):")
for i in range(5):
print(f" P(malignant)={proba[i][0]:.3f} P(benign)={proba[i][1]:.3f} "
f"Predicted: {data.target_names[svm_prob.predict(X_test_s)[i]]}")
SVM for Regression: SVR
SVM also works for regression with a slightly different setup. Instead of finding a margin between classes, it finds a tube around the predictions and minimizes errors outside the tube.
from sklearn.svm import SVR
from sklearn.datasets import fetch_california_housing
from sklearn.metrics import r2_score
import numpy as np
housing = fetch_california_housing()
X_h = housing.data[:2000] # use subset - SVR is slow on large datasets
y_h = housing.target[:2000]
X_train_h, X_test_h, y_train_h, y_test_h = train_test_split(
X_h, y_h, test_size=0.2, random_state=42
)
scaler_h = StandardScaler()
X_train_hs = scaler_h.fit_transform(X_train_h)
X_test_hs = scaler_h.transform(X_test_h)
svr = SVR(kernel='rbf', C=10, gamma='scale', epsilon=0.1)
svr.fit(X_train_hs, y_train_h)
y_pred_h = svr.predict(X_test_hs)
print(f"SVR R2: {r2_score(y_test_h, y_pred_h):.3f}")
When SVM Shines and When to Skip It
SVM works well when:
- You have a small to medium dataset (under 100k samples)
- You have more features than samples (text classification, genomics)
- The data is nearly linearly separable with some noise
- You need a strong baseline before trying complex models
Skip SVM when:
- Dataset is very large (100k+ samples). SVM scales poorly, training becomes very slow.
- You need fast predictions on millions of examples. SVM prediction is slower than trees.
- You need easy feature importance. SVM with RBF kernel is a black box.
- You need to retrain frequently. SVM training doesn't scale to streaming data.
# Quick benchmark: compare SVM to XGBoost and Random Forest
from sklearn.ensemble import RandomForestClassifier
import xgboost as xgb
import time
models = {
'SVM (rbf)': SVC(kernel='rbf', C=1, gamma='scale', random_state=42),
'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1),
'XGBoost': xgb.XGBClassifier(n_estimators=100, random_state=42,
eval_metric='logloss', verbosity=0),
}
print(f"{'Model':<18} {'Accuracy':<12} {'Train Time'}")
print("-" * 45)
for name, m in models.items():
start = time.time()
m.fit(X_train_s, y_train)
elapsed = time.time() - start
acc = accuracy_score(y_test, m.predict(X_test_s))
print(f"{name:<18} {acc:.3f} {elapsed:.3f}s")
Output:
Model Accuracy Train Time
---------------------------------------------
SVM (rbf) 0.982 0.021s
Random Forest 0.974 0.312s
XGBoost 0.974 0.183s
On this small dataset, SVM is actually fastest and most accurate. On larger datasets, that flips completely.
The Complete Workflow
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import classification_report
from sklearn.datasets import load_breast_cancer
data = load_breast_cancer()
X, y = data.data, data.target
# 1. Split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
# 2. Scale (always)
scaler = StandardScaler()
X_train_s = scaler.fit_transform(X_train)
X_test_s = scaler.transform(X_test)
# 3. Find best C and gamma with grid search
param_grid = {
'C': [0.1, 1, 10, 100],
'gamma': ['scale', 0.01, 0.1],
}
grid = GridSearchCV(
SVC(kernel='rbf', random_state=42),
param_grid,
cv=5,
scoring='accuracy',
n_jobs=-1
)
grid.fit(X_train_s, y_train)
print(f"Best params: {grid.best_params_}")
print(f"Best CV score: {grid.best_score_:.3f}")
# 4. Evaluate final model
best_svm = grid.best_estimator_
print(f"\nTest accuracy: {accuracy_score(y_test, best_svm.predict(X_test_s)):.3f}")
print()
print(classification_report(y_test, best_svm.predict(X_test_s),
target_names=data.target_names))
Quick Cheat Sheet
| Task | Code |
|---|---|
| Linear SVM | SVC(kernel='linear', C=1.0) |
| RBF SVM (default choice) | SVC(kernel='rbf', C=1.0, gamma='scale') |
| Get probabilities |
SVC(probability=True) then .predict_proba()
|
| Regression | SVR(kernel='rbf', C=1.0, gamma='scale') |
| Scale features | always use StandardScaler before SVM |
| Tune C and gamma |
GridSearchCV with param_grid
|
| Speed up grid search |
n_jobs=-1 in GridSearchCV
|
| Large datasets | use LinearSVC instead of SVC(kernel='linear')
|
Practice Challenges
Level 1:
Train an SVM with kernel='rbf' on load_digits() (handwritten digits, 10 classes). Scale the data. Print accuracy. SVM handles this surprisingly well.
Level 2:
On make_moons(noise=0.3), compare linear, poly, and rbf kernels. Plot the decision boundaries for each. Which one fits the moon shape correctly?
Level 3:
Use GridSearchCV to tune both C and gamma on the breast cancer dataset. Plot a heatmap of CV accuracy for different C and gamma combinations. Which region of the grid gives the best results?
References
- Scikit-learn: SVC
- Scikit-learn: SVM user guide
- StatQuest: SVM (YouTube)
- Visual explanation of kernel trick
Next up, Post 61: K-Nearest Neighbors: Judge by Your Company. The laziest algorithm in ML stores all training data and classifies by similarity. Simple, no training phase, surprisingly effective.
Top comments (0)