Harsimranjit Singh

Posted on Sep 21, 2024

Enhancing Support Vector Machines for Non-Linear Classification with Kernel Functions

Problem with Non-Linear Data in SVMs

Support Vector Machines are powerful classifier when dealing with Linearly seperable data. For a set of data points, if there exists a hyperplane that can seperate points of different classes, SVM identifies the maximal margin hyperplane. However , not all data is linearly seperable. In many real-world problems, the classes are often non-linearly seperable, meaning a straight line or plane cannot separate them accurately.

Let's create a synthetic dataset where the classes are arranged in concentric circles:

import matplotlib.pyplot as plt
from sklearn.datasets import make_circles 

X, y = make_circles(100, factor=.1, noise=.1)

plt.figure(figsize=(8, 6))
plt.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='bwr')
plt.title('Concentric Circles Dataset')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.colorbar(label='Class')
plt.grid(True)
plt.show()

As seen in the plot, the data points form two concentric circles. A straight line (or plane) cannot effectively seperate these two classes.

Solution Using Projection in Higher Dimensions

In many non-linear classification problems, the data may not be separable by a linear hyperplane in its original feature space. By projecting the data into a higher-dimensional space, we can often make it linearly separable.
For example, consider data in concentric circles. In 2D space, these circles cannot be seperated by a straight line. However if we map this data into 3d, the circles may become concentric spheres, which can be seperated by plane.

Illustration:

imagine above 2D dataset of concentric circles:

in this plot, we see two classes arranged in concentric circles.
Now, if we project this data into 3D, each point $(x_i, x_2)$ from the 2D space is mapped to a new point $(x_1, x_2, \phi(x_1, x_2))$ in the 3D space, where $\phi$ is function that defines the transformation. For instance, if we use a transformation function that maps $(x_1, x_2) \text{ to } (x_1, x_2, x_1^2 + x_2^2)$ we obtain:

In this 3D space, plane can be used to seperate the classes.

Problem with the Above Solution:

Computational Cost: Mapping data to higher dimensions requires additional computational resources. The complexity of training a model on high-dimensional data grows with the dimensionality, making the training process more time-consuming and resource-intensive.
Curse of Dimensionality: As the dimensionality increases, the volume of the space grows exponentially. This "curse of dimensionality" can lead to sparse data, making it harder for the model to find meaningful patterns and increasing the risk of overfitting.
Model complexity: In high-dimensional spaces, models might become overly complex, capturing noise rather than the underlying structure of the data. This can lead to poor generalization on unseen data.

Need for a solution: We require a method that allows us to benefit from higher-dimensional mappings without the associated computational burden.

Kernel Functions: Measuring Similarity

A **Kernel function $K(x, z)$ computes the inner product of two data points in the transformed feature space without explicitly performing the transformation $\phi(x)$ :
$K(x, z) = \langle \phi(x), \phi(z) \rangle$

Interpretation:

Similarity Measure: Kernel quantifies the similarity between data points in the feature space.
Implicit Mapping: We can operate in the high-dimensional space indirectly.

Understanding the Primal and Dual Form of SVM

The SVM optimization problem can be formulated in both primal and dual forms as we discussed it earlier:

Primal Problem:

For a given training dataset ${(x_i, y_i)}_{i=1}^{n}, \text{ where } x_i \in \mathbb{R}^n \text{ and } y_i \in {-1, 1}$ , the primal optimzation problem seeks to find the weight vector W and b that minimize the following objective function :

$\min \frac{1}{2} |w|^2 \quad \text{subject to} \quad y_i (w^\top x_i + b) \geq 1, \, \forall i$
This formulation aims to maximize the margin between the two classes while ensuring that all the data points are correctly classified.

Dual Problem:

To solve the primal problem more efficiently, we introduce Langrange multiplier and formulate the dual problem:

$\max_{\alpha} \quad \sum_{i=1}^{n} \alpha_i - \frac{1}{2} \sum_{i=1}^{n} \sum_{j=1}^{n} \alpha_i \alpha_j y_i y_j \langle x_i, x_j \rangle \quad \text{subject to} \quad \alpha_i \geq 0, \, \forall i, \quad \sum_{i=1}^{n} \alpha_i y_i = 0$

in this dual formulation, the optimization depends only on the inner product of x_i and x_j between the data points, not on the data points themselves.

Observations:

The dual problem involves only the inner products of the input vectors.
The solution can be expressed in terms of these inner products.

Motivation for Kernels

Given that the dual SVM formulation relies solely on inner products, we can think of these inner products as measures of similarity between data points. The inner product quantifies how similar two vectors are in the input space.

Idea

Enhance the similarity measure: Use a function that better captures the similarity between data points in a way that can handle non-linear relationships
Define a Kernel Function: Instead of using the standard inner product, introduce a function K(x_i, x_j) that computes the similarity between the data points.

Defining Kernel Functions:

A Kernel function is a function $K: \mathbb{R}^n \times \mathbb{R}^n \to \mathbb{R}$ that computes a generalized inner product between two input vectors:
$K(x_i, x_j) = \langle \phi(x_i), \phi(x_j) \rangle_H$
where:

$\phi: \mathbb{R}^n \to H$ is mapping from the input space to a (possibly higher-dimension) feature space.
$\langle \cdot, \cdot \rangle_H$ denotes the inner product in the feature space. ### Key Points:
Implicit Mapping: We do not need to know the explicit form of $\phi(x)$
Similarity Measure: The kernel function acts as a measure of similarity in the feature space.

Mathematics behind this

Mercer's Theorem:

A symmetric function K(x,z) can be expressed as a kernel( inner product in some feature space) if and only if the kernel matrix K , defined by $k_ij = K(x_i, x_j)$ , is positive semi-definite for any finite set of vectors {x}.

Implications:

Validity: Not all functions can serve as Kernels. Mercer's Theorem ensures that the chosen kernel corresponds to a inner product in some feature space.
Optimization: Using a valid kernel guarantees that the optimization problem remains convex and solvable.

Common Kernel functions

Several kernel functions can be used with SVM to handle non-linear data.

1. Polynomial Kernel

The polynomial kernel allows the SVM to fit polynomial boundaries. It's defined as:

$K(x_i, x_j) = (\gamma \langle x_i, x_j \rangle + r)^d$

Parameters:
$\gamma$ : Scale parameter.
$r$ : Cofficient term.
$d$ :degree of polynomial

2. Radial Basis Function (RBF) Kernel

The RBF kernel is one of the most popular kernels for non-linear classification. it measures the similarity between two points based on their distance.
$K(x_i, x_j) = \exp(-\gamma |x_i - x_j|^2)$

Parameters:
$\gamma$ : Determines the spread of the kernel.

3. Sigmoid Kernel

Inspired by neural networks, the sigmoid kernel is defined as:
$K(x_i, x_j) = \tanh(\gamma \langle x_i, x_j \rangle + r)$

Parameters:
$\gamma$ : Scale parameter.
$r$ : Cofficient term.

4 Linear Kernel

The linear kernel is simply the standard inner product. it's suitable for linearly separable data.
$K(x_i, x_j) = \langle x_i, x_j \rangle$

Applying Kernel SVM to the Concentric Circles Dataset

Let's apply an SVM with different kernels to our synthetic dataset and visualize the decision boundaries.

import matplotlib.pyplot as plt
from sklearn.datasets import make_circles
from sklearn.svm import SVC
import numpy as np
from mlxtend.plotting import plot_decision_regions

X, y = make_circles(n_samples=100, factor=0.1, noise=0.1, random_state=42)


kernels = ['linear', 'poly', 'rbf']
degree = 3  
plt.figure(figsize=(18, 5))

for i, kernel in enumerate(kernels):
    if kernel == 'poly':
        clf = SVC(kernel=kernel, degree=degree, gamma='auto')
    else:
        clf = SVC(kernel=kernel, gamma='auto')

    clf.fit(X, y)
    plt.subplot(1, 3, i+1)
    plot_decision_regions(X=X, y=y, clf=clf, legend=2)
    plt.title(f'SVM with {kernel.capitalize()} Kernel')
    plt.xlabel('Feature 1')
    plt.ylabel('Feature 2')

plt.tight_layout()
plt.show()

Advantages of Using Kernel Functions

Handling Non-linearity: Kernels enable SVMs to perform well on non-linearly separable data by implicitly mapping them into higher-dimensional spaces where linear separation is possible.
Computational Efficiency: Thanks to the kernel trick, SVMs can operate in high-dimensional spaces without explicitly computing the transformed features, reducing computational overhead.

Considerations and Best Practices

Choice of Kernel: The choice of kernel and its parameters significantly impact the model's performance. It's essential to experiment with different kernels and perform hyperparameter tuning to find the optimal settings.
Overfitting: High-degree polynomial kernels or very small values of $\gamma$ in RBF kernels can lead to overfitting. Regularization parameters (like C in SVM) should be adjusted to balance the trade-off between margin maximization and classification error.
Interpretability: Models using non-linear kernels are often less interpretable compared to linear models. Ensure that model interpretability aligns with your application requirements.

Conclusion

Support Vector Machines, when combined with kernel functions, become a versatile tool capable of handling complex, non-linearly separable datasets like concentric circles. By leveraging the kernel trick, SVMs can implicitly operate in higher-dimensional feature spaces, enabling them to find optimal decision boundaries without the computational burden of explicit feature transformation

DEV Community