DEV Community: Harsimranjit Singh

Enhancing Support Vector Machines for Non-Linear Classification with Kernel Functions

Harsimranjit Singh — Sat, 21 Sep 2024 00:43:32 +0000

Problem with Non-Linear Data in SVMs

Support Vector Machines are powerful classifier when dealing with Linearly seperable data. For a set of data points, if there exists a hyperplane that can seperate points of different classes, SVM identifies the maximal margin hyperplane. However , not all data is linearly seperable. In many real-world problems, the classes are often non-linearly seperable, meaning a straight line or plane cannot separate them accurately.

Let's create a synthetic dataset where the classes are arranged in concentric circles:

import matplotlib.pyplot as plt
from sklearn.datasets import make_circles 

X, y = make_circles(100, factor=.1, noise=.1)

plt.figure(figsize=(8, 6))
plt.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='bwr')
plt.title('Concentric Circles Dataset')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.colorbar(label='Class')
plt.grid(True)
plt.show()

As seen in the plot, the data points form two concentric circles. A straight line (or plane) cannot effectively seperate these two classes.

Solution Using Projection in Higher Dimensions

In many non-linear classification problems, the data may not be separable by a linear hyperplane in its original feature space. By projecting the data into a higher-dimensional space, we can often make it linearly separable.
For example, consider data in concentric circles. In 2D space, these circles cannot be seperated by a straight line. However if we map this data into 3d, the circles may become concentric spheres, which can be seperated by plane.

Illustration:

imagine above 2D dataset of concentric circles:

in this plot, we see two classes arranged in concentric circles.
Now, if we project this data into 3D, each point $(x_{i}, x_{2})$ from the 2D space is mapped to a new point $(x_{1}, x_{2}, ϕ (x_{1}, x_{2}))$ in the 3D space, where $ϕ$ is function that defines the transformation. For instance, if we use a transformation function that maps $(x_{1}, x_{2}) to (x_{1}, x_{2}, x_{12} + x_{22})$ we obtain:

In this 3D space, plane can be used to seperate the classes.

Problem with the Above Solution:

Computational Cost: Mapping data to higher dimensions requires additional computational resources. The complexity of training a model on high-dimensional data grows with the dimensionality, making the training process more time-consuming and resource-intensive.
Curse of Dimensionality: As the dimensionality increases, the volume of the space grows exponentially. This "curse of dimensionality" can lead to sparse data, making it harder for the model to find meaningful patterns and increasing the risk of overfitting.
Model complexity: In high-dimensional spaces, models might become overly complex, capturing noise rather than the underlying structure of the data. This can lead to poor generalization on unseen data.

Need for a solution: We require a method that allows us to benefit from higher-dimensional mappings without the associated computational burden.

Kernel Functions: Measuring Similarity

A **Kernel function $K (x, z)$ computes the inner product of two data points in the transformed feature space without explicitly performing the transformation $ϕ (x)$ :
$K (x, z) = ⟨ ϕ (x), ϕ (z)⟩$

Interpretation:

Similarity Measure: Kernel quantifies the similarity between data points in the feature space.
Implicit Mapping: We can operate in the high-dimensional space indirectly.

Understanding the Primal and Dual Form of SVM

The SVM optimization problem can be formulated in both primal and dual forms as we discussed it earlier:

Primal Problem:

For a given training dataset $(x_{i}, y_{i})_{i = 1 n}, where x_{i} \in R^{n} and y_{i} \in - 1, 1$ , the primal optimzation problem seeks to find the weight vector W and b that minimize the following objective function :

$min \frac{1}{2} ∣ w ∣^{2} subject to y_{i} (w^{⊤} x_{i} + b) \geq 1, \forall i$
This formulation aims to maximize the margin between the two classes while ensuring that all the data points are correctly classified.

Dual Problem:

To solve the primal problem more efficiently, we introduce Langrange multiplier and formulate the dual problem:

$max_{α} \sum_{i = 1 n} α_{i} - \frac{1}{2} \sum_{i = 1 n} \sum_{j = 1 n} α_{i} α_{j} y_{i} y_{j} ⟨ x_{i}, x_{j} ⟩ subject to α_{i} \geq 0, \forall i, \sum_{i = 1 n} α_{i} y_{i} = 0$

in this dual formulation, the optimization depends only on the inner product of x_i and x_j between the data points, not on the data points themselves.

Observations:

The dual problem involves only the inner products of the input vectors.
The solution can be expressed in terms of these inner products.

Motivation for Kernels

Given that the dual SVM formulation relies solely on inner products, we can think of these inner products as measures of similarity between data points. The inner product quantifies how similar two vectors are in the input space.

Idea

Enhance the similarity measure: Use a function that better captures the similarity between data points in a way that can handle non-linear relationships
Define a Kernel Function: Instead of using the standard inner product, introduce a function K(x_i, x_j) that computes the similarity between the data points.

Defining Kernel Functions:

A Kernel function is a function $K : R^{n} \times R^{n} \to R$ that computes a generalized inner product between two input vectors:
$K (x_{i}, x_{j}) = ⟨ ϕ (x_{i}), ϕ (x_{j}) ⟩_{H}$
where:

$ϕ : R^{n} \to H$ is mapping from the input space to a (possibly higher-dimension) feature space.
$⟨ \cdot, \cdot ⟩_{H}$ denotes the inner product in the feature space. ### Key Points:
Implicit Mapping: We do not need to know the explicit form of $ϕ (x)$
Similarity Measure: The kernel function acts as a measure of similarity in the feature space.

Mathematics behind this

Mercer's Theorem:

A symmetric function K(x,z) can be expressed as a kernel( inner product in some feature space) if and only if the kernel matrix K , defined by $k_{i} j = K (x_{i}, x_{j})$ , is positive semi-definite for any finite set of vectors {x}.

Implications:

Validity: Not all functions can serve as Kernels. Mercer's Theorem ensures that the chosen kernel corresponds to a inner product in some feature space.
Optimization: Using a valid kernel guarantees that the optimization problem remains convex and solvable.

Common Kernel functions

Several kernel functions can be used with SVM to handle non-linear data.

1. Polynomial Kernel

The polynomial kernel allows the SVM to fit polynomial boundaries. It's defined as:

$K (x_{i}, x_{j}) = (γ ⟨ x_{i}, x_{j} ⟩ + r)^{d}$

Parameters:
$γ$ : Scale parameter.
$r$ : Cofficient term.
$d$ :degree of polynomial

2. Radial Basis Function (RBF) Kernel

The RBF kernel is one of the most popular kernels for non-linear classification. it measures the similarity between two points based on their distance.
$K (x_{i}, x_{j}) = exp (- γ ∣ x_{i} - x_{j} ∣^{2})$

Parameters:
$γ$ : Determines the spread of the kernel.

3. Sigmoid Kernel

Inspired by neural networks, the sigmoid kernel is defined as:
$K (x_{i}, x_{j}) = tanh (γ ⟨ x_{i}, x_{j} ⟩ + r)$

Parameters:
$γ$ : Scale parameter.
$r$ : Cofficient term.

4 Linear Kernel

The linear kernel is simply the standard inner product. it's suitable for linearly separable data.
$K (x_{i}, x_{j}) = ⟨ x_{i}, x_{j} ⟩$

Applying Kernel SVM to the Concentric Circles Dataset

Let's apply an SVM with different kernels to our synthetic dataset and visualize the decision boundaries.

import matplotlib.pyplot as plt
from sklearn.datasets import make_circles
from sklearn.svm import SVC
import numpy as np
from mlxtend.plotting import plot_decision_regions

X, y = make_circles(n_samples=100, factor=0.1, noise=0.1, random_state=42)


kernels = ['linear', 'poly', 'rbf']
degree = 3  
plt.figure(figsize=(18, 5))

for i, kernel in enumerate(kernels):
    if kernel == 'poly':
        clf = SVC(kernel=kernel, degree=degree, gamma='auto')
    else:
        clf = SVC(kernel=kernel, gamma='auto')

    clf.fit(X, y)
    plt.subplot(1, 3, i+1)
    plot_decision_regions(X=X, y=y, clf=clf, legend=2)
    plt.title(f'SVM with {kernel.capitalize()} Kernel')
    plt.xlabel('Feature 1')
    plt.ylabel('Feature 2')

plt.tight_layout()
plt.show()

Advantages of Using Kernel Functions

Handling Non-linearity: Kernels enable SVMs to perform well on non-linearly separable data by implicitly mapping them into higher-dimensional spaces where linear separation is possible.
Computational Efficiency: Thanks to the kernel trick, SVMs can operate in high-dimensional spaces without explicitly computing the transformed features, reducing computational overhead.

Considerations and Best Practices

Choice of Kernel: The choice of kernel and its parameters significantly impact the model's performance. It's essential to experiment with different kernels and perform hyperparameter tuning to find the optimal settings.
Overfitting: High-degree polynomial kernels or very small values of $γ$ in RBF kernels can lead to overfitting. Regularization parameters (like C in SVM) should be adjusted to balance the trade-off between margin maximization and classification error.
Interpretability: Models using non-linear kernels are often less interpretable compared to linear models. Ensure that model interpretability aligns with your application requirements.

Conclusion

Support Vector Machines, when combined with kernel functions, become a versatile tool capable of handling complex, non-linearly separable datasets like concentric circles. By leveraging the kernel trick, SVMs can implicitly operate in higher-dimensional feature spaces, enabling them to find optimal decision boundaries without the computational burden of explicit feature transformation

Support Vector Machines: From Hard Margin to Soft Margin

Harsimranjit Singh — Mon, 12 Aug 2024 03:46:45 +0000

Support Vector Machines are powerful tools in machine learning for classifying data and predicting values. They are popular in various fields like bioinformatics and financial forecasting because they handle complex data problems well. This article aims to explain SVMs in a simple way, covering topics like maximal margin classifier, support vectors, kernel trick and infinite dimensional mapping.

What is SVMs

An SVM performs classification at its core by finding the hyperplane that best divides a dataset into classes. The unique aspect of SVMs lies in their ability to find the optimal hyperplane that maximizes the margin- the distance between the hyperplane and the nearest data points from each class. A larger margin helps the SVM make better predictions on new data as decision boundaries are much clearer. The SVM does this with the help of Support Vectors.

Support Vectors

Support Vectors are the subset of training data that are directly involved in constructing the optimal separating hyperplane These points are crucial because they lie on the edge of the margin and influence the position and orientation of the hyperplane.

Mathematically, if you have a dataset with class labels $y_{i} \in$ {-1, +1} and feature $X_{i}$ the support vectors satisfy the condition:

$y_{i} (w \cdot x_{i} + b) = 1$

From where this equation comes?

The hyperplane in d-dimensional space is defined by the equation:

W.X - b = 0
where:

w is the normal vector( normal to the hyperplane).
b is the bias term (offset from the origin).

For a given data point $X_{i}$ with class label $y_{i}$ :

if $y_{i} = 1$ , the data point belongs to the positive class.
if $y_{i} = - 1$ , the data point belongs to the negative class.

We want the hyperplane to correctly classify these data points, so we need to ensure that:

Points from the positive class are on one side of the hyperplane. For points on the positive side: $w \cdot x_{i} - b \geq 1$
Points from the negative class are on the other side.

For points on the negative side:
$w \cdot x_{i} - b \leq - 1$

To combine these constraints into simple form, we use the class label $y_{i}$ , the constraint can be written as:

$y_{i} (w \cdot x_{i} + b) \geq 1$
Here's why:

if $y_{i}$ = 1, the constraint becomes $w \cdot x_{i} - b \geq 1$ , which ensures correct classification for positive class points.
if if $y_{i}$ = -1, the constraint becomes $w \cdot x_{i} - b \leq 1$ , which ensure correct classification for negative class points.

Margin Calculation

In SVMs, the margin is the distance between the hyperplane and the closest data points from each class(support vectors).

To calculate the margin, we use the following formula:
$Margin = \frac{2}{∣∣ w ∣∣}$

From where do we get this formula

The perpendicular distance d from point $X_{i}$ to hyperplane is:
$d = \frac{∣ w \cdot x _{i} - b ∣}{∣∣ w ∣∣}$
Now for the support vectors, the distance from the hyperplane is exactly
$d = \frac{1}{∣∣ w ∣∣}$
This is because support vectors lie on the boundaries of the margin, where
$w \cdot x_{i} - b = 1$
Therefore:
$d = \frac{1}{∣∣ w ∣∣}$

Now we need the distance between the both hyperplanes
$w \cdot x_{i} - b = 1$
$w \cdot x_{i} - b = - 1$
Therefore the distance will be:
$Margin = \frac{2}{∣∣ w ∣∣}$

Understanding Hard Margin SVM

The term "Hard Margin" comes from the fact that the algorithm requires all data points to be classified with a margin of at least 1. In other words, there are no allowances for misclassification. These strict requirements are why it's called a "hard" margin

Formulating the Hard Margin SVM

1. Objective function

The goal of Hard margin SVM is to maximize the margin between the two classes. As we previously discussed:

$Margin = \frac{2}{∣∣ w ∣∣}$

To maximize the margin, we need to minimize its reciprocal
$Minimize \frac{1}{2} ∣ w ∣^{2}$

Why squaring the norm? Because it provides smoothness and differentiability. This makes it easier to compute gradients and perform optimization using gradient-based methods.
Minimizing the squared norm is equivalent to minimizing the norm because minimizing the $∣∣ w ∣ ∣^{2}$ will always lead to the same optimal W as minimizing $∣∣ w ∣∣$

2. Constraints:

The constraints ensure that each point is correctly classified and lies at least at a distance of 1 from the hyperplane.
$y_{i} (w \cdot x_{i} + b) \geq 1$

3. Hard Margin SVM Optimization Problem:

Putting it all together, the hard-margin SVM optimization problem is:
$Minimize \frac{1}{2} ∣ w ∣^{2} subject to y_{i} (w \cdot x_{i} - b) \geq 1, \forall i$

Now we need to solve this problem to find the solution

Problem with Hard Margin

While Hard Margin SVMs are effective for linearly separable data, they come with certain limitation

It fails in the case of outliers and misclassified data

In this two points are outliers, in this scenario, hard SVM fails to plot the decision boundary as it tries to classify all the points but is unable to classify these two points.

To tackle this Soft-Margin SVM is used

Soft Margin SVM

While Hard Margin SVM works well with linearly separable data, it struggles with datasets containing outliers or overlapping classes. To address these limitations, Soft Margin SVM introduces a concept called "slack Variables"

Slack Variables

Slack variables $ξ_{i}$ is introduced to measure the degree to which a data point is allowed to violate the margin constraints. They allow for some misclassification and margin violations. For each data point i, the slack variable $ξ_{i}$ is non-negative and represents the extent of the constraint violation.

if $ξ_{i}$ =0, the data point is correctly classified and lies on or outside the margin.
if 0< $ξ_{i}$ <1, the data point is correctly classified but withing the margin.
if $ξ_{i}$ >=1, the data point is misclassified.

Soft-Margin SVM: Objective Function and Constraints

Objective Function
The soft-margin SVM modifies the objective function to incorporate slack variables. The goal is now to balance between maximizing the margin and minimizing the classification error. The new objective function is:
$Minimize \frac{1}{2} ∣∣ w ∣ ∣^{2} + C \sum_{i = 1 n} ξ_{i}$
where:

$\frac{1}{2} ∣∣ w ∣ ∣^{2}$ still represents the margin maximization term.
C is the regularization parameter that controls the trade-off between the margin size and classification error.
$ξ_{i}$ are the slack variables representing the extent of misclassification for each data point.

Constraints
The constraints are modified to account for slack variables:
$y_{i} (w^{T} x_{i} + b) \geq 1 - ξ_{i}$
$ξ_{i} \geq 0$
where:

The term $1 - ξ_{i}$ allows some flexibility by permitting data points to lie within or beyond the margin.
$ξ_{i} \geq 0$ ensure that slack variables are non-negative, representing non-negative deviations from the ideal margin.

What is $ξ_{i}$

Hinge Loss
The hinge loss function is crucial concept in understanding the role of $ξ_{i}$ in SVMs. it measures the cost associated with misclassifications and margin violations. The hinge loss for data point ( $x_{i}, y_{i}$ ) is defined as:
$L (y_{i}, f (x_{i})) = max (0, 1 - y_{i} f (x_{i}))$
$where f (x_{i}) = w \cdot x_{i} + b$

Intuition behind the Hinge Loss Function

1. Decision Function:

in SVMs, the decision function $f (x_{i})$
where w is weight vector, x is input feature vector, and b is bias term.

2. Margin and Classification:

For a correctly classified point, we want $y_{i} f (x_{i}) \geq 1$ . This ensures that point is on the correct side of the margin.
For a misclassified point $y_{i} f (x_{i}) < 1$ . ### 3. Hinge Loss: The hinge loss penalizes points that do not satisfy the margin requirement $y_{i} f (x_{i}) \geq 1$ . The Loss is defined as:

$L (y_{i}, f (x_{i})) = max (0, 1 - y_{i} f (x_{i}))$

Correct Classification with Margin:
where $y_{i} f (x_{i}) \geq 1$
$L (y_{i}, f (x_{i})) = max (0, 1 - y_{i} f (x_{i})) = * * 0 * *$
This means there is no penalty because the point is correctly classified and lies outside the margin.
Margin Violation:
when $0 \leq y_{i} f (x_{i}) < 1$
$L (y_{i}, f (x_{i})) = max (0, 1 - y_{i} f (x_{i})) = 1 - y_{i} f (x_{i})$
This means the point is correctly classified but lies within the margin, resulting a penalty proportional to distance from the margin.
Misclassification:
when $y_{i} f (x_{i}) < 0$
$L (y_{i}, f (x_{i})) = max (0, 1 - y_{i} f (x_{i})) = 1 - y_{i} f (x_{i})$
This means the point is misclassified, resulting in penalty that increases as the point moves further away from the correct side of the decision boundary.

Explaining Hinge Loss with example

Let's use small dataset consisting of five points:
(1,2) with label 1
(2,2) with label 1
(2.5,2.5) with label -1
(3,1) with label -1
(3,3) with label

Let's fit a linear SVM model with soft margin(C=1)

from the trained model:

w = [0.67, 0.67]
b = -2

Hinge Loss Calculation

For each point, we calculate the hinge loss:

Point(1,2):
- y=1
- $f (x)$ = 0.67.1 + 0.67.2 - 2 = 1.00
- Hinge Loss: max(0,1- $y . f (x)$ ) = max(0,1-1.1.00) = 0.00
- Correctly classified and outside margin
Point (2,2):
- y=1
- $f (x)$ = 0.67 x 2 + 0.67 x 2 - 2 = 0.50
- Hinge Loss: max(0,1- $y . f (x)$ ) = max(0,1-1 x 0.50) = 0.50
- Correctly classified but within margin
Point (2.5,2.5):
- y = -1
- $f (x)$ = 0.67 x 2.5 + 0.67 x 2.5 - 2 = 1.33
- Hinge Loss: max(0,1- $y . f (x)$ ) = max(0,1+ 1.33) = 2.33
- Misclassified
Point (3,1):
- y = -1
- $f (x)$ = 0.67 x 3 + 0.67 x 1 - 2 = -1.00
- Hinge Loss: max(0,1- $y . f (x)$ ) = max(0,1 - (-1)*(-1)) = 0.00
- Correctly classified and outside margin
Point (3,3):
- y = 1
- $f (x)$ = 0.67 x 3 + 0.67 x 3 - 2 = 2.00
- Hinge Loss: max(0,1- $y . f (x)$ ) = max(0,1 - 2.00) = 0.00
- Correctly classified and outside margin

import numpy as np
import matplotlib.pyplot as plt
from sklearn.svm import SVC

X = np.array([[1, 2], [2, 2], [2.5, 2.5], [3, 1], [3, 3]])
y = np.array([1, 1, -1, -1, 1])  


model = SVC(kernel='linear', C=1)
model.fit(X, y)
w = model.coef_[0]
b = model.intercept_[0]
x_plot = np.linspace(0, 4, 100)
y_plot = -w[0] / w[1] * x_plot - b / w[1]


margin = 1 / np.sqrt(np.sum(w ** 2))
y_margin1 = y_plot + margin
y_margin2 = y_plot - margin


plt.scatter(X[:, 0], X[:, 1], c=y, cmap='bwr', s=50)
plt.plot(x_plot, y_plot, 'k-', label='Decision Boundary')
plt.plot(x_plot, y_margin1, 'k--', label='Margin')
plt.plot(x_plot, y_margin2, 'k--')


for i, txt in enumerate(y):
    plt.annotate(f'({X[i][0]},{X[i][1]})', (X[i][0], X[i][1]))

plt.annotate('Misclassified Point', xy=(2.5, 2.5), xytext=(3, 1.5),
             arrowprops=dict(facecolor='red', shrink=0.05),
             fontsize=10, color='red')
plt.annotate('Correctly Classified\nBut Within Margin', xy=(2, 2), xytext=(0.5, 2.5),
             arrowprops=dict(facecolor='blue', shrink=0.05),
             fontsize=10, color='blue')

plt.xlim(0, 4)
plt.ylim(0, 4)
plt.title('Soft Margin SVM')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.legend()
plt.show()

def hinge_loss_point(x, y, w, b):
    f_x = np.dot(w, x) + b
    return max(0, 1 - y * f_x)

for i in range(len(X)):
    x_i = X[i]
    y_i = y[i]
    loss = hinge_loss_point(x_i, y_i, w, b)
    print(f'Point {i+1}: x={x_i}, y={y_i}, f(x)={np.dot(w, x_i) + b:.2f}, Hinge Loss={loss:.2f}')

Short Summary

Correctly classified and outside the margin: No penalty like Point (1,2), Point (3,1) and Point (3,3)
Correctly classified but within the margin: Penalty proportional to distance from margin like Point (2,2)
Misclassified: High penalty, increasing with distance from the correct side of the boundary

Constraint for Soft-Margin

$y_{i} (w^{T} x_{i} + b) \geq 1 - ξ_{i}$
$ξ_{i} \geq 0$

Why $1 - ξ_{i}$

Let's take simple examples in different cases:

Point A: Correctly classified and outside the margin
- Slack Variable: $ξ_{A} = 0$
- Constraint: $y_{A} (w^{T} x_{A} + b) \geq 1 - ξ_{A}$
- Satisfies: The constraint is met because $ξ_{A}$ is 0, so no margin violation
Point B: Correctly classified but within the margin $0 < x i_{i} < 1$
- Slack Variable: $ξ_{A} = 0.5$
- Constraint: $y_{B} (w^{T} x_{B} + b) \geq 1 - ξ_{B}$
- Satisfies: The point is within the margin but still classified correctly. The slack variable $ξ_{B}$ of 0.5 allows this.
- Interpretation
  - Without Slack Variable: The constraint would be $y_{B} (w^{T} x_{B} + b) \geq 1$ . This means the point xB would need to be classified correctly and be at least 1 uint away from the decision boundary
  - With Slack Variables: The slack variable $x i_{B}$ of 0.5 allows the constraint to be relaxed. it changes the requirements from 1 to 0.5, meaning the point can be within the margin but still satisfy the relaxed constraint.
Point C: Misclassified
- Slack Variable: $ξ_{A} = 1$
- Constraint: $y_{B} (w^{T} x_{B} + b) \geq 0$
- Satisfies: The point is misclassified but the slack variable $ξ_{C}$ of 1 relaxes the constraint enough to allow this.

Real Objective Function

in the Soft Margin SVM, we aim to balance two objectives:

Maximizing the margin (which is equivalent to minimizing $∣ w ∣^{2}$
Minimizing the classification error Which is related to the slack variables

Now we need to

$Minimize \frac{1}{2} ∣∣ w ∣ ∣^{2} + C \sum_{i = 1 n} ξ_{i}$
under
$y_{i} (w^{T} x_{i} + b) \geq 1 - ξ_{i}$
$ξ_{i} \geq 0$
The Soft Margin SVM problem is a type of quadratic programming problem. It involves a quadratic objective function and linear constraints. To solve the problem efficiently, it is often useful to convert it to its dual form using Lagrange multipliers.
The dual form of this equations is
$max_{α} (\sum_{i = 1 n} α_{i} - \frac{1}{2} \sum_{i = 1 n} \sum_{j = 1 n} α_{i} α_{j} y_{i} y_{j} (x_{i T} x_{j}))$

Derivation

Regularization Parameter (C) in SVM

The regularization parameter C in SVM plays a crucial role in the bias-variance tradeoff:

Hign C Value:
- Effect:A high C value aims to minimize the training error by allowing fewer misclassifications on the training set. This reduces bias but increases variance, making the model more prone to overfitting.
- Result: The decision boundary becomes very tight around the support vectors, potentially fitting noise in the training data
Low C Value:
- Effect: A low C value allows more misclassifications on the training set, which increases bias but reduces variance. The model is less flexible and smoother
- Result: The decision boundary is smoother and less sensitive to training data, which can prevent overfitting but might lead to underfitting.

Conclusion

Support Vector Machines offer a robust approach for classification tasks, excelling in scenarios where a clear decision boundary is essential. By understanding key concepts such as support vectors, the margin, slack variables, and hinge loss, practitioners can effectively leverage SVMs for a wide range of machine learning applications.

Exploring the Depths of Support Vector Machines: Hard Margin SVM

Harsimranjit Singh — Tue, 30 Jul 2024 03:20:12 +0000

What is SVMs

Support Vectors

Mathematically, if you have a dataset with class labels $y_{i} \in$ {-1, +1} and feature $X_{i}$ the support vectors satisfy the condition:

$y_{i} (w \cdot x_{i} + b) = 1$

From where this equation comes?

The hyperplane in d-dimensional space is defined by the equation:

W.X - b = 0
where:

w is the normal vector( normal to the hyperplane).
b is the bias term (offset from the origin).

For a given data point $X_{i}$ with class label $y_{i}$ :

if $y_{i} = 1$ , the data point belongs to the positive class.
if $y_{i} = - 1$ , the data point belongs to the negative class.

We want the hyperplane to correctly classify these data points, so we need to ensure that:

Points from the positive class are on one side of the hyperplane. For points on the positive side: $w \cdot x_{i} - b \geq 1$
Points from the negative class are on the other side.

For points on the negative side:
$w \cdot x_{i} - b \leq - 1$

To combine these constraints into simple form, we use the class label $y_{i}$ , the constraint can be written as:

$y_{i} (w \cdot x_{i} + b) \geq 1$
Here's why:

if $y_{i}$ = 1, the constraint becomes $w \cdot x_{i} - b \geq 1$ , which ensures correct classification for positive class points.
if if $y_{i}$ = -1, the constraint becomes $w \cdot x_{i} - b \leq 1$ , which ensure correct classification for negative class points.

Margin Calculation

In SVMs, the margin is the distance between the hyperplane and the closest data points from each class(support vectors).

To calculate the margin, we use the following formula:
$Margin = \frac{2}{∣∣ w ∣∣}$

From where do we get this formula

Now we need the distance between the both hyperplanes
$w \cdot x_{i} - b = 1$
$w \cdot x_{i} - b = - 1$
Therefore the distance will be:
$Margin = \frac{2}{∣∣ w ∣∣}$

Understanding Hard Margin SVM

Formulating the Hard Margin SVM

1. Objective function

The goal of Hard margin SVM is to maximize the margin between the two classes. As we previously discussed:

$Margin = \frac{2}{∣∣ w ∣∣}$

To maximize the margin, we need to minimize its reciprocal
$Minimize \frac{1}{2} ∣ w ∣^{2}$

2. Constraints:

The constraints ensure that each point is correctly classified and lies at least at a distance of 1 from the hyperplane.
$y_{i} (w \cdot x_{i} + b) \geq 1$

3. Hard Margin SVM Optimization Problem:

Putting it all together, the hard-margin SVM optimization problem is:
$Minimize \frac{1}{2} ∣ w ∣^{2} subject to y_{i} (w \cdot x_{i} - b) \geq 1, \forall i$

Now we need to solve this problem to find the solution

Problem with Hard Margin

While Hard Margin SVMs are effective for linearly separable data, they come with certain limitation

It fails in the case of outliers and misclassified data

In this two points are outliers, in this scenario, hard SVM fails to plot the decision boundary as it tries to classify all the points but is unable to classify these two points.

To tackle this Soft-Margin SVM is used

Exploring the Depths of Support Vector Machines: Hard Margin SVM

Harsimranjit Singh — Tue, 30 Jul 2024 03:10:13 +0000

What is SVMs

Support Vectors

Mathematically, if you have a dataset with class labels $y_{i} \in$ {-1, +1} and feature $X_{i}$ the support vectors satisfy the condition:

$y_{i} (w \cdot x_{i} + b) = 1$

From where this equation comes?

The hyperplane in d-dimensional space is defined by the equation:

W.X - b = 0
where:

w is the normal vector( normal to the hyperplane).
b is the bias term (offset from the origin).

For a given data point $X_{i}$ with class label $y_{i}$ :

if $y_{i} = 1$ , the data point belongs to the positive class.
if $y_{i} = - 1$ , the data point belongs to the negative class.

We want the hyperplane to correctly classify these data points, so we need to ensure that:

Points from the positive class are on one side of the hyperplane. For points on the positive side: $w \cdot x_{i} - b \geq 1$
Points from the negative class are on the other side.

For points on the negative side:
$w \cdot x_{i} - b \leq - 1$

To combine these constraints into simple form, we use the class label $y_{i}$ , the constraint can be written as:

$y_{i} (w \cdot x_{i} + b) \geq 1$
Here's why:

if $y_{i}$ = 1, the constraint becomes $w \cdot x_{i} - b \geq 1$ , which ensures correct classification for positive class points.
if if $y_{i}$ = -1, the constraint becomes $w \cdot x_{i} - b \leq 1$ , which ensure correct classification for negative class points.

Margin Calculation

In SVMs, the margin is the distance between the hyperplane and the closest data points from each class(support vectors).

To calculate the margin, we use the following formula:
$Margin = \frac{2}{∣∣ w ∣∣}$

From where do we get this formula

Now we need the distance between the both hyperplanes
$w \cdot x_{i} - b = 1$
$w \cdot x_{i} - b = - 1$
Therefore the distance will be:
$Margin = \frac{2}{∣∣ w ∣∣}$

Understanding Hard Margin SVM

Formulating the Hard Margin SVM

1. Objective function

The goal of Hard margin SVM is to maximize the margin between the two classes. As we previously discussed:

$Margin = \frac{2}{∣∣ w ∣∣}$

To maximize the margin, we need to minimize its reciprocal
$Minimize \frac{1}{2} ∣ w ∣^{2}$

2. Constraints:

The constraints ensure that each point is correctly classified and lies at least at a distance of 1 from the hyperplane.
$y_{i} (w \cdot x_{i} + b) \geq 1$

3. Hard Margin SVM Optimization Problem:

Putting it all together, the hard-margin SVM optimization problem is:
$Minimize \frac{1}{2} ∣ w ∣^{2} subject to y_{i} (w \cdot x_{i} - b) \geq 1, \forall i$

Now we need to solve this problem to find the solution

Problem with Hard Margin

While Hard Margin SVMs are effective for linearly separable data, they come with certain limitation

It fails in the case of outliers and misclassified data

In this two points are outliers, in this scenario, hard SVM fails to plot the decision boundary as it tries to classify all the points but is unable to classify these two points.

To tackle this Soft-Margin SVM is used

ROC-AUC Curve in Machine Learning

Harsimranjit Singh — Wed, 03 Jul 2024 03:51:52 +0000

In machine learning, evaluating the performance of your models is crucial. One powerful tool for this purpose is the ROC-AUC curve. This article will explore what the ROC-AUC curve is, and how it works.

Understanding the ROC Curve

The ROC curve visually represents the model's performance across all possible classification thresholds. It plots the True Positive Rate(TPR) on the y-axis and the **False Positive Rate(FPR) on the x-axis.

TPR (True Positive Rate): Also known as recall or sensitivity, it measures the proportion of actual positive cases classified by the model.

FPR (False Positive Rate): it measures the proportion of actual negative cases incorrectly classified as positive by the model.

Plotting the ROC Curve

To plot a ROC curve, you vary the threshold for classifying positive and negative samples. At each threshold, you calculate the TPR and FPR, which gives you a point on the ROC curve. By connecting these points, you create the ROC curve.

Interpreting the ROC Curve

The ideal ROC curve hugs the top-left corner of the plot, indicating a high TPR and a low FPR. The closer the ROC curve is to this corner, the better the model. Conversely, a ROC curve along the diagonal line from (0,0) to (1,1) indicates a model with no discrimination ability.

Area Under the ROC Curve (AUC)

The AUC provides a single number summary of the ROC curve. It represents the probability that a randomly chosen positive instance is ranked higher than a randomly chosen negative instance. An AUC of 0.5 indicates no discrimination, while an AUC of 1.0 represents perfect discrimination.

Advantages of the ROC-AUC Curve

The ROC-AUC curve has several advantages. It is threshold-independent, meaning it evaluates model performance across all thresholds. This makes it robust, especially in scenarios with imbalanced datasets where metrics like accuracy can be misleading.

Practical Considerations

ROC-AUC is particularly useful in fields like medical diagnostics and fraud detection, where the costs of false positives and false negatives differ. However, it's essential to consider the specific context of your problem when interpreting AUC values.

ROC-AUC in Practice

Here's a brief guide on how to plot ROC curves and calculate AUC using Python's scikit-learn library:

from sklearn.metrics import roc_curve, auc
import matplotlib.pyplot as plt


fpr, tpr, _ = roc_curve(y_true, y_scores)
roc_auc = auc(fpr, tpr)

plt.figure()
plt.plot(fpr, tpr, color='darkorange', lw=2, label='ROC curve (area = %0.2f)' % roc_auc)
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.0])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic')
plt.legend(loc="lower right")
plt.show()

Conclusion

The ROC-AUC curve is a vital tool for evaluating the performance of classification models. By understanding and correctly interpreting this metric, you can gain deeper insights into your model's strengths and weaknesses. Remember to use ROC-AUC alongside other metrics to get a comprehensive evaluation of your models.

Maximum Likelihood Estimation with Logistic Regression

Harsimranjit Singh — Thu, 27 Jun 2024 03:52:13 +0000

In our previous article, we introduce logistic regression, a fundamental technique in machine learning used for binary classification. Logistic regression predicts the probability of binary outcomes based on input features.
This article dives into the mathematical foundation of logistic regression.

Understanding Likelihood

Likelihood, refers to the chance of observing a specific outcome or event given a particular model or set of conditions.
Breakdown to understand better:

Focus on Specific Outcome: Unlike probability, which deals with the general chance of an event happening, likelihood focuses on a specific outcome given something else is true.
Model-Based: We use the model to calculate the likelihood of observing a specific data set assuming the model's parameters are true.
Higher Likelihood: Higher the likelihood of the model means parameters are better fit for explaining the data.

Example

Imagine you have a coin that might be biased, and you flip it 5 times, getting the results:
Heads, Tails, Heads, Heads, Tails. You want to estimate the probability θ of getting heads.
1-> Suppose θ = 0.5:

The likelihood of getting the sequence is: L(0.5) = P(H) x P(T) X P(H) X P(H) X P(T) = 0.5 * 0.5 * 0.5 * 0.5 * 0.5 = 0.03125

2-> Suppose θ = 0.7:

The likelihood of getting the same sequence is : L(0.7) = P(H) x P(T) X P(H) X P(H) X P(T) = 0.7 * 0.7 * 0.7 * 0.7 * 0.7 = 0.1029

Difference between likelihood and probability

Probability: Focuses on the general chance of an event happening in the long run.
Likelihood: Focuses on the chance of observing a specific outcome given a particular scenario.

Maximum Likelihood Estimation (MLE)

Maximum Likelihood Estimation (MLE) is a method used to estimate the parameters of the statistical model. The goal is to find the parameter values that maximize the likelihood function, with best fitting the observed data.

Step-by-Step

Define the Likelihood Function: The likelihood function L(θ) represents the probability of observing the data as a function of the model parameters θ.
Log-Likelihood Function: For mathematical convenience, we often use the log-likelihood function L(θ), which is the natural log of likelihood function:
L(θ) = log L(θ)
Maximize the log-likelihood: Find the parameter values that maximize the log-likelihood function. This involves taking the derivative of the log-likelihood with respect to the parameters and setting it to zero to solve for the parameters.

MLE in Logistic Regression

Logistic regression models the probability of a binary outcome (success/failure) based on input features x.

where Xi are the input features, β are the parameters to be estimated, and yi is the binary outcome.

Log-Likelihood Function:

The likelihood of observing the given data under logistic regression is:

Deriving the MLE for Logistic Regression

To Find the MLE for β, we need to maximize the log-likelihood function. This involves:

Calculating the Gradient: Compute the derivative of the log-likelihood with respect to β.
Optimization: Use an optimization algorithm (e.g., gradient descent) to find the parameter values that maximize the log-likelihood.

Practical Implementation

import numpy as np
import scipy.optimize as opt

X = np.array([[1, 2], [1, 3], [1, 4], [1, 5]])  # Adding a column of ones for the intercept
y = np.array([0, 0, 1, 1])

# Sigmoid function
def sigmoid(z):
    return 1 / (1 + np.exp(-z))

# Log-likelihood function
def log_likelihood(beta, X, y):
    z = np.dot(X, beta)
    return -np.sum(y * np.log(sigmoid(z)) + (1 - y) * np.log(1 - sigmoid(z)))

beta_init = np.zeros(X.shape[1])

# Optimization
result = opt.minimize(log_likelihood, beta_init, args=(X, y), method='BFGS')
beta_hat = result.x

print("Parameters", beta_hat)

Conclusion

Understanding the mathematical foundations of logistic regression and maximum likelihood estimation is essential for effectively applying these techniques in machine learning. By maximizing the likelihood function, logistic regression identifies the parameters β that best fit the observed data, enabling accurate predictions of binary outcomes based on input features

Understanding Logistic Regression

Harsimranjit Singh — Tue, 18 Jun 2024 14:23:30 +0000

Previously, we discussed linear regression, a method used to predict continuous outcomes. Today let's explore logistic regression, which is essential for binary classification problems in data science.

What is Logistic Regression?

Logistic regression is used to predict a binary outcome (such as Yes/No or True/False) based on one or more input variables. Unlike linear regression, which deals with continuous data, logistic regression estimates the probability that a given input belongs to a certain class.

Applications of Logistic Regression

Spam Detection: Email services apply logistics regression to classify emails as spam or not by understanding the other input variables.
Medical Predictions: Logistic regression can be used to determine the probability of medical conditions, such as predicting heart attacks based on variables like weight and exercise habits.
educational Outcomes: Application aggregators use logistic regression to predict the probability of a student being accepted to a particular university or degree course by analyzing scores

Logistic Regression Equation and Assumptions

Logistic Regression Equation

Logistic regression uses the logistic function (sigmoid function) to map predictions and their probabilities. The sigmoid function is defined as:

Graph of Sigmoid function

if the output of the sigmoid function is greater than 0.5, the model predicts the instance as the positive class (1)
if the output is less than 0.5, the model predicts the instance as the negative class (0)

Interpretation of sigmoid function

The sigmoid function's output can be interpreted as the probability of the instance belonging to the positive class. For example

If the output is 0.7, there is a 70% chance that the instance belongs to the positive class
If the output is 0.2, there is a 20% chance that the instance belongs to the positive class

Assumptions:

Binary Logistics regression requires the dependent variable to be binary That means the outcome variable must have tow possible outcomes, such as 'yes' or 'no'.
Independence of observation The observation should be independent of each other, in other words, the outcome of one instance should not affect the outcome of another.
Linearity of independent variables and log odds Although logistic regression does not require dependent and independent variables to be linearly dependent. it requires that the independent variables are linearly related to log odds.
Absence of multicollinearity The independent variables should not be too highly correlated with each other,
Large sample size" logistic requires large sample size generally you require at least 10 cases with the least frequency outcome for each dependent variables.

Types of Logistic Regression

Binary Logistic Regression: when the dependent variable has two outcomes, such as predicting whether a loan will be approved (yes/no)
Multinomial Logistic Regression: when the dependent variable has more than two discrete outcomes, such as predicting the type of transport a person will choose (car, bike, bus)
Ordinal Logistic Regression: Used when the dependent variable is ordinal, such as survey responses (agree, disagree, unsure)

Conclusion

Logistic regression is a powerful and flexible tool for binary outcome modeling. Its simplicity, interpretability, and effectiveness with linearly separable datasets make it a preferred choice for many binary classification tasks in machine learning and predictive analytics. Understanding its assumptions and best practices ensures the development of robust and reliable models

Elastic Net Regularization: Balancing Between L1 and L2 Penalties

Harsimranjit Singh — Fri, 07 Jun 2024 18:46:40 +0000

Elastic Net regularization stands out by combining the strengths of both L1(lasso) and L2(Ridge) regularization methods. This article will explore the theoretical, mathematical and practical aspects of the Elastic Net regularization.

Lasso vs. Ridge Regression

Lasso Regression: Adding L1 norm penalty, promoting sparsity by driving some coefficients to zero. This can lead to feature selection. However, Lasso can struggle with highly correlated features.
Ridge Regression: Adding L2 norm penalty, shrinking all coefficients towards zero but not necessarily driving them to zero. This avoids sparsity but can be less effective in feature selection.

Elastic Net Regularization

Elastic Net regularization is a combined approach that blends L1 and L2 regularization penalties. Elastic Net addresses some limitations of Lasso and Ridge, particularly in scenarios with highly correlated features.

Mathematical Formulation

The Elastic Net regularization adds both L1 and L2 penalties to the loss function. The penalty term is:

Understanding the impact:

The L1 penalty from Lasso encourages sparsity, potentially driving some coefficients to zero(feature selection)
The L2 penalty from ridge regression shrinks all coefficients towards zero, promoting smoother coefficient shrinkage and potentially better handling of correlated features.

By adjusting the values of lambda1 and lambda2, we can control the relative influence of the L1 and L2 penalties. A higher lambda1, encourages more sparsity, while a lower lambda2 smother coefficients shrinkage.

Benefits of Elastic Net:

Overfitting: Elastic net helps prevents overfitting by penalizing overly complex models.
Feature Selection: The L1 component can drive coefficients to zero, potentially performing feature selection.
Handles Correlated Features: Elastic net can be more robust to highly correlated features.

Choosing the Right value:

Finding the optimal values for λ₁ and λ₂ is crucial for optimal performance. Techniques like cross-validation are employed to identify the combination of λ₁ and λ₂ that minimizes the validation error while maintaining a desirable sparsity level.

When to use

When the dataset is quite large
input columns have multicollinearity

Practical Implementation

import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import ElasticNet
from sklearn.datasets import make_regression

X, y = make_regression(n_samples=100, n_features=10, noise=0.1, random_state=42)

elastic_net = ElasticNet(alpha=0.1, l1_ratio=0.5)  # alpha controls L1 & L2, l1_ratio controls L1 vs L2 ratio
elastic_net.fit(X, y)

plt.figure(figsize=(12, 6))
plt.plot(range(X.shape[1]), elastic_net.coef_, marker='o', linestyle='none')
plt.xlabel('Feature Index')
plt.ylabel('Coefficient Value')
plt.title('Elastic Net Coefficients')
plt.xticks(range(X.shape[1]))
plt.grid(True)
plt.show()

Conclusion

In conclusion, Elastic Net regularization is a versatile and effective technique for improving the performance and interpretability of linear regression models. By leveraging both L1 and L2 penalties, it offers a comprehensive solution that can be fine-tuned to suit a variety of datasets and modelling challenges.

Polynomial Regression: Exploring Non-Linear Relationships

Harsimranjit Singh — Sun, 26 May 2024 15:16:28 +0000

In our previous discussions, we explored the fundamentals of linear regression and gradient descent optimization. Today, we discuss a new topic - Polynomial regression. This technique empowers us to capture non-linear relationships between independent and dependent variables, a more flexible approach when a straight line does not fit in the data.

Beyond Straight Lines

Linear regression assumes a linear relationship between the independent variables and the dependent variables. However, real-world data often exhibits different patterns.
Polynomial regression addresses this by introducing polynomial terms of the independent variables. If we have one variable X then we can transform this variable like X^2, X^3, and so on. These terms allow the model to capture curves, bends, and other non-linear trends in the data

Here:

Y: Dependent variable
b0: The intercept term(bias)
b_i: Coefficients associated with each terms (i=1 to d)
X: independent variables
X^i: The polynomial terms of X (i=1 to d)

Let's take the example of a small dataset

Suppose we have a dataset representing the relationship between hours studied (x) and exam scores (y)

let's first visualize the dataset

import numpy as np
import matplotlib.pyplot as plt

hours_studied = np.array([1, 2, 3, 4, 5])
exam_scores = np.array([50, 65, 75, 80, 82])

plt.scatter(hours_studied, exam_scores, color='blue', label='Data points')
plt.xlabel('Hours studied')
plt.ylabel('Exam score')
plt.legend()
plt.grid(True)
plt.show()

The shows a rough upward trend between hours studied and exam scores this type of relation can not be captured by a straight line

So to implement the polynomial regression we need to first modify our independent variable using polynomial features which transform the features into polynomial features like we specify the degree 2 then it make two columns out of one which are x (original) and x^2 (transformed).

from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression

X = hours_studied.reshape(-1, 1)
y = exam_scores.reshape(-1, 1)

# Polynomial features (change the dataset)
poly = PolynomialFeatures(degree=3)
X_poly = poly.fit_transform(X)


model = LinearRegression() # use the normal linear regression model
model.fit(X_poly, y)

y_pred = model.predict(X_poly)

plt.scatter(hours_studied, exam_scores, color='blue', label='Data points')
plt.plot(hours_studied, y_pred, color='red', label='Polynomial Regression')
plt.xlabel('Hours Studied')
plt.ylabel('Exam Score')
plt.title('Polynomial Regression: Hours Studied vs. Exam Score')
plt.legend()
plt.grid(True)
plt.show()

The red curve represents the polynomial regression line fitted to the data. In the above example, we use the degree 3 polynomial.
By doing so we can capture the non-linear data as well.

Choosing the Right Degree:

The degree of polynomials dictates the model's complexity. we will encounter a trade-off here:

Higher Degrees: Capture a more complex relationship but can lead to overfitting, where it performs well on training data and performs poorly on unseen data.
Lower Degrees: Less prone to overfitting but might miss important non-linear patterns.

Estimations of Coefficients:

The coefficients in polynomial regression can be estimated using OLS the same method used for linear regression.

Conclusion:

Polynomial regression is a powerful statistical technique used to model complex relationships between the data. It can capture non-linear patterns that linear regression might miss.

Some important points to remember

Polynomial regression assumes that the relationship between the x and y is polynomial.
There are several types of polynomial regression simple, multiple, and orthogonal polynomial regression
The interpretation of the coefficients in polynomial regression is similar to linear regression, with the addition of higher-degree terms.
The polynomial regression also assumes that the error terms are randomly distributed with mean of zero.

Understanding Lasso Regularization: Enhancing Model Performance and Feature Selection

Harsimranjit Singh — Fri, 24 May 2024 02:16:33 +0000

Lasso regularization is a powerful technique in machine learning, which is used to prevent overfitting. But lasso goes a step further- it can also help us identify the most important features of the model. In this article today we will discuss the theoretical aspects of lasso along with its mathematical formulation.

Lasso Regularization

Lasso regularization is designed to enhance model sparsity, meaning it can zero out coefficients of less important features, effectively performing feature selection. This is particularly useful in high-dimensional data scenarios where we want to identify the most relevant predictors.

Mathematical formulation

Lasso regularization modifies the objective function (linear regression)by adding a penalty term to the function. This penalty is the L1 norm of the coefficient vector defined as the sum of the absolute values of the coefficients

where:

lambda is the regularization parameter that controls the strength of the penalty.
another term is the L1 norm

The L1 penalty encourages sparsity in the model by shrinking some coefficients to zero, effectively performing the feature selection.

Benefits of Lasso Regularization

Feature Selection: Lasso can automatically perform feature selection by setting the coefficients of less important features to zero.
Prevents Overfitting: By reducing the variance of model, the lasso helps to prevent overfitting.

Practical Implementation

import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import Lasso
from sklearn.datasets import make_regression

X, y = make_regression(n_samples=100, n_features=10, noise=0.1, random_state=42)

lasso = Lasso(alpha=0.1)
lasso.fit(X, y)

plt.figure(figsize=(12, 6))
plt.plot(range(X.shape[1]), lasso.coef_, marker='o', linestyle='none')
plt.xlabel('Feature Index')
plt.ylabel('Coefficient Value')
plt.title('Lasso Coefficients')
plt.xticks(range(X.shape[1]))
plt.grid(True)
plt.show()

In the above code, the alpha is the hyperparameter that we need to tune. It is the value of the lambda in the equation.

Choosing the Optimal Parameter

The value of lambda significantly impacts the sparsity and performance of the model. A higher value leads to a stronger penalty, potentially driving more coefficients to zero and risking underfitting. Conversely, a lower value provides less regularization, potentially resulting in overfitting.

Feature Selection

Consider a more complex dataset with multiple features. By fitting a Lasso model and examining the coefficients, we can determine which features are most important.


X, y = make_regression(n_samples=100, n_features=10, noise=0.1)

lasso = Lasso(alpha=0.5)
lasso.fit(X, y)

plt.figure(figsize=(10, 6))
plt.bar(range(X.shape[1]), lasso.coef_)
plt.title('Lasso Coefficients with Strong Regularization')
plt.xlabel('Feature index')
plt.ylabel('Coefficient value')
plt.grid(True)
plt.show()

In this plot, many of the coefficients would be zero, indicating that Lasso has selected only the most relevant features.

Conclusion

Lasso regularization is a robust technique for enhancing model interpretability and performance. By adding an L1 penalty to the linear regression objective function, Lasso encourages sparsity in the model, effectively performing feature selection. This helps in identifying the most relevant predictors and prevents overfitting, making it particularly useful in high-dimensional datasets.

Bias and Variance Tradeoff

Harsimranjit Singh — Sat, 04 May 2024 02:16:57 +0000

In the process of project building, bias and variance are two fundamental concepts that help us understand the behaviour and performance of predictive models.

Bias

Bias is the inability of a machine learning model to fit the training data, which means how much the predictions of a model deviate from the true values we are trying to predict.
A high-bias model tends to underfit the data, meaning it fails to capture the underlying patterns and relationships present in the data.

Variance

Variance is defined as how much predictions of a model vary when the training data changes (for different datasets). It shows the model's

sensitivity to small fluctuations or noise in the training data. Model with high variance are overly complex and tend to capture not only the underlying patterns but also the noise present in the training data. As a result, they perform well on the training data but poorly on new, unseen data.

Let' Take a Real World Example

Let's imagine we built a machine-learning model to predict the output values. There may be some error while predicting the output.

We can decompose the above error while output into three parts:

Error due to bias
Error due to variance
Irreducible error

Decomposing the Error:

Imagine you are a party planner trying to guess how much food to order for your guests. Here's how the different errors in your prediction would play out:

Bias This is like consistently underestimating how much your friends eat. Maybe you forget a key factor like there's always a big group with healthy appetites, so you always order too little food (your predictions are off)
Variance This is like your prediction all over the place. Sometimes you order just the right amount, but other times you overestimate or underestimate. This happens because you are too focused on what your friends ate last time, which might not be a good indicator of how much they will eat this time.
Irreducible Error This is like unexpected things happening like someone bringing a surprise dish or a guest with a smaller appetite than usual.

End Goal

Our goal is to make an overall good guess, which have less bias and less variance.

The Bias-Variance Tradeoff:

The key takeaway is that bias and variance have a tradeoff. Reducing one often leads to an increase in the other.
A high-variance model has low variance(undercutting), while a high-variance model has low bias(overfitting). The ideal scenario lies in achieving a balance between the two – a model with low bias and low variance. This sweet spot minimizes the overall error, leading to accurate and generalizable predictions on unseen data.

Conclusion

Today, we learn about the concept of bias and variance tradeoffs including its meaning with real-life examples.

Stay tuned for more topics around this topic. Next we will dive deep into regularization and its type.

Gradient Descent: Optimizer Behind Machine Learning

Harsimranjit Singh — Sat, 30 Mar 2024 01:04:34 +0000

Today, we will discuss one of the most important topic in machine learning which is Gradient Descent. gradient descent is a way for our programs to learn and improve by constantly getting better at what they do.

At its core, Gradient Descent is an optimization algorithm used to minimize a loss function by iteratively moving towards the minimum.

Understanding the Intuition

Imagine you are blindfolded on the hill, trying to reach the bottom where the surface is flat. As you are blindfolded you will rely on feeling the slope beneath your feet to guide your steps. This
is like Gradient Descent in machine learning.
In this scenario:

The Hill: it's a mathematical functions of our model(Loss function)
The Hiker: Model that tries to find the best parameters.
Feeling the Slope: is just like calculating the gradient descent.
Descending: steps taken to reach at the bottom of the hill

Why Gradient Descent over OLS

In our previous discussion, we used the OLS method to find the optimized values of the parameter so why use gradient descent which provides the approximate values of the parameters? While calculating the values of coefficients we need to find the inverse of (X^TX) which can be computationally expensive as for calculating the inverse the time complexity is n^3 which is too much so because of that we use other methods to find or approximate the values of coefficients.

Requirements for Functions

Before delving into Gradient Descent, it's crucial to understand the requirements for functions that this algorithm can work with.

Differentiability: The function must have a derivative for each point in its domain.
Convexity: A function is convex if it curves upward, like a bowl, and the line segment connecting any points on its graph lies above the graph itself. This ensures that function always increases in one direction and always decreases in another direction.

Continuing with Gradient Descent

Now let's back to our main topic and talk about Gradient Descent. As mentioned earlier, Gradient Descent is an optimization algorithm used to minimize a function by iteratively moving towards the minimum. Here's how it works

Basic Idea

Start at a point: Begin at a random point on the function or surface.
Calculate Gradient: calculate the gradient(derivative) of function at that point. The gradient points in the direction of the steepest increase of the function.
Move Opposite to Gradient: Take a small step in the opposite direction of the gradient to decrease the function value.
Update Position: Update the position on the function based on the formula provided below

5. Repeat: Repeat this process until convergence or stopping criteria are met.

Stopping Criteria. when there are no improvements in the values of new parameters, the difference between the old parameters and new parameter values is negligible.

Lets you learn gradient descent by applying to simple linear regression

in this example, we will use a dataset generated with 'make_regression' from 'sklearn' to create a linear relationship between a single feature and a target variable with some added noise. Our goal is to find the optimal values of the intercept.

The red line in the plot represents the best-fit line calculated using the OLS method. This will be used to distinguish between the OLS and gradient descent.

Gradient Descent

Now, let's dive into gradient descent, in our example, let's assume an initial value of slope(m) = 78.35 and intercept(b) = 0. we will apply Gradient Descent to update the intercept(b) and plot the resulting line

we will calculate the slope with a derivative of equation( we will discuss detailed maths in next articles)
loss_slope = -2 * np.sum(y - m * X.ravel() - b)

m = 78.35
b = 0

# Calculate loss slope
loss_slope = -2 * np.sum(y - m * X.ravel() - b)

# Learning rate
lr = 0.1

# Update intercept using Gradient Descent
step_size = loss_slope * lr
b = b - step_size

# Predicted values with updated intercept
y_pred = (m * X + b).reshape(4)

# Plotting
plt.scatter(X, y)
plt.plot(X, reg.predict(X), color='red', label='OLS')
plt.plot(X, y_pred, color='#00a65a', label='b = {}'.format(b))
plt.xlabel('X')
plt.ylabel('y')
plt.legend()
plt.title('Linear Regression with Gradient Descent (Iteration 1)')
plt.show()

in the plot, the green line represents the linear regression line with the updated intercept(b) using gradient Descent for one iteration. we can see it has started moving towards a better fit compared to the initial OLS line(red).

Iterative Updates

Now, let's continue the process for multiple iterations to see how the lines move. we will iterate through the Gradient Descent process multiple times, updating the intercept(b) based on the loss

b = -100
m = 78.35
lr = 0.01

epochs = 100

for i in range(epochs):
    loss_slope = -2*np.sum(y-m*X.ravel()-b)
    b = b-(lr*loss_slope)
    y_pred = m*X +b
    plt.plot(X,y_pred)
plt.scatter(X,y)

in the final plot, we can see the evolution of linear regression towards the optimal values.

Conclusion

Gradient Descent is a powerful optimization technique used in various machine learning algorithms, especially in cases where the cost function is complex and not easily minimized through analytical methods like OLS. By iteratively updating parameters, such as the intercept (b) and slope (m) in linear regression

Next, we will discuss in detail the mathematics behind this and fill in the gaps that are left out today. Stay tuned

DEV Community: Harsimranjit Singh

Enhancing Support Vector Machines for Non-Linear Classification with Kernel Functions

Problem with Non-Linear Data in SVMs

Solution Using Projection in Higher Dimensions

Illustration:

Problem with the Above Solution:

Kernel Functions: Measuring Similarity

Interpretation:

Understanding the Primal and Dual Form of SVM

Primal Problem:

Dual Problem:

Observations:

Motivation for Kernels

Idea

Defining Kernel Functions:

Mathematics behind this

Mercer's Theorem:

Implications:

Common Kernel functions

1. Polynomial Kernel

2. Radial Basis Function (RBF) Kernel

3. Sigmoid Kernel

4 Linear Kernel

Applying Kernel SVM to the Concentric Circles Dataset

Advantages of Using Kernel Functions

Considerations and Best Practices

Conclusion

Support Vector Machines: From Hard Margin to Soft Margin

What is SVMs

Support Vectors

From where this equation comes?

Margin Calculation

From where do we get this formula

Understanding Hard Margin SVM

Formulating the Hard Margin SVM

1. Objective function

2. Constraints:

3. Hard Margin SVM Optimization Problem:

Problem with Hard Margin

Soft Margin SVM

Slack Variables

Soft-Margin SVM: Objective Function and Constraints

What is ξi​

Intuition behind the Hinge Loss Function

1. Decision Function:

2. Margin and Classification:

Explaining Hinge Loss with example

Hinge Loss Calculation

Short Summary

Constraint for Soft-Margin

Why 1−ξi​

Real Objective Function

Derivation

Regularization Parameter (C) in SVM

Conclusion

Exploring the Depths of Support Vector Machines: Hard Margin SVM

What is SVMs

Support Vectors

From where this equation comes?

Margin Calculation

From where do we get this formula

Understanding Hard Margin SVM

Formulating the Hard Margin SVM

1. Objective function

2. Constraints:

3. Hard Margin SVM Optimization Problem:

Problem with Hard Margin

Exploring the Depths of Support Vector Machines: Hard Margin SVM

What is SVMs

Support Vectors

From where this equation comes?

Margin Calculation

From where do we get this formula

Understanding Hard Margin SVM

Formulating the Hard Margin SVM

1. Objective function

2. Constraints:

3. Hard Margin SVM Optimization Problem:

Problem with Hard Margin

ROC-AUC Curve in Machine Learning

What is $ξ_{i}$

Why $1 - ξ_{i}$