What is KNN?
K-Nearest Neighbors (KNN) is a non-parametric classification algorithm. Instead of learning explicit model parameters, it classifies a new sample by looking at the majority label among its K closest points in the training dataset, based on a chosen distance metric.
How do determine the value of K?
Choosing a value of K can be a bit tricky as if the value is too small the model overfits on the training data and if it's too large, the model underfits on the training data. There are various techniques that can be used to determine the value of K such as the Elbow method
where you plot the K value vs Performance Metric for a range of values of K, and then choose the value of K with the best performance metric.
How does it work?
1.1 Loading Your Data
KNN is a computationally expensive algorithm because it needs to store and process the entire training dataset to make predictions. The larger your dataset, the more time and memory it requires.
1.2 Finding Distances of Data Point to Each Training Data
Now in this step, we calculate the distances of your new data point with each existing data point in your dataset. There are multiple ways to calculate the distances between two vectors. These are
- Euclidean Distance
- Manhattan Distance
- Cosine Similarity
The most commonly used distance is Euclidean Distance, it’s provided by the formula
Where:
and
are two data points and
and
are x
and y
coordinates of the respective data points.
or a more generalized form
Note: Depending on your problem, you can also use other distance metrics such as Manhattan Distance
or Cosine Similarity
.
1.3 Finding the K Nearest Neighbours
We sort the distances in increasing order and we take the first K data points, as they are the K nearest data points.
1.4 Finding the most frequent class
Finally, we look at the class labels of these K nearest neighbors, count how many times each class appears, and assign the most frequent one to the new data point.
Code Implementation
import numpy as np
from collections import Counter
class KNN:
def __init__(self, K):
self.k = K
self.x = self.y = None
def fit(self, x, y):
self.x = np.array(x)
self.y = np.array(y)
def predict(self, x):
x = np.array(x)
self.distances = []
len_x = len(self.x)
for i in range(len_x):
distance = np.sqrt(np.sum((self.x[i] - x) ** 2))
self.distances.append([distance, self.y[i]])
self.distances.sort(key=lambda d: d[0])
nearest_k = self.distances[:self.k]
labels = [label for _, label in nearest_k]
label_counts = Counter(labels)
return label_counts.most_common(1)[0][0]
Top comments (0)