ML Learning #3: KNN - You are who you surround yourself with

#algorithms #beginners #tutorial #machinelearning

What is KNN?

K-Nearest Neighbors (KNN) is a non-parametric classification algorithm. Instead of learning explicit model parameters, it classifies a new sample by looking at the majority label among its K closest points in the training dataset, based on a chosen distance metric.

How do determine the value of K?

Choosing a value of K can be a bit tricky as if the value is too small the model overfits on the training data and if it's too large, the model underfits on the training data. There are various techniques that can be used to determine the value of K such as the Elbow method where you plot the K value vs Performance Metric for a range of values of K, and then choose the value of K with the best performance metric.

How does it work?

1.1 Loading Your Data

KNN is a computationally expensive algorithm because it needs to store and process the entire training dataset to make predictions. The larger your dataset, the more time and memory it requires.

1.2 Finding Distances of Data Point to Each Training Data

Now in this step, we calculate the distances of your new data point with each existing data point in your dataset. There are multiple ways to calculate the distances between two vectors. These are

Euclidean Distance
Manhattan Distance
Cosine Similarity

The most commonly used distance is Euclidean Distance, it’s provided by the formula

Euclidean Distance(A,B) = \sqrt{(A_y- B_y)^2 + (A_x - B_x)^2}

Where: $A$ and $B$ are two data points and $x$ and $y$ are x and y coordinates of the respective data points.

or a more generalized form

Euclidean Distance(A,B) = \sqrt{\sum_{i=1}^{n}(A_i- B_i)^2}

Note: Depending on your problem, you can also use other distance metrics such as Manhattan Distance or Cosine Similarity.

1.3 Finding the K Nearest Neighbours

We sort the distances in increasing order and we take the first K data points, as they are the K nearest data points.

1.4 Finding the most frequent class

Finally, we look at the class labels of these K nearest neighbors, count how many times each class appears, and assign the most frequent one to the new data point.

Code Implementation


import numpy as np
from collections import Counter

class KNN:
    def __init__(self, K):
        self.k = K
        self.x = self.y = None

    def fit(self, x, y):
        self.x = np.array(x)
        self.y = np.array(y)

    def predict(self, x):
        x = np.array(x)
        self.distances = []
        len_x = len(self.x)
        for i in range(len_x):
            distance = np.sqrt(np.sum((self.x[i] - x) ** 2))
            self.distances.append([distance, self.y[i]])

        self.distances.sort(key=lambda d: d[0])
        nearest_k = self.distances[:self.k]
        labels = [label for _, label in nearest_k]
        label_counts = Counter(labels)

        return label_counts.most_common(1)[0][0]