DEV Community

Cover image for Brushing Up on k-NN for Classification in Python: Theory to Practice
Koki Esaki
Koki Esaki

Posted on • Updated on

Brushing Up on k-NN for Classification in Python: Theory to Practice

Theory

The k-Nearest Neighbor (k-NN) algorithm is frequently characterized as the foundational algorithm in machine learning. It operates by calculating the distances between data points in a training dataset and a test dataset to identify the closest points, termed "nearest neighbors." This method does not restrict itself to just one nearest neighbor; instead, it allows for the selection of a specific number (k) of nearest neighbors during its prediction process.

k-NN

Euclidean and Manhattan distances are commonly used to calculate distances.

Euclidean distance:

d=(b1a1)2+(b2a2)2 d = \sqrt{(b_1 - a_1)^2 + (b_2 - a_2)^2}

Manhattan distance:

d=(b1a1)+(b2a2) d = |(b_1 - a_1)| + |(b_2 - a_2)|

This method is not only applicable to classification tasks but can also be used for regression problems.

Implementation

To implement the k-NN algorithm, we will use the Iris dataset, which is a popular dataset for classification tasks. The dataset contains 150 samples of iris flowers, each with four features: sepal length, sepal width, petal length, and petal width. The target variable is the species of the iris flower, which can be one of three classes: setosa, versicolor, or virginica.

First, we will load the dataset and split it into training and test datasets.

pip install numpy==1.23.5 pandas==1.5.3 scikit-learn==1.2.2
Enter fullscreen mode Exit fullscreen mode
import numpy as np
import pandas as pd
from sklearn import datasets
from sklearn import model_selection


dataset = datasets.load_iris()
X = pd.DataFrame(data=dataset.data, columns=dataset.feature_names)
y = pd.Series(data=dataset.target, name="target")

X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, random_state=0)

print("samples: {}; features: {}".format(*X.shape))
print("samples: {}; values: {}".format(*y.shape, y.unique()))
Enter fullscreen mode Exit fullscreen mode
samples: 150; features: 4
samples: 150; values: [0 1 2]
Enter fullscreen mode Exit fullscreen mode

The following code snippet demonstrates the implementation of the k-NN algorithm using the Euclidean distance metric.

from typing import List


class KNeighborsClassifier:

    def __init__(self) -> None:
        self._X_train = None  # The training features to be saved
        self._y_train = None  # The training target to be saved

    def fit(self, X: pd.DataFrame, y: pd.Series) -> None:
        """Fit the model from the training dataset.

        :param X: The training features.
        :param y: The training target.
        """

        self._X_train = X
        self._y_train = y

    def predict(self, X: pd.DataFrame) -> np.ndarray:
        """Predict the class labels for the provided data.

        :param X: The data to be classified.
        :return: The class labels for the provided data.
        """
        classlabels = []
        for p0 in X.values:
            distances = []
            for p1 in self._X_train.values:
                # Calculate the Euclidean distance between two points.
                distance = self.calculate_euclidean_distance(p0, p1)
                distances.append(distance)

            # In this classification model, the nearest point is the class label.
            # It is possible to use a different number of nearest points to get outcomes in other problems.
            nearest_index = np.array(distances).argmin()
            classlabels.append(self._y_train.values[nearest_index])

        return classlabels

    def calculate_euclidean_distance(self, p0: List[float], p1: List[float]) -> float:
        """Calculate the Euclidean distance between two points.

        :param p0: The first point.
        :param p1: The second point.
        :return: The Euclidean distance between the two points.
        """
        return np.sqrt(np.sum((p0 - p1) ** 2, axis=0))
Enter fullscreen mode Exit fullscreen mode

Now that we have implemented the k-NN algorithm, we can fit the model to the training dataset.

model = KNeighborsClassifier()
model.fit(X_train, y_train)
Enter fullscreen mode Exit fullscreen mode

Finally, we can use the model to predict the class labels for the test dataset and evaluate the model's performance.

from sklearn.metrics import accuracy_score

y_test_pred = model.predict(X_test)
print(f"Accuracy score for test data: {accuracy_score(y_test, y_test_pred)}")
Enter fullscreen mode Exit fullscreen mode
Accuracy score for test data: 0.9736842105263158
Enter fullscreen mode Exit fullscreen mode

Top comments (0)