Theory
The k-Nearest Neighbor (k-NN) algorithm is frequently characterized as the foundational algorithm in machine learning. It operates by calculating the distances between data points in a training dataset and a test dataset to identify the closest points, termed "nearest neighbors." This method does not restrict itself to just one nearest neighbor; instead, it allows for the selection of a specific number (k) of nearest neighbors during its prediction process.
Euclidean and Manhattan distances are commonly used to calculate distances.
Euclidean distance:
Manhattan distance:
This method is not only applicable to classification tasks but can also be used for regression problems.
Implementation
To implement the k-NN algorithm, we will use the Iris dataset, which is a popular dataset for classification tasks. The dataset contains 150 samples of iris flowers, each with four features: sepal length, sepal width, petal length, and petal width. The target variable is the species of the iris flower, which can be one of three classes: setosa, versicolor, or virginica.
First, we will load the dataset and split it into training and test datasets.
pip install numpy==1.23.5 pandas==1.5.3 scikit-learn==1.2.2
import numpy as np
import pandas as pd
from sklearn import datasets
from sklearn import model_selection
dataset = datasets.load_iris()
X = pd.DataFrame(data=dataset.data, columns=dataset.feature_names)
y = pd.Series(data=dataset.target, name="target")
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, random_state=0)
print("samples: {}; features: {}".format(*X.shape))
print("samples: {}; values: {}".format(*y.shape, y.unique()))
samples: 150; features: 4
samples: 150; values: [0 1 2]
The following code snippet demonstrates the implementation of the k-NN algorithm using the Euclidean distance metric.
from typing import List
class KNeighborsClassifier:
def __init__(self) -> None:
self._X_train = None # The training features to be saved
self._y_train = None # The training target to be saved
def fit(self, X: pd.DataFrame, y: pd.Series) -> None:
"""Fit the model from the training dataset.
:param X: The training features.
:param y: The training target.
"""
self._X_train = X
self._y_train = y
def predict(self, X: pd.DataFrame) -> np.ndarray:
"""Predict the class labels for the provided data.
:param X: The data to be classified.
:return: The class labels for the provided data.
"""
classlabels = []
for p0 in X.values:
distances = []
for p1 in self._X_train.values:
# Calculate the Euclidean distance between two points.
distance = self.calculate_euclidean_distance(p0, p1)
distances.append(distance)
# In this classification model, the nearest point is the class label.
# It is possible to use a different number of nearest points to get outcomes in other problems.
nearest_index = np.array(distances).argmin()
classlabels.append(self._y_train.values[nearest_index])
return classlabels
def calculate_euclidean_distance(self, p0: List[float], p1: List[float]) -> float:
"""Calculate the Euclidean distance between two points.
:param p0: The first point.
:param p1: The second point.
:return: The Euclidean distance between the two points.
"""
return np.sqrt(np.sum((p0 - p1) ** 2, axis=0))
Now that we have implemented the k-NN algorithm, we can fit the model to the training dataset.
model = KNeighborsClassifier()
model.fit(X_train, y_train)
Finally, we can use the model to predict the class labels for the test dataset and evaluate the model's performance.
from sklearn.metrics import accuracy_score
y_test_pred = model.predict(X_test)
print(f"Accuracy score for test data: {accuracy_score(y_test, y_test_pred)}")
Accuracy score for test data: 0.9736842105263158
Top comments (0)