K Nearest Neighbors: Using What is Around You to Make a Prediction

Introduction

K Nearest Neighbors also know simply as KNN is a regression and classification algorithm that uses the points closest to it to create predictions, with K being the number of points you want to look at for the prediction. This said, it does work a little differently for Regression and Classification. For Regression KNN takes the average of the k nearest neighbors to make a prediction and for Classification KNN takes the most common class of the nearest neighbors, simple as that. Now let me try to brush the dust off my art skills and try to make a visual.

Above is an example of KNN being used for classification with the star being the point that is being predicted using K=3 (the 3 closest points), it classifies the predicted point as being red since it is the most common class to that point.

Calculating Distance

So to get more technical since your points don't always exist on a 2D plain let's go over how the distance is calculated. The most common distances used being the Manhattan distance, Euclidean distance and the Minkowski distance.

Manhattan distance

The Manhattan distance somewhat has to do with exactly what you think about when you hear Manhattan and math, counting the amount of blocks you traveled. The formula is the summation of the absolute values of the distance you traveled in each direction. So if we are starting at location (0,0) and we move to location (4,5), it would be as simple as |0-4|+|0-5| = 9.

If you want to play with it in python it would look like the following:

# Locations of two points A and B
A = (0, 0)
B = (4, 5)

manhattan_distance = 0

# Use a for loop to iterate over each element
for i in range(2):
    # Calculate the absolute difference and add it
    manhattan_distance += abs(A[i] - B[i])

manhattan_distance

*Try playing with adding more numbers to the A and B coordinates and then increasing the range to the amount of numbers in one of your coordinates.

Euclidean distance

This one may look like we are adding a lot to the calculation but we really aren't, let's take this formula with only two dimensions: a^2 + b^2 = c^2. That's right its just the Pythagorean theorem! The Euclidean distance is the distance between points by moving in a straight line.

Again if you want to try it out in Python here is some code:

from math import sqrt 

# Locations of two points A and B
A = (2, 3, 5)
B = (1, -1, 3)

euclidean_distance = 0

# Use a for loop to iterate over each element
for i in range(3):
    # Calculate the difference, square, and add it
    euclidean_distance += (A[i] - B[i])**2

# Square root of the final result 
euclidean_distance = sqrt(euclidean_distance)

euclidean_distance

Minkowski distance

Then lastly out of the distance I will cover we have the Minkowski distance. I won't get too much into the mathematics of calculating this distance except that it actually encompasses both the Manhattan and Euclidean distances, if you exchange c with 1 you have the Manhattan distance and if you do the same with 2 you have the Euclidean distance. That is because the Minkowski distance is a generalized distance metric across a Normed Vector Space.

How many K's should I use

So now that we have gone through distance metrics and have an idea of finding the "Nearest Neighbors" let's take a look at how many closest points to take a look at. This can be tricky and dependent on your dataset because if we choose a K that is too small we will wind up getting a lot of noise and our model would lean towards being "overfit", then if we choose too big of a K all of our predictions would lean toward the same outcome and be "underfit". So as a general rule it is best to start with K = sqrt(N) with N being your sample size, as well as having K be an odd number if the number of classes is even. From here you can generate your model with different models of K around your first value and compare the error rates.

Conclusion

I hope this helps give you a better understanding of what is going on behind the scenes with running a KNN model, with how the distance is calculated and what the neighbors actually are.