Akhilesh

Posted on May 9

61. K-Nearest Neighbors: Judge by Your Company

#ai #python #programming #beginners

Every other algorithm we've covered so far actually learns something during training. It builds a model, adjusts weights, grows a tree.

KNN does none of that.

It stores the entire training dataset. When a new example comes in, it finds the K most similar training examples and lets them vote. Majority wins.

That's it. No training. No model. Just memory and distance.

And yet it works surprisingly well on many problems. Understanding why it works, and more importantly when it fails, will make you a better ML practitioner.

What You'll Learn Here

How KNN classifies using distance and voting
The three most common distance metrics and when to use each
How K affects the bias-variance tradeoff
Why scaling is critical for KNN
KNN for regression
The curse of dimensionality and why KNN struggles with many features
Weighted KNN and when it helps

How KNN Actually Works

You have a new point. You want to classify it.

KNN does three things:

Calculate the distance from the new point to every training example
Find the K training examples with the smallest distance
Take a vote. The majority class among those K neighbors wins.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification

# Simple 2D example to visualize
X, y = make_classification(
    n_samples=50, n_features=2, n_redundant=0,
    n_informative=2, random_state=42
)

# New point to classify
new_point = np.array([[0.5, 0.5]])

# Calculate distances from new_point to all training examples
distances = np.sqrt(np.sum((X - new_point) ** 2, axis=1))

# Find 5 nearest neighbors
K = 5
nearest_indices = np.argsort(distances)[:K]
nearest_labels  = y[nearest_indices]

print(f"5 nearest neighbors labels: {nearest_labels}")
print(f"Votes: class 0 = {(nearest_labels == 0).sum()}, "
      f"class 1 = {(nearest_labels == 1).sum()}")
print(f"Prediction: class {np.bincount(nearest_labels).argmax()}")

Output:

5 nearest neighbors labels: [1 1 0 1 0]
Votes: class 0 = 2, class 1 = 3
Prediction: class 1

Three neighbors said class 1. Two said class 0. Class 1 wins.

Distance Metrics

How you measure distance changes which neighbors get picked. Three main options:

Euclidean Distance (default)
Straight-line distance. What you'd measure with a ruler.

d = sqrt((x1-y1)^2 + (x2-y2)^2 + ... + (xn-yn)^2)

Works well when features are continuous and have similar scales.

Manhattan Distance
Sum of absolute differences along each axis. Like navigating a city grid.

d = |x1-y1| + |x2-y2| + ... + |xn-yn|

Works better in high dimensions and when features have outliers.

Minkowski Distance
Generalization of both. Parameter p controls which one you get.

d = (|x1-y1|^p + |x2-y2|^p + ... )^(1/p)

p=1: Manhattan
p=2: Euclidean

from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
import numpy as np

data = load_breast_cancer()
X, y = data.data, data.target

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

scaler = StandardScaler()
X_train_s = scaler.fit_transform(X_train)
X_test_s  = scaler.transform(X_test)

# Compare distance metrics
metrics = [
    ('euclidean', 2),
    ('manhattan', 1),
    ('minkowski_p3', 3),
]

print(f"{'Metric':<18} {'CV Accuracy'}")
print("-" * 32)

for name, p in metrics:
    knn = KNeighborsClassifier(n_neighbors=5, p=p)
    score = cross_val_score(knn, X_train_s, y_train, cv=5).mean()
    print(f"{name:<18} {score:.3f}")

Output:

Metric             CV Accuracy
--------------------------------
euclidean          0.956
manhattan          0.958
minkowski_p3       0.955

On this dataset the differences are small. Try both euclidean and manhattan and pick whichever performs better on your validation set.

Choosing K: The Most Important Decision

K is the number of neighbors that vote. It directly controls the bias-variance tradeoff.

Small K (K=1): very flexible boundary. Memorizes training data. High variance. Overfits easily.
Large K: very smooth boundary. Makes the same prediction for large regions. High bias. Underfits.

from sklearn.metrics import accuracy_score

train_scores = []
test_scores  = []
k_values     = range(1, 51)

for k in k_values:
    knn = KNeighborsClassifier(n_neighbors=k)
    knn.fit(X_train_s, y_train)
    train_scores.append(accuracy_score(y_train, knn.predict(X_train_s)))
    test_scores.append(accuracy_score(y_test,  knn.predict(X_test_s)))

import matplotlib.pyplot as plt

plt.figure(figsize=(10, 5))
plt.plot(k_values, train_scores, label='Train accuracy', color='blue')
plt.plot(k_values, test_scores,  label='Test accuracy',  color='orange')
plt.xlabel('K (number of neighbors)')
plt.ylabel('Accuracy')
plt.title('KNN: Accuracy vs K')
plt.legend()
plt.grid(True, alpha=0.3)
plt.savefig('knn_k_vs_accuracy.png', dpi=100)
plt.show()

best_k = k_values[np.argmax(test_scores)]
print(f"Best K: {best_k}, Test accuracy: {max(test_scores):.3f}")

Output:

Best K: 7, Test accuracy: 0.974

The sweet spot is usually somewhere between 3 and 20. Always use cross-validation to find it rather than guessing.

A quick rule of thumb: start with K = sqrt(n_training_samples). It's not perfect but gives you a reasonable starting point.

import math
starting_k = math.isqrt(len(X_train))
print(f"Suggested starting K: {starting_k}")

Why Scaling Is Not Optional for KNN

This is the most important thing to know about KNN. Without scaling, features with large ranges completely dominate the distance calculation.

import pandas as pd

# Example: two features with very different scales
data_example = pd.DataFrame({
    'age':    [25, 30, 35],   # range 0-100
    'salary': [30000, 50000, 80000]  # range 0-200000
})

point_a = np.array([26, 31000])  # close to row 0 in both features
point_b = np.array([26, 55000])  # same age as a, different salary

# Distance from row 0 to point_a
dist_a = np.sqrt((25-26)**2 + (30000-31000)**2)
# Distance from row 0 to point_b
dist_b = np.sqrt((25-26)**2 + (30000-55000)**2)

print(f"Distance to point_a (close age, close salary):   {dist_a:.1f}")
print(f"Distance to point_b (close age, far salary):     {dist_b:.1f}")
print()
print("Salary difference of 25k completely dominates the distance.")
print("Age difference of 1 year is invisible.")

Output:

Distance to point_a (close age, close salary):   1000.0
Distance to point_b (close age, far salary):     25000.0

After scaling, both features contribute equally. Always StandardScaler before KNN.

Weighted KNN: Closer Neighbors Vote More

By default, all K neighbors get equal votes. With weights='distance', closer neighbors have more influence.

# Compare uniform vs distance weighting
for weights in ['uniform', 'distance']:
    scores = []
    for k in [3, 5, 7, 10, 15]:
        knn = KNeighborsClassifier(n_neighbors=k, weights=weights)
        score = cross_val_score(knn, X_train_s, y_train, cv=5).mean()
        scores.append(score)

    print(f"weights='{weights}': best CV = {max(scores):.3f} at K={[3,5,7,10,15][np.argmax(scores)]}")

Output:

weights='uniform':  best CV = 0.965 at K=7
weights='distance': best CV = 0.967 at K=5

Distance weighting often helps slightly, especially when K is larger. It's worth trying both.

KNN for Regression

Instead of voting, neighbors average their values.

from sklearn.neighbors import KNeighborsRegressor
from sklearn.datasets import fetch_california_housing
from sklearn.metrics import r2_score, mean_squared_error
import numpy as np

housing = fetch_california_housing()
X_h, y_h = housing.data, housing.target

X_train_h, X_test_h, y_train_h, y_test_h = train_test_split(
    X_h, y_h, test_size=0.2, random_state=42
)

scaler_h = StandardScaler()
X_train_hs = scaler_h.fit_transform(X_train_h)
X_test_hs  = scaler_h.transform(X_test_h)

# Try different K values for regression
print(f"{'K':<6} {'R2':<8} {'RMSE'}")
print("-" * 25)

for k in [3, 5, 10, 20, 50]:
    knn_r = KNeighborsRegressor(n_neighbors=k, weights='distance')
    knn_r.fit(X_train_hs, y_train_h)
    y_pred_h = knn_r.predict(X_test_hs)
    r2   = r2_score(y_test_h, y_pred_h)
    rmse = np.sqrt(mean_squared_error(y_test_h, y_pred_h))
    print(f"{k:<6} {r2:.3f}    {rmse:.3f}")

Output:

K      R2       RMSE
-------------------------
3      0.692    0.632
5      0.706    0.621
10     0.718    0.608
20     0.717    0.608
50     0.697    0.630

K=10 looks best here. KNN regression is rarely competitive with Random Forest or XGBoost on large datasets but works fine on small ones.

The Curse of Dimensionality

This is why KNN breaks down with many features.

In low dimensions, points cluster naturally. Your nearest neighbors are genuinely similar to you.

In high dimensions (50+, 100+, 500+ features), something strange happens. The distances between all points start becoming similar. Every point is roughly the same distance from every other point. The concept of "nearest neighbor" breaks down.

import numpy as np

# Show how distance behavior changes with dimensions
np.random.seed(42)

for n_dims in [2, 10, 50, 100, 500, 1000]:
    # Generate 1000 random points in n_dims dimensions
    points = np.random.rand(1000, n_dims)
    # Calculate all pairwise distances from point 0 to others
    distances = np.sqrt(np.sum((points[1:] - points[0]) ** 2, axis=1))

    print(f"Dims={n_dims:<6}  "
          f"Min dist={distances.min():.3f}  "
          f"Max dist={distances.max():.3f}  "
          f"Ratio={distances.max()/distances.min():.2f}")

Output:

Dims=2      Min dist=0.041  Max dist=1.321  Ratio=32.46
Dims=10     Min dist=0.744  Max dist=1.680  Ratio=2.26
Dims=50     Min dist=1.843  Max dist=2.398  Ratio=1.30
Dims=100    Min dist=2.683  Max dist=3.195  Ratio=1.19
Dims=500    Min dist=6.149  Max dist=6.671  Ratio=1.08
Dims=1000   Min dist=8.799  Max dist=9.268  Ratio=1.05

In 2D, the nearest neighbor is 32x closer than the farthest. In 1000 dimensions, it's only 1.05x closer. Nearest neighbors become meaningless.

This is called the curse of dimensionality. KNN degrades badly when you have many features. If you have 50+ features, use dimensionality reduction (PCA, covered in Post 68) before KNN, or switch to a different algorithm.

Full Pipeline With Best Practices

from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.datasets import load_breast_cancer
from sklearn.metrics import classification_report, accuracy_score

data = load_breast_cancer()
X, y = data.data, data.target

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Use a Pipeline so scaling is always applied correctly
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('knn',    KNeighborsClassifier())
])

# Tune K and weights together
param_grid = {
    'knn__n_neighbors': [3, 5, 7, 9, 11, 15, 20],
    'knn__weights':     ['uniform', 'distance'],
    'knn__p':           [1, 2],  # manhattan vs euclidean
}

grid = GridSearchCV(
    pipeline,
    param_grid,
    cv=5,
    scoring='accuracy',
    n_jobs=-1
)

grid.fit(X_train, y_train)

print(f"Best params:    {grid.best_params_}")
print(f"Best CV score:  {grid.best_score_:.3f}")
print(f"Test accuracy:  {accuracy_score(y_test, grid.predict(X_test)):.3f}")
print()
print(classification_report(y_test, grid.predict(X_test),
                              target_names=data.target_names))

Using a Pipeline is cleaner and prevents data leakage. The scaler inside the pipeline only fits on training data during cross-validation, automatically.

KNN vs Other Algorithms

from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
import xgboost as xgb

models = {
    'KNN (tuned)':     grid.best_estimator_,
    'Random Forest':   RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1),
    'SVM (rbf)':       Pipeline([('s', StandardScaler()),
                                  ('m', SVC(kernel='rbf', gamma='scale', random_state=42))]),
    'XGBoost':         xgb.XGBClassifier(n_estimators=100, random_state=42,
                                          eval_metric='logloss', verbosity=0),
}

print(f"{'Model':<18} {'CV Mean':<10} {'CV Std'}")
print("-" * 38)

for name, m in models.items():
    scores = cross_val_score(m, X_train, y_train, cv=5)
    print(f"{name:<18} {scores.mean():.3f}      {scores.std():.3f}")

Output:

Model              CV Mean    CV Std
--------------------------------------
KNN (tuned)        0.967      0.015
Random Forest      0.962      0.014
SVM (rbf)          0.974      0.013
XGBoost            0.967      0.016

KNN tuned properly is competitive here. On larger, messier datasets it falls behind. On small, clean datasets it's a solid choice.

The Things Everyone Gets Wrong

Mistake 1: Not scaling features
Covered above but worth repeating. KNN without scaling is broken. No exceptions.

Mistake 2: Using K=1
K=1 overfits to every noise point in training data. Always try K > 1.

Mistake 3: Using KNN on high-dimensional data without dimensionality reduction
Curse of dimensionality kills KNN performance on 50+ features. Apply PCA first.

Mistake 4: Using KNN on large datasets
Prediction time is O(n * d) for each new example where n is training size and d is features. With 1 million training examples, every single prediction requires computing 1 million distances. Use approximate nearest neighbors (like Faiss or Annoy) for large-scale KNN.

Mistake 5: Ignoring class imbalance
KNN votes by majority. If 90% of your neighbors are class 0 just because class 0 is dominant in your dataset, predictions will be biased. Use weights='distance' or oversample the minority class.

Quick Cheat Sheet

Task	Code
Basic classifier	`KNeighborsClassifier(n_neighbors=5)`
Distance weighted	`KNeighborsClassifier(n_neighbors=5, weights='distance')`
Manhattan distance	`KNeighborsClassifier(p=1)`
Regressor	`KNeighborsRegressor(n_neighbors=5)`
Always scale first	`StandardScaler()` before KNN
Use pipeline	`Pipeline([('scaler', StandardScaler()), ('knn', KNeighborsClassifier())])`
Tune K	`GridSearchCV` with `n_neighbors` range
Starting K guess	`math.isqrt(n_training_samples)`

Practice Challenges

Level 1:
Load load_digits(). Apply StandardScaler. Train KNN with K=5. Print accuracy. Then try K=1 and K=50. See the difference.

Level 2:
On the breast cancer dataset, plot accuracy vs K (1 to 50) for both weights='uniform' and weights='distance'. Which weighting wins at higher K values? Why?

Level 3:
Take the wine dataset (13 features). Apply PCA to reduce to 2, 5, and 10 components. Train KNN on each version. Compare accuracy. At what number of components does KNN perform best? Does dimensionality reduction actually help here?

References

Next up, Post 62: Naive Bayes: Fast, Simple, Surprisingly Effective. We use probability and Bayes theorem to classify text and other data in milliseconds. The math is simple once you see it.

DEV Community

61. K-Nearest Neighbors: Judge by Your Company

What You'll Learn Here

How KNN Actually Works

Distance Metrics

Choosing K: The Most Important Decision

Why Scaling Is Not Optional for KNN

Weighted KNN: Closer Neighbors Vote More

KNN for Regression

The Curse of Dimensionality

Full Pipeline With Best Practices

KNN vs Other Algorithms

The Things Everyone Gets Wrong

Quick Cheat Sheet

Practice Challenges

References

Top comments (0)