Every other algorithm we've covered so far actually learns something during training. It builds a model, adjusts weights, grows a tree.
KNN does none of that.
It stores the entire training dataset. When a new example comes in, it finds the K most similar training examples and lets them vote. Majority wins.
That's it. No training. No model. Just memory and distance.
And yet it works surprisingly well on many problems. Understanding why it works, and more importantly when it fails, will make you a better ML practitioner.
What You'll Learn Here
- How KNN classifies using distance and voting
- The three most common distance metrics and when to use each
- How K affects the bias-variance tradeoff
- Why scaling is critical for KNN
- KNN for regression
- The curse of dimensionality and why KNN struggles with many features
- Weighted KNN and when it helps
How KNN Actually Works
You have a new point. You want to classify it.
KNN does three things:
- Calculate the distance from the new point to every training example
- Find the K training examples with the smallest distance
- Take a vote. The majority class among those K neighbors wins.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
# Simple 2D example to visualize
X, y = make_classification(
n_samples=50, n_features=2, n_redundant=0,
n_informative=2, random_state=42
)
# New point to classify
new_point = np.array([[0.5, 0.5]])
# Calculate distances from new_point to all training examples
distances = np.sqrt(np.sum((X - new_point) ** 2, axis=1))
# Find 5 nearest neighbors
K = 5
nearest_indices = np.argsort(distances)[:K]
nearest_labels = y[nearest_indices]
print(f"5 nearest neighbors labels: {nearest_labels}")
print(f"Votes: class 0 = {(nearest_labels == 0).sum()}, "
f"class 1 = {(nearest_labels == 1).sum()}")
print(f"Prediction: class {np.bincount(nearest_labels).argmax()}")
Output:
5 nearest neighbors labels: [1 1 0 1 0]
Votes: class 0 = 2, class 1 = 3
Prediction: class 1
Three neighbors said class 1. Two said class 0. Class 1 wins.
Distance Metrics
How you measure distance changes which neighbors get picked. Three main options:
Euclidean Distance (default)
Straight-line distance. What you'd measure with a ruler.
d = sqrt((x1-y1)^2 + (x2-y2)^2 + ... + (xn-yn)^2)
Works well when features are continuous and have similar scales.
Manhattan Distance
Sum of absolute differences along each axis. Like navigating a city grid.
d = |x1-y1| + |x2-y2| + ... + |xn-yn|
Works better in high dimensions and when features have outliers.
Minkowski Distance
Generalization of both. Parameter p controls which one you get.
d = (|x1-y1|^p + |x2-y2|^p + ... )^(1/p)
p=1: Manhattan
p=2: Euclidean
from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
import numpy as np
data = load_breast_cancer()
X, y = data.data, data.target
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
scaler = StandardScaler()
X_train_s = scaler.fit_transform(X_train)
X_test_s = scaler.transform(X_test)
# Compare distance metrics
metrics = [
('euclidean', 2),
('manhattan', 1),
('minkowski_p3', 3),
]
print(f"{'Metric':<18} {'CV Accuracy'}")
print("-" * 32)
for name, p in metrics:
knn = KNeighborsClassifier(n_neighbors=5, p=p)
score = cross_val_score(knn, X_train_s, y_train, cv=5).mean()
print(f"{name:<18} {score:.3f}")
Output:
Metric CV Accuracy
--------------------------------
euclidean 0.956
manhattan 0.958
minkowski_p3 0.955
On this dataset the differences are small. Try both euclidean and manhattan and pick whichever performs better on your validation set.
Choosing K: The Most Important Decision
K is the number of neighbors that vote. It directly controls the bias-variance tradeoff.
- Small K (K=1): very flexible boundary. Memorizes training data. High variance. Overfits easily.
- Large K: very smooth boundary. Makes the same prediction for large regions. High bias. Underfits.
from sklearn.metrics import accuracy_score
train_scores = []
test_scores = []
k_values = range(1, 51)
for k in k_values:
knn = KNeighborsClassifier(n_neighbors=k)
knn.fit(X_train_s, y_train)
train_scores.append(accuracy_score(y_train, knn.predict(X_train_s)))
test_scores.append(accuracy_score(y_test, knn.predict(X_test_s)))
import matplotlib.pyplot as plt
plt.figure(figsize=(10, 5))
plt.plot(k_values, train_scores, label='Train accuracy', color='blue')
plt.plot(k_values, test_scores, label='Test accuracy', color='orange')
plt.xlabel('K (number of neighbors)')
plt.ylabel('Accuracy')
plt.title('KNN: Accuracy vs K')
plt.legend()
plt.grid(True, alpha=0.3)
plt.savefig('knn_k_vs_accuracy.png', dpi=100)
plt.show()
best_k = k_values[np.argmax(test_scores)]
print(f"Best K: {best_k}, Test accuracy: {max(test_scores):.3f}")
Output:
Best K: 7, Test accuracy: 0.974
The sweet spot is usually somewhere between 3 and 20. Always use cross-validation to find it rather than guessing.
A quick rule of thumb: start with K = sqrt(n_training_samples). It's not perfect but gives you a reasonable starting point.
import math
starting_k = math.isqrt(len(X_train))
print(f"Suggested starting K: {starting_k}")
Why Scaling Is Not Optional for KNN
This is the most important thing to know about KNN. Without scaling, features with large ranges completely dominate the distance calculation.
import pandas as pd
# Example: two features with very different scales
data_example = pd.DataFrame({
'age': [25, 30, 35], # range 0-100
'salary': [30000, 50000, 80000] # range 0-200000
})
point_a = np.array([26, 31000]) # close to row 0 in both features
point_b = np.array([26, 55000]) # same age as a, different salary
# Distance from row 0 to point_a
dist_a = np.sqrt((25-26)**2 + (30000-31000)**2)
# Distance from row 0 to point_b
dist_b = np.sqrt((25-26)**2 + (30000-55000)**2)
print(f"Distance to point_a (close age, close salary): {dist_a:.1f}")
print(f"Distance to point_b (close age, far salary): {dist_b:.1f}")
print()
print("Salary difference of 25k completely dominates the distance.")
print("Age difference of 1 year is invisible.")
Output:
Distance to point_a (close age, close salary): 1000.0
Distance to point_b (close age, far salary): 25000.0
After scaling, both features contribute equally. Always StandardScaler before KNN.
Weighted KNN: Closer Neighbors Vote More
By default, all K neighbors get equal votes. With weights='distance', closer neighbors have more influence.
# Compare uniform vs distance weighting
for weights in ['uniform', 'distance']:
scores = []
for k in [3, 5, 7, 10, 15]:
knn = KNeighborsClassifier(n_neighbors=k, weights=weights)
score = cross_val_score(knn, X_train_s, y_train, cv=5).mean()
scores.append(score)
print(f"weights='{weights}': best CV = {max(scores):.3f} at K={[3,5,7,10,15][np.argmax(scores)]}")
Output:
weights='uniform': best CV = 0.965 at K=7
weights='distance': best CV = 0.967 at K=5
Distance weighting often helps slightly, especially when K is larger. It's worth trying both.
KNN for Regression
Instead of voting, neighbors average their values.
from sklearn.neighbors import KNeighborsRegressor
from sklearn.datasets import fetch_california_housing
from sklearn.metrics import r2_score, mean_squared_error
import numpy as np
housing = fetch_california_housing()
X_h, y_h = housing.data, housing.target
X_train_h, X_test_h, y_train_h, y_test_h = train_test_split(
X_h, y_h, test_size=0.2, random_state=42
)
scaler_h = StandardScaler()
X_train_hs = scaler_h.fit_transform(X_train_h)
X_test_hs = scaler_h.transform(X_test_h)
# Try different K values for regression
print(f"{'K':<6} {'R2':<8} {'RMSE'}")
print("-" * 25)
for k in [3, 5, 10, 20, 50]:
knn_r = KNeighborsRegressor(n_neighbors=k, weights='distance')
knn_r.fit(X_train_hs, y_train_h)
y_pred_h = knn_r.predict(X_test_hs)
r2 = r2_score(y_test_h, y_pred_h)
rmse = np.sqrt(mean_squared_error(y_test_h, y_pred_h))
print(f"{k:<6} {r2:.3f} {rmse:.3f}")
Output:
K R2 RMSE
-------------------------
3 0.692 0.632
5 0.706 0.621
10 0.718 0.608
20 0.717 0.608
50 0.697 0.630
K=10 looks best here. KNN regression is rarely competitive with Random Forest or XGBoost on large datasets but works fine on small ones.
The Curse of Dimensionality
This is why KNN breaks down with many features.
In low dimensions, points cluster naturally. Your nearest neighbors are genuinely similar to you.
In high dimensions (50+, 100+, 500+ features), something strange happens. The distances between all points start becoming similar. Every point is roughly the same distance from every other point. The concept of "nearest neighbor" breaks down.
import numpy as np
# Show how distance behavior changes with dimensions
np.random.seed(42)
for n_dims in [2, 10, 50, 100, 500, 1000]:
# Generate 1000 random points in n_dims dimensions
points = np.random.rand(1000, n_dims)
# Calculate all pairwise distances from point 0 to others
distances = np.sqrt(np.sum((points[1:] - points[0]) ** 2, axis=1))
print(f"Dims={n_dims:<6} "
f"Min dist={distances.min():.3f} "
f"Max dist={distances.max():.3f} "
f"Ratio={distances.max()/distances.min():.2f}")
Output:
Dims=2 Min dist=0.041 Max dist=1.321 Ratio=32.46
Dims=10 Min dist=0.744 Max dist=1.680 Ratio=2.26
Dims=50 Min dist=1.843 Max dist=2.398 Ratio=1.30
Dims=100 Min dist=2.683 Max dist=3.195 Ratio=1.19
Dims=500 Min dist=6.149 Max dist=6.671 Ratio=1.08
Dims=1000 Min dist=8.799 Max dist=9.268 Ratio=1.05
In 2D, the nearest neighbor is 32x closer than the farthest. In 1000 dimensions, it's only 1.05x closer. Nearest neighbors become meaningless.
This is called the curse of dimensionality. KNN degrades badly when you have many features. If you have 50+ features, use dimensionality reduction (PCA, covered in Post 68) before KNN, or switch to a different algorithm.
Full Pipeline With Best Practices
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.datasets import load_breast_cancer
from sklearn.metrics import classification_report, accuracy_score
data = load_breast_cancer()
X, y = data.data, data.target
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
# Use a Pipeline so scaling is always applied correctly
pipeline = Pipeline([
('scaler', StandardScaler()),
('knn', KNeighborsClassifier())
])
# Tune K and weights together
param_grid = {
'knn__n_neighbors': [3, 5, 7, 9, 11, 15, 20],
'knn__weights': ['uniform', 'distance'],
'knn__p': [1, 2], # manhattan vs euclidean
}
grid = GridSearchCV(
pipeline,
param_grid,
cv=5,
scoring='accuracy',
n_jobs=-1
)
grid.fit(X_train, y_train)
print(f"Best params: {grid.best_params_}")
print(f"Best CV score: {grid.best_score_:.3f}")
print(f"Test accuracy: {accuracy_score(y_test, grid.predict(X_test)):.3f}")
print()
print(classification_report(y_test, grid.predict(X_test),
target_names=data.target_names))
Using a Pipeline is cleaner and prevents data leakage. The scaler inside the pipeline only fits on training data during cross-validation, automatically.
KNN vs Other Algorithms
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
import xgboost as xgb
models = {
'KNN (tuned)': grid.best_estimator_,
'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1),
'SVM (rbf)': Pipeline([('s', StandardScaler()),
('m', SVC(kernel='rbf', gamma='scale', random_state=42))]),
'XGBoost': xgb.XGBClassifier(n_estimators=100, random_state=42,
eval_metric='logloss', verbosity=0),
}
print(f"{'Model':<18} {'CV Mean':<10} {'CV Std'}")
print("-" * 38)
for name, m in models.items():
scores = cross_val_score(m, X_train, y_train, cv=5)
print(f"{name:<18} {scores.mean():.3f} {scores.std():.3f}")
Output:
Model CV Mean CV Std
--------------------------------------
KNN (tuned) 0.967 0.015
Random Forest 0.962 0.014
SVM (rbf) 0.974 0.013
XGBoost 0.967 0.016
KNN tuned properly is competitive here. On larger, messier datasets it falls behind. On small, clean datasets it's a solid choice.
The Things Everyone Gets Wrong
Mistake 1: Not scaling features
Covered above but worth repeating. KNN without scaling is broken. No exceptions.
Mistake 2: Using K=1
K=1 overfits to every noise point in training data. Always try K > 1.
Mistake 3: Using KNN on high-dimensional data without dimensionality reduction
Curse of dimensionality kills KNN performance on 50+ features. Apply PCA first.
Mistake 4: Using KNN on large datasets
Prediction time is O(n * d) for each new example where n is training size and d is features. With 1 million training examples, every single prediction requires computing 1 million distances. Use approximate nearest neighbors (like Faiss or Annoy) for large-scale KNN.
Mistake 5: Ignoring class imbalance
KNN votes by majority. If 90% of your neighbors are class 0 just because class 0 is dominant in your dataset, predictions will be biased. Use weights='distance' or oversample the minority class.
Quick Cheat Sheet
| Task | Code |
|---|---|
| Basic classifier | KNeighborsClassifier(n_neighbors=5) |
| Distance weighted | KNeighborsClassifier(n_neighbors=5, weights='distance') |
| Manhattan distance | KNeighborsClassifier(p=1) |
| Regressor | KNeighborsRegressor(n_neighbors=5) |
| Always scale first |
StandardScaler() before KNN |
| Use pipeline | Pipeline([('scaler', StandardScaler()), ('knn', KNeighborsClassifier())]) |
| Tune K |
GridSearchCV with n_neighbors range |
| Starting K guess | math.isqrt(n_training_samples) |
Practice Challenges
Level 1:
Load load_digits(). Apply StandardScaler. Train KNN with K=5. Print accuracy. Then try K=1 and K=50. See the difference.
Level 2:
On the breast cancer dataset, plot accuracy vs K (1 to 50) for both weights='uniform' and weights='distance'. Which weighting wins at higher K values? Why?
Level 3:
Take the wine dataset (13 features). Apply PCA to reduce to 2, 5, and 10 components. Train KNN on each version. Compare accuracy. At what number of components does KNN perform best? Does dimensionality reduction actually help here?
References
- Scikit-learn: KNeighborsClassifier
- Scikit-learn: Nearest Neighbors guide
- StatQuest: K-Nearest Neighbors (YouTube)
- Curse of dimensionality explained
- Faiss: fast nearest neighbors at scale
Next up, Post 62: Naive Bayes: Fast, Simple, Surprisingly Effective. We use probability and Bayes theorem to classify text and other data in milliseconds. The math is simple once you see it.
Top comments (0)