Jashwanth Thatipamula

Posted on Nov 29, 2025

SmartKNN: An Interpretable Weighted Distance Framework for K-Nearest Neighbours

#career #ai #machinelearning #productivity

SmartKNN vs Weighted_KNN & KNN: A Practical Benchmark on Real Regression Datasets

K-Nearest Neighbours is still widely used in industry because it’s simple, interpretable, and surprisingly strong on tabular data.
But vanilla KNN falls apart when the real world kicks in — noise, irrelevant features, skewed scales, and mixed data types.
**
This benchmark compares three variants across 31 real OpenML regression datasets:**

KNN
Weighted_KNN
SmartKNN

All experiments were performed on ~31 real regression datasets across two batches.

1. Benchmark Summary

Metric	Weighted_KNN	SmartKNN
Avg MSE (Batch 1)	4.146e7	4.181e7
Avg MSE (Batch 2)	2.354e6	1.423e6
Typical R²	0.10 – 0.50	0.50 – 0.88
RMSE trend	Higher	Lower

Interpretation:

Batch 1 was influenced by several outlier datasets with huge variance, which inflated MSE for both models.

Batch 2 shows the true behaviour: SmartKNN produces more accurate, stable, and variance-aware predictions.

SmartKNN consistently achieves higher R² and noticeably lower RMSE on realistic tabular tasks.

2. SmartKNN vs Weighted_KNN — Where Each One Shines

SmartKNN Strong Wins

Datasets [openML]: 622, 634, 637, 638, 645, 654, 656, 657, 659, 695, 712

SmartKNN performs exceptionally well on:

Medium-to-large tabular data
Mixed numeric/categorical datasets
High-variance or noisy features
Datasets with uneven feature importance or irrelevant columns

Weighted_KNN tends to break when noise or skewed scaling appears. SmartKNN stays stable due to feature weighting + filtering.

Weighted_KNN Wins

Datasets [openML] : 675, 683, 687, 690

Weighted_KNN has a slight advantage when:

The dataset is tiny
Features are clean and linear
No noise or irrelevant dimensions
Complexity is low

SmartKNN introduces small overhead on trivial datasets, which can slightly reduce performance.

SmartKNN vs Vanilla KNN

Metric	SmartKNN	KNN
Avg MSE (Batch 1)	1.304e6	1.613e6
Avg MSE (Batch 2)	4.622e7	4.649e7
R² trend	Higher in complex datasets	Higher in trivial datasets

SmartKNN Wins: 24
KNN wins: 7

SmartKNN is a substantially better baseline for regression tasks in modern ML pipelines.

Why SmartKNN Works Better

SmartKNN Component	Effect
Weighted neighbour influence	Handles noisy & imbalanced features
Adaptive feature scaling	Reduces collapse on high variance datasets
Noise-aware preprocessing	Boosts resilience to outliers
Feature filtering	Removes weak or non-informative dimensions to improve signal clarity
Weighted Euclidean distance	Improves neighbour ranking through data-driven feature importance

**
SmartKNN keeps the interpretability of KNN but fixes its long-standing weaknesses**

Code Example

from smart_knn import SmartKNN

model = SmartKNN(k=8, weight_threshold=0.009)
model.fit(X_train, y_train)
preds = model.predict(X_test)

Install

pip install smart-knn

Conclusion

SmartKNN isn’t trying to replace neural networks or ensemble methods.

It’s designed to be a modern, robust upgrade to classical KNN:

Handles noisy features
Adapts to complex datasets
Improves R² and RMSE significantly in most real-world scenarios
More stable on almost every real-world regression task

A modern KNN baseline for regression tasks.

Useful Links

GitHub Repo — https://github.com/thatipamula-jashwanth/smart-knn

Kaggle Notebook

DOI - https://doi.org/10.5281/zenodo.17713746

Huggingface - https://huggingface.co/JashuXo/smart-knn

Feedback & Collaboration

If you'd like to benchmark SmartKNN on your own dataset or contribute to upcoming features (classification mode, automated hyperparameter search), you're welcome to connect and collaborate.

DEV Community