DEV Community

Cover image for SmartKNN: An Interpretable Weighted Distance Framework for K-Nearest Neighbours
Jashwanth Thatipamula
Jashwanth Thatipamula

Posted on

SmartKNN: An Interpretable Weighted Distance Framework for K-Nearest Neighbours

SmartKNN vs Weighted_KNN & KNN: A Practical Benchmark on Real Regression Datasets

K-Nearest Neighbours is still widely used in industry because it’s simple, interpretable, and surprisingly strong on tabular data.
But vanilla KNN falls apart when the real world kicks in — noise, irrelevant features, skewed scales, and mixed data types.
**
This benchmark compares three variants across 31 real OpenML regression datasets:**

  • KNN

  • Weighted_KNN

  • SmartKNN

All experiments were performed on ~31 real regression datasets across two batches.


1. Benchmark Summary

Metric Weighted_KNN SmartKNN
Avg MSE (Batch 1) 4.146e7 4.181e7
Avg MSE (Batch 2) 2.354e6 1.423e6
Typical R² 0.10 – 0.50 0.50 – 0.88
RMSE trend Higher Lower

Interpretation:

Batch 1 was influenced by several outlier datasets with huge variance, which inflated MSE for both models.

Batch 2 shows the true behaviour: SmartKNN produces more accurate, stable, and variance-aware predictions.

SmartKNN consistently achieves higher R² and noticeably lower RMSE on realistic tabular tasks.


2. SmartKNN vs Weighted_KNN — Where Each One Shines

SmartKNN Strong Wins

Datasets [openML]: 622, 634, 637, 638, 645, 654, 656, 657, 659, 695, 712

SmartKNN performs exceptionally well on:

  • Medium-to-large tabular data

  • Mixed numeric/categorical datasets

  • High-variance or noisy features

  • Datasets with uneven feature importance or irrelevant columns

Weighted_KNN tends to break when noise or skewed scaling appears. SmartKNN stays stable due to feature weighting + filtering.

Weighted_KNN Wins

Datasets [openML] : 675, 683, 687, 690

Weighted_KNN has a slight advantage when:

  • The dataset is tiny

  • Features are clean and linear

  • No noise or irrelevant dimensions

  • Complexity is low

SmartKNN introduces small overhead on trivial datasets, which can slightly reduce performance.


SmartKNN vs Vanilla KNN

Metric SmartKNN KNN
Avg MSE (Batch 1) 1.304e6 1.613e6
Avg MSE (Batch 2) 4.622e7 4.649e7
R² trend Higher in complex datasets Higher in trivial datasets
  • SmartKNN Wins: 24

  • KNN wins: 7

SmartKNN is a substantially better baseline for regression tasks in modern ML pipelines.


Why SmartKNN Works Better

SmartKNN Component Effect
Weighted neighbour influence Handles noisy & imbalanced features
Adaptive feature scaling Reduces collapse on high variance datasets
Noise-aware preprocessing Boosts resilience to outliers
Feature filtering Removes weak or non-informative dimensions to improve signal clarity
Weighted Euclidean distance Improves neighbour ranking through data-driven feature importance

**
SmartKNN keeps the interpretability of KNN but fixes its long-standing weaknesses**


Code Example

from smart_knn import SmartKNN

model = SmartKNN(k=8, weight_threshold=0.009)
model.fit(X_train, y_train)
preds = model.predict(X_test)

Enter fullscreen mode Exit fullscreen mode

Install

pip install smart-knn

Enter fullscreen mode Exit fullscreen mode

Conclusion

SmartKNN isn’t trying to replace neural networks or ensemble methods.

It’s designed to be a modern, robust upgrade to classical KNN:

  • Handles noisy features

  • Adapts to complex datasets

  • Improves R² and RMSE significantly in most real-world scenarios

  • More stable on almost every real-world regression task

A modern KNN baseline for regression tasks.


Useful Links

GitHub Repo — https://github.com/thatipamula-jashwanth/smart-knn

Kaggle Notebook

DOI - https://doi.org/10.5281/zenodo.17713746

Huggingface - https://huggingface.co/JashuXo/smart-knn


Feedback & Collaboration

If you'd like to benchmark SmartKNN on your own dataset or contribute to upcoming features (classification mode, automated hyperparameter search), you're welcome to connect and collaborate.

Top comments (0)