SmartKNN vs Weighted_KNN & KNN: A Practical Benchmark on Real Regression Datasets
K-Nearest Neighbours is still widely used in industry because it’s simple, interpretable, and surprisingly strong on tabular data.
But vanilla KNN falls apart when the real world kicks in — noise, irrelevant features, skewed scales, and mixed data types.
**
This benchmark compares three variants across 31 real OpenML regression datasets:**
KNN
Weighted_KNN
SmartKNN
All experiments were performed on ~31 real regression datasets across two batches.
1. Benchmark Summary
| Metric | Weighted_KNN | SmartKNN |
|---|---|---|
| Avg MSE (Batch 1) | 4.146e7 | 4.181e7 |
| Avg MSE (Batch 2) | 2.354e6 | 1.423e6 |
| Typical R² | 0.10 – 0.50 | 0.50 – 0.88 |
| RMSE trend | Higher | Lower |
Interpretation:
Batch 1 was influenced by several outlier datasets with huge variance, which inflated MSE for both models.
Batch 2 shows the true behaviour: SmartKNN produces more accurate, stable, and variance-aware predictions.
SmartKNN consistently achieves higher R² and noticeably lower RMSE on realistic tabular tasks.
2. SmartKNN vs Weighted_KNN — Where Each One Shines
SmartKNN Strong Wins
Datasets [openML]: 622, 634, 637, 638, 645, 654, 656, 657, 659, 695, 712
SmartKNN performs exceptionally well on:
Medium-to-large tabular data
Mixed numeric/categorical datasets
High-variance or noisy features
Datasets with uneven feature importance or irrelevant columns
Weighted_KNN tends to break when noise or skewed scaling appears. SmartKNN stays stable due to feature weighting + filtering.
Weighted_KNN Wins
Datasets [openML] : 675, 683, 687, 690
Weighted_KNN has a slight advantage when:
The dataset is tiny
Features are clean and linear
No noise or irrelevant dimensions
Complexity is low
SmartKNN introduces small overhead on trivial datasets, which can slightly reduce performance.
SmartKNN vs Vanilla KNN
| Metric | SmartKNN | KNN |
|---|---|---|
| Avg MSE (Batch 1) | 1.304e6 | 1.613e6 |
| Avg MSE (Batch 2) | 4.622e7 | 4.649e7 |
| R² trend | Higher in complex datasets | Higher in trivial datasets |
SmartKNN Wins: 24
KNN wins: 7
SmartKNN is a substantially better baseline for regression tasks in modern ML pipelines.
Why SmartKNN Works Better
| SmartKNN Component | Effect |
|---|---|
| Weighted neighbour influence | Handles noisy & imbalanced features |
| Adaptive feature scaling | Reduces collapse on high variance datasets |
| Noise-aware preprocessing | Boosts resilience to outliers |
| Feature filtering | Removes weak or non-informative dimensions to improve signal clarity |
| Weighted Euclidean distance | Improves neighbour ranking through data-driven feature importance |
**
SmartKNN keeps the interpretability of KNN but fixes its long-standing weaknesses**
Code Example
from smart_knn import SmartKNN
model = SmartKNN(k=8, weight_threshold=0.009)
model.fit(X_train, y_train)
preds = model.predict(X_test)
Install
pip install smart-knn
Conclusion
SmartKNN isn’t trying to replace neural networks or ensemble methods.
It’s designed to be a modern, robust upgrade to classical KNN:
Handles noisy features
Adapts to complex datasets
Improves R² and RMSE significantly in most real-world scenarios
More stable on almost every real-world regression task
A modern KNN baseline for regression tasks.
Useful Links
GitHub Repo — https://github.com/thatipamula-jashwanth/smart-knn
Kaggle Notebook
https://www.kaggle.com/code/jashwanththatipamula/smartknn-vs-weightedknn-benchmarks
https://www.kaggle.com/code/jashwanththatipamula/smartknn-vs-weightedknn-benchmarks-proof
https://www.kaggle.com/code/jashwanththatipamula/smartknn-benchmark-proof
https://www.kaggle.com/code/jashwanththatipamula/smartknn-benchmarks-proof
DOI - https://doi.org/10.5281/zenodo.17713746
Huggingface - https://huggingface.co/JashuXo/smart-knn
Feedback & Collaboration
If you'd like to benchmark SmartKNN on your own dataset or contribute to upcoming features (classification mode, automated hyperparameter search), you're welcome to connect and collaborate.
Top comments (0)