SmartKNN v2.2 is a focused update aimed at making the library more scalable, predictable, and efficient when working with large datasets. While this is a minor version bump, the release introduces meaningful internal improvements that directly impact training-time performance and backend correctness especially at scale
.
This update does not change the public API or inference behavior, making it a safe upgrade for existing users.
Smarter Feature Weighting at
Feature weighting based on Mutual Information (MI) plays a critical role in SmartKNN’s performance. In v2.2, MI computation has been optimized to better handle very high-dimensional datasets.
The key improvement is parallelized MI computation, which significantly reduces training time when the number of features is large. Importantly, the behavior for low- and medium-dimensional datasets remains unchanged, ensuring consistency and reproducibility for existing workflows.
Correct Automatic Backend Selection
SmartKNN supports multiple backends, including brute-force and ANN-based approach. In earlier versions, automatic backend selection could introduce unnecessary overhead for small datasets.
In v2.2, this logic has been corrected:
- The brute-force backend is now explicitly enforced below 10K rows
- ANN backends are avoided when they provide no practical benefit
This change improves correctness, reduces setup overhead, and ensures the most appropriate backend is used by default.
More Stable Feature Selection
Feature selection has been refined with updates to the Random Forest–based feature relevance logic. Improved split constraints make feature pruning more stable, particularly when dealing with noisy or skewed data distributions.
The result is more reliable feature selection without increasing model complexity or changing user-facing behavior.
Faster ANN Training for Very Large Datasets
For users working at scale, ANN index construction can be a major bottleneck. SmartKNN v2.2 introduces internal optimizations that significantly improve ANN training performance on multi-million-row datasets.
These changes:
- Improve overall scalability
- Reduce ANN index build time
Inference accuracy remain unchanged.
Measured Performance Improvement
Across internal benchmarks, the following training-time improvements were observed:
- Around 10% faster training on medium-sized datasets
- Up to 25% faster training on multi-million-row datasets
- Reduced ANN index build overhead for large-scale workloads
No regressions were observed in inference accuracy...
Improved Robustness During Inference
This release also fixes inference-time handling of NaN and Inf values in query inputs. SmartKNN now consistently emits a warning when invalid values are detected, while preserving existing normalization and prediction behavior.
This makes inference safer and easier to debug in real-world pipelines.
Final Notes
- No API changes were introduced
- ANN inference behavior and tuning parameters (nlist, nprobe) remain unchanged
- Improvements primarily target training-time scalability and correctness
SmartKNN v2.2 is a safe, drop-in upgrade that makes the system faster and more predictable especially for large-scale and production workloads.
If you’re running SmartKNN on big data, this “minor” release is very much worth it.
TRY SmartKNN -
pip install smart-knn
Top comments (0)