DEV Community

Jashwanth Thatipamula
Jashwanth Thatipamula

Posted on

SmartKNN Regression Benchmarks High-Dimensional Datasets

This release presents initial regression benchmarks for SmartKNN, evaluated on large-D datasets with a focus on single-prediction p95 latency and R² under real production constraints.

All benchmarks are:

  • CPU-only
  • Single-query inference
  • Non-parametric, nonlinear models
  • Large-scale datasets

More benchmarks (higher-dimensional datasets, classification tasks, mixed feature spaces) will be released soon.


Datasets Used

Dataset OpenML ID Approx Rows Features (D) Task Source
Buzzinsocialmedia_Twitter 4549 466,600 77 Regression OpenML
Allstate_Claims_Severity 44045 150,654 124 Regression OpenML
College Scorecard 46674 99,759 118 Regression OpenML

Benchmark Results

Buzzinsocialmedia_Twitter — OpenML ID 4549

Model RMSE ↓ R² ↑ Train (s) Batch (ms) Single Med (ms) Single P95 (ms)
XGBoost 254.43 0.8274 22.21 0.005 0.228 0.280
LightGBM 214.79 0.8770 25.67 0.008 0.511 0.650
CatBoost 231.43 0.8572 39.53 0.000 0.809 1.021
SmartKNN (wt=0.0) 167.15 0.9255 214.39 0.060 0.383 0.561

Allstate_Claims_Severity — OpenML ID 44045

Model RMSE ↓ R² ↑ Train (s) Batch (ms) Single Med (ms) Single P95 (ms)
XGBoost 0.5355 0.5604 11.20 0.005 0.211 0.272
LightGBM 0.5356 0.5603 8.40 0.020 0.511 0.630
CatBoost 0.5408 0.5516 22.84 0.043 1.035 1.308
SmartKNN (wt=0.0) 0.6219 0.4071 51.51 0.062 0.305 0.366

College Scorecard — OpenML ID 46674

Model RMSE ↓ R² ↑ Train (s) Batch (ms) Single Med (ms) Single P95 (ms)
XGBoost 0.1855 0.6935 8.36 0.006 0.237 0.329
LightGBM 0.1864 0.6905 5.77 0.010 0.505 0.635
CatBoost 0.1946 0.6626 14.25 0.001 0.879 0.950
SmartKNN (wt=0.0) 0.2300 0.5290 27.31 0.054 0.248 0.286

Notes

  • SmartKNN is a non-parametric, instance-based model with ANN acceleration.
  • Benchmarks emphasize tail latency (p95) rather than average inference time.
  • All results are reproducible using publicly available datasets.

Positioning & Claim (Carefully Worded)

SmartKNN demonstrates competitive p95 single-prediction latency on CPU among non-parametric, nonlinear models at large-scale data sizes, while preserving instance-based decision behavior.

While tree-based models remain strong on accuracy and average latency, SmartKNN proves that KNN-style models can be competitive in tail latency, which is often the dominant concern in real production systems.

To our knowledge, SmartKNN is among the fastest CPU-only, nonlinear, instance-based regression models evaluated at this scale with reported p95 single-query latency.


Reproducibility & Community Benchmarks

We strongly encourage the community to:

  • Run these benchmarks on different hardware
  • Test alternative ANN configurations
  • Compare against additional models
  • Share results publicly

If you:

  • Find a performance regression -> open a GitHub Issue
  • Have questions, ideas, or improvements -> start a GitHub Discussion
  • Run new benchmarks -> post your results

Community validation and feedback will directly shape future releases.


Links

Top comments (0)