Jashwanth

Posted on Dec 29, 2025

SmartKNN Regression Benchmarks High-Dimensional Datasets

#machinelearning #ai #webdev #programming

This release presents initial regression benchmarks for SmartKNN, evaluated on large-D datasets with a focus on single-prediction p95 latency and R² under real production constraints.

All benchmarks are:

CPU-only
Single-query inference
Non-parametric, nonlinear models
Large-scale datasets

More benchmarks (higher-dimensional datasets, classification tasks, mixed feature spaces) will be released soon.

Datasets Used

Dataset	OpenML ID	Approx Rows	Features (D)	Task	Source
Buzzinsocialmedia_Twitter	4549	466,600	77	Regression	OpenML
Allstate_Claims_Severity	44045	150,654	124	Regression	OpenML
College Scorecard	46674	99,759	118	Regression	OpenML

Benchmark Results

Buzzinsocialmedia_Twitter — OpenML ID 4549

Model	RMSE ↓	R² ↑	Train (s)	Batch (ms)	Single Med (ms)	Single P95 (ms)
XGBoost	254.43	0.8274	22.21	0.005	0.228	0.280
LightGBM	214.79	0.8770	25.67	0.008	0.511	0.650
CatBoost	231.43	0.8572	39.53	0.000	0.809	1.021
SmartKNN (wt=0.0)	167.15	0.9255	214.39	0.060	0.383	0.561

Allstate_Claims_Severity — OpenML ID 44045

Model	RMSE ↓	R² ↑	Train (s)	Batch (ms)	Single Med (ms)	Single P95 (ms)
XGBoost	0.5355	0.5604	11.20	0.005	0.211	0.272
LightGBM	0.5356	0.5603	8.40	0.020	0.511	0.630
CatBoost	0.5408	0.5516	22.84	0.043	1.035	1.308
SmartKNN (wt=0.0)	0.6219	0.4071	51.51	0.062	0.305	0.366

College Scorecard — OpenML ID 46674

Model	RMSE ↓	R² ↑	Train (s)	Batch (ms)	Single Med (ms)	Single P95 (ms)
XGBoost	0.1855	0.6935	8.36	0.006	0.237	0.329
LightGBM	0.1864	0.6905	5.77	0.010	0.505	0.635
CatBoost	0.1946	0.6626	14.25	0.001	0.879	0.950
SmartKNN (wt=0.0)	0.2300	0.5290	27.31	0.054	0.248	0.286

Notes

SmartKNN is a non-parametric, instance-based model with ANN acceleration.
Benchmarks emphasize tail latency (p95) rather than average inference time.
All results are reproducible using publicly available datasets.

Positioning & Claim (Carefully Worded)

SmartKNN demonstrates competitive p95 single-prediction latency on CPU among non-parametric, nonlinear models at large-scale data sizes, while preserving instance-based decision behavior.

While tree-based models remain strong on accuracy and average latency, SmartKNN proves that KNN-style models can be competitive in tail latency, which is often the dominant concern in real production systems.

To our knowledge, SmartKNN is among the fastest CPU-only, nonlinear, instance-based regression models evaluated at this scale with reported p95 single-query latency.

Reproducibility & Community Benchmarks

We strongly encourage the community to:

Run these benchmarks on different hardware
Test alternative ANN configurations
Compare against additional models
Share results publicly

If you:

Find a performance regression -> open a GitHub Issue
Have questions, ideas, or improvements -> start a GitHub Discussion
Run new benchmarks -> post your results

Community validation and feedback will directly shape future releases.

DEV Community