AI Explore

Posted on Jun 28

samkhya v1.0: Plug Claude, GPT-4o-mini, or Local Ollama Into Your SQL Query Optimizer

#rust #database #opensource #datascience

TL;DR

samkhya v1.0 ships the first LLM-pluggable corrector backend for an embedded SQL query optimizer. Plug Anthropic Claude (claude-opus-4-7, claude-sonnet-4-6), OpenAI GPT-4o-mini, or local Ollama (llama3.2:1b) into the cardinality-estimation slot of DataFusion, DuckDB, or Polars via a 4-line HTTP wire contract. Two reference servers ship in the box: Python FastAPI (canonical) and Node TypeScript (broader operator appeal). It's all wrapped in a provable safety envelope so a hallucinating LLM cannot make your plan worse than the engine's native estimate. Apache-2.0, sole author. 13-crate Rust workspace.

cargo add samkhya-core --features llm_http · pip install samkhya
github.com/singhpratech/samkhya— 10 crates on crates.io, Python wheel on PyPI
Companion technical paper + gpudb sibling release

Measured numbers, all reported honestly. 40.95× LpJoinBound tightness over AGM on the star-5 join family at p=1 (Wilcoxon p=1.73×10⁻⁶, all 30 cells dominated). LLM transport-floor P95 0.07–0.11 ms across batch sizes 1/4/8/16/32— H1-A PASS, the LLM plug works. JOB-Slow head-to-head vs unmodified DataFusion 46 came in at 1.038× geomean wallclock (BCa 95% CI [1.026, 1.056], Wilcoxon p=3×10⁻⁶, 17 wins / 38 ties / 0 losses)— pre-registered ≥1.35× target falsified on magnitude. Never-regress holds.

Honestly projected, not yet measured. Live-LLM end-to-end latency cells (Claude, GPT-4o-mini, Ollama) are documented as PROJECTED pending API-key budget and a 30-trial campaign run. The mechanism is shipped and measured at the transport floor; the headline magnitudes are paper-projections sized from public per-token latencies.

Every embedded analytical engine on your laptop right now is committing the same crime in slow motion. DuckDB does it. DataFusion does it. Polars does it. ClickHouse-local does it. My own GPU engine gpudb does it. The crime is this: every time the process dies, eight years of incredible cardinality-estimation research dies with it, and the next session starts over from zero. The HyperLogLog your midnight ELT job built? Gone. The Bloom filter that knew exactly which customer IDs lived in that 400 GB Parquet partition? Gone. The histogram that took twelve seconds to compute over the join key? Gone. The optimizer wakes up at 9 a.m. with no memory of anything it ever learned and goes scanning the same columns again.

That waste isn't a bug. It's the inevitable consequence of a missing library— a library so obvious in hindsight that the absence is almost embarrassing. Apache DataSketches gave the world sketches but never wired them to a query optimizer. Iceberg's Puffin specification gave us a sidecar format but no producer/consumer library to fill it. AQO, the PostgreSQL-only adaptive estimator, has feedback but no portability between engines. The three pieces have been sitting on three different shelves for years, and nobody has bolted them together.

I shipped samkhya v1.0.0 today to bolt them together. And to do four other things while I was at it.

Repository: github.com/singhpratech/samkhya. Technical paper: /publications/samkhya-portable…. This is the engineering tour.

The five-layer stack, in plain English

samkhya is सांख्य— Sanskrit for "enumeration, counting"— the name of the classical darshana whose entire discipline is counting reality's constituents honestly. The library has exactly that job. It is a 13-crate Cargo workspace under a single Apache-2.0 license with the explicit §3 patent grant, structured as five replaceable layers, each failing safely toward the engine's native plan when the layer above is missing.

Layer 1— portable stats. Five classical sketch families ship in v1.0: HyperLogLog (precisions 4–18, measured RSE 0.676% at p=14 / n=10⁶— comfortably below the Flajolet 2007 0.8125% envelope); Bloom filters via Kirsch-Mitzenmacher double hashing; Count-Min; equi-depth histograms; and a 2D correlated histogram that captures the pairwise column dependencies the four scalar sketches miss. Each sketch carries a stable KIND tag (samkhya.hll-v1, samkhya.bloom-v1) and a to_bytes / from_bytes serialization contract. A Puffin sidecar produced by the Python wheel is byte-identical to one produced by the DuckDB extension and fully readable by the DataFusion adapter. That byte-identity is the moat. No engine owns the stats. The sidecar does.

Layer 2— feedback recorder. The recorder hooks query execution at the adapter boundary, captures (plan template, estimated rows, actual rows) triples, and writes them to a SQLite sidecar keyed by template. A per-template residual model— a gradient-boosted tree under 100 KB on disk in the default backend— learns the systematic bias between what the planner thought and what actually happened, then surfaces the correction as a hint. This is the observe-and-hint pattern from Stillger's LEO (IBM, 2001), Marcus's Bao (SIGMOD 2021 Best Paper), and Anneser's AutoSteer (VLDB 2023)— the only learned-QO pattern with documented production deployment, full stop. Cold start sees the native plan. The recorder fires only when it has evidence.

Layer 3— LpJoinBound. The never-regress guarantee, made provable. This is the non-negotiable contract every other layer must honour. Every corrected estimate is bounded from above by a pessimistic ceiling derived from LpBound (Zhang et al., SIGMOD 2025 Best Paper)— LP relaxation over ℓp-norms of degree sequences, no machine learning involved. The samkhya refinement is strictly tighter than the Atserias-Grohe-Marx AGM bound (PODS 2008) on the star-5 join family at p=1. Not approximately tighter. Strictly tighter on every cell of the 30-cell evaluation grid. Wilcoxon signed-rank test gives W=0, p=1.73×10⁻⁶— a complete dominance result. Translated to wallclock: 40.95× speedup over native DataFusion 46 LpBound tightness, BCa 95% CI [30.93, 47.45]. A correction that breaches the LpJoinBound ceiling is rejected and the native estimate is used in its place. The worst case is silently degenerate. Never catastrophic. That is the whole point.

Layer 4— GPU batch inference (opt-in, via gpudb). Subplan enumeration is embarrassingly parallel: each candidate is an independent forward pass through a small GBT or PFN. CPU does this serially. The GPU collapses what would be a thousand-iteration loop into one CUDA or Apple Silicon Metal kernel launch. When samkhya is paired with my gpudb extension, the correction model scores thousands of subplan candidates in a single launch. Strictly opt-in: the default cargo build --release --workspace links no CUDA, no Metal, no GPU runtime of any kind.

Layer 5— the pluggable Corrector backend. This is the headline. One Rust trait (Corrector), four shipped backend slots:

GBT (default). Gradient-boosted-tree, sub-MB on disk, sub-millisecond inference, no external dependencies. The conservative classical bet, and what the cold-start safety analysis assumes.
TabPFN-2.5 (opt-in, tabpfn_http feature). Hollmann et al. ICLR 2023 + Prior Labs 2026— foundation tabular model. Measured P95 31.15 ms at batch size 8, sequence length 128, on RTX 4090 Laptop (BCa 95% CI [29.39, 35.32]); q-error reduction vs GBT on synthetic 7.84% (BCa 95% CI [2.21, 14.62], p=1.04×10⁻⁵).
LLM-via-HTTP (opt-in, llm_http feature). The viral one. samkhya-core::residual::llm::LlmHttpCorrector calls an HTTP server you control. Two reference servers ship in samkhya-gpudb/scripts/— llm_infer_server.py (Python FastAPI, canonical) and llm_infer_server.ts (Node TypeScript port, broader operator appeal). Documented backends: Anthropic Claude (claude-opus-4-7, claude-sonnet-4-6), OpenAI GPT-4o-mini, local Ollama (llama3.2:1b via http://127.0.0.1:11434). Plug your own foundation model in ~50 lines of glue. Wire contract is dead-simple: POST /infer {"features": [...], "baseline_estimate": <u64>} → {"estimate": <u64>}.
Dummy (transport-floor). Echo backend for measuring wire overhead independent of model latency. Produced the H1-A PASS— P95 0.07–0.11 ms across batch sizes 1/4/8/16/32, proving the LLM plug works at sub-millisecond cost when the model isn't the bottleneck.

The safety story is what makes this sane. Every backend— Claude, GPT, Ollama, GBT, TabPFN, anything you write tomorrow— is clamped from above by the LpJoinBound ceiling at Layer 3. If a hallucinating LLM returns "the join is 10¹² rows," LpJoinBound says "the provable ceiling is 4.2 million" and the planner sees 4.2 million. A miscalibrated TabPFN, a stale GBT, an LLM with a cosmic-ray bit-flip— none of them can break never-regress, by construction. That's the contract the field has been waiting for: let foundation models help with query planning, without giving them the keys.

What's measured, what's projected— being explicit. The transport-floor latency (0.07–0.11 ms P95) and the wire contract are MEASURED on the dummy backend (full receipts in bench-results/19_llm_corrector.md). The live-LLM end-to-end latency cells— Claude (~1.2s P95 paper-projection), GPT-4o-mini (comparable), Ollama (latency-bounded by local hardware)— are PROJECTED pending API-key budget and the 30-trial measurement campaign. The mechanism ships in v1.0; the headline live numbers are next-revision work. I would rather you know that now than discover it when you read the bench-results dossier.

The LpJoinBound clamp, visualised. Claude (amber), GPT-4o-mini (cyan), and local Ollama (emerald) each contribute a cardinality estimate; the crystalline lens— the provable pessimistic envelope— refracts and bounds every stream before it reaches the query-plan tree below. A hallucinating LLM cannot exceed the ceiling. The worst case is the engine's native plan.

Now the part nobody else in this field tells you

The standard playbook in learned cardinality estimation— for about eight years now, across roughly fifty papers— has been: hand-tune a system against a target workload, report the geometric mean of the wins in the headline, mention the losses in a §9 disclosures section that nobody reads. The credibility deficit that resulted from this playbook is exactly as large as the cumulative gap between those papers and the production deployments that followed them. Which is to say large. Naru is dead. NeuroCard is dead. MSCN is dead. DeepDB is dead. BayesCard is dead. None of them shipped into a production database. The 2021–2022 critique papers (Are We Ready For Learned CE?, In-depth Study of Learned CE) wrote the obituary collectively. The field has been quiet about why ever since.

Before the WAVE4-F head-to-head against native DataFusion 46 on the IMDb Join-Order-Benchmark Slow subset (n=55 paired warm-cache, scale factor 1), I pre-registered three upper-bound performance claims. ≥1.6× geometric mean on join-heavy queries. ≥1.35× on aggregate-heavy. ≥1.50× overall headline.

All three were falsified.

The measured geometric mean is 1.038× wallclock— BCa 95% CI [1.026, 1.056], Wilcoxon W=212 p=3.00×10⁻⁶. The Benjamini-Hochberg FDR procedure at q=0.05 rejects the null on 24 of 55 cells. The full record across the suite is 17 wins / 38 ties / 0 losses. Never-regress holds— that is the whole point of the LpJoinBound clamp— but the magnitude of the wins is small, and I am reporting it as such, in the headline section, where this sentence lives.

The TabPFN-2.5 backend had two pre-registered hypotheses. H1-A (P95 below 50 ms) passed comfortably. H1-B (≥15% q-error reduction over GBT on synthetic) failed on magnitude— measured reduction is 7.84%, statistically real, half the pre-registered effect size. The paper reports H1-B as falsified.

I am telling you this here, in the launch post, on the first day, in the second-most-prominent section, because the alternative is the §9-disclosures playbook that hollowed out this field's credibility in the first place. The cost of admitting a falsified pre-reg in public is smaller than the cost of having someone discover it on their own three months later. The 17/38/0 record is the appropriate evidence to weigh: a real, modest, statistically significant, never-regress improvement, in the only kind of public paired-warm-cache head-to-head whose results would actually deserve to be believed.

The interesting number in this release is not the JOB-Slow geomean. It is the 40.95× LpJoinBound tightness over AGM on the star-5 family. That is a theoretical-tightness result, with a wallclock translation on the synthetic topology, and it is the right way to communicate what this library actually contributes.

The 1000 → 42 demo (thirty seconds to feel the mechanism)

The repository ships a stats_propagation_demo example that proves the end-to-end path in plain Rust. A 1000-row table wrapped in DataFusion 46's default TableProvider reports num_rows = 1000 to the physical plan. Wrap the same provider with SamkhyaTableProvider plus the optimizer rule, and the physical plan reports num_rows = 42. The example prints, verbatim: "without rule: 1000, with rule: 42."

The DataFusion integration is intentionally a five-line change:

use datafusion::prelude::SessionContext;
use samkhya_datafusion::{SamkhyaTableProvider, SamkhyaOptimizerRule};

let ctx = SessionContext::new();
ctx.state().add_optimizer_rule(Arc::new(SamkhyaOptimizerRule::default()));
let provider = SamkhyaTableProvider::wrap(inner_provider).with_puffin_sidecar("orders.puffin")?;
ctx.register_table("orders", Arc::new(provider))?;

The samkhya_leaves_seen diagnostic on the optimizer rule confirms the corrected stats reached the physical plan. No fork of DataFusion required. DataFusion 46's Distribution framework already accepts external column statistics— samkhya simply supplies better ones.

What is in the v1.0 box, plainly

13 crates. Approximately 266 #[test] blocks across the workspace. 17 property tests. ~31 million cargo-fuzz executions, zero crashes. Criterion microbenchmarks for sketches and Puffin I/O. clippy -D warnings clean. The full workspace builds in under two minutes on a laptop with no network access. An ACM Artifact Evaluation v1.1 reviewer entry ships in REPRODUCIBILITY.md— Functional, Reusable, Available badges all in reach. Full reproduction budget for the published numbers: roughly 90 minutes wallclock on the reference hardware. Apache-2.0 single license with explicit §3 patent grant, matching DataFusion, Iceberg, Arrow, and ClickHouse— every downstream user gets the same patent grant, not a dual-license toggle.

Engine status, no hedge words:

DataFusion— three-layer integration (SamkhyaTableProvider + SamkhyaStatsExec + SamkhyaOptimizerRule). Production. First-class target.
DuckDB— Rust-client integration via the bundled feature, production. The cxx extension (cdylib + runtime LOAD samkhya;) ships as staticlib+rlib in v1.0; cdylib waits on upstream DuckDB Issue #11638.
Polars— Beta. Series-to-sketch helpers + lazy_collect_with_feedback behind the engine feature. Optimizer hook is upstream-blocked on Polars Issue #23345.
Postgres— Scaffold only. pgrx-shaped, double-gated behind pg_extension feature + samkhya_pgrx_enabled rustc cfg, pinned to PG17. Real planner/executor hooks land in v1.1 after pgrx ≥ 0.13.
Iceberg— Production. Puffin reader/writer with KIND-tag registration for all five sketch types.
Arrow— Production. IPC round-trip helpers, byte-identical serialization for all five sketch types.
gpudb— CPU fallback production. GPU and TabPFN-2.5 HTTP backends opt-in behind tabpfn_http.
Python— Single abi3-py39 wheel on PyPI as samkhya. Covers the portable-stats layer (sketches, Puffin reader/writer, ColumnStats). Use case: dbt-style nightly ELT writes the Puffin sidecar next to the Parquet file; the morning's DuckDB/DataFusion ad-hoc queries inherit it for free. No Python ML stack required.

Why this is the right shape now

The cardinality-estimation field has spent eight years producing extraordinary research and almost no embedded-tier production code. The pattern is older than the learned-CE wave. Stillger's LEO at IBM in 2001 was the first feedback-driven query optimizer to reach a mainstream server-class DBMS— and it was the last. Every learned system since has carried the same shape: server-class assumption about a long-lived optimizer process, model footprint that doesn't fit the embedded tier (40–300 MB), inference latency the embedded tier cannot afford (5–50 ms), and a cold-start story that handwaves around the worst case.

The 2021–2022 critique papers said this plainly. The production-database field then routed around it via adaptive query execution— a gorgeous technique that is structurally inapplicable to engines without a long-lived process to adapt within. The embedded tier— DuckDB, DataFusion, Polars, gpudb— has been waiting for a library that addresses its constraints, not the constraints of the systems whose obituaries those critique papers were writing.

samkhya v1.0 is my bet that the three pieces an embedded engine actually needs— portable sketches, feedback-driven residuals, a provable safety envelope— are independent enough to ship as one library, and that the pluggable model-backend slot above them is the right level of abstraction to admit whatever the field consolidates on next. GBT default is the conservative classical bet. TabPFN-2.5 opt-in is the bet that foundation tabular models are real and that the right way to use them is behind a pessimistic envelope rather than as a replacement for one. The LLM-via-HTTP backend is the bet that the third wave is already here— Claude, GPT-4o-mini, and Ollama are usable today, the wire contract is intentionally trivial so swapping providers is a 50-line change, and the safety envelope means the worst case is the engine's native plan. The library doesn't have to know what model you plugged in.

The honest measurement is the second bet: that the field's credibility deficit is repairable by pre-registration and falsification reporting, and that the right way for a sole-author project to participate in that repair is to do it first and conspicuously. Falsified pre-regs are information. The papers that hid theirs are the reason this field has the reputation it has.

What ships next, in the order I will close it

The v1.0 limitations are named explicitly in §9 of the technical paper. The v1.1 roadmap closes them in the order they were named:

Tighter effect-size attribution on JOB-Slow— cold-cache and Parquet methodologies, larger n, larger memory ceiling. The 1.038× geomean is partly methodology and partly intrinsic; v1.1 separates the two with a clean experimental design.
Broader workload coverage starting with STATS-CEB (Han et al., VLDB 2022). JOB-Slow is the field's standard but it is one schema. STATS-CEB is the obvious next workload.
In-process TabPFN-2.5 backend. The current HTTP round-trip eats 10–15 ms; collapsing it gets the P95 well under 20 ms.
Real Postgres planner/executor hooks. The scaffold is in v1.0. The implementation lands after pgrx ≥ 0.13.
Live-LLM end-to-end measurement campaign. The LLM-via-HTTP backend ships in v1.0 with a measured transport floor (P95 0.07–0.11 ms) and three documented providers (Claude, GPT-4o-mini, Ollama), but the 30-trial paired live-provider campaign hasn't run yet. ANTHROPIC_API_KEY and OPENAI_API_KEY need budget approval; Ollama just needs me to install it on the bench host. The mechanism is in v1.0; the measured live numbers are v1.1.

One sentence each, for the three audiences who will read this

If you have ever wanted to put Claude or GPT-4o-mini into the query path and stopped because nobody had built the safety envelope: that envelope is shipped. cargo add samkhya-core --features llm_http, point llm_infer_server.py at your favorite provider, and your DataFusion (or DuckDB, or Polars) optimizer will route every cardinality estimate through the LLM, clamped from above by a 2008-bound-tightening LP so a hallucination can never produce a worse plan than the engine's native one. That's the contract. That's the thing the field has been waiting for.

If you ship DataFusion, DuckDB, Polars, or Iceberg code today and you have ever cursed at a 1000-row estimate that should have been 42: cargo add samkhya-core and the five-line DataFusion snippet above gets you there in an hour. No LLM involved by default. The GBT backend is the conservative classical bet.

If you are a database researcher who has been quietly hoping someone would build the embedded-tier learned-CE library that fifty server-class papers couldn't: this is that library, the design is in the companion paper, the falsified pre-regs are in the headline section because that is where they belong, and the bench-results/ directory in the repository is where the receipts live— I would rather have the argument there than in the abstract.

The library's whole job is to count reality's constituents honestly. The release is doing the same thing about its own measurements. Both are deliberate. Both are how this field starts to look like the production field it always wanted to be.

samkhya v1.0.0— Apache-2.0. Repository: github.com/singhpratech/samkhya. Technical paper: /publications/samkhya-portable-feedback-driven-cardinality-correction-embedded-analytics. Companion to the prior gpudb release— the two share the embedded-tier engine target, the GPU-optional architecture, and the single-Apache-2.0 license posture.