DEV Community: AI Explore

ferrovec: a 33 KB Rust HNSW Vector Index That Compiles to WebAssembly

AI Explore — Sat, 18 Jul 2026 23:58:19 +0000

I wanted semantic search that ran with no server — the vectors, the index, and the query all inside a browser tab, working offline, nothing leaving the machine. The plan was the boring, winning one: a fast core in Rust, compiled to WebAssembly, wrapped in a plain JS API.

Then I went looking for a Rust HNSW crate to drop in, and every mature one refused to compile to wasm32-unknown-unknown. They hard-depend on rayon, or mmap-rs, or num_cpus — reasonable on a server, immovable walls in the browser. So I wrote my own.

ferrovec is a tiny, dependency-light HNSW vector index — the same approximate-nearest-neighbor algorithm behind Pinecone, Weaviate, and Qdrant. It's at 0.3.1 on crates.io (Rust core) and npm (browser package).

The design constraints (on purpose)

Featherweight — the Rust core's only dependencies are serde and postcard. The wasm build is reported at ~33 KB gzipped.
No unsafe outside one audited SIMD kernel — #![deny(unsafe_code)] crate-wide, with a single scoped exception.
No system randomness — a deterministic seeded splitmix64 PRNG, so there is no getrandom in the tree. That's exactly what lets it land on wasm32-unknown-unknown with no shims.
Incremental + portable — upsert inserts, tombstoning removals, and a compact binary format with a versioned header that reloads natively or in the browser.

Layer	Dependencies	Runs in the browser?
ferrovec core (Rust)	`serde` + `postcard` — 2, total	✅ ~33 KB gzipped wasm
a typical HNSW crate	`rayon` / `mmap-rs` / `num_cpus`	❌ no `wasm32` backend

The whole learning curve for the Rust core:

use ferrovec::{Hnsw, Metric, Config};

let mut index = Hnsw::new(4); // 4-dim, Cosine by default
index.insert("a", &[1.0, 0.0, 0.0, 0.0]).unwrap();
index.insert("b", &[0.0, 1.0, 0.0, 0.0]).unwrap();
index.insert("c", &[0.9, 0.1, 0.0, 0.0]).unwrap();

let hits = index.search(&[1.0, 0.0, 0.0, 0.0], 2).unwrap();
assert_eq!(hits[0].id, "a"); // nearest first

search returns Neighbor { id, distance }, nearest first. Three metrics, all "smaller = closer": Cosine (default, 1 - cos), Dot (1 - dot), and L2 (squared Euclidean).

Why HNSW is fast

It never scores your query against the whole dataset. HNSW stacks graphs: the bottom layer holds every node densely connected; each layer above is a sparser sample with longer hops, like an express lane over local streets. A search enters at the top, greedily walks toward the query, drops a layer, repeats — so only a handful of comparisons happen at the dense bottom. You tune it with three knobs:

Config {
 max_connections: 16, // M — neighbors per node per layer
 ef_construction: 200, // candidate list while building
 ef_search: 50, // candidate list while querying
 metric: Metric::L2,
 seed: 42,
}

Determinism is the feature that unlocks the browser

Most index libraries call the OS random source to pick each node's top layer — and that one call is what makes them impossible to compile for the browser, because getrandom has no wasm32-unknown-unknown backend without a JS shim. ferrovec carries its own seeded splitmix64 instead. No getrandom anywhere.

The bonus: builds are reproducible. Same vectors, same order, same seed → byte-identical graphs on your laptop and in a user's browser. Which makes compaction honest — remove only tombstones a node (it stays for graph connectivity, filtered from results), so heavy churn grows memory; compact() rebuilds in place from the live vectors only, rewinding the PRNG to Config::seed so the result is identical to a fresh build of the survivors.

index.remove("drop"); // tombstoned, still using memory
index.compact(); // rebuild from live nodes; PRNG rewound to seed
assert!(index.contains("keep"));
assert!(!index.contains("drop"));

The SIMD kernel — the one place unsafe is allowed

Distance is the inner loop, so it's the one place worth hand-optimizing. Scalar dot/l2 stay as the always-available reference; on wasm32 + simd128 the dispatch swaps in a hand-written SIMD128 kernel that accumulates in four lanes and folds them — and that kernel is the only code allowed to use unsafe. Four-lane accumulation isn't bit-identical to a left-to-right scalar sum (a few ULPs under IEEE-754), so a wasm integration test asserts search returns the same nearest id as a scalar brute force. The wobble never changes who wins.

Into the browser

The core exposes a FerrovecCore class via wasm-bindgen — bring your own Float32Array embeddings. But the npm package closes the last gap: automatic embedding via transformers.js, a dedicated Web Worker so nothing runs on the main thread, and OPFS persistence. The API collapses to three lines:

import { Ferrovec } from "ferrovec";

const db = await Ferrovec.open("notes"); // spawns the worker, loads the model
await db.insert("the cat sat on the mat"); // embed + index
const hits = await db.query("a napping kitten", 5); // → [{ id, text, score },...]

No network call in that snippet, no server on the other end.

Surviving a reload — and a second tab

Indexes persist to the Origin Private File System by default: the worker opens ferrovec/<name>/index.bin via a FileSystemSyncAccessHandle, and writes are full snapshots debounced ~250 ms, with close() forcing a final flush. Where sync access isn't available (Node, insecure context) it degrades to in-memory instead of crashing.

The interesting problem is the second tab. Two workers, one OPFS file — early versions let the second tab silently fork and diverge. The 0.3.x fix is single-writer leader election: every store worker requests an exclusive Web Lock ferrovec-leader:<name>; the winner owns the OPFS file and the only engine, and the losers become followers that proxy insert/query/remove to the leader over a BroadcastChannel as correlation-id round-trips. One authoritative index, so no tab can diverge. db.role tells you which you got: leader, follower, or solo. (Both OPFS sync handles and Web Locks need a secure context, so this switches off gracefully off HTTPS.)

See it run

The live demo is the real wasm core running semantic search over 24 sentence embeddings entirely in your tab — no server, no network, the wasm binary and vectors baked into one HTML file.

Status, honestly

Roadmap complete, both registries at 0.3.1:

Milestone	Ships	Where
M1–M2 + compaction	HNSW core, WASM boundary, SIMD128 kernel	crates.io
M3–M6	transformers.js embeddings, OPFS persistence, 3-line API, leader election	npm

Two things this post does not contain: benchmarks and recall numbers — I haven't measured them, so I won't quote them. The ~33 KB is the reported build size; everything else here you can read out of the source or reproduce from a clone. Benchmarks are the next post, and they'll ship with the commands to reproduce them.

Project site: https://singhpratech.github.io/ferrovec/
Live demo: https://singhpratech.github.io/ferrovec/demo.html
crates.io: https://crates.io/crates/ferrovec · cargo add ferrovec
npm: https://www.npmjs.com/package/ferrovec · npm install ferrovec
GitHub: https://github.com/singhpratech/ferrovec

ferrovec is an independent open-source project (MIT). Nothing here is a benchmark. transformers.js is a HuggingFace project used by the browser package.

Your Data Pipeline's Real Bug Is a Lineage Problem, Not a Cleaning Problem

AI Explore — Sat, 18 Jul 2026 19:31:51 +0000

TL;DR — Data cleaning for AI gets treated like a one-time batch job borrowed from BI tooling, but the real failure mode is losing the trail from a bad model output back to the source record that caused it. Cleaning, chunking, and embedding are all lossy transforms that erase provenance by default. Pipelines that feed models reliably need lineage as a first-class layer, not an afterthought bolted on for compliance.

Every data quality checklist you've inherited was built for a different consumer. Null checks, deduplication, schema validation, referential integrity — this is the vocabulary of BI dashboards and financial reporting, where the question is "does this aggregate reconcile." AI pipelines get handed the same checklist and told to feed a model instead of a report. The checklist mostly works. But it's answering the wrong question, and the gap it leaves is where your hardest debugging sessions live.

The question that matters for a model isn't "is this row correct." It's "when the model says something wrong, can I find the exact byte of source data that caused it." Most pipelines cannot answer that question. Not because nobody tried to clean the data, but because cleaning, chunking, and embedding are lossy transforms, and lossy transforms destroy the evidence trail by construction.

Cleaning Destroys the Evidence

Think about what a typical ingestion pipeline does to a document before it reaches a model. HTML gets stripped. Boilerplate gets removed. Whitespace gets normalized. Near-duplicate paragraphs get collapsed. Text gets split into chunks at arbitrary token boundaries. Chunks get embedded into vectors. Each of these steps is individually defensible, sometimes necessary, and each one throws away information about where the surviving content came from and what it looked like before.

By the time a chunk is sitting in a vector store, it is several transforms removed from the source document. If that chunk later gets retrieved into a prompt and produces a hallucinated or subtly wrong answer, tracing it back requires reconstructing a chain that nobody logged: which raw document, which version of that document, which cleaning pass, which chunking parameters, which embedding model checkpoint. In practice, almost nobody can reconstruct that chain. So the debugging default becomes re-reading the whole corpus by hand, or worse, shrugging and re-tuning a prompt until the symptom disappears while the actual cause stays buried.

This is the part that gets missed in most "clean your data" advice: the goal isn't just to remove garbage before it reaches the model. It's to remove garbage without losing the ability to explain, after the fact, why any given piece of surviving data looks the way it does. Those are different engineering problems, and the second one is the one that actually determines whether your AI system is debuggable in production.

Lineage Is the Missing Layer

Data engineering has a mature answer to this in the abstract: lineage. Track where data came from, what transformed it, and what it turned into. The discipline exists. It's just rarely applied past the ingestion layer in AI systems, because lineage tooling was built to answer "what feeds this dashboard," not "what feeds this specific model output at inference time."

Feeding a model reliably means lineage has to survive three transforms that traditional data lineage tools weren't designed for: cleaning that rewrites content rather than just filtering rows, chunking that splits one record into many with no natural key, and embedding that turns text into a vector with no human-readable trace at all. If your lineage graph stops at "raw document ingested," you have lineage for the easy 20% of the pipeline and none for the part that actually shapes what the model sees.

What This Actually Looks Like in Practice

A lineage-first pipeline doesn't need exotic infrastructure. It needs a handful of disciplined habits that most teams skip because they feel like overhead until the day they save you.

Keep the raw source immutable and content-addressed, so every downstream artifact can point back to an exact byte-identical version of the original, not "the document as it exists today." Version every transform function, not just the data — a cleaning rule changing silently is functionally the same bug as bad data arriving, and it should be diffable the same way. Give every chunk a stable identifier that survives re-chunking runs, so "this chunk" means something specific rather than "chunk number seventeen from whatever the last run produced." Log retrieval events with the chunk identifiers actually used, not just the query and the answer, so a bad generation can be joined back to the exact evidence that was in context.

None of this is exotic. It's the same discipline good software engineers apply to production incidents — keep enough state around that a bug report is reproducible — applied to the data feeding the model instead of the code running it. The reason it's skipped isn't difficulty, it's that the cost is invisible until the first time you desperately need it and don't have it.

The Cost You're Actually Paying

Skipping lineage doesn't show up as a clean failure. It shows up as a tax you pay repeatedly in slightly different forms. An eval regresses and nobody can tell if it's a model change, a prompt change, or a silent shift in what the retrieval pipeline is now returning. A user reports a factually wrong answer and the incident review consists of guessing which document it probably came from. A source document turns out to be legally or factually wrong and needs to be removed, and there's no way to know which embeddings, which fine-tuning examples, or which cached outputs were derived from it, so removal becomes "rebuild everything from scratch and hope."

That last one deserves more attention than it gets. Deletion in AI pipelines is not a delete statement. If you can't trace what a piece of source data touched downstream, you can't actually remove its influence — you can only remove the row and hope the model forgets, which it usually doesn't in any verifiable way. Lineage isn't just a debugging convenience. It's what makes deletion, correction, and targeted retraining possible at all instead of theoretical.

Cleaning Is Necessary. It Just Isn't Sufficient.

None of this argues against cleaning your data. Cleaning is still necessary, and most pipelines should do more of it, not less. The argument is narrower: cleaning alone optimizes for what the model sees on the way in, while lineage is what lets you reason about what the model produced on the way out. A pipeline that's spotless at ingestion but opaque past that point will still fail in ways nobody can explain, because the failure surfaces three transforms downstream from where the visibility stops.

The teams that will have an easier time running AI systems in production aren't the ones with the cleanest corpus. They're the ones who can point at any output and say, without guessing, exactly which piece of source data made it possible.

The Inference Hardware Wars: Why Your Token Bill Is Decided in a Fab, Not a Prompt

AI Explore — Sat, 18 Jul 2026 03:35:01 +0000

TL;DR — The token-price collapse you keep celebrating is a hardware war, not a software miracle. NVIDIA's near-monopoly is under attack on two flanks: Cerebras and Groq are racing on raw speed (thousands of tokens per second), while NVIDIA's own next-gen Rubin is counterpunching on cost-per-token. Whoever wins that fight — not your prompt engineering — sets your real inference bill and your latency floor.

Every few weeks someone publishes a chart of plummeting LLM prices and credits it to "efficiency" — better quantization, smarter batching, leaner prompts. That story is comforting because it's the part you control. It's also mostly wrong. The dominant force pushing your cost-per-token down isn't a clever decoding trick; it's a brawl over the silicon that the tokens run on. The price you pay is, to a first approximation, a hardware number with a software wrapper.

That matters because it changes what you should be paying attention to. If your inference economics are set by prompt golf, you optimize prompts. If they're set by which accelerator wins the next 18 months, you watch the foundries, the power contracts, and the architecture roadmaps — and you build so you can move when the floor drops. Here's the actual war.

The incumbent: NVIDIA owns the ground you're standing on

Start with the fact everyone knows and few price correctly: NVIDIA isn't winning inference because its chips are the fastest at generating tokens. It's winning because it owns training, and because CUDA is the most expensive switching cost in computing. Every serious model is trained on NVIDIA. Every framework, kernel, and serving stack assumes CUDA first. That gravity well means most inference lands on NVIDIA hardware by default, not by benchmark.

The result is a quasi-monopoly on the plumbing of AI. And monopolies don't usually cut their own prices — until someone makes them. The interesting development of 2026 is that NVIDIA is now being attacked from two directions at once, and it is responding by attacking its own margins before anyone else can.

Flank one: the speed war (Cerebras and Groq)

The first flank is latency and throughput, and it is not being fought with GPUs at all. Two companies decided the GPU was the wrong shape for inference and built something else.

Cerebras builds a wafer-scale processor — the CS-3 system is a single chip the size of a dinner plate, the largest commercial chip ever made. The whole point is to keep the model's weights in enormous on-wafer SRAM instead of shuttling them across a network of smaller GPUs. Memory bandwidth is the thing that throttles token generation, and wafer-scale attacks it head-on. On the open GPT-OSS-120B model, Cerebras reports roughly 2,700 tokens/sec with a time-to-first-token around 280ms. For comparison, Cerebras's own benchmark puts a Blackwell GPU at roughly 650 tokens/sec on the same model — call it a 4x gap. (That figure is vendor-sourced and Cerebras has every incentive to flatter it, so treat the exact multiple as marketing-adjacent. The order of magnitude, though, is real and repeatable across independent third-party measurements.)

Groq takes a different route to the same goal with its LPU — a "Language Processing Unit" built around deterministic, software-scheduled dataflow instead of the GPU's dynamic, cache-driven execution. No speculative scheduling, no memory-hierarchy guessing; the compiler lays out exactly when every value moves. That determinism is what lets Groq sustain very high tokens/sec at low, predictable latency. On price, Groq sits in the $0.15/M input, $0.75/M output class for open models (secondary figures — confirm against the live price sheet before you build a budget on them).

The trade is explicit and worth saying plainly: both architectures buy speed by sacrificing flexibility. Wafer-scale and LPU systems are tuned for a curated set of models. You don't get the GPU's "run literally anything, including the training run you'll do next month" generality. You get a blisteringly fast appliance for serving the models it supports. For a lot of production inference, that's exactly the trade you want — but it is a trade, not a free lunch.

Why thousands of tokens per second is a product feature, not a flex

It's tempting to read 2,700 tok/s as a vanity number. It isn't. There's a UX phase change that happens somewhere north of human reading speed, and these systems blow past it.

At 50 tok/s — typical streamed GPU output — you watch the model think. You stream tokens because you have to; the latency is the experience. At 2,000+ tok/s, a multi-paragraph answer materializes effectively instantly, which kills the entire premise of streaming and unlocks workloads that were previously impossible. Agentic loops are the big one: an agent that makes ten sequential model calls to plan, act, and reflect is dead on arrival at 50 tok/s and snappy at 2,000. Speculative decoding, long chain-of-thought, multi-pass self-critique — all the techniques that trade more tokens for better answers — only become economical when each token is nearly free in time. Speed doesn't just make the same product faster; it changes which products are buildable.

The OpenAI–Cerebras bet: speed at industrial scale

If you thought the speed flank was a niche play, the numbers say otherwise. In January 2026, Cerebras signed a roughly $10 billion cloud-inference contract with OpenAI — about 750 MW of capacity, which works out to something like 32,768 CS-3 systems, deploying from Q1 2026 through 2028 (reported by The Next Platform, January 15, 2026).

OpenAI committing ten billion dollars to non-NVIDIA inference silicon is the loudest possible signal that the latency war is real, well-funded, and aimed squarely at the GPU's most profitable workload.

On pricing, Cerebras's cloud sits around $0.25/M input and $0.69/M output for the open models it serves — competitive with GPU clouds on dollars while being multiples faster on tokens/sec. That combination — comparable price, far better latency — is precisely the wedge a challenger needs to pry workloads off the incumbent. You don't have to be cheaper if you're dramatically faster at the same price; you just have to be fast enough to enable products the incumbent can't.

Flank two: the efficiency war (NVIDIA vs. NVIDIA)

The second flank is cost-per-token, and here NVIDIA is the one swinging — at its own installed base. At CES in January 2026, NVIDIA announced Vera Rubin NVL72, the successor to the Blackwell generation. The headline claims (NVIDIA's own, so read them as vendor targets, not benchmarks): up to 5x inference performance and up to 10x lower cost-per-token versus Blackwell, with volume ramp in the second half of 2026.

Sit with that 10x. Even if the real-world number lands at half the claim, NVIDIA is telling its customers that the chips they bought last year will be roughly an order of magnitude more expensive per token to run than the chips shipping this year. That is not a company resting on a monopoly. That is a company that has looked at Cerebras, Groq, and the hyperscalers' in-house silicon (Google's TPUs, Amazon's Trainium/Inferentia) and concluded the only safe move is to obsolete itself on a schedule before a competitor does it for them. The CUDA moat protects the platform; it does not protect any single generation's pricing.

This is the part the "it's just software efficiency" crowd misses. A 10x cost-per-token improvement at the silicon layer swamps anything you can squeeze out of a prompt. When the token-price chart drops a notch in late 2026, the cause won't be your batching strategy. It'll be Rubin entering the fleet.

Power is the wall everyone eventually hits

Underneath both flanks sits the constraint that increasingly decides the whole game: electricity. The Stanford AI Index 2026 (via IEEE Spectrum) pegs AI datacenter power at roughly 29.6 GW — the scale of a large national grid, devoted to running models. At that scale, efficiency stops being an environmental footnote and becomes the binding cost input.

The same data offers an illustrative spread worth internalizing: the least-efficient inference setups can emit more than 10x the carbon of the most efficient for comparable work, with one model drawing on the order of 23W for a medium prompt where another draws around 5W (these are illustrative single-model figures, not a universal law). The lesson generalizes cleanly: when you're power-bound, the accelerator that delivers more tokens per watt wins on cost and on how much capacity you can physically stand up behind a fixed power contract. Tokens-per-watt is quietly becoming the metric that decides which of these architectures can actually scale — because at 750 MW deployments, you can't buy your way past the substation.

What this means if you serve models at scale

Strip away the vendor theater and the operating reality for an infra team is blunt: your serving cost is dominated by hardware dollars-per-token and your utilization, not by which model you picked. A few practical consequences follow.

Latency is now a procurement decision, not a code decision. If your product needs real-time agents, no amount of streaming polish gets you what a Cerebras or Groq endpoint gives you out of the box. Buy the latency floor; don't try to engineer around physics.
Don't hard-couple to one accelerator's quirks. The same lesson as model-swapping applies one layer down. The vendor with the best dollars-per-token will rotate — Blackwell, then Rubin, then whatever Cerebras and the TPU teams answer with. Build your serving layer so the accelerator behind the endpoint is a config value, not an assumption baked into forty call sites.
Track tokens-per-watt, not just dollars-per-token. In a power-bound world the two converge, and the watt number is the one that tells you whether you can grow.
Utilization is the silent margin killer. A faster, pricier chip running at 80% beats a cheaper one idling at 20%. The hardware war only helps you if your scheduling actually keeps the silicon hot.

Where this lands by 2027

Here's my read. NVIDIA keeps the throne — CUDA and its lock on training are not falling in eighteen months, and Rubin's cost-per-token claims, even discounted, are enough to defend the bulk of the market. But the monopoly on inference specifically gets carved up at the edges, and the edges are the high-value parts: ultra-low-latency, agent-heavy, real-time workloads peel off to wafer-scale and LPU systems, with OpenAI's $10B Cerebras bet as the template others copy. The hyperscalers' in-house chips quietly eat another slice for their own first-party traffic.

The net effect for everyone serving models: cost-per-token keeps falling fast through 2027, and almost none of that win will come from your prompts. It'll come from Rubin ramping, from challengers forcing price discipline, and from power efficiency becoming the real ceiling. The teams who profit are the ones who treated the accelerator as a swappable component and built the eval and serving harness to move the moment a cheaper, faster floor appears. The token-price collapse is real — just don't flatter yourself about who's causing it. It's being decided in a fab, not in your prompt.

Structured Output Gives You Syntax. It Doesn't Give You Semantics

AI Explore — Fri, 17 Jul 2026 13:01:08 +0000

TL;DR — Constrained decoding and JSON schema enforcement guarantee that model output parses — they say nothing about whether the values are true, safe, or grounded in real system state. Treat structured output like you'd treat any untrusted client at an API boundary: schema validation is step one, not the whole job. The dangerous bugs left in production aren't malformed JSON anymore; they're well-formed lies.

Structured output was sold as a solved problem. Constrain the decoder to a grammar, force the model to emit valid JSON against a schema, and the parsing errors that plagued early LLM integrations disappear. They did disappear. But somewhere along the way, teams started treating schema-valid as a synonym for correct, and that substitution is quietly causing a new class of production bugs that don't look like bugs at all — they look like clean, well-typed data.

This is worth being precise about, because the word "type" is doing a lot of unearned work in how people talk about function calling and structured generation.

The Guarantee You Actually Got

Grammar-constrained decoding gives you exactly one guarantee: the output will conform to a shape. If your schema says status is an enum of three strings, the model will emit one of those three strings. If it says amount is a number, you get a number, not a sentence fragment. That's real, and it's valuable — it eliminated an entire category of glue code that used to exist purely to recover from malformed responses.

But a type system in the compiler sense does more than check shape. It checks that a value belongs to a domain that makes the rest of your program sound. A JSON schema can say user_id is a string. It cannot say user_id refers to a user that exists. It can say refund_amount is a positive number. It cannot say that number is less than or equal to the actual balance on the account. It can say date matches an ISO format. It cannot say the date isn't in the future for a field that logically can't be in the future. Grammar constraints operate entirely inside the syntax of the value. They have zero visibility into the semantics of your domain, because that semantics lives in your database, your business rules, and the current state of the world — none of which the decoder has access to at generation time.

Well-Formed Lies

Before structured output, a bad model response usually broke your parser. You got an exception, a retry, a visible failure. That failure was annoying but it was honest — the system told you something was wrong.

Now the failure mode has changed shape. The model still hallucinates, still misreads context, still guesses under uncertainty — but it does all of that inside a perfectly valid JSON object. The enum value is real, just the wrong one. The order ID is correctly formatted, just doesn't exist. The function call has the right argument types, just the wrong argument values. Nothing throws. Nothing logs an error. The malformed response of two years ago has become the well-formed lie of today, and well-formed lies are much harder to catch because your existing observability was built to catch parse failures, not semantic ones.

This is the same mistake web developers made with client-side form validation fifteen years ago: confusing "the browser won't let the user submit garbage" with "the server can trust what it receives." We relearned that lesson the hard way once already. Structured LLM output is asking us to relearn it a second time, in a domain where the "client" is a probabilistic model instead of a browser, which makes the untrusted input even less predictable, not more.

Function Calling Is an RPC Client You Didn't Write

Function calling makes this sharper because the stakes go up. A structured object that's wrong sits in a database field. A function call that's wrong executes. The model is choosing which method to invoke and with what arguments, and the tool schema you gave it is documentation, not a contract in the enforceable sense. The schema can constrain the type of an argument. It cannot encode the precondition that a refund can't exceed the original charge, that a cancellation can't target an order that already shipped, that a permission-scoped action can't be invoked for a resource outside the caller's tenant.

Those preconditions exist in your codebase already — they're the same invariants you'd check for any caller, human or programmatic. The mistake teams make is skipping that check specifically for LLM-originated calls, on the theory that schema compliance already did the validating. It didn't. The LLM is best modeled as an RPC client you didn't write, running code you didn't review, calling into your system with arguments generated by a process that has no concept of your business rules unless you put them there explicitly and check them again downstream.

Two Validation Layers, Not One

The fix is unglamorous and it's the same fix that's always existed for untrusted input: layer your validation instead of collapsing it into one step.

Schema validation at the boundary — shape, type, required fields. This is what constrained decoding and JSON schema already give you for free. Keep it, don't over-invest in it further.
Semantic validation after parsing, before any side effect. Does this ID resolve to a real, accessible entity? Is this value within the domain's actual legal range, not just its syntactic range? Does this combination of fields represent a state your system can actually be in? This layer has to be hand-written, because it encodes knowledge the model's grammar cannot see.

Neither layer replaces the other. Teams that only run the first layer are shipping the equivalent of an API that trusts its own request validator to also be a database consistency check. Teams that run both are treating LLM output the way they'd treat any input from a system they don't fully control — which, given what an LLM actually is, is exactly the right level of trust.

Measuring the Gap

If you evaluate your structured-output or tool-calling pipeline only on schema compliance rate, you are measuring the layer that was already guaranteed by construction. It will look great and tell you almost nothing about production risk. The metric that matters is the gap between schema-valid and semantically-valid — how often does a well-formed response fail your domain checks after it parses cleanly? That number is your actual hallucination rate for structured tasks, and it's usually far more informative than any aggregate accuracy score, because it isolates exactly the failure mode your syntax guarantees were never designed to catch.

Structured output didn't make LLM integrations safe. It made them legible. Legibility is genuinely useful — you can't validate what you can't parse — but it's the beginning of the trust boundary, not the end of it. Treat the schema as a cast, not a proof, and build the semantic checks you'd build for any other untrusted caller. The model will keep being confidently wrong inside perfectly valid JSON. Your validation layer is the only thing standing between that confidence and your production data.

crimson-crab: a Production-Grade Rust SDK for Claude (and Why tokio Leaves the Tree on wasm32)

AI Explore — Thu, 16 Jul 2026 23:26:47 +0000

On a native target, cargo tree --edges normal --invert tokio puts tokio v1.52.4 squarely in your build. It arrives through reqwest, by way of hyper, hyper-rustls, hyper-util, tokio-rustls, tokio-util and tower.

Retarget that same query at wasm32-unknown-unknown and the tokio count is zero.

Same crate, same default features, same command, two correct answers. On wasm32 reqwest swaps to the browser's fetch backend, hyper leaves the dependency graph, and tokio leaves with it.

What it is

crimson-crab 🦀 is a Rust SDK for Anthropic's Claude API. v0.1.0 went up on crates.io on 16 July 2026. It covers Messages and token counting, fine-grained SSE streaming with an accumulated final Message, tool use with custom tools and server-tool passthrough, extended and adaptive thinking, prompt caching with 5-minute and 1-hour TTLs, structured output via JSON Schema, Message Batches, and the Models endpoint.

The scope is deliberately narrow: Claude, and nothing else. If you are building against several model vendors, a multi-provider framework will serve you better, and rig and genai are genuinely good at that job. crimson-crab is for teams who have already chosen Claude and want the whole surface, exactly as Anthropic ships it.

cargo add crimson-crab

use crimson_crab::model_ids::CLAUDE_OPUS_4_8;
use crimson_crab::prelude::*;

#[tokio::main]
async fn main() -> crimson_crab::Result<()> {
 // Reads ANTHROPIC_API_KEY from the environment.
 let client = Client::from_env()?;

 let request = MessagesRequest::builder().model(CLAUDE_OPUS_4_8).max_tokens(1024).messages(vec![MessageParam::user("Explain Rust's borrow checker in one line.")]).build()?;

 let message = client.messages().create(&request).await?;
 println!("{}", message.text());
 Ok(())
}

Note the #[tokio::main] in that snippet. On a native target you still bring a runtime; the crate simply declines to pick one for you. tokio sits in crimson-crab's manifest under [dev-dependencies], where the test suite and the seven examples use it. Client is Clone + Send + Sync and shares one connection pool, so you build it once and drop it in your axum state or a plain struct field. No Arc, no Mutex.

The dependency that depends on your target

Here is the native tree in full, because the tidy one-line summary of it would be a lie:

$ cargo tree --edges normal --invert tokio
tokio v1.52.4
├── hyper v1.10.1
│ ├── hyper-rustls v0.27.9
│ │ └── reqwest v0.12.28
│ │ └── crimson-crab v0.1.0
│ ├── hyper-util v0.1.20
│ │ ├── hyper-rustls v0.27.9 (*)
│ │ └── reqwest v0.12.28 (*)
│ └── reqwest v0.12.28 (*)
├── hyper-rustls v0.27.9 (*)
├── hyper-util v0.1.20 (*)
├── reqwest v0.12.28 (*)
├── tokio-rustls v0.26.4
│ ├── hyper-rustls v0.27.9 (*)
│ └── reqwest v0.12.28 (*)
├── tokio-util v0.7.18
│ └── reqwest v0.12.28 (*)
└── tower v0.5.3
 ├── reqwest v0.12.28 (*)
 └── tower-http v0.6.11
 └── reqwest v0.12.28 (*)

Seven distinct paths converge on reqwest there. On a native target tokio is in your build, and it has to be: reqwest's default backend is hyper, and hyper runs on tokio. Any reqwest-based crate advertising itself as tokio-free on native is wrong, and cargo tree -i tokio settles the argument in about a second.

What crimson-crab actually claims is narrower. tokio is not a direct dependency of the library: it appears in the manifest only under [dev-dependencies], and in a native build it reaches you transitively through reqwest. Nothing in the public API names a runtime type either. Streaming hands back a futures_core::Stream, MessageStream is Send + Unpin, and crimson_crab::Error is Send + Sync + std::error::Error. The crate has no opinion about your executor and never asks for one.

use crimson_crab::prelude::*;
use futures_util::StreamExt;

// `request` is the same MessagesRequest built above; `stream` borrows it.
let mut stream = client.messages().stream(&request).await?;
while let Some(event) = stream.next().await {
 if let StreamEvent::ContentBlockDelta {
 delta: ContentDelta::TextDelta { text },..
 } = event?
 {
 print!("{text}");
 }
}
// After draining, the accumulated `Message` is identical in shape to a
// non-streaming response.
if let Some(message) = stream.final_message() {
 println!("\n[stop_reason: {:?}]", message.stop_reason);
}

That plain Stream is what makes the wasm32 result possible at all. There, reqwest resolves to the browser's fetch backend, hyper falls out of the graph, and tokio falls out with it, which is how the count reaches zero.

cargo check --target wasm32-unknown-unknown passes on default features. Be precise about what that buys you, though: cargo check type-checks and borrow-checks, and it neither codegens nor links. So this is evidence that the crate and its dependency graph resolve cleanly for wasm32, with no feature juggling and no default-features = false incantation to memorize. It is not a wasm artifact, and it is not me promising you a browser demo. Whether your application links and runs in a browser is your integration problem, and this post makes no claim about it.

The honest headline is therefore about the direct dependency list and the public API surface, and it stops there. Say it more loosely than that and a reader runs one command and stops trusting you. Only the precise version survives contact with cargo tree.

The docs cannot rot

cargo test --all-features gives 191 passed, 0 failed. The composition is the interesting part:

Kind	Count
Unit tests (in `src/`)	43
Integration tests (7 files, wiremock)	35
Doc-tests	113
Total	191 passed, 0 failed

113 of 191. The majority of this test suite is the documentation, executing.

Precision matters here too, because "every example runs" is exactly the sort of thing people say loosely. Of the 113 doc-tests, 21 are marked no_run: cargo test compiles and type-checks those against the real API but does not execute them, since they would need a live API key. The other 92 compile and run. None are marked ignore, so there is no example in the docs that the compiler never sees. A doc example that drifts out of sync with the API does not quietly become a stale snippet somebody files an issue about eighteen months later; it becomes a red build on the commit that broke it.

The integration tests run against wiremock, so the whole suite works offline: no API key, no network, no rate limits, no flakes. You can clone the repo on a plane and get a green run.

Panic-freedom you can check

Plenty of libraries describe themselves as panic-free in a README paragraph. This one is checkable in about ten seconds. src/lib.rs opens with:

#![forbid(unsafe_code)]
#![cfg_attr(
 not(test),
 deny(clippy::unwrap_used, clippy::expect_used, clippy::panic, clippy::todo)
)]
#![deny(missing_docs)]

and cargo clippy --all-features --all-targets passes with 0 warnings. The not(test) scope means those denies bind the library build, so the property holds for a mechanical reason: the compiler refuses to produce a version of the library that violates it. deny(missing_docs) is quietly what keeps the doc-test count honest too, since an undocumented public item fails the build outright.

A response from a model that does not exist yet

Every wire enum in the crate carries an Unknown catch-all: content blocks, stream events, deltas, stop reasons, tool definitions, cache TTLs, thinking configs. An unrecognised variant is preserved as raw JSON and re-serialized unchanged, so deserialization never errors on it. The enums are #[non_exhaustive], so you match with a wildcard arm and a minor release can add a known variant without breaking your build.

That is the whole mechanism, and it happens to be one of the 113 doc-tests. Here it is exactly as it appears in src/types/content.rs:

use crimson_crab::types::{ContentBlock, TextBlock};

let json = serde_json::json!({"type": "text", "text": "Hello"});
let block: ContentBlock = serde_json::from_value(json.clone()).unwrap();
assert_eq!(block, ContentBlock::Text(TextBlock::new("Hello")));
assert_eq!(serde_json::to_value(&block).unwrap(), json);

// An unrecognised block type is preserved rather than rejected.
let novel = serde_json::json!({"type": "brand_new", "foo": 1});
let block: ContentBlock = serde_json::from_value(novel.clone()).unwrap();
assert!(matches!(block, ContentBlock::Unknown(_)));
assert_eq!(serde_json::to_value(&block).unwrap(), novel);

The last two assertions are the ones that matter. A block type nobody had invented when the crate was published deserializes cleanly, and serializes back byte-equivalent to what arrived. Your process keeps running and your logs keep the payload.

The model field is an open string everywhere, which is the same principle applied one level up. The crate exports constants (CLAUDE_OPUS_4_8, CLAUDE_FABLE_5, CLAUDE_SONNET_5, CLAUDE_HAIKU_4_5) purely as conveniences; a model that is absent from that list still works, passed as a plain string. Beta flags get the same escape hatch. .beta("some-flag") appends an anthropic-beta flag and .extra_field(key, value) sets a top-level body field, so a beta that shipped this morning is reachable today without waiting on an SDK release.

Retries, and the streaming timeout

Connection errors, timeouts, 408, 409, 429 and 5xx retry with full-jitter exponential backoff: 0.5s base, 8s cap. retry-after is honored and capped at 60s, so a hostile or broken server cannot park your retry loop for an hour. Streaming requests retry only before the first byte, which is the only safe answer once tokens are already on the wire.

The streaming detail I like most is the timeout. The client applies an idle read timeout rather than a total-request deadline, so a long but actively flowing SSE response is never cut off merely for crossing an elapsed-time limit. If you have ever had a long generation guillotined at exactly thirty seconds by an HTTP client that counts elapsed time and ignores whether data is still arriving, you already know why that distinction earns its place in the design.

Status, honestly

v0.1.0, published 16 July 2026. A single crate: roughly 6,076 lines in src/ and 1,445 in tests/, with 7 runnable examples (basic, batches, prompt_caching, streaming, structured_output, thinking, tool_use). MSRV 1.75, edition 2021, dual-licensed MIT OR Apache-2.0. Raising the MSRV would be a minor-version change.

Things this post does not contain: benchmarks, latency figures, throughput numbers. None were measured, so none are quoted. There are no adoption numbers either, because it launched hours before this went up and there aren't any yet. What I can hand you instead is a test suite, a clippy run and a dependency tree, every one of which you can reproduce yourself in a few minutes on your own machine. For a v0.1.0, that seems like the honest trade.

Find crimson-crab

crimson-crab project site — the one-page tour
crimson-crab on crates.io · cargo add crimson-crab
docs.rs/crimson-crab — the API reference, and the source of those 113 doc-tests
crimson-crab on GitHub — README, ARCHITECTURE.md, and the examples

crimson-crab is an independent open-source project and is not affiliated with Anthropic. Every number in this post comes from a command you can run against a fresh clone: cargo test --all-features, cargo clippy --all-features --all-targets, and cargo tree.

Prompt Injection Isn't a Filtering Problem, It's an Architecture Problem

AI Explore — Thu, 16 Jul 2026 13:00:58 +0000

TL;DR— Most prompt injection defenses treat it like spam detection: classify the bad input, block it, move on. That approach plateaus because natural language has no reliable boundary between instruction and data. The fix that actually scales is architectural— provenance tracking and capability-scoped execution, the same lesson SQL injection taught us decades ago with prepared statements.

Every few weeks a new jailbreak technique makes the rounds, a vendor ships a patch, and the cycle repeats. Look at the shape of that cycle closely and you'll notice it's identical to the spam filter arms race from the 2000s: classify the bad input, block it, wait for the next mutation. It never converged then either. The reason prompt injection defense keeps plateauing is that the industry is solving the wrong layer of the problem. Detecting malicious text is a losing game because natural language has no syntactic boundary between an instruction and a piece of data. SQL had this exact problem, and it wasn't solved by getting better at pattern-matching malicious strings. It was solved by making the ambiguity structurally impossible.

The category error

A prompt injection classifier— whether it's a fine-tuned model, a set of regex heuristics, or a hosted moderation API— is fundamentally trying to answer: does this text look like an attack? That's a content question. But the actual vulnerability isn't in the content. It's in the fact that your system concatenates untrusted data (a scraped webpage, a support ticket, a tool response, another agent's output) into the same context window as your trusted system instructions, with no mechanism for the model to reliably tell them apart.

This is the same category error as thinking SQL injection is a string-sanitization problem. You can strip quotes, escape special characters, and blocklist keywords all day, and someone will find the encoding you missed. The actual fix was prepared statements: separate the query structure from the data values at the protocol level, so the database engine never has the opportunity to misinterpret data as code. The vulnerability didn't get patched. It got architected out.

Why detection plateaus

Guardrail models and jailbreak classifiers are Bayesian filters operating on an adversarial, unbounded input space. Every defense you publish becomes a training signal for the next attack. Indirect prompt injection makes this worse, because the attacker doesn't even need access to your chat interface— they just need their payload to end up somewhere your agent will read it: a PDF, a calendar invite, a GitHub issue, a product review. The attack surface isn't your prompt box. It's every document your agent is allowed to ingest.

The deeper issue is that a transformer processes a token stream, not a labeled data structure. Instructions and data live in the same representational space. You can add delimiters, role tags, and system-prompt framing, but the model is still doing next-token prediction over a flat sequence, and a sufficiently well-crafted piece of "data" can shift its behavior exactly like an instruction would. This isn't a bug that better training eliminates. It's a structural property of how these models currently ingest context. Treating it as a bug you can patch away with a stronger classifier is why the arms race never ends.

What the SQL analogy actually teaches

The lesson from prepared statements isn't "add more escaping." It's "stop trusting the channel to self-report its own trust level, and instead track provenance outside the content itself." Applied to LLM systems, that means every piece of context entering a prompt should carry a provenance tag that lives in your orchestration layer, not in the text the model sees: system, user, retrieved_untrusted, tool_output_untrusted. The model doesn't need to understand this tag semantically. Your control plane does.

Once you have provenance, you can enforce policy outside the model: content tagged retrieved_untrusted is never allowed to trigger a tool call directly, is never concatenated into a position that resembles a system instruction, and is never permitted to modify the agent's plan without a re-authorization step that originates from a trusted source. This is taint tracking, borrowed straight from web security, applied to context assembly instead of variable assignment. It doesn't require the model to be smarter about detecting attacks. It requires your pipeline to stop giving untrusted text the same execution privileges as trusted instructions.

Defense in depth means capability boundaries, not more classifiers

If provenance tracking is the schema-level fix, capability scoping is the runtime fix. The actual damage from a successful injection almost never happens in the text output— it happens in the action the model was permitted to take afterward: sending an email, executing a database write, calling a payment API, exfiltrating data through a crafted URL. Defense in depth for agentic systems means treating every tool call as if the instruction behind it might be attacker-controlled, and asking: what's the blast radius if it is?

Least privilege by default. An agent summarizing customer emails should not hold the same credential as one that can issue refunds. Split them, even if it's the same underlying model.
No untrusted-to-action shortcuts. Content from retrieval or tool output should never be able to directly trigger a high-privilege action without passing through a policy check that doesn't rely on the model's own judgment.
Output-side sanitization. If your agent's output can end up rendered in a browser, executed as code, or interpolated into another prompt downstream, treat that output the way you'd treat any untrusted string headed for a sink— encode it for its destination.
Human confirmation at privilege boundaries. Not for every action— that just trains people to click through— but specifically at the moment an agent crosses from read to write, or from internal to external effect.

None of this stops an injection from happening. It stops an injection from mattering. That's the actual definition of defense in depth: layers that assume the outer layer will eventually fail, and are designed so failure there doesn't cascade into a compromised action.

What to actually build

If you're responsible for an LLM system today, the ROI ordering is not: better jailbreak classifier, then better system prompt, then better guardrail model. It's the reverse. First, map every place untrusted content enters your context and tag it at the orchestration layer. Second, enumerate every tool call your agent can make and assign it the minimum privilege required, with no tool holding blanket access because it was convenient during a demo. Third, put policy enforcement on actions, not on text— the model's opinion about whether an instruction is legitimate is not a security boundary. Only after those three are in place does a jailbreak classifier earn its keep, as a monitoring signal rather than a gate.

The model providers will keep improving instruction-following robustness, and that helps. But waiting for the model to become un-injectable is waiting for a property no current architecture guarantees. The systems that hold up under real adversarial pressure are the ones that assumed injection would eventually succeed, and built the blast radius small enough that it wouldn't matter when it did.

The Silent Socket: A Gemini-TTS Batch That Hung Forever, and the Deadline That Smashed It

AI Explore — Wed, 15 Jul 2026 15:00:02 +0000

This is a submission for DEV's Summer Bug Smash: Smash Stories powered by Sentry.

There's a special kind of bug that a retry loop can't save you from: the request that never fails. It doesn't throw. It doesn't time out. It just… sits there, socket open, forever — and your carefully-written exponential backoff waits right alongside it, because backoff only runs when something errors, and nothing did.

I hit exactly that while running a large batch of text-to-speech jobs through Google's Gemini TTS to narrate long-form content. Here's how a job that was "still running" for hours turned out to be doing nothing at all, and the one-line reframe that fixed it: a hang is not an error, so give every await a deadline.

The setup

The pipeline is simple in shape: take a long piece of text, split it into paragraph-sized chunks, send each chunk to Gemini TTS, get back audio, then concatenate the clips into one file. Dozens to hundreds of chunks per run. Because any single network call can flake, each chunk was wrapped in retry-with-backoff:

async function ttsWithRetry(chunk) {
 for (let attempt = 0; attempt < 4; attempt++) {
 try {
 return await callGeminiTTS(chunk); // <- the await that betrayed me
 } catch (err) {
 await sleep(2 ** attempt * 1000); // 1s, 2s, 4s, 8s
 }
 }
 throw new Error("TTS failed after retries");
}

Clean. Resilient. Ran beautifully for small batches.

The symptom

On the big runs, the job would chew through most of the chunks, gradually slow down, and then simply stop making progress. No error. No crash. No log line. The process was still alive — it just never produced another clip.

The tell was in the system stats: the process sat at 0% CPU with no new output files appearing. A crashed job is dead; this one was a ghost — present, "running," accomplishing nothing. And my backoff, the thing I'd added specifically to survive network trouble, never printed a single retry.

Why the retries never fired

I checked the connections and there it was: an ESTABLISHED but completely silent socket. The request had gone out, the connection was open, and then… nothing came back. Not an error, not a reset, not a close. Just an open pipe with no bytes flowing.

That's the whole bug in one sentence: my retry logic only handled failures, and a hang is not a failure. callGeminiTTS(chunk) never rejected, so the catch never ran, so backoff never fired. await will wait as long as you let it — and I had told it to wait forever. Under sustained load, an occasional connection would degrade into this silent state, and one stuck chunk was enough to freeze the entire batch behind it.

The fix: race every call against a deadline

The repair wasn't a bigger retry count or a fancier backoff curve. It was to make a hang look like the error it morally is — by racing every TTS call against a timeout, so a stall becomes a rejection that backoff can actually catch:

function withTimeout(promise, ms, label) {
 return Promise.race([
 promise,
 new Promise((_, reject) =>
 setTimeout(() => reject(new Error(`timeout after ${ms}ms: ${label}`)), ms)
 ),
 ]);
}

async function ttsWithRetry(chunk) {
 for (let attempt = 0; attempt < 4; attempt++) {
 try {
 // a silent socket now loses the race and REJECTS in 180s
 return await withTimeout(callGeminiTTS(chunk), 180_000, chunk.id);
 } catch (err) {
 await sleep(2 ** attempt * 1000);
 }
 }
 throw new Error("TTS failed after retries");
}

Now a stuck request rejects after 180 seconds, the catch runs, backoff kicks in, and the next attempt opens a fresh connection instead of inheriting the dead one. The batch stopped freezing. Runs that used to wedge indefinitely now self-heal and finish.

Two things carried the win:

Every network await got an upper bound. An await without a timeout is an unbounded promise to wait — and "forever" is a value your program can actually take. Bounding it turned an un-catchable hang into an ordinary, retryable error.
Detection got a real signal. I stopped assuming "process alive" meant "work happening." The reliable liveness check was new output files over time + CPU, not the mere existence of the process.

Before / after

Before: one degraded connection to Gemini TTS silently stalls a chunk → the whole batch hangs at 0% CPU → retries never fire because nothing errored → I babysit and kill it by hand.
After: a stall loses a 180s race → rejects → backoff opens a new connection → batch finishes on its own.

Where Google AI comes in

The engine doing the actual work here is Google's Gemini TTS, turning text into natural narration at batch scale. Google AI made the product possible; this fix made the pipeline around it reliable — which is the unglamorous half of shipping anything on top of a model. The most valuable thing I did for a Google-AI feature this quarter wasn't a prompt. It was giving its network calls a deadline.

The lesson

Retry logic guards against failure. It does nothing against silence. If a call can hang — and any network call can — then "resilient" means every single await has a deadline, and a stall is converted into an error your existing recovery can see.

A hang is not an error. So make it one. This bug is smashed — the batch now fails fast and heals itself instead of waiting politely, forever, for a socket that already gave up.

samkhya v1.1: Never Regress — Putting a Model in Your Query Optimizer Without Letting It Wreck the Plan

AI Explore — Wed, 15 Jul 2026 02:09:09 +0000

Every SQL query optimizer runs on a guess. Before it picks a join order it asks a question it usually can't answer well — how many rows will this step produce? — and the whole plan hangs off that number. Get it wrong by a few orders of magnitude, which optimizers routinely do on multi-join queries, and the planner cheerfully builds a hash table for a billion rows that turn out to be a thousand. Bad row-count estimates are the single most reliable way to turn a good schema into a slow one.

samkhya — Sanskrit सांख्य, "enumeration," the classical discipline of counting reality's constituents honestly — is an engine-agnostic Rust SDK whose only job is to make those counts less wrong, and to do it safely. Safely is the whole trick. The moment you let a learned model correct the optimizer's estimates, you've handed the most performance-critical number in the system to something that can be miscalibrated, stale, or — if it's an LLM — outright hallucinating. samkhya's answer is a guarantee it calls never-regress at the bound level: a corrected estimate is clamped under a ceiling the library can prove, so a bad model can never push the optimizer past a bound it can defend.

This is the deep dive — the clamp, the architecture, the three swappable backends, and the benchmark I pre-registered and then failed, reported as such.

The optimizer's oldest bug: it doesn't know how many rows

Cardinality estimation is decades old and still unsolved in practice. Engines lean on histograms and independence assumptions that fall apart the instant columns correlate or joins compose. The errors don't just accumulate; they multiply down a join tree, so a modest per-predicate mistake becomes a catastrophic whole-plan one. The academic fix is well known — feed real execution feedback back into the estimator — but feedback-driven correction has a dangerous failure mode: a correction that's wrong in the other direction makes the plan worse than the naïve estimate you started with. You can't ship a "usually helps" optimizer input. You need one that can't hurt.

Never-regress: a ceiling a model can't argue with

samkhya separates the suggestion from the guarantee. A corrector proposes a number. Then that number is clamped from above by a provable pessimistic ceiling before the optimizer ever sees it. The ceiling is an LpJoinBound: an LP relaxation over the ℓp-norms of the join's degree sequences, inspired by Zhang et al.'s LpBound (SIGMOD 2025 Best Paper — the idea, not a reimplementation), with no machine learning anywhere in it. It's pure combinatorics on the data's structure, so it holds regardless of what the model believes.

The library's own worked example says it best. An over-eager corrector proposes one million rows for a join whose true cardinality is six:

That is the guarantee in one picture: a hallucinating model gets pinned to a ceiling it can't breach, and with no feedback yet — a cold start — samkhya falls back to the engine's own native estimate. The worst case is "the engine you already had." One honesty note the library insists on: this is a guarantee about the bound, not about wallclock. On some workloads samkhya is slower; the benchmarks below say exactly where.

The shape: one sidecar, one trait, one ceiling

samkhya is a library, not a service — no daemon, no background thread, no GPU in the default build, and the whole 13-crate workspace compiles in under two minutes on a laptop with no network. Three layers, safety always last:

Portable stats via Iceberg Puffin sidecars. Classical sketches — HyperLogLog, Bloom, Count-Min, equi-depth and 2D correlated histograms — are serialized into versioned, KIND-tagged blobs inside an Iceberg Puffin file. The same sidecar an ELT pipeline writes is loaded, unchanged, and handed to the DataFusion and client-side DuckDB adapters. No engine owns the stats; the sidecar does. That's what "engine-agnostic" means: DuckDB, DataFusion, Polars, Postgres, Iceberg, and gpudb all read the same portable payload.
A single Corrector trait. One pluggable surface: propose a corrected estimate given the sketches and whatever feedback exists. Swap the backend without touching the engine integration or the safety layer.
The LpBound envelope — never regress. Every proposal, from every backend, passes through the clamp above before it reaches the planner. This layer is the floor of the design; there's no soft switch to turn it off.

Three backends behind one trait

Because correction and safety are separated, the corrector is genuinely swappable. samkhya ships three reference backends, each behind a Cargo feature flag and all capped by the same LpBound ceiling:

GBT (default). A sub-megabyte gradient-boosted-tree backend (gbdt-rs). No GPU, always on, boring in the best way — this is what ships.
TabPFN-2.5 (opt-in). A tabular foundation model behind the tabpfn_http feature, for when you have a GPU and want more accuracy.
LLM-pluggable (opt-in). An HTTP corrector shipping dual transport in v1.0: a canonical Python FastAPI server (port 8766) and a parity Node TypeScript port (port 8767) speaking the same wire contract, each with four reference backends — Anthropic, OpenAI, local Ollama, and a dummy. You can put an LLM in the query-optimizer loop precisely because the clamp means it can't hurt you.

The two backends worth putting numbers on, each measured against its own pre-registered bar:

The honest numbers — including the one I failed

Here is where most READMEs get shy. samkhya pre-registers its targets and reports every result, including the misses. The headline real-workload number comes from the actual Join-Order Benchmark — JOB-Slow, 55 paired warm-cache queries from the 113-query IMDb suite, against unmodified DataFusion:

Read that scoreboard honestly, because it's the whole point. The 0 losses is the guarantee paying off — the clamp did its job and nothing regressed. But the effect is small: a 1.038× geomean, statistically real (BCa 95% CI [1.026, 1.056], Wilcoxon p=3×10⁻⁶) yet nowhere near the ≥1.35× I pre-registered. That target — along with ≥1.6× join-heavy and ≥1.50× headline — is falsified, and the receipts name why: warm-cache only, CSV not Parquet, a small query budget, OOM past one heavy query. On a deliberately adversarial workload of seven patterns, samkhya is slower — a 0.949× cross-pattern geomean, ~5% off, worst cold-start cell +12.4%. That row exists on purpose.

The one microbenchmark that shines is the bound's tightness — how close the provable ceiling sits to the truth, which is what makes the clamp useful rather than vacuous:

Here's the measured-headlines table, straight from the repo:

Headline	Measured	Significance
JOB-Slow vs DataFusion (real IMDb, n=55) — pre-reg ≥1.35× FALSIFIED	geomean 1.038×; 17W / 38T / 0L	CI [1.026, 1.056]; p=3×10⁻⁶
Adversarial A–G (7 patterns) — reported on purpose	0.949× (~5% slower); worst +12.4%	per-pattern CIs in receipts
LpJoinBound vs AGM tightness (star-5, p=1)	40.95× tighter	CI [30.93, 47.45]; p=1.73×10⁻⁶
TabPFN-2.5 latency (RTX 4090 Laptop)	P95 31.15 ms (< 50 ms)	CI [29.39, 35.32]
TabPFN-2.5 q-error reduction vs GBT	7.84% (target 15% — magnitude falsified)	CI [2.21, 14.62]; p=1.04×10⁻⁵

Why ship a benchmark you failed

Because a number you pre-register and then report even when it embarrasses you is worth more than one you reverse-engineered from whatever your library happened to do. The falsified 1.35× isn't a bug in the write-up; it is the write-up. A cardinality corrector that quietly cherry-picked its wins would be exactly the "usually helps" input you can't trust in a planner. samkhya's pitch is the opposite: a small, real, statistically-defensible improvement on honest workloads, a provable ceiling that guarantees you never regress, and a benchmark table that shows the losses next to the wins. For an input this deep in the critical path, "never worse, sometimes better, and here are the receipts" beats a headline multiple.

Where it fits, and how to try it

samkhya is Apache-2.0, single-author, and built to drop into embedded analytical engines rather than replace them. If you run DuckDB, DataFusion, Polars, Postgres, or an Iceberg lakehouse and your multi-join plans occasionally fall off a cliff, the value proposition is narrow and honest: portable stats you write once and read everywhere, a corrector as simple (GBT) or as ambitious (an LLM) as you like, and a clamp that means the ambitious option can't cost you a regression.

cargo add samkhya-core # Rust
pip install samkhya # Python — PyO3 bindings, single abi3 wheel

Build a Puffin sidecar from a column and hand it to the DataFusion adapter — the full quick start, the architecture, and every receipt behind the numbers above are in the repo. For a one-page tour with a live demo command, see the samkhya project page.

Repo & receipts: project page · samkhya on GitHub · crates.io · PyPI — pip install samkhya · docs.rs · Iceberg Puffin spec

Every figure here is copied from the samkhya README's "Measured headlines" table and its linked bench-results/ receipts. Synthetic microbenchmarks are scoped to exactly what they measure; the real-workload JOB-Slow number is the honest headline.

A Redirect Is Not a Receipt: The Automation Bug That Told Me It Succeeded

AI Explore — Tue, 14 Jul 2026 23:41:19 +0000

This is a submission for DEV's Summer Bug Smash: Smash Stories powered by Sentry.

The worst bugs don't crash. They smile at you, hand you a green checkmark, and send a "success!" email while the actual work quietly didn't happen. I shipped one of those. It cost me trust before it cost me anything else — because for a while, I believed it.

Here's the story of a false-success bug in a browser automation, why the check passed when the action had failed, and the one-line reframe that fixes this whole class of mistake: a redirect is not a receipt.

The setup

I have a small Playwright automation that drives a web editor: open a draft, fill it in, click Publish, done. To decide whether the publish worked, I used the most natural signal I could see — the URL. In this app, a draft lives at a URL ending in /edit. Once you publish, the app navigates you away from /edit to the published post. So my success check was, essentially, "did we leave the edit page?"

// click Publish, then wait to leave the editor
await publishButton.click();
await page.waitForURL((u) => !u.pathname.endsWith("/edit"), { timeout: 25000 });
// got here? must be published. 🎉
return page.url();

It worked in every test. It worked for weeks. Then one day it lied to my face.

The day it lied

I ran the automation. Console said PUBLISHED. I got the "🟢 published" email. I went to look at the live post to admire it — and it wasn't there. The content was still sitting in the draft, untouched.

The action had been rejected — the platform rate-limited me — but my automation reported a clean success anyway. Worse than a crash: a crash tells the truth. This handed me a confident lie and an email to match.

Why the check passed when the publish failed

I pulled the real URL the automation ended on and saw it:

/p/<id>/submission?redirectUrl=%2Fp%2F<id>%2Fedit

Read that carefully. When the publish was rejected, the app bounced the request through an intermediate /submission URL whose query string contained the edit path (redirectUrl=…%2Fedit), before dumping me back on the editor with a red banner.

Now look at my check again: !u.pathname.endsWith("/edit"). The pathname of that bounce URL is /p/<id>/submission — it does not end in /edit. The /edit is in the query string, which pathname ignores. So my "did we leave the edit page?" test returned true… on a page that was, functionally, still the edit page mid-rejection.

I had confused a side effect (the URL changed) with a confirmation (the post is live). Those are not the same thing, and the gap between them is exactly where this bug lived. My "success" signal was a proxy for success, and the day the proxy and the reality disagreed, I believed the proxy.

The fix: confirm the postcondition, don't infer it

The repair wasn't a smarter URL regex. It was to stop inferring success from navigation and instead poll for the real outcome — either a genuinely published post URL, or the rejection banner that tells me it didn't work:

// Resolve the TRUE outcome: a real published URL, or an explicit rejection.
let outcome = { published: false, reason: "timeout — no nav, no banner" };
for (let i = 0; i < 20; i++) {
 await page.waitForTimeout(1000);
 const s = await page.evaluate(() => {
 const txt = document.body.innerText || "";
 // the platform's own rejection / rate-limit banner
 const rejected = /maximum of.* in the past 24 hours/i.test(txt);
 return { path: location.pathname, rejected };
 });
 if (s.rejected) { outcome = { published: false, reason: "rate-limited" }; break; }
 // a bounce lands on /submission or stays on /edit — NOT success
 const stillEditing =
 s.path.endsWith("/edit") || s.path.includes("/submission");
 if (!stillEditing) { outcome = { published: true, reason: "live URL" }; break; }
}
return outcome;

Two changes carry all the weight:

Positive confirmation, not absence of the old state. "Not on /edit" is not proof of publication. "On a real post URL" is. I now require the success state to be present, not merely the failure state to be gone.
Make failure loud. The platform was already telling me it rejected the publish — there was a banner. I just wasn't reading it. The automation now watches for that banner and reports rate-limited honestly, with a truthful email, instead of faking a win.

Before / after

Before: URL no longer ends in /edit → "published" → success email. Silently wrong the moment the app used a redirect bounce.
After: poll until either a real post URL (published) or the rejection banner (not published); report exactly what happened. No more confident lies.

What I learned

The lesson generalizes far past this one script. Any automation that decides success from an easy-to-observe side effect — a redirect, an HTTP 200 on a page that renders its real error in the body, a file that exists but is half-written, a job that "started" — is one edge case away from cheerfully reporting success on a failure. The failure modes hide in the gap between "something changed" and "the thing I wanted is true."

So: verify the postcondition, not a proxy for it. And when the system is already shouting its failure at you in a red banner, read the banner. A navigation is not a confirmation. A redirect is not a receipt.

The bug that smiles at you is worse than the one that screams. This one is smashed — now it screams when it should.

Your LLM Evals Are a Flaky Test Suite — Treat Them Like One

AI Explore — Tue, 14 Jul 2026 21:39:16 +0000

Your eval score moved from 82 to 79 overnight. Nobody changed the prompt. Nobody changed the model. You re-ran it and got 84. So you shipped, because 84 is up and to the right, and the standup was in ten minutes.

That is not an evaluation. That is a coin flip wearing a lab coat. If a unit test passed, failed, and passed again on the same code, you'd call it flaky and either fix it or quarantine it — you would never let it gate a release. LLM evals behave exactly like that flaky suite, and most teams read the noise as signal anyway.

TL;DR — An LLM eval is a test suite over a non-deterministic system graded by another non-deterministic system, on a set that quietly changes. Its real failure modes are the ones every flaky suite has: uncontrolled seeds, a drifting judge, a leaking or mutating dataset, and point scores reported without variance. Fix them the boring way — pin what you can, calibrate the judge, freeze and version the set, report distributions, gate on overlap — and "the model got worse" mostly turns back into "the measurement was noisy."

The score jitters because everything under it is non-deterministic

There are at least three independent random processes stacked in a single eval run. The model samples tokens with temperature. The judge — if you use LLM-as-judge — samples its verdict the same way. And the eval set you're scoring against is usually small, so a handful of items flipping swings the headline number by several points.

Stack three noise sources and you get a metric that moves on its own. Treating a 3-point delta as a regression, when your run-to-run standard deviation is 4 points, is the eval equivalent of chasing a race condition by adding print statements: you're reacting to jitter, not to change.

Pin the seeds you can; budget for the ones you can't

The first move with any flaky test is to remove the removable randomness. Same here. For scoring evals — factual accuracy, format compliance, pass/fail on a rubric — run the model at temperature = 0 and pin every sampling parameter you're allowed to pin. You will not get true determinism (kernels, batching, and hardware still wobble the logits), but you cut the variance hard.

For anything where you actually care about the spread — creativity, refusal behavior, robustness — do the opposite deliberately: fix a set of seeds and run each item across all of them. The point is that randomness is either controlled or measured. What you can't do is leave it uncontrolled and then read the average as if it were stable.

The LLM judge is a flaky dependency, so calibrate it

The moment your grader is itself a model, you've added a second system-under-test and started trusting its output without a test of its own. Judges drift across model versions, contradict themselves on re-run, and carry position and verbosity biases — the same answer scores differently depending on whether it was labeled A or B.

Treat the judge like any dependency you don't control: hold it to a golden set. Assemble a few dozen items you've hand-labeled, measure the judge's agreement with those human labels, and track that agreement number over time. If the judge only agrees with humans 70% of the time, your eval has a ±30% error bar baked in before the model under test says a single word. Calibrate the ruler before you measure the thing.

Version the eval set like source code, because it is

Here's the silent killer: someone "improves" the eval set. They fix a typo in a gold answer, drop three items that "weren't fair," add ten new hard cases. Now this week's 86 and last week's 81 are scored against different exams, and every comparison you make across that boundary is meaningless. That's not a better eval — it's a broken git history for your ground truth.

Eval data is code. Put it under version control, review changes to it in PRs, and stamp every result with the dataset hash it ran against. Never compare two scores that ran against different hashes without saying so out loud. And guard against the other direction — leakage — where eval items drift into training data or few-shot examples and the score climbs because the model has seen the answer key. A number that only goes up is often a number that's been contaminated.

Report the distribution, not the point score

A single number — "84" — throws away the one thing you needed: how much it moves when nothing changes. Run each eval N times (5 is enough to start), and report the mean with a spread — standard deviation or a confidence interval. That interval is your noise floor. It tells you the smallest change you're actually able to detect.

Once you have that, regressions get a real definition. A new prompt scoring 84 ± 3 against a baseline of 82 ± 3 has not improved anything — the intervals overlap; it's the same system. A drop to 74 ± 3 against 82 ± 3 is a genuine regression, because the intervals are clean apart. This is the same discipline as flaky-test triage: you don't chase a failure until you've shown it reproduces outside the noise band.

Gate merges on evals like tests, and quarantine the noisy ones

Evals only protect you if they run in CI and can actually block a merge. Wire them to the pipeline, set thresholds in terms of the confidence interval — "must not regress by more than the noise floor" — and fail the build when a change crosses that line. A dashboard nobody's merge depends on is a decoration, not a safety net.

And do the last flaky-test move: quarantine. When an eval is too noisy to gate on, don't average it into the headline number where it hides real movement — pull it into a watch list, tighten it, and only promote it back to gating once its variance is under control. Averaging a noisy eval into a clean one doesn't cancel the noise; it launders it into a number you'll trust by mistake.

The reframe

Stop treating a moved eval score as a verdict about the model's intelligence. It's a measurement, produced by a noisy pipeline you built, and measurements have error bars whether or not you print them. Almost everything that makes LLM evals feel unknowable is a discipline software testing already solved: control the randomness, calibrate the grader, version the fixtures, report the variance, gate on the signal.

Do that, and most "the model regressed" incidents resolve into "the measurement was noisy and we finally noticed." Your model is probably fine. Go fix the test suite around it.

The Model You Shipped Isn't the Model That Runs

AI Explore — Tue, 14 Jul 2026 21:34:55 +0000

TL;DR— Small language models get treated as portable artifacts— a single GGUF or ONNX file that behaves identically everywhere. It doesn't. Quantization kernels, compiler graph fusion, and accelerator-specific math mean the same weights produce different outputs on different devices. On-device AI's real bottleneck isn't model size— it's that evaluation has to happen per hardware target, and almost nobody is doing that.

Everyone treats a small language model as a file. You quantize it, you export it to GGUF or ONNX or a vendor-specific format, and you ship it. The assumption baked into that workflow is that the file is the model— that the same weights produce the same behavior wherever they land. On a server cluster running one accelerator type, that assumption mostly holds. On a device fleet spanning phones, laptops, wearables, and embedded boards, it's false, and the gap between those two worlds is the real unsolved problem in on-device AI.

The model is not a file. It's a stack: weights, a quantization scheme, a compiled execution graph, and an accelerator's specific arithmetic. Change any layer of that stack and you change the outputs— sometimes subtly, sometimes in ways that break a downstream agent loop or silently degrade a classifier's confidence. Nobody evaluates at that granularity. Almost everyone evaluates the checkpoint once, in one environment, and assumes portability that doesn't exist.

Quantization is not one thing

"We shipped an INT4 model" sounds like a single decision. It isn't. INT4 on one NPU uses per-channel scaling; on another it uses per-tensor scaling with a different calibration set. Some backends use symmetric quantization, others asymmetric with a zero-point offset. Rounding modes differ. Outlier handling differs— some runtimes clip, others use mixed precision for a handful of sensitive channels. Two "INT4" deployments of the identical checkpoint can diverge in perplexity by a meaningful margin, and that divergence is invisible unless you actually run both and compare.

This matters more for language models than for, say, image classifiers, because autoregressive generation compounds small numeric differences. A slightly different logit distribution at token five changes the sampled token, which changes the context for token six, which changes everything downstream. A model that's "basically the same" numerically can produce a noticeably different completion, tool call, or refusal decision depending on which chip generated it.

The compiler is a silent co-author

Beneath quantization sits the compiler that turns your graph into device-specific kernels. Operator fusion, memory layout decisions, and scheduling heuristics are not neutral— they change numerical precision at intermediate steps, especially in attention and normalization layers where operation order affects floating-point accumulation. Two compilers targeting the same hardware can produce measurably different outputs from the same ONNX graph, because they fuse operations differently and accumulate in different intermediate precisions.

This is the part teams underestimate. You didn't just pick a model and a quantization level. You picked a model, a quantization level, a compiler, and a compiler version, and each of those is a variable that can move independently of the others when a vendor ships an update. Your model didn't change. Its behavior did anyway.

Fragmentation is an evaluation problem, not a model problem

The instinct when facing this is to look for a magic format— one intermediate representation that guarantees identical behavior everywhere. That's not coming, for the same reason browser rendering never fully converged despite decades of standards work: the underlying hardware is genuinely different, vendors compete partly on their own kernel implementations, and "identical" isn't actually what accelerator vendors are optimizing for. They're optimizing for throughput and power on their own silicon, and numerical bit-for-bit parity with a competitor's chip is not on their roadmap.

So the fragmentation is permanent. The only thing that changes is whether you treat it as an evaluation problem. Web engineering solved its version of this with browser compatibility matrices and cross-browser CI— not by demanding browsers converge, but by testing against the real, divergent set of targets you actually ship to. On-device AI needs the same discipline and mostly doesn't have it. Most eval pipelines run once, on one reference build, and call it done.

What a real device-tier eval matrix looks like

Practically, this means building an evaluation matrix keyed by hardware tier, not just by model version. For each accelerator class you actually ship to, you need a held-out eval set run through the exact production stack— same quantization, same compiler, same runtime— not a proxy run on a dev machine with a generic backend. You're not evaluating the checkpoint. You're evaluating the deployed artifact on that specific tier.

You also need drift detection across compiler and runtime updates, the same way you'd gate a dependency bump in any other production system. A vendor SDK update that changes kernel fusion should trigger a re-run of your eval suite on affected devices before rollout, not after a support ticket tells you accuracy dropped on one phone model.

Telemetry from the fleet matters more here than in server deployments, precisely because you can't reproduce every device locally. Sampling real outputs from a representative slice of the device population— with consent and privacy constraints respected— is the only way to catch divergence you didn't anticipate in the lab. Server-side, one bad pod gets drained. Edge-side, a bad output pattern on one chip family might sit undetected for a release cycle.

Tiered fallback beats uniform expectations

The other implication is architectural: stop expecting uniform quality across your device fleet and design for tiers instead. A flagship phone's NPU and a five-year-old mid-range chip are not going to deliver the same effective model, even with identical weights, because the achievable quantization level and kernel efficiency differ. Build explicit tiers— full precision where the hardware supports it, more aggressive quantization with a validated accuracy floor where it doesn't, and a defined fallback (a smaller model, a cloud call, or a narrower feature set) below that floor. Pretending every device gets the same model is how you end up debugging a "random" quality complaint that's actually a specific chip generation running a specific compiler build that nobody put in the eval matrix.

Small models made on-device AI possible. They didn't make it uniform. The size of the weights was never the hard part— the hard part is that the same weights are not the same model once they leave your test environment, and evaluation has to catch up to that reality before on-device deployment can be trusted the way server deployment is.

Your AI Agent Is a Distributed System — Debug It Like One

AI Explore — Mon, 13 Jul 2026 14:37:53 +0000

Your agent didn't "hallucinate a wrong action." It called a tool that timed out, retried without an idempotency key, charged the customer twice, lost its scratchpad on the third hop, and then produced a confident summary of a state that no longer existed. None of that is an intelligence problem. It's the same class of problem a payments team or an orchestration team has been fighting for decades.

The moment you give a model tools, memory, and a loop, you have stopped building a prompt and started building a distributed system — one where every step is a network call to a flaky, slow, non-deterministic service. The good news: that's a solved-ish discipline. The bad news: most agent code ignores all of it.

TL;DR — An agent is a control loop over unreliable remote calls. Its real failure modes are distributed-systems failures: non-idempotent retries, timeouts and partial failure, lost or corrupted state, and zero observability. Fix those with the boring tools — idempotency keys, timeouts, durable state, tracing, trajectory tests — and the "AI reliability" problem mostly dissolves.

An agent is a control loop over calls that fail

Strip the branding and an agent is a while loop: ask the model what to do, execute a tool, feed the result back, repeat until done. Every arrow in that loop crosses a network boundary to something that can be slow, down, or lying — the LLM API, a search endpoint, your own microservices, a database.

That is the exact shape of a distributed system: independent components, unreliable links, no shared clock, partial failures. The literature is unambiguous about what happens if you pretend the network is reliable. An agent that assumes every tool call succeeds instantly and exactly once is a demo, not a system.

Retries need idempotency, or you double-charge the customer

The first thing everyone adds when agents flake is a retry. The second thing everyone learns — usually in production — is that retrying a non-idempotent action is how you send two emails, create two tickets, or move money twice.

The model can't see this. It calls charge_card, the call times out on the response (but succeeded on the server), the loop retries, and now there are two charges. The fix is not a smarter prompt — it's the same one Stripe hands every integrator: an idempotency key. Every side-effecting tool takes a caller-supplied key; the downstream service dedupes on it. Reads can retry freely; writes retry safely. This is a plumbing decision, and it belongs in your tool layer, never in the model's judgment.

Timeouts and partial failure are the default, not the edge case

Ask engineers what happens when a tool call hangs and most agent frameworks answer "it hangs." No timeout, no deadline budget, no cancellation — one slow dependency and the whole agent sits there burning tokens and wall-clock.

Borrow the service playbook: put a timeout on every call, propagate a deadline across the whole run (this task has 30 seconds total, not 30 seconds per hop), and decide explicitly what a timed-out step means. Did it fail, or did it succeed and you just didn't hear back? That ambiguity — success-but-no-ack — is the hardest problem in distributed systems, and your agent hits it constantly. Handle it on purpose, or the model will improvise on top of a state it can't observe.

State is the hard part — where does the agent's memory actually live?

The conversation buffer is not your state. The moment an agent does anything durable — files a refund, updates a record, kicks off a job — the truth lives out in the world, and your in-context "memory" is a cache of it that goes stale the instant the process restarts.

Treat agent state like you'd treat any workflow's state: make it durable and resumable. Persist each committed step so a crashed run can resume instead of redoing side effects. Separate what's been decided from what's been executed. This is why durable-execution engines are quietly eating the agent-infra space — they bring exactly-once-ish workflow semantics to a loop that otherwise loses its mind on the first restart. If your agent can't survive a process kill mid-run without repeating actions, you have a state bug, not a model bug.

You can't debug what you don't trace

When a service misbehaves, you open a distributed trace and follow the request. When an agent misbehaves, most teams open a wall of print statements and squint. Same problem, worse tooling.

Log the trajectory: every model decision, the exact tool inputs and outputs, latencies, token counts, retries, and the state at each step — correlated by a run ID, ideally as real spans. Then a wrong outcome becomes a traceable incident ("it retried search four times, each timed out at 5s, then answered from stale context") instead of a vibe. You cannot improve what you cannot see, and "the AI is unreliable" is what unreliability looks like when you have no traces.

Test the trajectory, not just the output

A unit test asserts an output. An agent needs more, because the same final answer can come from a clean two-step path or a flailing nine-step one that happened to land. Build a golden set of tasks and assert on the path: did it call the right tools, in a sane order, without redundant or destructive steps, inside the step and cost budget?

Track those metrics over time so a bad model swap or a broken tool shows up as a regression on a dashboard, not as a slow bleed of angry users. This is the unglamorous part everyone skips, and it's the line between an agent you can ship and a party trick you can demo.

The reframe that fixes most of it

Next time your agent "goes rogue," don't reach for a bigger model or a cleverer prompt first. Ask the distributed-systems questions:

Are my side-effecting tools idempotent, with keys, so retries are safe?
Does every call have a timeout and the run a deadline?
Is my state durable and resumable across a restart?
Can I pull the full trace of any run?
Do I have trajectory evals that catch a bad path before users do?

The teams shipping reliable agents aren't the ones with the smartest model. They're the ones who noticed they'd built a distributed system — and engineered it like one.

Your model is probably fine. Go engineer the loop around it.