Shipping 31 Rust crates in five days: a small-LLM dev stack, in the open

#hacktoberfest #opensource #rust #ai

I spent the last five days publishing 31 Rust crates to crates.io. Most of them are tiny, 150 to 300 lines of code each. None of them are a framework. Each one fixes one specific thing that breaks when you point a smaller open model at a real task.

This is the retrospective. What I built, what I would not build again, and the composition the 31 crates settle into.

The thesis

Big API models hide a lot. They ship strict JSON modes, server-side schema validators, giant context windows, and quietly-correct streaming endpoints. When you swap them for a smaller open model running locally, every one of those crutches disappears. Your output drifts. Your tool args go wrong. Your context budget gets tight fast. The piece you were going to copy from the cloud SDK no longer exists.

I started by writing one helper function for a Bedrock job at work. Then another. Then a few more. By day three it was clear they all wanted to be small Cargo crates instead of a private monorepo of helpers. So I broke them apart and shipped each one.

The 31

Grouped by purpose:

Agent reliability stack (10 crates): agentfit, agentguard, agentsnap, agentvet, agentcast, agenttrace, agentprompt, agentidemp, agenttap, llmfleet. Each one fixes a single failure mode: context budget, egress allowlist, trace snapshots, tool-arg validation, structured-output enforcement, run-level cost rollup, Jinja2 prompt templates, idempotency keys, wire-level introspection, fleet-level batching.

Anthropic-specific primitives (3): claude-cost, claude-stream, llm-json-repair. Cache-aware cost calc, SSE event parser, three-pass JSON repair.

Observability (2): cachebench (per-call prompt-cache hit ratio), otel-genai-bridge (translates between OpenInference and OTel GenAI semconv).

RAG + retrieval (7): ragdrift and ragdrift-core for 5-dimensional drift, embedrank for cosine top-k, promptbudget for token-budget truncation, stopstream for safe stop-sequence detection, citecite for citation markers, ragmetric for IR metrics.

Pure-Rust utility cores (9): snipsplit-core, lshdedup-core, vecnorm-core, toklab-core, annflat-core, maskprompt-core, embedcache-core, textsanity-core, secretsniff-core. These are the boring shared internals everything else depends on.

All MIT, all on crates.io under the same handle. None of them ship a model client. None of them lock you to an async runtime or HTTP client.

What worked

Shipping one crate per pain point, not one framework. A user hitting a Gemma 4 JSON output bug does not want to install a framework. They want one crate that fixes that specific bug in a Cargo.toml line. The composition emerges from people picking 3 of the 31, not from me deciding the architecture.

Refusing to wrap a model client. Every Rust LLM library I have used couples me to one HTTP client and one async runtime. The crates I shipped take a usage block, a Value, or a byte stream as input. You bring your own transport. The same cachebench instance works against Anthropic, OpenAI, and Bedrock because it does not care how you made the call.

Stable wire-format contracts as fingerprints. cachebench::fingerprint(messages, system, tools, model) returns a 16-character SHA prefix. Two calls with the same cacheable prefix get the same fingerprint. That makes per-prefix hit-rate alerting trivial. The trick is excluding the trailing user turn from the hash, so the cacheable prefix is what gets fingerprinted, not the whole request.

Tests that look like agent traces. agentsnap makes the test for an agent run look like a JSON snapshot of an LLM call sequence. First run records, subsequent runs diff with a unified diff and fail the test if anything regresses. AGENTSNAP_UPDATE=1 refreshes when the change is intentional. Five lines per test.

What I would not build again

Two crates that overlap. I shipped both agenttrace (run-level cost + latency rollup) and cachebench (per-call cache observability), and they overlap on the "what did this call cost" question. They compose cleanly enough, but a v0.2 of either is going to need to pick a single ownership boundary. Right now both can compute cost_usd and that is one boundary too many.

A crate I shipped before I had a real user. Two of the 31 are crates I built for "this seems useful" without anyone actually asking. They sit at zero downloads and they will probably stay there. If I had it to do over I would write the helper inline first, hit the same problem twice in two different real projects, and only then extract.

The first draft of the rate-limit policy. agentcast shipped with a retry policy that exponentially backed off on schema-validation failures. That was wrong. Schema failures are deterministic; an exponential backoff does not help. The right policy is "retry once with a precise hint, then give up and surface the issue." Fixed in v0.1.1 but the v0.1.0 reasoning was bad and I should have caught it.

The composition I keep coming back to

A minimal agent loop on a small open model, sketched in Rust:

use agentcast::Caster;
use agentfit::{Fitter, Strategy};
use agenttap::Tap;
use agentvet::Validator;

async fn one_turn(question: &str, tap: &Tap) -> anyhow::Result<String> {
    // 1. Fit the prompt to a budget the model will not gag on
    let messages = build_messages(question);
    let fitted   = Fitter::new(8_000).fit(messages, Strategy::DropOldest);

    // 2. Call the model (your transport, wrapped by `tap` for debug)
    let raw = call_with_tap(&fitted, tap).await?;

    // 3. Repair + validate the structured output
    let action = action_caster()?.parse(&raw)?;

    // 4. If the model picked a tool, validate args before running it
    if action["kind"] == "tool" {
        let v = tool_validator(&action["tool"])?;
        v.validate(&action["args"])
            .map_err(|e| anyhow::anyhow!(e.for_llm()))?;
        run_tool(&action).await
    } else {
        Ok(action["text"].as_str().unwrap_or("").to_string())
    }
}

Twelve lines of glue. Five concerns. Each concern is its own crate, each crate is around 200 lines of code, each crate ships with tests and a README and a doctest.

The unglamorous lesson

Most of the time the right thing to ship is a 200-line crate, not a 2000-line framework. The 200-line crate composes with three other 200-line crates that someone else wrote. The 2000-line framework composes with itself.

If you are building anything on a smaller open model and want a starting point, all 31 are on crates.io. Names above; the family handle is the same as my dev.to handle.

Happy shipping.