Eli Hadam Zucker

Posted on May 7

No LLM Classifier. No Latency Tax. How Rada Routes Cloud Requests in Pure Rust.

#ai #rust #devex #architecture

In Post 1 I covered the broad architecture. In Post 2 I went deep on the co-determination matrix and Sentinel. This post is the third piece: the Autorouter.

The Autorouter answers a deceptively simple question: when a request needs cloud, which cloud model should handle it?
Most platforms solve this with either a dropdown menu (pick your model) or a lightweight LLM classifier that reads the prompt and decides where to send it. Both approaches have real costs. The dropdown puts the decision on the developer. The classifier adds a serial dependency: you pay latency and tokens before the actual work starts.

Rada does neither. The Autorouter is a pure Rust function. It pattern-matches on a handful of signals and resolves a cloud tier in sub-millisecond time. The decision is made before the HTTP request leaves your machine.

Three lanes, not one highway

The cloud side of Rada is organized into three tiers:

Heavyweight: frontier reasoning models for complex architecture, multi-file planning, and large-context tasks. Think Claude Sonnet, DeepSeek V3.

Workhorse: solid general-purpose models for standard builds and explanations. Gemini Flash, Mistral Nemo.

Micro: lightweight models for small completions, lint fixes, and refactors that slightly exceed local capability. Qwen 2.5 Coder 32B, Gemini Flash-8B.

All routing goes through OpenRouter, which is live in production. The tier system means Rada picks the right weight class for the task rather than defaulting to the most expensive option.

The routing signals

The Autorouter reads four signals to classify each request:

Intent. The developer's selected mode (Refactor, Build, or Explain) establishes a starting tier. Refactors start light. Builds start heavier. This is the same intent axis from the co-determination matrix, now extending into cloud routing.
Token payload. The function estimates the total token count of the prompt, active code, and any retrieval context. Larger payloads get bumped up a tier because they typically
represent more complex tasks that benefit from stronger reasoning.
Content signals. The prompt is scanned for architectural indicators. If the request involves system design, scaffolding, or multi-file coordination, it routes to Heavyweight regardless of starting tier. I won't enumerate the exact keyword set here, but the detection is string-based and fast.
Memory band. Sentinel's real-time RAM classification feeds directly into the Autorouter. If the developer's machine is under critical memory pressure, even a lightweight refactor gets bumped to a higher cloud tier. The logic: if local is unsafe, don't also cheap out on cloud. Give the request the best chance of a complete, useful response so the developer isn't waiting for a re-run.

These four signals resolve to a base_tier_id through a Rust match expression. No weights. No inference. No network call. Just pattern matching.

Preset overrides

On top of the base tier, Rada supports cloud presets that let users nudge the routing without picking a specific model. The current presets are:

Reasoning: biases toward stronger models. A request that would have routed to Micro gets bumped to Workhorse.

Speed: biases toward faster models. A request that would have routed to Heavyweight can drop to Workhorse if the payload is small enough and memory isn't critical.

Coding: biases toward code-specialized models. Similar to Speed but tuned for code completion patterns.

Presets are optional. If the developer doesn't set one, the base tier stands. If they do, the preset applies a second pass over the base tier resolution. Two match expressions, still sub-millisecond.

`rust// Simplified structure (not the exact production code)
let base_tier = match intent {
    "refactorDebug" => evaluate_refactor_signals(tokens, memory),
    "buildFromScratch" => evaluate_build_signals(tokens, memory, content),
    "explain" => evaluate_explain_signals(tokens, memory),
    _ => "workhorse",
};

let final_tier = match preset {
    Some("reasoning") => bias_toward_stronger(base_tier),
    Some("speed") => bias_toward_faster(base_tier, tokens, memory),
    Some("coding") => bias_toward_code(base_tier, tokens, content),
    _ => base_tier,
};`

The actual implementation is more involved, but the shape is the same: two deterministic passes, both pure functions, no allocations on the hot path.

Why not an LLM classifier?

Some platforms route prompts by running them through a lightweight model first: "read this prompt and decide if it's a simple or complex task." There are a few problems with this.

Latency. Even a small classifier model adds 200-800ms of serial latency before the real work begins. For a tool that's supposed to feel responsive, that's a lot of dead air.

Cost. The classifier consumes tokens. On a platform where every cloud interaction counts against a daily quota, spending tokens on routing instead of the actual task is waste.

Brittleness. LLM classifiers are probabilistic. The same prompt might route differently on consecutive runs. Debugging why a request went to the wrong tier becomes an inference problem instead of a code path you can trace.

Rada's routing is a Rust function. It's deterministic. Given the same inputs, it always produces the same output. You can unit test it. You can trace exactly why a specific request went to a specific tier. And the routing decision takes microseconds, not hundreds of milliseconds.

The token estimation trick

One detail worth calling out: the Autorouter needs a token count to make its tier decision, but running a real tokenizer before routing would add overhead. Instead, Rada uses a character-based heuristic:


rustfn estimate_token_count(text: &str) -> usize {
    text.chars().count().div_ceil(4)
}

Characters divided by four, rounded up. This is a rough approximation (real tokenization varies by model and content), but it's accurate enough for tier classification. The Autorouter doesn't need an exact count. It needs to know whether the payload is small, medium, or large. A 4:1 ratio gets that right consistently, and it costs zero allocations.

What this means for the developer

The Autorouter is invisible by design. The developer selects an intent, writes a prompt, and the system decides whether to handle it locally (via Behavioral Routing and Sentinel) or route to cloud (via the Autorouter). If it goes to cloud, the Autorouter picks the tier.
The developer never sees a model picker. They don't need to know which cloud model is best for their task. The system handles it, and the routing logic is fast enough that it adds no perceptible delay.
Requests routed through the Autorouter consume daily cloud quota at half the rate of manually selected models. This creates a structural incentive: trust the routing, get twice the cloud interactions per day. It's Rada's way of rewarding developers for letting the system do what it's designed to do.

The full picture

Three posts, three systems:

Post 1: the broad architecture. Local-first with intentional cloud. Why this approach and why now.

Post 2: the co-determination matrix. One model, nine behavioral profiles. Intent crossed with real-time hardware state.

Post 3 (this one): the Autorouter. When local isn't enough, a pure Rust function picks the right cloud tier in microseconds.

Together, Behavioral Routing, Sentinel, and the Autorouter form a single decision pipeline: measure the machine, understand the intent, adapt the local model or route to the right cloud tier. No hot-swaps. No model pickers. No wasted tokens on routing classifiers.

Rada is in closed beta. If you want to see how these systems work together on your hardware, the waitlist is at userada.dev.

Eli Hadam Zucker is the founder of Rada. Previously at Wise. Building local-first AI tooling in Rust.

DEV Community