DEV Community: Eli Hadam Zucker

No LLM Classifier. No Latency Tax. How Rada Routes Cloud Requests in Pure Rust.

Eli Hadam Zucker — Thu, 07 May 2026 06:00:00 +0000

In Post 1 I covered the broad architecture. In Post 2 I went deep on the co-determination matrix and Sentinel. This post is the third piece: the Autorouter.

The Autorouter answers a deceptively simple question: when a request needs cloud, which cloud model should handle it?
Most platforms solve this with either a dropdown menu (pick your model) or a lightweight LLM classifier that reads the prompt and decides where to send it. Both approaches have real costs. The dropdown puts the decision on the developer. The classifier adds a serial dependency: you pay latency and tokens before the actual work starts.

Rada does neither. The Autorouter is a pure Rust function. It pattern-matches on a handful of signals and resolves a cloud tier in sub-millisecond time. The decision is made before the HTTP request leaves your machine.

Three lanes, not one highway

The cloud side of Rada is organized into three tiers:

Heavyweight: frontier reasoning models for complex architecture, multi-file planning, and large-context tasks. Think Claude Sonnet, DeepSeek V3.

Workhorse: solid general-purpose models for standard builds and explanations. Gemini Flash, Mistral Nemo.

Micro: lightweight models for small completions, lint fixes, and refactors that slightly exceed local capability. Qwen 2.5 Coder 32B, Gemini Flash-8B.

All routing goes through OpenRouter, which is live in production. The tier system means Rada picks the right weight class for the task rather than defaulting to the most expensive option.

The routing signals

The Autorouter reads four signals to classify each request:

Intent. The developer's selected mode (Refactor, Build, or Explain) establishes a starting tier. Refactors start light. Builds start heavier. This is the same intent axis from the co-determination matrix, now extending into cloud routing.
Token payload. The function estimates the total token count of the prompt, active code, and any retrieval context. Larger payloads get bumped up a tier because they typically
represent more complex tasks that benefit from stronger reasoning.
Content signals. The prompt is scanned for architectural indicators. If the request involves system design, scaffolding, or multi-file coordination, it routes to Heavyweight regardless of starting tier. I won't enumerate the exact keyword set here, but the detection is string-based and fast.
Memory band. Sentinel's real-time RAM classification feeds directly into the Autorouter. If the developer's machine is under critical memory pressure, even a lightweight refactor gets bumped to a higher cloud tier. The logic: if local is unsafe, don't also cheap out on cloud. Give the request the best chance of a complete, useful response so the developer isn't waiting for a re-run.

These four signals resolve to a base_tier_id through a Rust match expression. No weights. No inference. No network call. Just pattern matching.

Preset overrides

On top of the base tier, Rada supports cloud presets that let users nudge the routing without picking a specific model. The current presets are:

Reasoning: biases toward stronger models. A request that would have routed to Micro gets bumped to Workhorse.

Speed: biases toward faster models. A request that would have routed to Heavyweight can drop to Workhorse if the payload is small enough and memory isn't critical.

Coding: biases toward code-specialized models. Similar to Speed but tuned for code completion patterns.

Presets are optional. If the developer doesn't set one, the base tier stands. If they do, the preset applies a second pass over the base tier resolution. Two match expressions, still sub-millisecond.

`rust// Simplified structure (not the exact production code)
let base_tier = match intent {
    "refactorDebug" => evaluate_refactor_signals(tokens, memory),
    "buildFromScratch" => evaluate_build_signals(tokens, memory, content),
    "explain" => evaluate_explain_signals(tokens, memory),
    _ => "workhorse",
};

let final_tier = match preset {
    Some("reasoning") => bias_toward_stronger(base_tier),
    Some("speed") => bias_toward_faster(base_tier, tokens, memory),
    Some("coding") => bias_toward_code(base_tier, tokens, content),
    _ => base_tier,
};`

The actual implementation is more involved, but the shape is the same: two deterministic passes, both pure functions, no allocations on the hot path.

Why not an LLM classifier?

Some platforms route prompts by running them through a lightweight model first: "read this prompt and decide if it's a simple or complex task." There are a few problems with this.

Latency. Even a small classifier model adds 200-800ms of serial latency before the real work begins. For a tool that's supposed to feel responsive, that's a lot of dead air.

Cost. The classifier consumes tokens. On a platform where every cloud interaction counts against a daily quota, spending tokens on routing instead of the actual task is waste.

Brittleness. LLM classifiers are probabilistic. The same prompt might route differently on consecutive runs. Debugging why a request went to the wrong tier becomes an inference problem instead of a code path you can trace.

Rada's routing is a Rust function. It's deterministic. Given the same inputs, it always produces the same output. You can unit test it. You can trace exactly why a specific request went to a specific tier. And the routing decision takes microseconds, not hundreds of milliseconds.

The token estimation trick

One detail worth calling out: the Autorouter needs a token count to make its tier decision, but running a real tokenizer before routing would add overhead. Instead, Rada uses a character-based heuristic:


rustfn estimate_token_count(text: &str) -> usize {
    text.chars().count().div_ceil(4)
}

Characters divided by four, rounded up. This is a rough approximation (real tokenization varies by model and content), but it's accurate enough for tier classification. The Autorouter doesn't need an exact count. It needs to know whether the payload is small, medium, or large. A 4:1 ratio gets that right consistently, and it costs zero allocations.

What this means for the developer

The Autorouter is invisible by design. The developer selects an intent, writes a prompt, and the system decides whether to handle it locally (via Behavioral Routing and Sentinel) or route to cloud (via the Autorouter). If it goes to cloud, the Autorouter picks the tier.
The developer never sees a model picker. They don't need to know which cloud model is best for their task. The system handles it, and the routing logic is fast enough that it adds no perceptible delay.
Requests routed through the Autorouter consume daily cloud quota at half the rate of manually selected models. This creates a structural incentive: trust the routing, get twice the cloud interactions per day. It's Rada's way of rewarding developers for letting the system do what it's designed to do.

The full picture

Three posts, three systems:

Post 1: the broad architecture. Local-first with intentional cloud. Why this approach and why now.

Post 2: the co-determination matrix. One model, nine behavioral profiles. Intent crossed with real-time hardware state.

Post 3 (this one): the Autorouter. When local isn't enough, a pure Rust function picks the right cloud tier in microseconds.

Together, Behavioral Routing, Sentinel, and the Autorouter form a single decision pipeline: measure the machine, understand the intent, adapt the local model or route to the right cloud tier. No hot-swaps. No model pickers. No wasted tokens on routing classifiers.

Rada is in closed beta. If you want to see how these systems work together on your hardware, the waitlist is at userada.dev.

Eli Hadam Zucker is the founder of Rada. Previously at Wise. Building local-first AI tooling in Rust.

How One Local Model Gets Nine Personalities (Without Ever Swapping Weights)

Eli Hadam Zucker — Tue, 05 May 2026 18:32:07 +0000

In my first post, I laid out the broad architecture of Rada: local-first AI coding, Behavioral Routing, Sentinel, the Autorouter. Think of that post as the map. This one is the terrain.

Today I'm going deep on the co-determination matrix. The system that lets a single resident model produce nine distinct behavioral profiles by crossing developer intent with real-time hardware state. And Sentinel, the Rust module that measures the hardware side of that equation on every single request.

Quick context if you missed Post 1

Rada keeps one local GGUF model loaded in RAM. No hot-swapping. The model adapts its behavior based on what you're doing (intent) and what your machine can handle (memory band). Sentinel monitors RAM. The Autorouter handles cloud when local isn't enough.

That's the 30-second version. Now let's get into the actual implementation.

The co-determination matrix

The core idea is that the model's behavior isn't determined by a single variable. It's the product of two axes:

Intent (what the developer is doing):

Refactor/Debug: tighten existing code, fix bugs, improve structure
Build from Scratch: generate new code, scaffold features
Explain: walk through logic, teach concepts

Memory Band (what the hardware can handle right now):

Normal (< 70% RAM): full capability
Elevated (70-84% RAM): constrained but functional
Critical (≥ 85% RAM): local generation unsafe, escalate

Cross those two axes and you get a 3×3 matrix. Each cell is a BehaviorProfile in Rust:

struct BehaviorProfile {
    label: &'static str,
    memory_band: MemoryBand,
    prompt_suffix: &'static str,
    temperature: f32,
    max_output_tokens: u32,
    max_retrieval_chars: usize,
    local_token_cap: usize,
}

Every profile controls four knobs: temperature, output token budget, retrieval context size, and a behavioral instruction injected into the system prompt.

Here's the actual matrix from the codebase:

	Normal (< 70%)	Elevated (70-84%)	Critical (≥ 85%)
Refactor/Debug	temp 0.1, 1200 tokens	temp 0.05, 900 tokens	temp 0.0, 700 tokens
Build	temp 0.45, 12000 tokens	temp 0.3, 8000 tokens	temp 0.2, 5000 tokens
Explain	temp 0.35, 1800 tokens	temp 0.25, 1200 tokens	temp 0.15, 800 tokens

Look at the spread. A Build task at normal memory gets temperature 0.45 and a 12,000-token budget. The same Build task under critical memory pressure? Temperature drops to 0.2 and the budget cuts to 5,000 tokens. The model becomes conservative precisely when your machine needs it to.

Refactor at normal memory runs at 0.1 temperature. Already deterministic, already tight. Under critical pressure it drops to 0.0 and the token budget shrinks from 1,200 to 700. At that point the model is producing the smallest safe diff it can manage.

The prompt suffix: behavioral steering without fine-tuning

Each cell in the matrix also carries a prompt_suffix that gets injected into the system prompt at generation time. This is how a 7B model "knows" it's supposed to refactor conservatively vs. build expansively vs. explain clearly.

The prompt composition pipeline builds five layers in order:

Role preamble (language expertise based on the active file)
Sentinel status ("Current Sentinel profile: Refactor / Elevated. Memory band: elevated.")
Intent persona (operational instructions for the selected mode)
Behavioral routing instruction (the prompt_suffix from the matrix cell)
Output format spec (how to structure the response with @@file tags)

The model sees its own constraints. It knows its memory band. It knows its token budget. That transparency turns out to matter. A model that's told "you have 900 tokens and memory is elevated" produces tighter output than one that's just given a smaller max_tokens parameter silently. The behavioral instruction gives the model permission to be concise.

Sentinel: measuring reality on every request

Here's where the memory axis of the matrix becomes real.

Sentinel is a Rust module that checks system RAM before every local generation. Not a polling loop. Not a timer. A synchronous check at the decision point.

The implementation is platform-specific:

macOS: calls memory_pressure -Q first. If that fails (older macOS versions), falls back to parsing vm_stat output, calculating used memory from free pages, speculative pages, and purgeable pages.

Linux: reads /proc/meminfo and computes usage from MemTotal minus MemAvailable.

Windows: runs Get-CimInstance Win32_OperatingSystem via PowerShell and calculates from total vs. free physical memory.

All three paths produce the same output: a single u8 representing percent of RAM in use. That number feeds into this function:

fn classify_memory_band(memory_usage_percent: Option<u8>) -> MemoryBand {
    match memory_usage_percent {
        Some(usage) if usage >= 85 => MemoryBand::Critical,
        Some(usage) if usage >= 70 => MemoryBand::Elevated,
        _ => MemoryBand::Normal,
    }
}

Notice the Option<u8>. If platform detection fails entirely (permissions issues, unexpected output format), the system defaults to Normal. Fail open, not closed. A developer who can't get RAM readings still gets full local capability rather than being locked out.

Two enforcement points

Sentinel doesn't just label the memory band. It enforces it.

Gate 1: Pre-generation check. Before any local inference starts, Sentinel reads memory and resolves the behavior profile:

let memory_usage_percent = detect_system_memory_usage_percent();
let behavior_profile = resolve_behavior_profile(intent, memory_usage_percent);

If the result is a Critical-band profile, the request never reaches the local model. It gets rerouted to the Autorouter for cloud handling, with a message explaining why: "Sentinel escalated this request because local memory pressure is critical."

Gate 2: Budget enforcement. Even within Normal and Elevated bands, the resolved profile's token budgets and retrieval caps constrain the generation. A Build request that would happily consume 12,000 tokens at 60% RAM gets capped at 8,000 tokens at 75% RAM. The model adapts its output to fit.

Why per-request, not periodic

Your RAM state changes constantly. You open a browser tab. Docker pulls an image. Spotlight re-indexes. The memory band from 30 seconds ago might not be the one you're in now.

A periodic check (say, every 5 seconds) creates a window where the system's picture of memory is stale. A per-request check means the behavioral profile is always current. The memory axis of the co-determination matrix isn't a configuration setting. It's measured. Every time.

The overhead is negligible. Reading /proc/meminfo on Linux takes microseconds. The macOS memory_pressure call is similarly lightweight. The cost of one extra syscall per request is invisible compared to the inference time that follows.

Why this is the patentable piece

System prompts aren't novel. Temperature scaling isn't novel. RAM monitoring isn't novel.

What's novel is the co-determination: using real-time hardware state as a first-class input to model behavioral configuration, intersected with developer intent, to produce adaptive profiles from a single resident model. The model doesn't just respond to what you asked. It responds to what you asked given what your machine can handle right now.

We filed a US provisional patent on this mechanism. The claim isn't any individual piece. It's the intersection.

What this looks like in practice

Developer on a 16 GB MacBook. Running VS Code, Chrome with 15 tabs, Docker, Slack. RAM sitting at 72%. They select Refactor intent and ask the model to clean up a function.

Sentinel reads 72%, classifies Elevated band. The co-determination matrix resolves to: temperature 0.05, 900-token output budget, tightened retrieval context. The prompt suffix tells the model to emit the smallest possible safe diff.

The model produces a focused, conservative refactor. No rambling. No unnecessary changes. The developer's machine stays stable.

Same developer, same session, but they close Docker and Chrome. RAM drops to 55%. They switch to Build intent and ask for a new feature module. Sentinel reads 55%, classifies Normal band. Temperature jumps to 0.45, output budget goes to 12,000 tokens, full retrieval context. The model produces expansive, complete code.

Same model. Same weights. Different behavior. Zero downtime between the two requests.

What's next

The Autorouter is the other half of this story. When Sentinel escalates to cloud, the Autorouter decides which cloud tier to hit (and the routing is a pure Rust function, not another LLM call). That's the next post.

Rada is in closed beta. If you want to see how the co-determination matrix behaves on your specific hardware, the waitlist is at userada.dev.

Eli Hadam Zucker is the founder of Rada. Previously at Wise. Building local-first AI tooling in Rust.

Why I'm Building a Local-First AI Coding Workspace (And How Behavioral Routing Makes It Work)

Eli Hadam Zucker — Wed, 29 Apr 2026 19:28:22 +0000

Why I'm Building a Local-First AI Coding Workspace (And How Behavioral Routing Makes It Work)

There's a pattern forming in the AI coding tool space that I think is worth paying attention to.

GitHub paused Copilot Pro+ signups because agentic workloads broke their cost model. Cursor Pro+ is $60/mo and climbing. Claude Code might leave the Pro tier entirely. The common thread: these tools are all cloud-only, which means every user interaction is a cost event on someone else's infrastructure. At scale, the math stops working. Prices go up, access gets gated, and developers end up paying more for less.

I left my role at Wise earlier this year to build a different kind of AI coding tool. One where local inference is the default and cloud is a resource you use intentionally. That tool is called Rada.

This post is a technical walkthrough of how it works under the hood.

The core thesis

Most of what developers ask an AI coding assistant to do doesn't need a frontier cloud model. Refactors, explanations, boilerplate, quick fixes. That's maybe 80% of interactions, and all of it can run on a local LLM.

The remaining 20% (complex architecture decisions, large-scale code generation, multi-file refactors) benefits from cloud models. So the architecture needs to handle both, and it needs to make the transition between them seamless.

The problem with hot-swapping

The naive approach to local AI tooling is to keep multiple models and swap them based on the task. Need a coding model? Load it. Need a general model? Unload the coder, load the general one.

This is a terrible experience in practice. GGUF models at Q4_K_M quantization run anywhere from 4-11 GB in RAM. Loading and unloading them takes time, spikes memory usage, and creates gaps in responsiveness. If you're in a flow state and the tool needs 30 seconds to swap models, you've already lost the thread.

Rada takes a fundamentally different approach.

Behavioral Routing

Instead of swapping models, Rada keeps a single model resident in RAM and adjusts how it behaves based on what the developer is doing. We call this Behavioral Routing.

The system supports three intent modes:

Refactor: tighten existing code, improve naming, reduce complexity
Build: generate new code, scaffold features, implement from scratch
Learn: explain unfamiliar code, walk through logic, teach concepts

When the developer selects an intent (or the system infers it from context), Behavioral Routing adjusts three parameters on the resident model:

System prompt: each intent gets a tailored system prompt that shapes the model's approach. A Refactor prompt emphasizes preserving behavior while improving structure. A Build prompt focuses on completeness and best practices. A Learn prompt prioritizes clarity and step-by-step explanation.

Temperature: Refactor tasks use lower temperature (more deterministic, safer transformations). Build tasks use slightly higher temperature (more creative solutions). Learn tasks sit in the middle (clear but not robotic).

Context window: the system dynamically adjusts how much context gets sent to the model based on intent. Refactoring a single function needs a narrow window. Building a new feature might need broader project context. Learning about a module needs enough surrounding code to give a complete picture.

The result: one model, three distinct behaviors. No load/unload cycle. No RAM spikes. The model stays warm and responsive.

The local model roster

Rada uses a tiered model roster, all GGUF Q4_K_M quantizations:

Model	Size in RAM	Primary Intent
Qwen 2.5 Coder 7B	~4.7 GB	Refactor
Llama 3.1 8B	~5.3 GB	Learn
DeepSeek Coder V2 Lite 16B (MoE)	~10.6 GB	Build

Which model gets loaded depends on your hardware, not your preference.

Sentinel: RAM-aware model selection

Sentinel is a Rust background process that monitors system memory and determines which model tier your machine can support. The selection is deterministic: Sentinel reads available RAM, checks it against the model ladder, and picks the highest tier that fits without putting the system under pressure.

On a 16 GB machine, you'll get the 7B or 8B models. On 32 GB+, you can run the DeepSeek MoE. The ladder scales up from there as hardware allows.

This is a deliberate design choice. Asking users to pick their own model leads to people overloading their systems or underutilizing their hardware. Sentinel removes that friction. You open Rada and it works with whatever machine you have.

The Autorouter: when local isn't enough

Some tasks genuinely need cloud models. Complex multi-file refactors, architectural decisions that need broad reasoning, large-scale code generation. For these, Rada has the Autorouter.

The Autorouter evaluates the incoming request and routes it to the appropriate cloud endpoint based on task complexity and the intent mode. The cloud model roster is tiered:

Heavyweight: Claude Sonnet, DeepSeek V3 (for complex reasoning tasks)
Workhorse: Gemini Flash, Mistral Nemo (for solid general-purpose tasks)
Micro: Qwen 2.5 Coder 32B, Gemini Flash-8B (for lighter cloud tasks that still exceed local capability)

Cloud routing is managed through OpenRouter, which is already live in production.

Here's a detail that matters: requests routed through the Autorouter consume at 0.5x the normal daily cloud quota rate. If a user manually selects a cloud model, it burns at 1x. This creates an incentive to trust the Autorouter's judgment rather than always reaching for the biggest model. The system rewards efficient routing.

The daily cloud burst quota

Every Rada user gets a daily cloud burst quota. Free tier users can bring their own API keys. Pro users ($19/mo) get 20 daily bursts. Ultra users ($55/mo) get 75.

The quota resets daily (UTC). This is intentional. Monthly quotas create anxiety and hoarding behavior. Daily quotas encourage consistent usage without the fear of burning through your allocation in the first week.

Combined with the 0.5x Autorouter rate, Pro users effectively get 40 cloud interactions per day when they let the system route. That's enough for a full day of coding where cloud is used for the tasks that actually need it.

Why Rust and Tauri

The backend is built in Rust with Tauri as the desktop framework. A few reasons:

Memory matters: when your app is managing local LLM inference, every megabyte of overhead counts. Rust gives predictable, low memory usage. An Electron wrapper eating 500 MB of RAM on top of your model isn't acceptable.

Sentinel needs to be fast: the RAM monitoring process runs continuously. It needs to be lightweight and responsive. Rust's performance characteristics make this straightforward.

Tauri's footprint: compared to Electron, Tauri produces significantly smaller binaries and uses the system's native webview. On a tool where local resource management is the core feature, the framework can't be the one wasting resources.

The frontend is React, which gives us the UI flexibility we need without fighting the framework.

Why this matters now

The AI coding tool market is at an inflection point. Cloud-only architectures worked when usage was lower and providers were subsidizing costs to gain market share. That era is ending.

GitHub pausing Copilot Pro+ isn't a temporary measure. It's a signal that cloud-only AI tooling at consumer price points doesn't have sustainable unit economics. The cost per agentic session is too high.

Local-first isn't about being anti-cloud. It's about using cloud intentionally and keeping the majority of interactions local where they belong. The developer gets a faster, more responsive tool. The provider gets margins that actually work. The dependency on any single cloud provider goes away.

What's next

Rada is heading into closed beta. We're looking for developers who want to test the local-first approach and help calibrate the system with real-world hardware data. Early testers get first right to a lifetime deal on cloud routing.

If this architecture interests you, the waitlist is at userada.dev.

I'll be writing more about the technical decisions behind Rada here on dev.to. Next up: how Sentinel's RAM heuristics work in practice across different hardware configurations.

Eli Hadam Zucker is the developer behind Rada. Previously at Wise. Building local-first AI tooling in Rust.