On-device first, cloud when it counts. oneinfer-edge routes every request to where it runs best.

#ai #productivity #opensource #startup

Most teams deploying LLMs in production face the same three failure modes: overpaying frontier models for trivial tasks, routing latency bleeding into TTFT on real-time pipelines, and zero fallback when a node goes down. We built oneinfer-edge routing to solve all three simultaneously, and we did it by deeply integrating five open source routing models, not replacing them.

"Why not just use a Mixture-of-Experts model?"

It's the right question. Mixtral, DeepSeek, and other MoE architectures already route inputs dynamically. Each token activates only a subset of expert layers, giving you a large effective parameter count at a fraction of the compute. So why build a separate routing layer at all?

Because MoE and LLM routing solve fundamentally different problems.

MoE routing happens inside a single model, at the token level, during a single inference pass. The model decides internally which expert neurons fire. You have no control over it, no visibility into it, and critically, it cannot route between different hardware, different cost tiers, or different models entirely. Teams that have tried to use MoE as a routing solution hit this wall directly: all inference stays pinned to one hardware tier regardless of task complexity, cost sensitivity, or privacy requirements.

External LLM routing happens between models, at the request level, before inference begins. It answers a different question: not "which expert layers should activate for this token?" but "which model, local or cloud, cheap or powerful, should handle this request at all?"

In practice it looks like this. A routine summarization request comes in. The router scores it, decides local Apple Silicon can handle it, and dispatches it without spending a single cloud token. Three turns later in the same session, the user asks for a multi-step legal analysis. The router scores that differently, escalates it to a cloud GPU, gets the response, and hands it back. The user sees one continuous conversation. Under the hood, two completely different pieces of hardware did the work.

That is the hybrid. Not a setting you configure once, but a decision made fresh on every request based on complexity, cost, and what hardware is available.

If you are running inference on local Apple Silicon for privacy-sensitive or cost-sensitive tasks and escalating to cloud GPUs only for complex workloads, a MoE model cannot do that. It lives on one piece of hardware. External routing is the layer that makes the local-cloud boundary intelligent rather than static, and that's the gap OneInfer Edge was built to fill.

"The Five Routers Under the Hood"

RouteLLM BERT (routellm/bert) handles our Fastest routing goal. It classifies prompt intent before the inference request is even dispatched. Arch-Router benchmarks show competing solutions averaging 1000ms+ for routing decisions; BERT-class routers cut that to sub-10ms, which is the only acceptable overhead for voice pipelines and autocomplete where TTFT is the product.

RouteLLM Matrix Factorization (routellm/mf) powers Lowest Cost routing. The MF router achieves 95% of GPT-4 performance using only 26% GPT-4 calls, roughly 48% cheaper than a random baseline, with over 85% cost reduction on MT Bench, accepted at ICLR 2025. In OneInfer Edge, this router is what actually orchestrates the local-cloud split: it scores incoming prompts and dispatches routine work to local Apple Silicon or NVIDIA endpoints, spending zero cloud tokens on tasks that don't warrant them. The escalation signal is prompt complexity, a score derived from semantic density, instruction count, and chain-of-thought depth. Below the threshold, local. Above it, cloud.

Arch-Router 1.5B (katanemo/Arch-Router-1.5B) is the Balanced router. At inference time it analyzes incoming prompts using semantic similarity, task indicators, and contextual cues, then applies user-defined routing preferences to select the best model. It achieves median routing latency of 50ms with 99th percentile under 75ms, fast enough for production without sacrificing intent accuracy.

Router-R1 Qwen2.5 3B (ulab-ai/Router-R1-Qwen2.5-3B-Instruct) drives the Highest Quality goal. Unlike single-round routers, Router-R1 interleaves "think" actions with "route" actions and integrates each response into its evolving context. NeurIPS 2025. This is what handles multi-step code refactoring, financial analysis chains, agentic tool-calling pipelines, and anything where a wrong model selection cascades into compounding errors across turns.

MMBERT32K (llm-semantic-router/mmbert32k) handles Reliability. It classifies input modality (text, image, audio) and reroutes traffic around degraded nodes in real time. Critical for multimodal enterprise workloads where a single endpoint failure should never surface to the user.

Which Router Combination Applies to Your Workload

Real-time voice and autocomplete: BERT as the primary router. Sub-10ms classification means the routing decision is invisible to the user. Arch-Router handles preference matching; MMBERT32K provides fallback if the speech-to-text endpoint degrades.

Document processing pipelines: MF router routes paragraph-level extraction to local endpoints, achieving near-GPT-4 quality at a fraction of the cost. Contract clause analysis and multi-hop legal reasoning escalate to Router-R1's quality path. Same pipeline, two radically different cost profiles per task type.

Agentic and tool-calling workflows: Router-R1 is the right primary router here. Multi-turn reasoning chains are exactly where poor initial model selection compounds. Router-R1's deliberation step catches mismatches before they cascade. BERT handles latency-sensitive tool invocations within the same pipeline.

Privacy-sensitive or on-device-only workloads: MF router with local-only routing enforced. The escalation threshold can be set to never trigger cloud routing, keeping all inference on-device. MMBERT32K monitors local endpoint health and fails gracefully rather than silently escalating to cloud.

Multimodal enterprise support: MMBERT32K classifies whether an incoming request contains a screenshot, log file, or plain text, then routes each modality to the appropriate specialized endpoint. Local processing handles simple, privacy-sensitive tasks; cloud offloading is reserved for complex queries requiring more capacity.

Structured data, SQL generation, and analytics: Arch-Router's semantic similarity scoring handles this well. These prompts have strong task indicators that map cleanly to model capabilities. MF handles the cost layer; Router-R1 steps in for multi-hop analytical reasoning.

What Combining All Five Actually Unlocks

The insight isn't that any single router is better. It's that each router optimizes for exactly one dimension, and in production, you need all five dimensions working simultaneously on the same request lifecycle.

Here's what that looks like technically.

Request Lifecycle: How the Five Routers Chain Together

Every incoming prompt passes through a lightweight classification layer first. MMBERT32K identifies input modality. Is this text, an image, or audio? That classification happens in the same step that it checks endpoint health. If the primary node is degraded, traffic is hot-swapped before any model is invoked. Zero user-facing downtime, zero manual intervention.

From there, prompt complexity determines the next hop. A fundamental mismatch exists between task complexity and resource allocation: offloading simple queries to the cloud invites prohibitive latency, while on-device models lack capacity for demanding computation. OneInfer Edge solves this by scoring prompt difficulty inline. Routine queries go to the MF router and are dispatched to local Apple Silicon or NVIDIA hardware via OneInfer Edge endpoints, with zero cloud tokens spent. Complex, multi-step queries escalate to Router-R1 for deep reasoning before model selection.

Latency-sensitive paths bypass this chain entirely. Arch-Router achieves median routing latency of 50ms with 99th percentile under 75ms, compared to 1000ms+ for competing solutions, but for real-time voice or autocomplete, even 50ms is overhead. BERT classification runs sub-10ms on prompt intent, committing to a model before a human perceives any gap.

What You Can Build With This Stack

Coding Copilots If you are building a coding copilot, use BERT for autocomplete and inline suggestions. Any slower than sub-10ms and the suggestion feels laggy. When the same copilot handles a refactor or multi-file architectural change, Router-R1 takes over and escalates to cloud if local cannot handle it. The developer sees one tool. Under the hood, two completely different routing paths are doing the work. Real-Time Voice Assistants If you are building a real-time voice assistant, route through BERT for intent classification in under 10ms. When follow-ups involve multi-step reasoning, escalate to Router-R1. If the speech-to-text endpoint degrades mid-session, MMBERT32K reroutes traffic automatically. Your users never see the transition.

Document Processing Pipelines If you are building a document processing pipeline, use the MF router to send paragraph-level extraction to local endpoints. You get near-frontier quality at a fraction of the cost. When the same pipeline hits contract clause analysis or multi-hop legal reasoning, Router-R1 takes over. One pipeline, two radically different cost profiles depending on what the task actually demands.

Agentic and Tool-Calling Workflows If you are building agentic workflows where the model calls tools across multiple turns, Router-R1 is where you start. This is exactly the failure mode it was built for: a wrong model selection in step one compounds quietly through steps three, five, and seven before you notice. Router-R1 deliberates before committing. BERT handles the latency-sensitive tool invocations within the same pipeline.

Privacy-First and On-Device Apps If you are building for privacy-sensitive or on-device-only requirements, lock the MF router to local-only routing. Set the escalation threshold so it never touches the cloud. MMBERT32K monitors local endpoint health and fails gracefully rather than silently pushing sensitive data to a cloud API.

Multimodal Support Systems If you are building a multimodal support system, MMBERT32K classifies whether an incoming request contains a screenshot, log file, or plain text before routing begins. Each modality goes to the right endpoint. Simple and privacy-sensitive tasks stay local. Complex queries that need more capacity go to the cloud.

Why Open Source is the Only Way to Build This

The lack of open source infrastructure has led to a highly fragmented research landscape: individual studies implement bespoke routing pipelines using different model ensembles, datasets, and evaluation protocols, hindering fair comparison, reproducibility, and cumulative progress.

We are a small team. We could not have built a five-dimensional routing layer from scratch. What we could do, and what any team can do, is deeply integrate published research, contribute back edge deployment patterns and real-world benchmark data that academic benchmarks don't capture, and push production findings upstream.

That feedback loop is exactly how open source accelerates. RouteLLM gave us the cost-quality math. Arch-Router gave us sub-100ms production-grade routing. Router-R1 gave us RL-trained reasoning-aware selection. MMBERT32K gave us modality awareness and reliability. We are combining those primitives into patterns that run on local hardware in Bengaluru as reliably as they do on A100s in Virginia.

The community built the components. The opportunity now is to build the connective tissue and put it back in the open.

Repo: Github link

Feature requests and issues: Issues and Feedback

Learn more: oneinfer edge

Open source and contributions are very welcome. Happy to answer any questions about the routing implementation here.