We've been running a small experiment for the past few months: instead of picking one LLM for all tasks, we built a simple router that sends different queries to different models based on what they're good at.
We used DeepSeek-V4 Pro, Kimi 2.6, MiniMax 2.7, and Qwen3 235B. No single model won across the board. Here's what surprised us.
What we actually did
We set up a lightweight proxy that normalizes API requests to OpenAI-compatible format. When a request comes in, it checks:
Task type (reasoning, long context, summarization, etc.)
Context length
Cost budget (optional)
Then it routes to one of the four models.
We didn't build anything fancy – just a few hundred lines of Python with retry logic and basic fallbacks.
The surprising results
Task type Best model Why
Long document QA (>100K tokens) Kimi 2.6 Almost no retrieval degradation
Complex reasoning / math Qwen3 235B Highest GSM8K, but expensive
General chat + quick responses DeepSeek-V4 Pro Good balance of speed and accuracy
Multimodal understanding MiniMax 2.7 Surprisingly good at image + text
The cost difference was 2x between cheapest and most expensive for similar tasks. That alone made routing worth it.
What broke (and what didn't)
Things that worked well:
OpenAI-compatible interface saved us from rewriting app code
Simple YAML routing rules were enough for 80% of cases
Fallback to a cheaper model when the primary was rate-limited
Things that failed:
Automatic "smart routing" based on embeddings was overkill and slow
Trying to predict cost per task turned into a mess – we switched to simple budget tiers
Streaming normalization between models was surprisingly painful (SSE formats differ)
A concrete example
One request that surprised us: summarizing a 90K token legal document.
Kimi 2.6 retrieved all key clauses correctly
DeepSeek missed 2 out of 12 critical points
Qwen3 did well but cost 2x more
So we now route any document >80K tokens to Kimi by default.
Questions for the community
We're still early in this. Would love to hear from others who've tried multi-model routing:
Do you route dynamically or just pick one model per use case?
How do you handle cost control without breaking user experience?
Has anyone tried open source routers vs building your own?
Also curious if people have benchmarked these models on their own workloads – our numbers might not generalize.
Happy to share our routing config if helpful. Just ask.
Top comments (0)