Here's a pattern I see everywhere in LLM code:
response = openai.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": prompt}]
)
The problem isn't the code. The problem is that "gpt-4o" is a hardcoded decision that will silently become wrong. Models update, get deprecated, get cheaper alternatives. Three months from now, whatever you've hardcoded might be the expensive choice when a better-for-your-use-case option exists.
More importantly: what you actually wanted to say was "I want a high-quality model for this request". The specific model name is an implementation detail that got promoted to an interface.
The intent gap
Consider these scenarios:
- A summarization script that runs on a cron job. What you care about: cheap. You do not care which model.
- A code review tool where correctness matters. What you care about: quality reasoning. The model name is incidental.
- A real-time autocomplete feature. What you care about: latency. Cheapest fast model, updated automatically as the landscape changes.
In each case, you know your intent. You don't know (and shouldn't have to know) which specific model best serves that intent right now.
Enter the prefer parameter
Model Router is an OpenAI-compatible API that adds one non-standard parameter: prefer.
{
"model": "auto",
"prefer": "cheap"
}
prefer can be: cheap, fast, balanced, quality, or coding.
-
prefer: cheap→ routes to Qwen or DeepSeek via AWS Bedrock. Surprisingly capable, very cheap. -
prefer: fast→ routes to whichever model currently has the lowest observed latency -
prefer: quality→ routes to Claude Sonnet or GPT-4o -
prefer: coding→ routes by SWE-bench-weighted composite score; highest coding benchmark wins -
prefer: balanced→ balanced cost/quality within the standard tier (default)
The routing decision is made at request time, not at code-write time. When a better cheap model comes out, prefer: cheap routes to it automatically. Your code doesn't change.
How it works under the hood
There are two independent axes:
Tier — constrains the eligible model pool:
-
economy: open-weight models (Qwen, DeepSeek, Mistral, GLM via Bedrock). Some are free. -
standard: mid-tier commercial models (Claude Haiku, Gemini Flash, Llama) -
premium: top-tier models (Claude Sonnet, GPT-4o, Gemini Pro)
Prefer — selects within the pool:
- Takes the tier's eligible models and picks based on declared optimization target
You can use them independently or together:
{"model": "auto", "tier": "economy", "prefer": "quality"}
// → best quality model within the economy tier
{"model": "auto", "prefer": "fast"}
// → fastest model across all tiers
{"model": "auto"}
// → default: standard tier, balanced routing
You can also pin a specific model the normal way:
{"model": "claude-3-5-sonnet"}
// → routes directly, tier/prefer ignored
The Bedrock angle
One thing that surprised me building this: AWS Bedrock has a genuinely interesting set of open-weight models that are quite cheap and reasonably capable for the right workloads. Qwen, DeepSeek, MiniMax, Kimi — models that most developers aren't routing to directly because the Bedrock setup is a bit of friction.
Model Router handles that friction. The tier: economy pool is essentially "give me the good stuff from Bedrock" as a single API call. Some of those models — Llama 3, some Qwen variants — have no usage cost at all.
Compatibility
The API is OpenAI-compatible. Drop-in replacement:
from openai import OpenAI
client = OpenAI(
base_url="https://api.lxg2it.com/v1",
api_key="your-model-router-key"
)
response = client.chat.completions.create(
model="auto",
extra_body={"prefer": "cheap"},
messages=[{"role": "user", "content": "Hello!"}]
)
The prefer parameter is passed via extra_body in the Python SDK (or as a top-level parameter in raw JSON).
Where it's at
This is early but running. The routing works, the billing works, and real external users are building with it — including someone processing 80k-token documents through it as part of a pipeline. What it doesn't have yet is scale.
If you have use cases where you're currently hardcoding model names and wishing you could express intent instead — I'd genuinely love to hear if this helps. $1 signup credit, no commitment.
Top comments (0)