The `prefer` parameter: routing AI requests by intent instead of model name

#ai #api #llm #developer

Here's a pattern I see everywhere in LLM code:

response = openai.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": prompt}]
)

The problem isn't the code. The problem is that "gpt-4o" is a hardcoded decision that will silently become wrong. Models update, get deprecated, get cheaper alternatives. Three months from now, whatever you've hardcoded might be the expensive choice when a better-for-your-use-case option exists.

More importantly: what you actually wanted to say was "I want a high-quality model for this request". The specific model name is an implementation detail that got promoted to an interface.

The intent gap

Consider these scenarios:

A summarization script that runs on a cron job. What you care about: cheap. You do not care which model.
A code review tool where correctness matters. What you care about: quality reasoning. The model name is incidental.
A real-time autocomplete feature. What you care about: latency. Cheapest fast model, updated automatically as the landscape changes.

In each case, you know your intent. You don't know (and shouldn't have to know) which specific model best serves that intent right now.

Enter the `prefer` parameter

Model Router is an OpenAI-compatible API that adds one non-standard parameter: prefer.

{
  "model": "auto",
  "prefer": "cheap"
}

prefer can be: cheap, fast, balanced, quality, or coding.

prefer: cheap → routes to Qwen or DeepSeek via AWS Bedrock. Surprisingly capable, very cheap.
prefer: fast → routes to whichever model currently has the lowest observed latency
prefer: quality → routes to Claude Sonnet or GPT-4o
prefer: coding → routes by SWE-bench-weighted composite score; highest coding benchmark wins
prefer: balanced → balanced cost/quality within the standard tier (default)

The routing decision is made at request time, not at code-write time. When a better cheap model comes out, prefer: cheap routes to it automatically. Your code doesn't change.

How it works under the hood

There are two independent axes:

Tier — constrains the eligible model pool:

economy: open-weight models (Qwen, DeepSeek, Mistral, GLM via Bedrock). Some are free.
standard: mid-tier commercial models (Claude Haiku, Gemini Flash, Llama)
premium: top-tier models (Claude Sonnet, GPT-4o, Gemini Pro)

Prefer — selects within the pool:

Takes the tier's eligible models and picks based on declared optimization target

You can use them independently or together:

{"model": "auto", "tier": "economy", "prefer": "quality"}
// → best quality model within the economy tier

{"model": "auto", "prefer": "fast"}
// → fastest model across all tiers

{"model": "auto"}
// → default: standard tier, balanced routing

You can also pin a specific model the normal way:

{"model": "claude-3-5-sonnet"}
// → routes directly, tier/prefer ignored

The Bedrock angle

One thing that surprised me building this: AWS Bedrock has a genuinely interesting set of open-weight models that are quite cheap and reasonably capable for the right workloads. Qwen, DeepSeek, MiniMax, Kimi — models that most developers aren't routing to directly because the Bedrock setup is a bit of friction.

Model Router handles that friction. The tier: economy pool is essentially "give me the good stuff from Bedrock" as a single API call. Some of those models — Llama 3, some Qwen variants — have no usage cost at all.

Compatibility

The API is OpenAI-compatible. Drop-in replacement:

from openai import OpenAI

client = OpenAI(
    base_url="https://api.lxg2it.com/v1",
    api_key="your-model-router-key"
)

response = client.chat.completions.create(
    model="auto",
    extra_body={"prefer": "cheap"},
    messages=[{"role": "user", "content": "Hello!"}]
)

The prefer parameter is passed via extra_body in the Python SDK (or as a top-level parameter in raw JSON).

Where it's at

This is early but running. The routing works, the billing works, and real external users are building with it — including someone processing 80k-token documents through it as part of a pipeline. What it doesn't have yet is scale.

If you have use cases where you're currently hardcoding model names and wishing you could express intent instead — I'd genuinely love to hear if this helps. $1 signup credit, no commitment.

api.lxg2it.com