sm1ck

Posted on Apr 21 • Originally published at honeychat.bot

LLM routing per tier via OpenRouter — when one model doesn't fit all

#ai #llm #python #openrouter

📦 Full runnable example: github.com/sm1ck/honeychat/tree/main/tutorial/02-routing — docker compose up exposes POST /complete on localhost:8000. Every snippet below is pulled from that repo.

Most introductory "chat with AI" tutorials pick one model and call it a day. That works in a toy. It stops being enough in production, where users have different price sensitivity, different conversation styles, and different expectations for what the product should allow.

Here's how to route LLM calls across a handful of providers via OpenRouter, how that routing handles finish_reason=content_filter empty-completion edge cases, and the fallback chain pattern that keeps replies flowing.

TL;DR

Route by tier (price elasticity) and by content mode (what kind of turn this is). A single default model can't do both.
Some reasoning/model-provider combinations can return finish_reason=content_filter with empty content on borderline content. A retry policy that only catches HTTP errors can miss this.
The working pattern: primary → different-provider fallback → specialized last resort, with retries triggered by both error responses and suspicious empty completions.

Run it yourself in 3 minutes

1. Clone and configure

git clone https://github.com/sm1ck/honeychat
cd honeychat/tutorial/02-routing
cp .env.example .env

Open .env, paste your OPENROUTER_API_KEY (get one here). The three default model slots all point to free-tier OpenRouter models so you can experiment without spending.

2. Start the service

docker compose up --build -d
curl http://localhost:8000/health   # {"ok":true}

3. Send a normal turn — primary answers

curl -X POST http://localhost:8000/complete \
  -H 'content-type: application/json' \
  -d '{"messages":[{"role":"user","content":"Name three cold-climate fruits."}]}' \
  | jq

Expected response:

{
  "content": "Apples, pears, and cloudberries...",
  "model": "meta-llama/llama-3.1-8b-instruct:free",
  "attempt": 0,
  "used_fallback": false
}

attempt: 0 means the primary model answered. used_fallback: false means no retry was needed.

4. Force a fallback

Override the primary to point at a model you know tends to refuse — or any bogus model name — and watch the chain kick in:

curl -X POST http://localhost:8000/complete \
  -H 'content-type: application/json' \
  -d '{"messages":[{"role":"user","content":"Say hi"}],"primary":"this/model-does-not-exist"}' \
  | jq '.model, .attempt, .used_fallback'

attempt: 1 (or 2) — the next rung answered. In production, log this metric: a rising fallback rate on a class of content means it's time to move the content to a different primary, not to tweak retry logic.

5. Run the unit tests

pip install -e ".[dev]"
pytest -v

Seven tests cover the failure modes in this chain — content_filter=empty, transient 5xx, non-transient 4xx, all-models-fail.

With the service running and the tests green, the rest of this post explains why the chain is shaped this way.

Why one model doesn't fit all

Three distinct pressures push against a single-model setup:

Price elasticity by tier. A free user generating 20 messages a day at flagship-model prices can burn cash every month per active user for zero revenue. A paying top-tier user sending the same 20 messages may reasonably expect higher quality. The unit economics do not agree.

Content mode. Mainstream-aligned models can refuse content that some legitimate companion/roleplay products allow on paid tiers. Conversely, less-restrictive models can have weaker long-context coherence. The right model depends on the turn.

Latency vs. depth. Instant conversational turns need sub-3-second responses. Long scene-writing turns can tolerate 10+ seconds for better prose. Hardcoding a single model optimizes for one and sacrifices the other.

The reasoning-model empty-completion edge case

This is the one that cost me a full afternoon to diagnose.

Some reasoning-class model/provider combinations do server-side moderation or filtering before returning a final answer. On borderline turns, they may not return an HTTP error. Instead, they can return a valid response with:

{
  "choices": [{
    "finish_reason": "content_filter",
    "message": { "content": "" }
  }]
}

Empty string. No exception. No status code to check. If you don't guard for it, your user sees a blank reply.

If your retry logic only triggers on httpx.HTTPStatusError, this can pass through.

The guard

The whole failure mode is caught by a tiny function:

def _is_silent_refusal(choice: dict) -> bool:
    """
    The whole point of this post: reasoning models can return a successful
    HTTP response with finish_reason=content_filter AND an empty content.
    If you only check HTTP status, you ship blank replies to users.
    """
    reason = choice.get("finish_reason")
    content = choice.get("message", {}).get("content") or ""
    return reason in ("content_filter", "length") and not content.strip()

→ full source

Resilient fallback chain

async def complete(
    messages: list[dict],
    *,
    primary: str | None = None,
    chain: Iterable[str] | None = None,
) -> CompletionResult:
    """Run the fallback chain. Return the first usable response."""
    models = list(chain) if chain is not None else _build_chain(primary)

    async with httpx.AsyncClient() as client:
        for attempt, model in enumerate(models):
            try:
                data = await _call(client, model, messages)
            except httpx.HTTPStatusError as e:
                if e.response.status_code in TRANSIENT_CODES:
                    continue
                raise
            except (httpx.ReadTimeout, httpx.ConnectError):
                continue

            choice = (data.get("choices") or [{}])[0]
            if _is_silent_refusal(choice):
                continue

            content = choice.get("message", {}).get("content") or ""
            if not content.strip():
                continue

            return CompletionResult(content=content, model=model, attempt=attempt)

    raise AllModelsFailedError(f"no model returned usable content; tried {models}")

→ full source

Two details worth calling out:

Empty content check is separate from the finish reason. Some models can return finish_reason=stop with empty content when they refuse. Always check not content.strip().
Track which model ultimately answered. Log attempt > 0 as a fallback event. If your primary fails 10% of the time on a class of content, that's a routing decision, not a retry problem — move that content to a different primary.

Picking the fallback order

For a permissive roleplay mode, the shape looks like this:

content-mode primary   → first model for this type of turn
  ↓ (on failure / empty)
diff-provider fallback → avoids the same upstream failure mode
  ↓
specialized last resort
  ↓
abort — ask the user to try a shorter or clearer prompt

The ordering rule: different-provider fallbacks. If the primary is hosted on provider A and fails for a provider-side reason, prefer a fallback hosted on provider B. Same-provider fallbacks can fail on the same content because the provider's moderation layer may be upstream of the model. OpenRouter makes this easier because each model's provider metadata is visible.

Content-level gating happens before the LLM, not after

The fallback chain handles model-level refusals. But if the user's intent is clearly above your product's content ceiling, retrying on a more permissive model just burns extra tokens before the user hits the real limit. Gate the content level in your system prompt assembly — don't rely on the model to enforce policy.

Keep the tier-level policy simple: the escalation class (detected from user intent) must be ≤ the user's plan ceiling. If over, the character responds in-character and the bot sends the upsell. The LLM does not need to know the tier exists — it just gets a system prompt with the right constraints for this turn.

Instrumentation that matters

Log three things per LLM call:

Model that answered (primary or fallback index)
Time to first token vs total time — tells you whether latency was model-side or network-side
Token cost (input + output) per message, bucketed by tier

Costs track in Redis counters with short TTL — daily sum, per-user daily sum. A global daily ceiling blocks new generations if spend crosses a configured threshold (fail-closed: if the counter is unreachable, block, don't pass). This helped cap a runaway generation loop at a known ceiling.

What I'd change if starting over

Route by content mode from day 1, not as an afterthought. Retrofitting the split into an existing handler is painful.
Instrument the silent-refusal rate. It may be rare, but you won't know unless you measure it specifically.
Don't share a single OpenRouter key across environments. Rate limits are per-key and dev noise eats prod quota.
Publish the tier → model map in your public docs. Users comparing products care. Competitors already know. Keeping the docs in sync with the code forces alignment.

Where this lives

HoneyChat's LLM router sits behind the chat handler on both the Telegram bot and the web app. Public architecture: github.com/sm1ck/honeychat/blob/main/docs/architecture.md.

Previous in the series: dual-layer memory with Redis + ChromaDB.
Next: character consistency with custom LoRA.

References

Curious how others have solved the silent-refusal pattern. If you've hit it on a different provider, drop a comment — I want to know which models ship which behavior.

DEV Community