If you build on top of LLMs, you've probably hit this: you ship a feature, traffic spikes, and the API bill comes back way higher than you expected. Per-token pricing makes costs hard to predict — you're billed by how verbose the model is, not by the value you ship.
I got tired of that (plus juggling three API keys), so here's a setup that fixes both: one OpenAI-compatible endpoint that auto-picks the best model and charges a flat price per call.
The core idea
Instead of calling each provider directly, you point your existing OpenAI SDK at a single gateway and send one model name: modelis-auto. It routes each request to the best model for the task (GPT-5.5, Claude Opus 4.8, Gemini 3.1, Grok, DeepSeek…) and bills a flat per-call rate — so your cost is predictable regardless of which model handled it.
Zero migration: just change base_url
If you already use the OpenAI SDK, this is a one-line change.
from openai import OpenAI
client = OpenAI(
api_key="YOUR_MODELIS_KEY",
base_url="https://modelishub.com/v1", # the only change
)
resp = client.chat.completions.create(
model="modelis-auto", # let it pick the best model
messages=[{"role": "user", "content": "Explain CRDTs in two sentences."}],
)
print(resp.choices[0].message.content)
Or with curl:
curl https://modelishub.com/v1/chat/completions \
-H "Authorization: Bearer YOUR_MODELIS_KEY" \
-H "Content-Type: application/json" \
-d '{"model":"modelis-auto","messages":[{"role":"user","content":"Hi"}]}'
That's it. Your existing code, SDKs, and OpenAI-compatible tools keep working.
"But which model actually answered?"
Fair question — auto-routing shouldn't be a black box. Every response returns a header telling you exactly which model handled the request:
X-Modelis-Routed-Model: claude-opus-4-8
And if you want control, you can stay in a quality tier or call a specific model directly:
model: "modelis-auto:premium" # stay in a quality tier
model: "gpt-5.5" # or pin a specific model
Why flat per-call instead of per-token
The point isn't "cheaper than everyone" — it's predictable. With a flat per-call price:
- A verbose response doesn't cost more than a terse one.
- A busy day scales with calls, not with token noise.
- You can actually budget, and price your own product with confidence.
Honest take: when per-token is still fine
If your workload is steady, you control prompt/response sizes tightly, and you've already optimized model choice per route, per-token billing can be cheaper. Flat per-call shines when traffic is bursty, prompts vary, or you just don't want to babysit model selection and cost. Pick what fits your reality.
Try it
There's a free tier: modelishub.com. I'd genuinely love feedback — especially whether predictable pricing actually matters for how you build, or if you prefer per-token control.
Top comments (1)
The "juggling three API keys" pain is the same root issue whether you're on the API side or just using these models day to day — the ecosystem is multi-model now and nobody wants to live inside one vendor. (I run my day-to-day through MultipleChat for exactly that reason, so the routing instinct here resonates.)
On the actual tradeoff: flat per-call vs per-token is really about where you want your variance. Per-token puts variance in cost; flat per-call moves it into your margin per request — great when traffic is bursty and prompt sizes vary, exactly as you said. The X-Modelis-Routed-Model header is the right call too; auto-routing only works if I can audit what answered and pin a model when a route regresses.
One question: how does modelis-auto decide "best model for the task" — latency/cost-weighted, or quality-scored per category? That routing logic is basically the whole product, and I'd trust it faster if the criteria were transparent.