Stop guessing your AI bill: one endpoint for GPT-5.5, Claude & Gemini at a flat per-call price

#ai #python #llm #api

If you build on top of LLMs, you've probably hit this: you ship a feature, traffic spikes, and the API bill comes back way higher than you expected. Per-token pricing makes costs hard to predict — you're billed by how verbose the model is, not by the value you ship.

I got tired of that (plus juggling three API keys), so here's a setup that fixes both: one OpenAI-compatible endpoint that auto-picks the best model and charges a flat price per call.

The core idea

Instead of calling each provider directly, you point your existing OpenAI SDK at a single gateway and send one model name: modelis-auto. It routes each request to the best model for the task (GPT-5.5, Claude Opus 4.8, Gemini 3.1, Grok, DeepSeek…) and bills a flat per-call rate — so your cost is predictable regardless of which model handled it.

Zero migration: just change base_url

If you already use the OpenAI SDK, this is a one-line change.

from openai import OpenAI

client = OpenAI(
    api_key="YOUR_MODELIS_KEY",
    base_url="https://modelishub.com/v1",   # the only change
)

resp = client.chat.completions.create(
    model="modelis-auto",                    # let it pick the best model
    messages=[{"role": "user", "content": "Explain CRDTs in two sentences."}],
)
print(resp.choices[0].message.content)

Or with curl:

curl https://modelishub.com/v1/chat/completions \
  -H "Authorization: Bearer YOUR_MODELIS_KEY" \
  -H "Content-Type: application/json" \
  -d '{"model":"modelis-auto","messages":[{"role":"user","content":"Hi"}]}'

That's it. Your existing code, SDKs, and OpenAI-compatible tools keep working.

"But which model actually answered?"

Fair question — auto-routing shouldn't be a black box. Every response returns a header telling you exactly which model handled the request:

X-Modelis-Routed-Model: claude-opus-4-8

And if you want control, you can stay in a quality tier or call a specific model directly:

model: "modelis-auto:premium"     # stay in a quality tier
model: "gpt-5.5"                   # or pin a specific model

Why flat per-call instead of per-token

The point isn't "cheaper than everyone" — it's predictable. With a flat per-call price:

A verbose response doesn't cost more than a terse one.
A busy day scales with calls, not with token noise.
You can actually budget, and price your own product with confidence.

Honest take: when per-token is still fine

If your workload is steady, you control prompt/response sizes tightly, and you've already optimized model choice per route, per-token billing can be cheaper. Flat per-call shines when traffic is bursty, prompts vary, or you just don't want to babysit model selection and cost. Pick what fits your reality.

Try it

There's a free tier: modelishub.com. I'd genuinely love feedback — especially whether predictable pricing actually matters for how you build, or if you prefer per-token control.

Top comments (2)

Saruman22 • Jun 18

The "juggling three API keys" pain is the same root issue whether you're on the API side or just using these models day to day — the ecosystem is multi-model now and nobody wants to live inside one vendor. (I run my day-to-day through MultipleChat for exactly that reason, so the routing instinct here resonates.)

On the actual tradeoff: flat per-call vs per-token is really about where you want your variance. Per-token puts variance in cost; flat per-call moves it into your margin per request — great when traffic is bursty and prompt sizes vary, exactly as you said. The X-Modelis-Routed-Model header is the right call too; auto-routing only works if I can audit what answered and pin a model when a route regresses.

One question: how does modelis-auto decide "best model for the task" — latency/cost-weighted, or quality-scored per category? That routing logic is basically the whole product, and I'd trust it faster if the criteria were transparent.

chenxiao5580-cmd • Jun 19

Great question — and you're right that the routing logic is the whole product, so here's the honest answer:

modelis-auto isn't latency/cost-weighted. It classifies each request by task difficulty and routes within the quality tier you pick — so the criterion is quality-fit-per-difficulty, not "cheapest that might work." Cost predictability comes from the flat per-call tier, not from down-routing to save money.

To your trust point: the X-Modelis-Routed-Model header on every response is exactly so you can audit which model answered and pin one when a route regresses. Agreed that auto-routing is only trustworthy if it's inspectable. Thanks for the sharp read on the tradeoffs.