DEV Community

chenxiao5580-cmd
chenxiao5580-cmd

Posted on

Stop guessing your AI bill: one endpoint for GPT-5.5, Claude & Gemini at a flat per-call price

If you build on top of LLMs, you've probably hit this: you ship a feature, traffic spikes, and the API bill comes back way higher than you expected. Per-token pricing makes costs hard to predict — you're billed by how verbose the model is, not by the value you ship.

I got tired of that (plus juggling three API keys), so here's a setup that fixes both: one OpenAI-compatible endpoint that auto-picks the best model and charges a flat price per call.

The core idea

Instead of calling each provider directly, you point your existing OpenAI SDK at a single gateway and send one model name: modelis-auto. It routes each request to the best model for the task (GPT-5.5, Claude Opus 4.8, Gemini 3.1, Grok, DeepSeek…) and bills a flat per-call rate — so your cost is predictable regardless of which model handled it.

Zero migration: just change base_url

If you already use the OpenAI SDK, this is a one-line change.

from openai import OpenAI

client = OpenAI(
    api_key="YOUR_MODELIS_KEY",
    base_url="https://modelishub.com/v1",   # the only change
)

resp = client.chat.completions.create(
    model="modelis-auto",                    # let it pick the best model
    messages=[{"role": "user", "content": "Explain CRDTs in two sentences."}],
)
print(resp.choices[0].message.content)
Enter fullscreen mode Exit fullscreen mode

Or with curl:

curl https://modelishub.com/v1/chat/completions \
  -H "Authorization: Bearer YOUR_MODELIS_KEY" \
  -H "Content-Type: application/json" \
  -d '{"model":"modelis-auto","messages":[{"role":"user","content":"Hi"}]}'
Enter fullscreen mode Exit fullscreen mode

That's it. Your existing code, SDKs, and OpenAI-compatible tools keep working.

"But which model actually answered?"

Fair question — auto-routing shouldn't be a black box. Every response returns a header telling you exactly which model handled the request:

X-Modelis-Routed-Model: claude-opus-4-8
Enter fullscreen mode Exit fullscreen mode

And if you want control, you can stay in a quality tier or call a specific model directly:

model: "modelis-auto:premium"     # stay in a quality tier
model: "gpt-5.5"                   # or pin a specific model
Enter fullscreen mode Exit fullscreen mode

Why flat per-call instead of per-token

The point isn't "cheaper than everyone" — it's predictable. With a flat per-call price:

  • A verbose response doesn't cost more than a terse one.
  • A busy day scales with calls, not with token noise.
  • You can actually budget, and price your own product with confidence.

Honest take: when per-token is still fine

If your workload is steady, you control prompt/response sizes tightly, and you've already optimized model choice per route, per-token billing can be cheaper. Flat per-call shines when traffic is bursty, prompts vary, or you just don't want to babysit model selection and cost. Pick what fits your reality.

Try it

There's a free tier: modelishub.com. I'd genuinely love feedback — especially whether predictable pricing actually matters for how you build, or if you prefer per-token control.

Top comments (1)

Collapse
 
sarumaninan profile image
Saruman22

The "juggling three API keys" pain is the same root issue whether you're on the API side or just using these models day to day — the ecosystem is multi-model now and nobody wants to live inside one vendor. (I run my day-to-day through MultipleChat for exactly that reason, so the routing instinct here resonates.)

On the actual tradeoff: flat per-call vs per-token is really about where you want your variance. Per-token puts variance in cost; flat per-call moves it into your margin per request — great when traffic is bursty and prompt sizes vary, exactly as you said. The X-Modelis-Routed-Model header is the right call too; auto-routing only works if I can audit what answered and pin a model when a route regresses.

One question: how does modelis-auto decide "best model for the task" — latency/cost-weighted, or quality-scored per category? That routing logic is basically the whole product, and I'd trust it faster if the criteria were transparent.