DEV Community: chenxiao5580-cmd

Cut your LLM bill by matching the model to the task — a real price comparison

chenxiao5580-cmd — Tue, 07 Jul 2026 11:25:12 +0000

With per-token pricing, the model you pick is the single biggest lever on your bill. Running a frontier model on a simple classification job can cost 10–90× more than a smaller model that handles it just as well. Here's a real price comparison and where the cheaper option is usually enough.

All prices are USD per 1M tokens (input / output), list prices — the savings below come from model choice, not a discount.

The price spread

Model	Input	Output
Claude Opus 4.8	$5.00	$25.00
Claude Sonnet 4.6	$3.00	$15.00
Claude Haiku 4.5	$1.00	$5.00
GPT-4o	$2.50	$10.00
GPT-4o mini	$0.15	$0.60
Gemini 2.5 Pro	$1.25	$10.00
Gemini 2.5 Flash	$0.30	$2.50
DeepSeek V4 Flash	$0.10	$0.20

Where a cheaper model is usually enough

Simple tasks (classification, extraction, rewriting, batch jobs) → GPT-4o mini instead of GPT-4o
$0.15 vs $2.50 input, $0.60 vs $10 output — ~94% cheaper. For structured, well-defined tasks the mini model rarely loses quality.

Everyday coding & general work → Claude Sonnet 4.6 instead of Opus 4.8
$3 vs $5 input, $15 vs $25 output — 40% cheaper. Save Opus for genuinely hard reasoning.

High-concurrency lightweight calls → Claude Haiku 4.5 instead of Sonnet 4.6
$1 vs $3 input, $5 vs $15 output — ~67% cheaper.

Long-document summarization & retrieval QA → Gemini 2.5 Flash instead of 2.5 Pro
$0.30 vs $1.25 input, $2.50 vs $10 output — ~75% cheaper.

Chinese-language work & high-volume throughput → DeepSeek V4 Flash instead of GPT-4o
$0.10 vs $2.50 input, $0.20 vs $10 output — 96–98% cheaper.

The catch: routing by hand is tedious

Knowing the cheaper model exists is one thing; actually picking it per request — and remembering to switch back for the hard prompts — is friction nobody keeps up with.

That's the problem Modelis is built around: it's an OpenAI-compatible gateway that classifies each request by difficulty and routes it to a fitted model automatically, then bills a flat per-call price so your cost stays predictable regardless of which model answered. Every response carries an X-Modelis-Routed-Model header so you can see exactly which model handled it.

Free tier — get a key.

Use GPT, Claude, and Gemini with the OpenAI SDK — one base_url, any language

chenxiao5580-cmd — Sat, 04 Jul 2026 01:35:07 +0000

If you're juggling separate SDKs and API keys for OpenAI, Anthropic, and Google, here's a simpler setup: point the OpenAI SDK at an OpenAI-compatible gateway and call GPT, Claude, and Gemini through one base_url and one key — at a flat, predictable per-call price instead of three separate per-token bills.

Here's the exact change in five common stacks.

Python (OpenAI SDK)

from openai import OpenAI

client = OpenAI(
    base_url="https://modelishub.com/v1",
    api_key="YOUR_MODELIS_KEY",
)
resp = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "Hello"}],
)
print(resp.choices[0].message.content)

The rest of your code stays the same — only base_url and api_key change.

Node.js (openai v4)

import OpenAI from 'openai';

const client = new OpenAI({
  baseURL: 'https://modelishub.com/v1',
  apiKey: process.env.MODELIS_API_KEY,
});

const resp = await client.chat.completions.create({
  model: 'gpt-4o-mini',
  messages: [{ role: 'user', content: 'Hello' }],
});
console.log(resp.choices[0].message.content);

curl (works in any language)

curl https://modelishub.com/v1/chat/completions \
  -H "Authorization: Bearer YOUR_MODELIS_KEY" \
  -H "Content-Type: application/json" \
  -d '{"model":"gpt-4o-mini","messages":[{"role":"user","content":"Hello"}]}'

The same pattern works in Go, Java, PHP, Rust — anywhere you can make an HTTP request.

LangChain

from langchain_openai import ChatOpenAI

llm = ChatOpenAI(
    base_url="https://modelishub.com/v1",
    api_key="YOUR_MODELIS_KEY",
    model="gpt-4o-mini",
)
print(llm.invoke("Hello").content)

Switching between GPT, Claude, and Gemini

To use a different model, just change the model id — claude-sonnet-4-6, gemini-2.5-flash, and so on. Same endpoint, same key. Fetch the full list any time with GET /v1/models.

resp = client.chat.completions.create(
    model="claude-sonnet-4-6",      # or gemini-2.5-flash, gpt-4o, ...
    messages=[{"role": "user", "content": "Hello"}],
)

This runs on Modelis, an OpenAI-compatible LLM gateway that auto-routes each request to a model fitted to the task and bills a flat per-call price (not per-token), so your bill stays predictable regardless of which model answered. Free tier to try it — details + free key.

Call GPT, Claude, and Gemini from one API key — a 3-step setup

chenxiao5580-cmd — Wed, 01 Jul 2026 01:35:04 +0000

If you want to try GPT, Claude, and Gemini without signing up for three separate platforms and juggling three billing dashboards, here's a 3-step setup using an OpenAI-compatible gateway.

1. Create an account

Go to modelishub.com, sign up with a username/email + password, and pass the quick human check. You're logged into the console immediately — no approval wait.

2. Create an API key

In the console, open API Tokens → New Token and give it a name (e.g. my-app). Optionally set a quota cap so you can control spend per project. Copy the key (it starts with sk-) — that's what you call the API with. Keep it private; usage and quota are tracked per key.

3. Make your first call

Point any OpenAI-compatible tool at https://modelishub.com/v1 with that key:

from openai import OpenAI

client = OpenAI(base_url="https://modelishub.com/v1", api_key="YOUR_KEY")
resp = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "Hello"}],
)
print(resp.choices[0].message.content)

To switch models, just change the model id — claude-sonnet-4-6, gemini-2.5-flash, and so on. Same endpoint, same key. Fetch the full list any time with GET /v1/models.

Checking your usage

The console Dashboard shows overall consumption; the Logs view shows every call (model, tokens used, quota spent, success or error), so you can pinpoint a wrong model id or a quota issue in seconds.

This runs on Modelis, an OpenAI-compatible LLM gateway that auto-routes each request to a fitted model and bills a flat per-call price (not per-token). Free tier to try it — get a key.

A flat per-call endpoint for summarize / classify / extract in your n8n and Make automations

chenxiao5580-cmd — Sun, 28 Jun 2026 13:00:34 +0000

If you run automations that summarize, classify, or pull fields out of text at volume, the LLM step is where per-token pricing turns budgeting into a guessing game: one batch of long inputs and the bill spikes. For these bounded-output jobs, a flat price per call fits better than a per-token frontier model. Here is how I wire it into n8n / Make, and when not to.

Why flat-per-call fits automation

Automation runs are repetitive and high-volume, and the outputs are short by nature: a summary, a label, a few extracted fields. I route them through Modelis, an OpenAI-compatible gateway that auto-routes each request to a fitting model and charges a flat price per call with output capped at ~1024 tokens. Because the output is bounded, each run costs the same and your monthly total stays predictable no matter the input size.

Wiring it into n8n / Make

It is a standard OpenAI-compatible POST /v1/chat/completions. Use an HTTP Request node:

Method: POST
URL: https://modelis-auto-chat.p.rapidapi.com/v1/chat/completions
Headers: x-rapidapi-host: modelis-auto-chat.p.rapidapi.com, x-rapidapi-key: YOUR_KEY, content-type: application/json
Body:

{"model":"modelis-auto","messages":[{"role":"user","content":"Label sentiment (positive/negative/neutral): {{ $json.text }}"}]}

The curl equivalent:

curl --request POST \
  --url https://modelis-auto-chat.p.rapidapi.com/v1/chat/completions \
  --header 'content-type: application/json' \
  --header 'x-rapidapi-host: modelis-auto-chat.p.rapidapi.com' \
  --header 'x-rapidapi-key: YOUR_KEY' \
  --data '{"model":"modelis-auto","messages":[{"role":"user","content":"Summarize in 2 sentences: ..."}]}'

If you would rather use a built-in OpenAI node that expects an Authorization: Bearer key and a custom base URL, run the tiny open-source adapter next to your workflow runner:

npx modelis-openai      # local proxy on 127.0.0.1:8787, MIT, ~120 lines

Then point the node at http://127.0.0.1:8787/v1 with model modelis-auto.

Prompts that fit (short outputs)

Summarize: Summarize in 2 sentences: ...
Classify: Label sentiment (positive/negative/neutral): ...
Extract: Return JSON with {name, email, company} from: ...

All produce short outputs, so the flat per-call price keeps high-volume runs cheap to reason about.

When NOT to use it

Long-form generation (articles, whole files, large code) will hit the ~1024-token cap and get truncated. Keep a high-output model for those. Use this for the short, structured outputs that automations actually need.

Try it

Free tier: https://rapidapi.com/chenxiao5580/api/modelis-auto-chat
Adapter source (read it before you run it): https://github.com/modelishub/modelis-openai

I built the adapter. I am most curious which extraction and classification tasks the routing handles well versus badly. If you point an automation at it, I would love to hear how it routed.

When a flat-price, capped-output LLM API is exactly right (and when it isn't)

chenxiao5580-cmd — Fri, 26 Jun 2026 01:59:53 +0000

Per-token pricing makes LLM bills hard to predict: a chatty model on a verbose prompt can cost several times what you budgeted, and that variance compounds at volume. For one whole class of work, though — bounded-output tasks — a flat price per call with a capped output is a better fit than the usual per-token frontier model. Here's the trade-off, honestly, and how I wire it up.

The idea: flat per call, output bounded

I've been routing this kind of work through Modelis, an OpenAI-compatible gateway that auto-routes each request to a fitting model (GPT / Claude / Gemini) and charges a flat price per call. Output is capped at ~1024 tokens.

That cap sounds like a pure limitation — and for some work it is (see the end). But for bounded-output tasks, the capped output is exactly why the price can stay flat and predictable.

Where it shines (bounded output is a feature)

Chat / support bots — replies are short, so cost per message is fixed.
Summarization — summaries are short by definition.
Classification / tagging / extraction — outputs are tiny.
RAG answer generation — you want concise, source-grounded answers, not essays.

For all of these, a capped flat price means high volume stays cheap and your monthly bill is predictable regardless of input size.

How to use it

It's a standard OpenAI-compatible POST /v1/chat/completions. Plain HTTP:

curl --request POST \
  --url https://modelis-auto-chat.p.rapidapi.com/v1/chat/completions \
  --header 'content-type: application/json' \
  --header 'x-rapidapi-host: modelis-auto-chat.p.rapidapi.com' \
  --header 'x-rapidapi-key: YOUR_KEY' \
  --data '{"model":"modelis-auto","messages":[{"role":"user","content":"Summarize this in one line: ..."}]}'

If your tool or SDK expects the standard Authorization: Bearer header, there's a tiny open-source adapter that bridges it (MIT, ~120 lines, runs locally):

npx modelis-openai      # local proxy on 127.0.0.1:8787

Point any OpenAI client at it:

from openai import OpenAI
client = OpenAI(base_url="http://127.0.0.1:8787/v1", api_key="YOUR_KEY")
print(client.chat.completions.create(
    model="modelis-auto",
    messages=[{"role": "user", "content": "Classify sentiment: 'shipping was slow but product is great'"}],
).choices[0].message.content)

Or, in Continue (~/.continue/config.yaml):

models:
  - name: Modelis
    provider: openai
    model: modelis-auto
    apiBase: http://127.0.0.1:8787/v1
    apiKey: YOUR_KEY

When NOT to use it (being honest)

Do not point this at:

generating whole files or large multi-file diffs,
autonomous coding agents (large refactors in Aider, or Cline / Roo).

The ~1024-token output cap will truncate those. For code generation, keep a high-output model configured and switch to it. Use the flat-price endpoint for the bounded-output jobs above, where short output is what you want anyway.

Try it

Free tier: https://rapidapi.com/chenxiao5580/api/modelis-auto-chat
Adapter source (read it before you run it): https://github.com/modelishub/modelis-openai

I built the adapter. I'm most curious which bounded-output tasks the routing handles well versus badly — if you try it for summaries, classification, or RAG answers, I'd love to hear how it routed.

Use a flat-priced, auto-routing LLM API in Aider or Cline — one npx command

chenxiao5580-cmd — Tue, 23 Jun 2026 04:03:09 +0000

Coding assistants like Aider, Cline, and Continue all speak the OpenAI wire protocol — point them at a base_url, give them an API key, done. That makes swapping in a different LLM backend trivial... if that backend uses Authorization: Bearer.

The flat-priced, auto-routing API I'd been using doesn't. It's distributed through RapidAPI, which authenticates with an X-RapidAPI-Key header instead of Bearer. So I couldn't just drop it into Aider. The fix turned out to be ~120 lines, so I open-sourced it.

modelis-openai

A zero-dependency local proxy (MIT, Node 18+). It listens on 127.0.0.1, speaks plain OpenAI, rewrites the auth header, and forwards to the upstream gateway. Streaming (stream: true) is piped straight through, so token-by-token output works exactly as with the OpenAI API.

your tool ──OpenAI(Bearer)──▶ modelis-openai (localhost) ──X-RapidAPI-Key──▶ upstream ──▶ best model

Quickstart

npx modelis-openai

Then point any OpenAI-compatible tool at it:

Setting	Value
Base URL	`http://127.0.0.1:8787/v1`
API key	your RapidAPI key
Model	`modelis-auto`

Drop it into your tool

Aider

export OPENAI_API_BASE=http://127.0.0.1:8787/v1
export OPENAI_API_KEY=<your-rapidapi-key>
aider --model openai/modelis-auto

Cline / Roo Code — API Provider OpenAI Compatible, Base URL http://127.0.0.1:8787/v1, Model ID modelis-auto.

Continue (~/.continue/config.yaml)

models:
  - name: Modelis
    provider: openai
    model: modelis-auto
    apiBase: http://127.0.0.1:8787/v1
    apiKey: <your-rapidapi-key>

Any OpenAI SDK

from openai import OpenAI
client = OpenAI(base_url="http://127.0.0.1:8787/v1", api_key="<your-rapidapi-key>")
print(client.chat.completions.create(
    model="modelis-auto",
    messages=[{"role": "user", "content": "Hello"}],
).choices[0].message.content)

How it works

Reads the key from Authorization: Bearer <key> (or MODELIS_RAPIDAPI_KEY).
Rewrites the request model to modelis-auto (configurable).
Forwards to the RapidAPI gateway with X-RapidAPI-Key / X-RapidAPI-Host.
Relays the response — including SSE streams and rate-limit headers — unchanged.

It also answers GET /v1/models and GET /health so tools that probe on startup don't choke.

Honest notes

It routes to a paid API (there's a free tier to start). The point of the proxy is to remove the integration friction, not to give anything away.
Cursor isn't supported — it sends requests from its own servers, so a localhost endpoint can't be reached. This is for tools that call the API from your machine.

One OpenAI-compatible API that auto-picks the model — and tells you which one answered

chenxiao5580-cmd — Mon, 22 Jun 2026 13:04:38 +0000

I had three API keys in my .env — OpenAI, Anthropic, Google — plus an unwritten
rule in my head: "cheap model for easy stuff, big model for the hard stuff."

In practice I'd forget to switch, burn a frontier model on a one-line prompt, and
then watch a per-token bill swing 5x week to week with no real change in what I
was doing. So I built a small gateway to take that decision off my plate, and
wrote a cookbook of drop-in examples for it. Sharing in case the same thing bugs
you.

The idea

Modelis is an OpenAI-compatible /chat/completions API. You send one model
name — modelis-auto — and it routes each request to the right model (GPT,
Claude, Gemini, …). Two things make it more than "yet another gateway":

Flat per-call pricing. You pay a fixed rate per request, not per token, so the bill doesn't balloon when a model gets chatty. It's predictable.
It tells you who answered. The response's model field is the real model that handled the request. No black box.

It's the OpenAI API with a different base_url, so there's no new SDK to learn —
your existing code keeps working.

How the routing actually works

You pick a quality tier (basic / standard / premium). For each request,
Modelis classifies it (how hard is it? code? reasoning? a one-liner?) and routes
to the cheapest model that clears that tier's quality floor for that kind of
task. A trivial prompt might land on Gemini Flash; a gnarly reasoning task gets
bumped to Claude or GPT — but you pay the same flat per-call rate for the tier
either way. The model choice is the system's problem; your cost stays fixed.

30-second integration

from openai import OpenAI

client = OpenAI(
    api_key="YOUR_MODELIS_KEY",           # free key at https://modelishub.com
    base_url="https://modelishub.com/v1",  # <- the only change
)

resp = client.chat.completions.create(
    model="modelis-auto",
    messages=[{"role": "user", "content": "In one sentence, what is an LLM gateway?"}],
)

print(resp.choices[0].message.content)
print("answered by:", resp.model)

Output — note the second line:

An LLM gateway is a middleware service that routes, manages, and secures
requests to one or more large language models.

answered by: openai/gpt-4.1-2025-04-14

I sent modelis-auto; for that easy prompt the router picked gpt-4.1 and said
so. Ask something harder and it moves up; ask something trivial and it drops to a
cheaper model — the per-call price doesn't move.

Use it from the tools you already have

Because it's just the OpenAI API with a different base_url, it drops into most
stacks unchanged. Runnable examples for each are in the cookbook:

curl — one request, see the model field
OpenAI Python / Node SDKs — change base_url, done
LangChain — ChatOpenAI(base_url=...)
Vercel AI SDK — createOpenAI({ baseURL })
Tool / function calling — works on the direct endpoint

👉 Cookbook: https://github.com/chenxiao5580-cmd/modelis-cookbook

How it's different from per-token gateways

Most multi-model gateways still bill you per token and still expect you to name a
model. That's fine until your spend becomes a function of how verbose a model
decides to be, and until "which model for this call?" becomes a decision you make
hundreds of times. Modelis trades that for a flat per-call price and an automatic
choice — and then tells you what it chose so you can sanity-check it. Different
trade-off, not a silver bullet; whether it fits depends on your workload.

Honest caveats

Text chat is the first-class path today. Tool calling works on the direct endpoint; vision isn't the focus yet.
The free signup quota is small — it's for kicking the tires, not running production for free.
There's a managed pay-as-you-go option on RapidAPI, but that plan is basic text chat only (no tools/vision) — the auto-routing across premium models lives on the direct endpoint.

Why I bother

Per-token billing punishes you for verbose models and makes spend unpredictable;
hand-picking a model per request is busywork you forget to do. A flat per-call
price + automatic routing + a response that's honest about which model ran felt
like the gateway I actually wanted to use.

If you try it, I'd genuinely like feedback on the routing quality and whether flat
pricing matters for your workload — drop a comment.

Use GPT, Claude, and Gemini with the OpenAI SDK - one base_url, any language

chenxiao5580-cmd — Fri, 19 Jun 2026 15:24:17 +0000

Here's the exact change in five common stacks.

Python (OpenAI SDK)

from openai import OpenAI

client = OpenAI(
    base_url="https://modelishub.com/v1",
    api_key="YOUR_MODELIS_KEY",
)
resp = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "Hello"}],
)
print(resp.choices[0].message.content)

The rest of your code stays the same — only base_url and api_key change.

Node.js (openai v4)

import OpenAI from 'openai';

const client = new OpenAI({
  baseURL: 'https://modelishub.com/v1',
  apiKey: process.env.MODELIS_API_KEY,
});

const resp = await client.chat.completions.create({
  model: 'gpt-4o-mini',
  messages: [{ role: 'user', content: 'Hello' }],
});
console.log(resp.choices[0].message.content);

curl (works in any language)

curl https://modelishub.com/v1/chat/completions \
  -H "Authorization: Bearer YOUR_MODELIS_KEY" \
  -H "Content-Type: application/json" \
  -d '{"model":"gpt-4o-mini","messages":[{"role":"user","content":"Hello"}]}'

The same pattern works in Go, Java, PHP, Rust — anywhere you can make an HTTP request.

LangChain

from langchain_openai import ChatOpenAI

llm = ChatOpenAI(
    base_url="https://modelishub.com/v1",
    api_key="YOUR_MODELIS_KEY",
    model="gpt-4o-mini",
)
print(llm.invoke("Hello").content)

Switching between GPT, Claude, and Gemini

To use a different model, just change the model id — claude-sonnet-4-6, gemini-2.5-flash, and so on. Same endpoint, same key. Fetch the full list any time with GET /v1/models.

resp = client.chat.completions.create(
    model="claude-sonnet-4-6",      # or gemini-2.5-flash, gpt-4o, ...
    messages=[{"role": "user", "content": "Hello"}],
)

Stop guessing your AI bill: one endpoint for GPT-5.5, Claude & Gemini at a flat per-call price

chenxiao5580-cmd — Thu, 18 Jun 2026 16:17:46 +0000

If you build on top of LLMs, you've probably hit this: you ship a feature, traffic spikes, and the API bill comes back way higher than you expected. Per-token pricing makes costs hard to predict — you're billed by how verbose the model is, not by the value you ship.

I got tired of that (plus juggling three API keys), so here's a setup that fixes both: one OpenAI-compatible endpoint that auto-picks the best model and charges a flat price per call.

The core idea

Instead of calling each provider directly, you point your existing OpenAI SDK at a single gateway and send one model name: modelis-auto. It routes each request to the best model for the task (GPT-5.5, Claude Opus 4.8, Gemini 3.1, Grok, DeepSeek…) and bills a flat per-call rate — so your cost is predictable regardless of which model handled it.

Zero migration: just change base_url

If you already use the OpenAI SDK, this is a one-line change.

from openai import OpenAI

client = OpenAI(
    api_key="YOUR_MODELIS_KEY",
    base_url="https://modelishub.com/v1",   # the only change
)

resp = client.chat.completions.create(
    model="modelis-auto",                    # let it pick the best model
    messages=[{"role": "user", "content": "Explain CRDTs in two sentences."}],
)
print(resp.choices[0].message.content)

Or with curl:

curl https://modelishub.com/v1/chat/completions \
  -H "Authorization: Bearer YOUR_MODELIS_KEY" \
  -H "Content-Type: application/json" \
  -d '{"model":"modelis-auto","messages":[{"role":"user","content":"Hi"}]}'

That's it. Your existing code, SDKs, and OpenAI-compatible tools keep working.

"But which model actually answered?"

Fair question — auto-routing shouldn't be a black box. Every response returns a header telling you exactly which model handled the request:

X-Modelis-Routed-Model: claude-opus-4-8

And if you want control, you can stay in a quality tier or call a specific model directly:

model: "modelis-auto:premium"     # stay in a quality tier
model: "gpt-5.5"                   # or pin a specific model

Why flat per-call instead of per-token

The point isn't "cheaper than everyone" — it's predictable. With a flat per-call price:

A verbose response doesn't cost more than a terse one.
A busy day scales with calls, not with token noise.
You can actually budget, and price your own product with confidence.

Honest take: when per-token is still fine

If your workload is steady, you control prompt/response sizes tightly, and you've already optimized model choice per route, per-token billing can be cheaper. Flat per-call shines when traffic is bursty, prompts vary, or you just don't want to babysit model selection and cost. Pick what fits your reality.

Try it

There's a free tier: modelishub.com. I'd genuinely love feedback — especially whether predictable pricing actually matters for how you build, or if you prefer per-token control.

Stop hand-picking an LLM per request: a practical case for auto-routing

chenxiao5580-cmd — Tue, 16 Jun 2026 16:24:38 +0000

Most LLM features ship with the model name hardcoded. You picked it once — usually the strongest one you could justify — and now every request, trivial or gnarly, hits the same expensive model. The easy ones overpay; if you down-picked to save money, the hard ones quietly degrade. You're paying the frontier price for "reformat this list," or shipping a weak answer on "find the bug in this trace."

Routing per request fixes the mismatch: classify each request's difficulty, then send it to the cheapest model in your quality tier that can actually handle it.

What "difficulty" can mean in practice

You don't need a research model to route well. Cheap, legible signals get you most of the way:

Input shape and length. A 30-token reformat is not a 6,000-token "reason over this codebase" request.
Task type keywords / structure. Extraction and classification skew easy; multi-step reasoning, code debugging, and math skew hard.
A tiny classifier up front. A small, fast model can score difficulty in a few milliseconds and a fraction of a cent, then hand off to the right tier.

The router doesn't have to be perfect. It has to be better than a single hardcoded choice — which is a low bar, because a hardcoded choice is wrong for half your distribution by construction.

Where routing misfires (and why you must plan for it)

Anyone who's run this in production will tell you the failure modes are the interesting part:

Deceptively short hard prompts. "Prove this is NP-complete" is 5 tokens and very hard. Length-only routing sends it to a weak model and you get a confident-wrong answer.
Misclassification is silent. A bad route doesn't error — it just returns a worse answer. Without eval, you won't see it.
Tier boundaries are fuzzy. Requests near the easy/hard line will flip routes run-to-run, making behavior feel non-deterministic.

So routing is not "set and forget." It needs guardrails.

Keeping it safe

Route within a quality floor, never below it. Let the user pick a tier; the router chooses within it, so the worst case is still acceptable. Never silently drop to a model the user wouldn't accept.
Bias toward the stronger model on uncertainty. When the difficulty score is ambiguous, round up. A few cents of overspend beats a wrong answer.
Make routing decisions observable. Log which model served each request and why. You can't debug or tune a router you can't see.
Eval the routes, not just the models. Periodically check that easy-routed requests would've gotten the same answer from the strong model, and that hard-routed ones aren't degrading.

What it buys you

When it's tuned, you stop overpaying on the (usually majority) easy traffic without down-grading the hard tail — and you stop hand-maintaining a model choice that drifts out of date every time providers ship something new. The routing logic lives in one place instead of smeared across feature code.

Takeaway

Hardcoding one model per feature optimizes for nothing — it's a coin flip that's wrong for half your request distribution. Difficulty-based routing within a quality tier, with an "round up when unsure" bias and real observability, is a better default. I've built this into a small OpenAI-compatible gateway called Modelis (send model: "auto", it routes within your tier and bills a flat per-call price, free tier) at modelishub.com — but you can build the same idea yourself with a small front classifier. I'd love to hear the nastiest "short prompt, secretly hard" example you've hit — those are the routing killers.

One base_url for GPT, Claude, and Gemini: cutting three SDKs down to one

chenxiao5580-cmd — Tue, 16 Jun 2026 16:16:17 +0000

The first time you add a second LLM provider to a codebase, it feels manageable. By the third, you've got three SDKs, three auth schemes, three slightly different messages shapes, three retry policies, and three places a model deprecation can break you. The "use the best model for each job" advice is correct — but the integration tax is real, and it compounds.

Here's the pattern I keep coming back to: put everything behind one OpenAI-compatible endpoint and stop maintaining three clients.

The actual cost of multi-SDK glue

It's not the happy path that hurts — it's everything around it:

Auth drift. OpenAI wants a Bearer key, Anthropic wants x-api-key + a version header, Google wants its own scheme. Three secrets, three rotation schedules.
Response-shape divergence. choices[0].message.content vs. a content block array vs. candidates[].content.parts[]. Every call site that reads a response needs a per-provider branch.
Streaming differences. SSE framing and event names differ enough that your stream parser grows a switch statement.
Failure-mode sprawl. Rate-limit codes, timeout behavior, and error bodies all differ. Your retry/backoff logic forks per provider.

None of it is hard. All of it is maintenance — the kind that quietly slows a small team down.

The OpenAI-compatible shim

Most providers can be normalized to the OpenAI Chat Completions contract, because the ecosystem already standardized on it. So you point the official OpenAI SDK at a different base_url:

from openai import OpenAI

client = OpenAI(
    base_url="https://your-gateway/v1",
    api_key="YOUR_GATEWAY_KEY",
)

resp = client.chat.completions.create(
    model="gpt-4o-mini",          # or claude-*, gemini-*, etc.
    messages=[{"role": "user", "content": "Summarize this PR in two lines."}],
)
print(resp.choices[0].message.content)

Same call site, same response shape, one key. To switch the underlying model you change one string. Your retry logic, your streaming parser, your logging — all written once, against one contract.

What you actually gain

One auth + one secret to rotate, not three.
One response shape at every call site — no per-provider branching.
No lock-in. Because you code against the OpenAI contract (an open de-facto standard), moving off any single provider — or off the gateway itself — is a base_url change, not a rewrite. That cuts both ways, and that's the point.

What you give up (the honest part)

A normalization layer can't be a free lunch:

Provider-specific features get flattened. Anthropic's prompt caching, Google's grounding, OpenAI's structured-output modes — anything that doesn't map cleanly to the common contract is either unavailable or exposed through non-standard fields you'll have to special-case anyway.
You add a hop. One more network segment and one more thing that can be down. For latency-sensitive paths, measure it.
You inherit the shim's coverage gaps. If the gateway hasn't mapped a parameter you need, you're blocked until it does. Pick one whose mapping is transparent and documented.

The trade is: lose access to the long tail of provider-specific knobs, gain a dramatically smaller integration surface. For most app teams — who use maybe 5% of any provider's surface — that's a good deal. For teams leaning hard on one provider's exclusive features, it isn't.

Takeaway

If your codebase has grown a per-provider branch at every LLM call site, collapsing to one OpenAI-compatible base_url removes a whole category of maintenance — at the cost of the provider-specific long tail. I build this into a small gateway called Modelis (one key for GPT/Claude/Gemini, optional auto routing, free tier) at modelishub.com, but the shim pattern works with anything OpenAI-compatible. Curious which provider-specific feature you'd refuse to give up — that's usually the deciding factor.

Stop getting surprise per-token LLM bills: a flat-rate, auto-routing API approach

chenxiao5580-cmd — Tue, 16 Jun 2026 16:13:36 +0000

If you ship anything on top of an LLM API, you've probably had this moment: you check the dashboard at the end of the month and the bill is 3x what you modeled. Nothing broke. Usage just... drifted. A few prompts got chattier, one model started "thinking" more, and your per-token math quietly fell apart.

I've been living in that loop, so I want to lay out why per-token pricing is hard to forecast, and a different billing shape that trades some theoretical savings for a number you can actually predict.

Why per-token spend is so hard to model

Per-token billing looks simple — price × tokens — but three things make it slippery in practice:

1. Output length is not yours to control. max_tokens is a hard ceiling, but the model decides how much of that ceiling it actually uses. Two models given the identical prompt can produce wildly different output lengths, and the verbose one costs you more for the same task.

2. "Reasoning" tokens are invisible until billed. Some newer models (e.g. OpenAI's o-series) emit internal reasoning tokens that you're charged for but don't see in the default response. Your logs show a 200-token answer; the invoice counts the 1,500 tokens it took to get there.

3. Input grows silently. RAG context, longer chat histories, system prompts that accrete over time — input token counts creep up release by release, and without active monitoring nobody notices until the bill does.

Individually these are fine. Together they mean your cost-per-request has a fat tail, and a fat tail is exactly what you can't put in a pricing page or a unit-economics spreadsheet.

The alternative: bill a flat price per call

The idea is simple: instead of charging for whatever tokens happened, charge a flat rate per request within a quality tier you pick. The bill becomes calls × tier_price, full stop. Output length, reasoning tokens, which specific model answered — none of it changes what you pay.

You give up something real here (more on that below), but you gain the one thing per-token can't offer: a cost you can forecast before you ship. For a lot of products — fixed-price SaaS features, internal tools with a budget, anything where you quote a customer a number — that predictability is worth more than shaving a few percent off the theoretical optimum.

Concretely: say a request averages 1,000 tokens and a cheap model bills $0.0005/1K — that's ~$0.0005/call. A flat tier at, say, $0.002/call costs more on that uniform workload. But the moment your requests vary — some 200 tokens, some 8,000, some routed to a frontier model — the per-token average climbs and its variance explodes, while the flat price stays put. Flat pricing wins on the spread, not the average.

Pairing it with auto-routing

Flat-per-call pricing gets more interesting when you stop hand-picking the model. If every request costs the same regardless of which model serves it, you can let a router classify each request's difficulty and send it to the cheapest model that can actually handle it — a small model for "summarize this", a frontier model for "debug this race condition" — without you wiring up that logic or eating a surprise when a hard request lands on an expensive model.

The developer-facing contract stays boring on purpose:

curl https://your-gateway/v1/chat/completions \
  -H "content-type: application/json" \
  -H "authorization: Bearer $KEY" \
  -d '{
    "model": "auto",
    "messages": [{"role":"user","content":"Explain quantum entanglement in one sentence."}]
  }'

It's an OpenAI-compatible /chat/completions body, so existing SDKs work by changing base_url and the key. You send one model name (auto) and let routing + a flat tier price do the rest. Zero migration is part of the pitch — if it needs a rewrite, nobody switches.

When flat pricing does NOT pay off (the honest part)

This isn't a free lunch, and pretending otherwise would be dishonest:

High-volume, short, predictable calls. If your requests are tiny and uniform (think: classification of one-line inputs at scale), per-token on a cheap model will almost certainly beat a flat per-call rate. Flat pricing's value is in variance reduction, and you have no variance.
You've already done the FinOps work. If you have tight token budgets, prompt-length guards, and good observability, you've manufactured your own predictability and a flat tier buys you less.
You need features flat tiers cap. Vision inputs, tool/function-calling, huge outputs — flat per-call tiers often bound these (depending on the gateway) to keep the price honest. If you need them, a metered model fits better.
You have strict latency SLAs. Cost-only routing can pick a cheap-but-slow model. If you're under a hard latency budget, the router needs a latency filter too — extra complexity that eats some of the simplicity win.

The rule of thumb: flat pricing is insurance against variance. The more your per-request cost jumps around — mixed task difficulty, verbose or reasoning-heavy models, growing context — the more that insurance is worth. The flatter your workload already is, the less you need it.

Takeaway

Per-token billing isn't wrong, but it optimizes for a metric (tokens) that isn't the one you're trying to control (a predictable bill). If you're quoting fixed prices to customers, or you just want your LLM line item to stop surprising you, a flat per-call rate — ideally with auto-routing underneath — is worth a look.

I've been building this approach into a small OpenAI-compatible gateway called Modelis (one key for GPT/Claude/Gemini, auto routing, flat per-plan pricing, free tier to try). If you want to kick the tires, it's at modelishub.com. But the billing argument stands on its own — I'd genuinely like to hear where flat-vs-metered breaks for your workload in the comments.

DEV Community: chenxiao5580-cmd

Cut your LLM bill by matching the model to the task — a real price comparison

The price spread

Where a cheaper model is usually enough

The catch: routing by hand is tedious

Use GPT, Claude, and Gemini with the OpenAI SDK — one base_url, any language

Python (OpenAI SDK)

Node.js (openai v4)

curl (works in any language)

LangChain

Switching between GPT, Claude, and Gemini

Call GPT, Claude, and Gemini from one API key — a 3-step setup

1. Create an account

2. Create an API key

3. Make your first call

Checking your usage

A flat per-call endpoint for summarize / classify / extract in your n8n and Make automations

Why flat-per-call fits automation

Wiring it into n8n / Make

Prompts that fit (short outputs)

When NOT to use it

Try it

When a flat-price, capped-output LLM API is exactly right (and when it isn't)

The idea: flat per call, output bounded

Where it shines (bounded output is a feature)

How to use it

When NOT to use it (being honest)

Try it

Use a flat-priced, auto-routing LLM API in Aider or Cline — one npx command

modelis-openai

Quickstart

Drop it into your tool

How it works

Honest notes

Links

One OpenAI-compatible API that auto-picks the model — and tells you which one answered

The idea

How the routing actually works

30-second integration

Use it from the tools you already have

How it's different from per-token gateways

Honest caveats

Why I bother

Use GPT, Claude, and Gemini with the OpenAI SDK - one base_url, any language

Python (OpenAI SDK)

Node.js (openai v4)

curl (works in any language)

LangChain

Switching between GPT, Claude, and Gemini

Stop guessing your AI bill: one endpoint for GPT-5.5, Claude & Gemini at a flat per-call price

The core idea

Zero migration: just change base_url

"But which model actually answered?"

Why flat per-call instead of per-token

Honest take: when per-token is still fine

Try it

Stop hand-picking an LLM per request: a practical case for auto-routing

What "difficulty" can mean in practice

Where routing misfires (and why you must plan for it)

Keeping it safe

What it buys you

Takeaway

One base_url for GPT, Claude, and Gemini: cutting three SDKs down to one

The actual cost of multi-SDK glue

The OpenAI-compatible shim

What you actually gain

What you give up (the honest part)

Takeaway

Stop getting surprise per-token LLM bills: a flat-rate, auto-routing API approach

Why per-token spend is so hard to model

The alternative: bill a flat price per call

Pairing it with auto-routing

When flat pricing does NOT pay off (the honest part)

Takeaway