Per-token pricing makes LLM bills hard to predict: a chatty model on a verbose prompt can cost several times what you budgeted, and that variance compounds at volume. For one whole class of work, though — bounded-output tasks — a flat price per call with a capped output is a better fit than the usual per-token frontier model. Here's the trade-off, honestly, and how I wire it up.
The idea: flat per call, output bounded
I've been routing this kind of work through Modelis, an OpenAI-compatible gateway that auto-routes each request to a fitting model (GPT / Claude / Gemini) and charges a flat price per call. Output is capped at ~1024 tokens.
That cap sounds like a pure limitation — and for some work it is (see the end). But for bounded-output tasks, the capped output is exactly why the price can stay flat and predictable.
Where it shines (bounded output is a feature)
- Chat / support bots — replies are short, so cost per message is fixed.
- Summarization — summaries are short by definition.
- Classification / tagging / extraction — outputs are tiny.
- RAG answer generation — you want concise, source-grounded answers, not essays.
For all of these, a capped flat price means high volume stays cheap and your monthly bill is predictable regardless of input size.
How to use it
It's a standard OpenAI-compatible POST /v1/chat/completions. Plain HTTP:
curl --request POST \
--url https://modelis-auto-chat.p.rapidapi.com/v1/chat/completions \
--header 'content-type: application/json' \
--header 'x-rapidapi-host: modelis-auto-chat.p.rapidapi.com' \
--header 'x-rapidapi-key: YOUR_KEY' \
--data '{"model":"modelis-auto","messages":[{"role":"user","content":"Summarize this in one line: ..."}]}'
If your tool or SDK expects the standard Authorization: Bearer header, there's a tiny open-source adapter that bridges it (MIT, ~120 lines, runs locally):
npx modelis-openai # local proxy on 127.0.0.1:8787
Point any OpenAI client at it:
from openai import OpenAI
client = OpenAI(base_url="http://127.0.0.1:8787/v1", api_key="YOUR_KEY")
print(client.chat.completions.create(
model="modelis-auto",
messages=[{"role": "user", "content": "Classify sentiment: 'shipping was slow but product is great'"}],
).choices[0].message.content)
Or, in Continue (~/.continue/config.yaml):
models:
- name: Modelis
provider: openai
model: modelis-auto
apiBase: http://127.0.0.1:8787/v1
apiKey: YOUR_KEY
When NOT to use it (being honest)
Do not point this at:
- generating whole files or large multi-file diffs,
- autonomous coding agents (large refactors in Aider, or Cline / Roo).
The ~1024-token output cap will truncate those. For code generation, keep a high-output model configured and switch to it. Use the flat-price endpoint for the bounded-output jobs above, where short output is what you want anyway.
Try it
- Free tier: https://rapidapi.com/chenxiao5580/api/modelis-auto-chat
- Adapter source (read it before you run it): https://github.com/modelishub/modelis-openai
I built the adapter. I'm most curious which bounded-output tasks the routing handles well versus badly — if you try it for summaries, classification, or RAG answers, I'd love to hear how it routed.
Top comments (0)