Where Claude API costs actually go
Most large Claude bills are not caused by the model choice alone — they come from sending more tokens than necessary, on a more expensive model than the task needs, without caching anything. Before switching providers, it is worth fixing the parts you control. This guide walks through practical ways to cut Claude (and GPT) API costs, then where a discount gateway fits in.
1. Match the model to the task
Opus-class models are worth it for hard reasoning and agentic work, but a lot of real traffic — classification, extraction, short rewrites, routing — runs fine on a smaller, much cheaper model. A common pattern is to route easy requests to a cheap model and only escalate to a frontier model when needed.
| Task | Reasonable model tier |
|---|---|
| Classification, routing, extraction | Cheapest (Haiku-class) |
| General chat, summaries, drafts | Mid (Sonnet-class) |
| Hard reasoning, agents, long context | Frontier (Opus-class) |
2. Use prompt caching for repeated context
If you send the same large system prompt, instructions, or document on every call, prompt caching lets the provider reuse that prefix at a large discount instead of re-billing it each time. Put the stable content first and the variable user input last, so the cacheable prefix stays identical across requests. For agents and RAG that resend the same context repeatedly, this is often the single biggest saving.
3. Trim tokens you are paying for
- Shorten bloated system prompts — every request pays for them.
- Cap
max_tokensto what the task needs instead of leaving a huge ceiling. - Summarize or window long chat history rather than resending the entire transcript.
- Strip boilerplate from retrieved documents before adding them to context.
4. Stream to cut perceived cost, not real cost
Streaming does not reduce token cost, but it improves responsiveness, which lets you use a cheaper model without the UX feeling slow. That can let you downgrade a tier for interactive features.
5. Use a discounted, OpenAI-compatible gateway
After you have optimized usage, the remaining lever is the per-token price itself. A gateway like APIVAI resells Claude and GPT models at a steep discount off official pricing behind an OpenAI- and Anthropic-compatible API, so you keep all of the optimizations above and simply lower the unit price. It drops into existing code with a base-URL and key change.
curl https://api.apivai.com/v1/chat/completions \
-H "Authorization: Bearer $APIVAI_API_KEY" \
-H "Content-Type: application/json" \
-d '{"model":"claude-haiku-4-5","messages":[{"role":"user","content":"Classify: refund request"}]}'
A simple cost checklist
- Route easy work to the cheapest capable model.
- Cache stable prompt prefixes.
- Cap
max_tokensand trim history and documents. - Measure real token usage before and after each change.
- Lower the unit price with a discount gateway once usage is tight.
Get started
Pick one of these levers and measure the difference on your real traffic. If you want the unit-price win with no code rewrite, create an APIVAI key, point your OpenAI-compatible client at the APIVAI base URL, and choose a model from /v1/models.
Top comments (0)