Bruno Pérez

Posted on May 20 • Originally published at manifest.build

10 Ways To Reduce Your LLM API Costs

#ai #llm #money #models

So you made it, you have your AI app on prod, you are onboarding your users and they like what you've done, cheers! Now comes the hard-to-swallow part, the AI bill.

Serving those users consumes AI inference and it's literally eating all your margins. Let's see 10 ways to reduce that AI bill of yours.

Choose a well-fitted AI model
Use your Pro subscriptions
Reduce output tokens to cut your LLM bill
Use prompt caching when you can
Use Batch API for nightly workflows
Be FLEX-ible and accept slow tiers
Don't use AI
Use free models and free tiers
Don't miss those Cloud provider credits
Observe your AI costs and take back control

1. Choose a well-fitted AI model

Choosing the perfectly fitted model is not that easy. While realizing that your model is not "smart enough" for your requirements is the easy part, the other side is trickier: you may use an overkill model and you are overspending without noticing it.

Potential reductions are huge here, for example switching from GPT-5.5 to GPT-5.4 reduces costs by 50%. Use GPT-5.4 Mini instead? That's 85% cheaper.

However nothing is magical here: the cheaper models produce lower-quality outputs. Spending time to benchmark models on your particular use cases is the key to evaluate the impact of the quality loss. The closer you are to real production data, the better. Here are some ideas to put you on the right track:

Downshift within your model's family: flagship -> mini -> nano
Look for other providers' models: For example Chinese providers tend to be cheaper for an equivalent quality
Go for the previous generation: Opus 4.7 -> Opus 4.6 -> Opus 4.5

In some cases, the query is variable and you can't anticipate it. Using LLM routers that evaluate your query on-the-fly can be a good choice.

2. Use your Pro subscriptions

Are you paying for a ChatGPT, MiniMax or other pro subscription? Make the most of it and plug it into your app. With Manifest, you can plug several pro subscriptions into your apps, while keeping your code and favorite SDK as it is. As of today Manifest allows you to plug Anthropic, GitHub Copilot, MiniMax, Ollama Cloud, OpenAI, OpenCode Go and Z.ai pro subscription directly into your AI app. Make sure it's ok with the terms of service of your providers.

The thing to watch here is the rate limit. Those subscriptions are often way cheaper than their API key alternative, but they are limited in usage. You need to set up a fallback model in case you hit the API rate limit. See how MyTrainer did exactly this in production.

3. Reduce output tokens to cut your LLM bill

Did you notice that you don't actually pay directly for the inference of the model you call? You pay for what goes in and for what comes out of it, not the "amount of thinking" itself. What if we can tweak the system to produce the same value with less?

The output tokens cost on average 5 times more than the input tokens. It's totally worth it to spend more on the input tokens to reduce the output ones.

The easiest way is to simply tell the LLM explicitly to "be concise" and even specify the format that you are expecting. Ask for structured JSON or CSV instead of prose. Prose tends to be long and adds verbosity to the outputs. If you need prose, Caveman (60k+ GH stars) reduces output tokens by 75% by… speaking like a caveman. Caveman tool real. Make agent talk short. 🪨

By the way most providers allow a "max_tokens" that truncate outputs. While this doesn't make the model concise — it just truncates — it can still prevent long outputs.

4. Use prompt caching when you can

Caching has been used for decades to reduce compute and thus latency. For example when a website shows you a list of items, like the latest news for example, it will load from the database only for the first user, and then store it somewhere. Next users will receive that list from the cache directly, and have it served almost instantaneously.

It works the same with LLM prompts: Models do heavy work on every token you send to them. If part of it doesn't change between requests, they do less work and you pay less. Generally cached input costs between 50% and 90% less.

The #1 rule here is that your static content (the one that doesn't change) should come before the dynamic content (the volatile one). The first changed character and you're paying full price for following tokens. Structure your call correctly so that system prompts and knowledge bases come first:

messages = [
    {"role": "system", "content": SYSTEM_PROMPT + KNOWLEDGE_BASE},  # Static
    {"role": "user", "content": f"{user_question}"},  # Dynamic
]

resp = client.chat.completions.create(
    model="gpt-5.4-mini",
    messages=messages,
)

5. Use Batch API for nightly workflows

Batch API is a straight 50% discount on inference if you accept to receive the response within 24 hours.

You're not going to implement that on a chat or on any actions that need real-time answers, however any background process is a good candidate. It requires a bit of change in your code but it's totally worth it if you have a significant amount of inference in scheduled tasks, nightly workflows or routines.

Here is the link to the providers' docs that support batch API:

6. Be FLEX-ible and accept slow tiers

Some providers even have a "flex" tier: a synchronous tier explicitly slower than the standard. Unlike Batch API, here you do get the real-time response, but it is significantly slower. However the discount is usually the same: 50% off.

Now it's up to you to decide if you can afford that extra latency on some calls. Here is a list of providers that offer that trade-off:

7. Don't use AI

This one can sound a bit provocative but the cheapest inference will always be the one you don't use. Are those AI features genuinely needed? Or are they just decorative nice-to-haves?

Limiting LLM calls and using algorithms (aka good old software) when possible is the biggest win on this list. Of course it's easier said than done, but many operations can be done programmatically: validation, regex, rules, heuristics…

Pro tip: go hybrid. Sometimes algorithms cannot handle all cases, but work well on some of them. Use conditions in your code to use them when possible, and fall back on LLMs.

8. Use free models and free tiers

If you are just hacking around, maybe you don't need to pay at all! Some providers offer inference for free on some models. Of course the rate limit is quite low but if your app has a low AI inference consumption, it may work. If it doesn't, you can still use model fallback.

And you know what? We prepared a very cool list of Awesome Free LLM APIs just for you 😎

9. Don't miss those Cloud provider credits

Building a startup? All major cloud providers can give you up to $300k in cloud credits, most of them can be used in inference. The terms vary from one provider to another, and you probably need to fill applications and even meet them.

The concept here is that the faster you can burn those tokens, the more they'll offer you. It makes sense as they are looking for new potential big customers that will stick with their models.

You'll probably never get the $300k on the first shot, it goes incrementally. One important thing: incubators, accelerators and startup programs tend to have agreements on credit packages with providers. Reach out to your program manager if you're into one of those!

10. Observe your AI costs and take back control

Last but not least, you can't optimize what you can't see. Maybe one single recurrent LLM call is burning all your budget and you haven't identified it yet.

There are many tools (open source or proprietary) like Manifest that let you visualize your costs, analyse them, and even set some budget limits. Some of them are simple and easy to understand, others more complicated and let you go into details. Up to you to find your fit!

Conclusion

Reducing cloud AI inference costs is key for an app's profitability. Consider the few lines of each point in this post as a starting point, an idea to dig into. Not all of them are necessarily applicable in your case, but they are worth knowing and understanding.

At Manifest, we think that AI is an incredible technology, and that it deserves to be affordable. Our platform is open source and gives total control to our users. Check out our website and give us a star on GitHub to support us! Or simply share this post, as it helps others pay less for AI.

Happy hacking!

Top comments (1)

Harjot Singh • May 30

Good comprehensive list. If I had to rank the 10 by actual impact for most teams, the top three are almost always: prompt/response caching (huge on repetitive workloads), trimming context to what the task needs, and model routing - cheap model for the bulk, premium only for hard reasoning. Those three usually account for the lion's share of the savings; the rest are good hygiene around the edges.

The one people underuse is the routing one, because it requires admitting most of your calls don't need your best model - which feels wrong until you actually bucket your traffic by difficulty and see how lopsided it is. Pair routing with caching and you're often at 70-80% reduction before touching anything exotic. Solid roundup - bookmarking to share with people who think the only lever is "switch providers."

DEV Community