Everyone watches the model bill.
You pick a model, check the pricing page, estimate your request volume, do the math. The number looks fine. You ship.
Six months later the bill is three times what you expected and nobody can explain exactly where it went.
I have been there. And after building multi-tenant LLM systems in production — serving enterprise clients across multiple services, two different LLM providers, an agentic orchestration layer, and a full retrieval pipeline — I can tell you exactly where the money goes.
It is not the model. Not directly.
It is five decisions you made before you ever called the model, most of which felt completely reasonable at the time.
1. You are using the same model for every task
This is the most common and most expensive mistake.
When you start a new service you pick a model. It works. You move on. That model ends up handling every request type — simple lookups, complex reasoning, structured output, ambiguous multi-turn conversations — all running through the same expensive inference endpoint.
We run two production services built on the same orchestration framework. One is a streaming Q&A chatbot over complex enterprise WMS documentation. Multi-turn, ambiguous queries, large documents, 20+ enterprise tenants. It needs a capable frontier model. It gets one.
The other is a structured flow bot. Narrow scope. Well-defined process. Predictable input shape. It runs on Amazon Nova Micro. A fraction of the cost. Faster. For what it actually does, it performs identically to the expensive alternative.
Same tool registry. Same orchestrator. Same prompt templates. Different model injected at startup. The cost difference, multiplied across every request, every tenant, every month, is not small.
The question most engineers never ask: what is the cheapest model that reliably does this specific job?
That question changes per use case. Per tenant. Sometimes per request type. The only way to act on it without a rewrite every time is to build the abstraction layer first.
# Provider abstraction — same interface regardless of model or vendor
class LLMBase:
def agent_completion_response(self, request): ...
def agent_completion_response_streaming(self, request): ...
# OpenAI and Bedrock both implement this interface
# The orchestrator never knows which one it's talking to
llm = LLM.get_provider("bedrock") # or "openai"
orchestrator = OpenaiOrchestratorAgent(llm_client=llm)
python
Swapping Nova Micro for GPT-4o-mini is a config change. If you are tightly coupled to one provider it is a rewrite. That flexibility is what lets you make cost decisions without architectural consequences.
2. Your agentic loop has no ceiling
A single LLM call has a predictable cost. An agentic loop does not.
In a tool-use orchestrator the model calls a tool, gets a result, decides what to do next. Normal runs take two or three iterations. Edge cases can take more. Without a hard cap, one unexpected input becomes a runaway process.
We hit this during testing. A tool was returning output the model did not know how to handle. No ceiling in place. The model called the same tool seven times trying to recover. It never did. The API timed out. The cost of that single test run was a useful reminder.
class OpenaiOrchestratorAgent:
def __init__(self, llm_client, max_loops: int = 10):
self.llm = llm_client
self.max_loops = max_loops # Hard ceiling. Not optional.
async def _run_non_streaming(self, ...):
loop_count = 0
while loop_count < self.max_loops:
loop_count += 1
# ... call model, execute tools ...
raise ValueError("Maximum iteration loops reached")
Ten is a generous ceiling for most agentic workflows. If your agent consistently needs more than ten iterations to answer a question, that is a design problem, not a reason to raise the cap.
The loop cap is your billing safety valve. Set it before you need it.
3. Your retrieval is pulling too many chunks
Every retrieval result is input tokens. Every input token costs money.
If your knowledge base returns 10 chunks when 3 would answer the question, you are paying for 7 tokens on every single request. Across every user, every tenant, every conversation turn.
Most teams accept the default retrieval count because it is convenient. numberOfResults gets set once during initial setup and never revisited.
# Bedrock KB search config
search_config = {
"vectorSearchConfiguration": {
"numberOfResults": 5, # Deliberate. Not default.
"overrideSearchType": "HYBRID", # HYBRID vs SEMANTIC per use case
"filter": tenant_filter # Hard tenant isolation
}
}
python
The right number depends on your chunk size, your query complexity, and your context window budget. For a narrow structured Q&A flow, 3 is often enough. For complex multi-document reasoning, 8 might be appropriate. The point is to have a reason for the number — not to accept whatever the SDK defaults to.
Retrieval configuration is a cost decision. Treat it like one.
4. Your tool schemas are carrying weight the model does not need
Every tool you register with the LLM becomes part of the input on every call where that tool is available. Every field in every tool schema is tokens. This multiplies across every iteration of every loop of every request.
The mistake is putting everything in the schema — including context that belongs to the system, not the model.
Tenant identifiers. Knowledge base names. Session IDs. Internal routing parameters. None of these are decisions the LLM needs to make. They are facts the system already knows. Putting them in the model-facing schema means the model reasons over them, the schema grows, and the token count on every call goes up.
The fix is to split tool parameters into two categories:
@tool(
name="knowledge_qna",
description="Search the knowledge base and answer the question",
params=[
{"name": "query", "description": "The user's question", "type": "string"}
],
# LLM provides: query
# System injects: kb_name, filter_config, session_id
internal_params=["kb_name", "filter_config", "session_id"],
category="retrieval"
)
def knowledge_qna(query: str, kb_name: str, filter_config: dict, session_id: str):
...
The LLM sees one parameter: query. The service layer injects the rest at runtime. The model schema stays lean. The internal parameters never appear in the tool schema that gets sent to the model on every call.
At 15 tools with bloated schemas, this is not a minor optimisation. It compounds.
5. You are synthesising when you don't need to
The default agentic pattern is: call a tool → get a result → pass the result back to the LLM → let the model compose a final answer. This is correct when the LLM needs to reason over the result, combine it with other context, or explain it to the user.
It is not correct when the tool already generated the final output.
If a tool produces a deterministic result — a formatted document, a structured XML payload, a direct lookup value — routing that back through the LLM for synthesis costs tokens, adds latency, and introduces a layer of unpredictability you do not need. The model might rephrase it. It might add hedging language. It might change the format. None of that is useful.
@tool(
name="modify_label_xml",
description="Modify the label XML template",
params=[...],
direct=True, # Result is the response. Bypasses LLM synthesis.
category="labelgen"
)
def modify_label_xml(...):
# Tool generates the final output directly
return updated_xml
python
# In the orchestrator — direct tools short-circuit the loop
if tool_entry["meta"].direct:
return str(result) # Done. No synthesis step.
The direct flag tells the orchestrator that the tool result is the response. The loop stops. No additional LLM call. No synthesis tokens. The result is exactly what the tool produced.
Not every tool should be direct. But every tool should have an explicit answer to the question: does the LLM need to do anything with this result, or is the result already the answer?
Putting it together
None of these are advanced optimisations. They are decisions that look small in isolation and expensive in aggregate.
| Decision | Cost impact |
|---|---|
| Using a frontier model for every task | 5-20x per request vs a smaller model |
| No loop cap | Unbounded. One edge case can cost more than a day of normal traffic |
| Default retrieval count | 2-3x token waste on every retrieval call |
| Bloated tool schemas | Multiplicative - every extra token across every tool across every call |
| Unnecessary synthesis | One extra LLM call per tool invocation that didn't need it |
The engineers whose AI systems stay affordable at scale are not the ones who picked the cheapest model. They are the ones who made deliberate decisions about all five of these things before the bill told them they had to.
Cost is not something you optimise later. By the time you are optimising it reactively, you have already explained the numbers to someone you did not want to.
If this was useful, I write about building production AI systems — RAG, agentic workflows, multi-tenant architecture, and the engineering decisions that don't show up in tutorials. Follow along.

Top comments (0)