Everyone talks about token pricing. Nobody talks about the real costs that eat your budget.
I have been building with LLM APIs for months. Here is what I wish someone told me on day one.
1. Retries Are Silent Budget Killers
API calls fail. Timeouts happen. Your retry logic fires 3x on a 4000-token prompt, and suddenly one request cost you 12,000 tokens.
Fix: Implement exponential backoff with a retry budget. Cap total retries per request, not just per attempt.
MAX_RETRY_TOKENS = 50000 # hard cap per logical request
retry_tokens_used = 0
for attempt in range(3):
if retry_tokens_used > MAX_RETRY_TOKENS:
break
try:
response = call_api(prompt)
break
except TimeoutError:
retry_tokens_used += estimate_tokens(prompt)
time.sleep(2 ** attempt)
2. System Prompts Are Repeated Every Call
That 2000-token system prompt? It is sent with every single API call. 100 calls per hour = 200,000 tokens just on system prompts.
Fix: Keep system prompts under 500 tokens. Move detailed instructions to the user message only when needed.
3. Conversation History Grows Exponentially
Each turn adds both your message AND the response to history. By turn 10, you are sending 10x the tokens of turn 1.
Fix: Implement sliding window or summarization.
def trim_history(messages, max_tokens=8000):
total = sum(count_tokens(m) for m in messages)
while total > max_tokens and len(messages) > 2:
removed = messages.pop(1) # keep system, remove oldest
total -= count_tokens(removed)
return messages
4. JSON Mode Doubles Your Tokens
Asking the model to respond in JSON? The structured output is typically 2-3x more tokens than plain text for the same information.
Fix: Only use JSON when you actually need to parse the output programmatically. For display purposes, plain text is fine.
5. The Model Choice Trap
GPT-4 is 30x more expensive than GPT-3.5 Turbo. Claude Opus is 15x more than Haiku. Most tasks do not need the big model.
Fix: Route by complexity. Use a cheap model for classification, extraction, and simple QA. Reserve the expensive model for reasoning and generation.
def choose_model(task_type):
cheap = ["classify", "extract", "summarize", "translate"]
if task_type in cheap:
return "gpt-3.5-turbo" # or haiku
return "gpt-4" # or opus
TL;DR
- Cap retry budgets in tokens, not just attempts
- Keep system prompts under 500 tokens
- Trim conversation history aggressively
- Skip JSON mode when you do not need parsing
- Route cheap tasks to cheap models
Do this before you optimize your prompts. The infrastructure savings are 10x bigger than prompt engineering gains.
Building with LLM APIs? What surprised you about the costs?
Top comments (0)