DEV Community

Kai Norden
Kai Norden

Posted on

The Hidden Cost of LLM API Calls — What Nobody Tells You

Everyone talks about token pricing. Nobody talks about the real costs that eat your budget.

I have been building with LLM APIs for months. Here is what I wish someone told me on day one.

1. Retries Are Silent Budget Killers

API calls fail. Timeouts happen. Your retry logic fires 3x on a 4000-token prompt, and suddenly one request cost you 12,000 tokens.

Fix: Implement exponential backoff with a retry budget. Cap total retries per request, not just per attempt.

MAX_RETRY_TOKENS = 50000  # hard cap per logical request
retry_tokens_used = 0

for attempt in range(3):
    if retry_tokens_used > MAX_RETRY_TOKENS:
        break
    try:
        response = call_api(prompt)
        break
    except TimeoutError:
        retry_tokens_used += estimate_tokens(prompt)
        time.sleep(2 ** attempt)
Enter fullscreen mode Exit fullscreen mode

2. System Prompts Are Repeated Every Call

That 2000-token system prompt? It is sent with every single API call. 100 calls per hour = 200,000 tokens just on system prompts.

Fix: Keep system prompts under 500 tokens. Move detailed instructions to the user message only when needed.

3. Conversation History Grows Exponentially

Each turn adds both your message AND the response to history. By turn 10, you are sending 10x the tokens of turn 1.

Fix: Implement sliding window or summarization.

def trim_history(messages, max_tokens=8000):
    total = sum(count_tokens(m) for m in messages)
    while total > max_tokens and len(messages) > 2:
        removed = messages.pop(1)  # keep system, remove oldest
        total -= count_tokens(removed)
    return messages
Enter fullscreen mode Exit fullscreen mode

4. JSON Mode Doubles Your Tokens

Asking the model to respond in JSON? The structured output is typically 2-3x more tokens than plain text for the same information.

Fix: Only use JSON when you actually need to parse the output programmatically. For display purposes, plain text is fine.

5. The Model Choice Trap

GPT-4 is 30x more expensive than GPT-3.5 Turbo. Claude Opus is 15x more than Haiku. Most tasks do not need the big model.

Fix: Route by complexity. Use a cheap model for classification, extraction, and simple QA. Reserve the expensive model for reasoning and generation.

def choose_model(task_type):
    cheap = ["classify", "extract", "summarize", "translate"]
    if task_type in cheap:
        return "gpt-3.5-turbo"  # or haiku
    return "gpt-4"  # or opus
Enter fullscreen mode Exit fullscreen mode

TL;DR

  1. Cap retry budgets in tokens, not just attempts
  2. Keep system prompts under 500 tokens
  3. Trim conversation history aggressively
  4. Skip JSON mode when you do not need parsing
  5. Route cheap tasks to cheap models

Do this before you optimize your prompts. The infrastructure savings are 10x bigger than prompt engineering gains.


Building with LLM APIs? What surprised you about the costs?

Top comments (0)