The Hidden Cost of LLM API Calls — What Nobody Tells You

Everyone talks about token pricing. Nobody talks about the real costs that eat your budget.

I have been building with LLM APIs for months. Here is what I wish someone told me on day one.

1. Retries Are Silent Budget Killers

API calls fail. Timeouts happen. Your retry logic fires 3x on a 4000-token prompt, and suddenly one request cost you 12,000 tokens.

Fix: Implement exponential backoff with a retry budget. Cap total retries per request, not just per attempt.

MAX_RETRY_TOKENS = 50000  # hard cap per logical request
retry_tokens_used = 0

for attempt in range(3):
    if retry_tokens_used > MAX_RETRY_TOKENS:
        break
    try:
        response = call_api(prompt)
        break
    except TimeoutError:
        retry_tokens_used += estimate_tokens(prompt)
        time.sleep(2 ** attempt)

2. System Prompts Are Repeated Every Call

That 2000-token system prompt? It is sent with every single API call. 100 calls per hour = 200,000 tokens just on system prompts.

Fix: Keep system prompts under 500 tokens. Move detailed instructions to the user message only when needed.

3. Conversation History Grows Exponentially

Each turn adds both your message AND the response to history. By turn 10, you are sending 10x the tokens of turn 1.

Fix: Implement sliding window or summarization.

def trim_history(messages, max_tokens=8000):
    total = sum(count_tokens(m) for m in messages)
    while total > max_tokens and len(messages) > 2:
        removed = messages.pop(1)  # keep system, remove oldest
        total -= count_tokens(removed)
    return messages

4. JSON Mode Doubles Your Tokens

Asking the model to respond in JSON? The structured output is typically 2-3x more tokens than plain text for the same information.

Fix: Only use JSON when you actually need to parse the output programmatically. For display purposes, plain text is fine.

5. The Model Choice Trap

GPT-4 is 30x more expensive than GPT-3.5 Turbo. Claude Opus is 15x more than Haiku. Most tasks do not need the big model.

Fix: Route by complexity. Use a cheap model for classification, extraction, and simple QA. Reserve the expensive model for reasoning and generation.

def choose_model(task_type):
    cheap = ["classify", "extract", "summarize", "translate"]
    if task_type in cheap:
        return "gpt-3.5-turbo"  # or haiku
    return "gpt-4"  # or opus