DEV Community

Eli Katz
Eli Katz

Posted on

Why your LLM bill spiked: 7 causes (and a way to fix them)

If you’re shipping LLM features, the invoice can jump before anyone knows why. Most cost blow-ups are predictable—and observable.

  1. Context bloat (prompts slowly grow) → Track input tokens p50/p95 → Add prompt budgets + summarize history

  2. Retry storms (1 action = N calls) → Track calls per workflow/session → Cap retries + backoff + fail fast

  3. Wrong model drift (expensive model becomes default) → Track model mix over time → Route: cheap by default, escalate on low confidence

  4. Agent/tool loops (runaway tool calls) → Track tool-call depth + trace length → Cap depth, limit tool output, add stop conditions

  5. Verbose outputs (paying for essays) → Track output token distribution → Set max response length + structured formats

  6. RAG overshoot (too many/too big chunks) → Track retrieved tokens/query → Reduce top-k, tighter chunks, retrieval budgets

  7. Abort + re-ask loops (stream cancel then repeat) → Track aborted generations + rapid repeats → Improve first response, add “continue?”, cache safely

We've built ZenLLM (zenllm.io): read-only LLM cost observability + optimization recommendations—so it can’t break prod or become a single point of failure.

Launching with a limited number of free LLM Savings Assessments (attribution + top waste + prioritized roadmap). If you want one, comment with your stack + your biggest cost mystery.

Top comments (0)