DEV Community

Cover image for Your LLM Bill Is Too High. Here's How to Fix It (Part 1)
Liz Zhang
Liz Zhang

Posted on • Originally published at zhang-liz.com

Your LLM Bill Is Too High. Here's How to Fix It (Part 1)

The cheapest LLM call is the one you do not make.

Everyone building with LLMs eventually hits the same wall. The prototype
works, usage climbs, and suddenly the API bill starts doing things
nobody planned for. The problem is usually not that AI is expensive. The
problem is that teams are using models for work that should never have
touched a model in the first place.

Before you debate GPT versus Claude versus Gemini, ask a more basic
question: Do you need an LLM at all?

Rule: use an LLM when the task requires ambiguity handling,
judgment, synthesis, flexible natural-language generation, complex
reasoning, or tool use. Do not use one because the word AI looks good in
the architecture diagram.

The no-model audit

A shocking amount of production LLM spend is expensive glue around work
that deterministic code, dedicated APIs, or cheaper ML services already
handle well.

Task Start here before an LLM Use an LLM when
Meeting transcription Dedicated speech-to-text service You need synthesis, follow-up extraction, or action-item judgment.
Translation Translation API or cheaper model The task needs tone adaptation, context-aware rewriting, or multilingual reasoning.
Structured document extraction OCR, document parser, AWS Textract-style pipeline The document layout is messy, fields are ambiguous, or human-like interpretation is required.
Small taxonomy classification Keyword rules, regex, small classifier Categories overlap, labels are subjective, or confidence is low.
Formatting and validation Schema validation, deterministic code The output needs natural-language repair or explanation.

Table 1. No-model audit: cheaper first-pass alternatives before using an LLM.

Where teams waste money

A no-model-first audit prevents teams from paying frontier-model prices for deterministic work
Figure 1. A no-model-first audit prevents teams from paying frontier-model prices for deterministic work.

The common pattern is simple. A team builds a general-purpose prompt,
points every request at a strong model, and ships. It works, so nobody
questions the architecture until the bill arrives. By then, the model
has become the default path for classification, extraction, routing,
formatting, translation, rewriting, and exception handling.

That is backwards. The model should not be the default path. The model
should be the judgment path.

![Illustrative savings potential by optimization lever. Actual savings vary by workload and traffic shape (https://dev-to-uploads.s3.amazonaws.com/uploads/articles/1rf8cty7ptfjfg0rvu39.png)
Figure 2. Illustrative savings potential by optimization lever. Actual savings vary by workload and traffic shape.

A better default architecture

  1. Validate inputs with code. Reject malformed payloads before spending tokens.
  2. Use deterministic tools first. Regex, parsers, lookup tables, and APIs are boring. That is why they are cheap and reliable.
  3. Use small models for fuzzy but routine tasks. Classification, extraction, and rewriting usually do not need a frontier model.
  4. Escalate only when confidence is low. Premium models should handle ambiguity, high-risk cases, and hard reasoning.

Practical checklist

  • Can the task be solved with deterministic code?
  • Can a dedicated API solve it more cheaply and consistently?
  • Can a small classifier handle the common path?
  • Are you sending repetitive context that could be cached?
  • Is the frontier model reserved for exception cases?

Bottom line

The first cost optimization step is not prompt compression. It is
architectural honesty. Most requests are boring. Treat them that way,
and the bill starts dropping before you even switch models.

Top comments (0)