Nolan Vale

Posted on Jun 12

Token Cost Optimization: How to Cut LLM Inference Spend Without Cutting Quality

#ai #llm #performance #rag

There is a version of token cost optimization that I do not recommend: cutting token counts by reducing the quality of your system prompt, your retrieved context, or your response formatting. This approach reduces cost and reduces quality in equal measure. You have not optimized anything. You have just accepted worse outputs at a lower price.

The token cost optimization that is worth doing reduces cost by eliminating wasteful patterns while preserving or improving the quality of what the model actually receives and generates. This is an engineering problem, not a quality trade-off. And there is typically significant waste to eliminate before you need to make any quality trade-offs at all.

Here is where the waste usually is and how to address it.

Source 1: Redundant Context in Every Request

For a RAG system serving an organization, some context is constant across every request: the system prompt that defines the agent's role and behavior, organizational facts that are always relevant, formatting instructions. When this context is large and always included, it becomes a significant fraction of per-request token cost.

Prompt caching is the solution. Both Anthropic and OpenAI offer prompt caching for content that is repeated across requests. Cached content is charged at a significantly reduced rate, typically 90% less than standard input tokens on Anthropic's API. For a system prompt that represents 20% of average request size, prompt caching alone reduces that 20% by 90%, which translates to a 18% reduction in total input token cost.

The prerequisite for effective prompt caching is structural: the cacheable content must appear at the start of the prompt in a consistent position across requests. System prompts that are dynamically assembled with user-specific or session-specific content inserted before the stable content cannot be cached effectively. Restructure prompts to place stable content at the beginning and dynamic content at the end.

For self-hosted deployments using vLLM or similar serving infrastructure, prefix caching provides the same benefit without API-level caching. The key principle is identical: structure prompts to maximize the length of the stable prefix.

Source 2: Over-Retrieval

The most common retrieval pattern is to retrieve a fixed top-k chunks regardless of query type. A simple factual query retrieves the same number of chunks as a complex analytical query. A query with a clear, high-confidence answer in the top result retrieves the same amount of context as a query where the relevant information is scattered across multiple documents.

The waste is significant. For simple queries where one or two chunks contain the full answer, retrieving eight chunks and sending all of them to the model is adding context that cannot improve the answer and is almost certainly adding noise.

Adaptive retrieval reduces this waste. Rather than a fixed top-k, implement a threshold-based retrieval that retrieves chunks above a similarity threshold up to a maximum. For queries with a clear top result and diminishing returns at lower similarity scores, this pattern retrieves fewer chunks. For queries where relevant information is distributed, it retrieves more.

For query types where the pattern is predictable, keyword lookups for specific facts, vs. analytical questions requiring synthesis, query classification can direct different queries to different retrieval configurations. A query classifier at the front of the pipeline adds a small number of classifier tokens and saves a larger number of retrieval context tokens for the appropriate query types.

Source 3: Response Length That Exceeds User Need

Generated response length is controllable. The default behavior of most language models, without explicit length guidance, is to generate responses that are longer than necessary, elaborating on points that could be stated more concisely, adding caveats and qualifications that may not be relevant to the specific query, providing context that the user did not request.

For enterprise applications, explicit length guidance in the system prompt, specific instructions about response format and length calibrated to actual user needs, reduces output token count substantially without reducing response quality. Users querying a knowledge base for a specific fact do not need a 500-word response. They need the fact and the source.

Structured output with defined schemas also reduces output waste. When the model generates a JSON object with defined fields rather than free-form prose, the output is bounded by the schema. Fields that are not relevant to a specific response are either empty or absent, rather than filled with generated prose that approximates their absence.

Source 4: Model Tier Misalignment

Not all queries require the same model capability. A simple keyword extraction task does not require the same model as a complex multi-document synthesis. Using a frontier model for tasks that a smaller, faster, cheaper model handles equally well is the most expensive form of waste in high-volume AI deployments.

The pattern that works: a cascade architecture where queries are routed to the smallest model capable of handling them reliably. A fast, cheap model handles simple tasks, classification, extraction, formatting, lookup, and complex tasks are escalated to a more capable model when the simple model's confidence falls below a threshold.

Implementing this requires an evaluation step: running a sample of your query distribution against both model tiers and measuring quality on each task type. The result is a routing policy based on observed quality differences, not assumptions about which tasks are "simple" or "complex."

For organizations running self-hosted inference, model quantization provides a related optimization: a quantized version of a large model can handle most tasks with quality comparable to the full-precision model at significantly lower compute cost. The tradeoff is worth evaluating empirically rather than assuming that quantization always degrades quality.

Source 5: Logging and Monitoring Overhead

For organizations using external AI APIs, logging full prompts and responses for debugging and compliance purposes creates a secondary cost: the storage and processing of token-volume data. For high-volume deployments, this can be significant.

Sampled logging, capturing full prompts and responses for a percentage of requests rather than all requests, reduces storage cost proportionally while maintaining sufficient data for debugging and quality monitoring. Compression of stored logs provides additional savings.

For compliance requirements that mandate full audit trails, there is a design option that eliminates the secondary cost entirely: keeping data on-premises. A self-hosted deployment logs inference data to internal storage, where the marginal cost of storage is substantially lower than cloud storage for high-volume log data, and where the compliance requirement is satisfied without third-party data transfer.

Putting It Together: A Cost Optimization Sequence

The sequence that produces the best results: start with the highest-leverage interventions first.

Instrument your current spend by cost category before optimizing. Measure input tokens, output tokens, and model tier usage separately. Identify which of the five sources above represents the largest fraction of your current cost.

Implement prompt caching first if you have significant stable prompt content. This is high-leverage, low-risk, and requires only structural changes to prompt assembly.

Audit retrieval configuration and implement adaptive retrieval. Measure the reduction in average retrieved context per query.

Add response length guidance to the system prompt and measure output token reduction.

Implement model routing if query volume is high enough to justify the engineering investment. The routing logic and evaluation framework have non-trivial development cost that only pays back at sufficient scale.

Evaluate quantization for self-hosted deployments after other optimizations are in place.

The organizations that run this sequence systematically typically find 30 to 50 percent cost reduction available before any quality trade-offs are required. Quality trade-offs, when they are genuinely required, can then be evaluated against a cost baseline that has already been substantially reduced.

DEV Community

Token Cost Optimization: How to Cut LLM Inference Spend Without Cutting Quality

Top comments (0)