Fix Your Prompt Structure Before You Touch Your Infrastructure

#ai #machinelearning #rag #llm

Fix Your Prompt Structure Before You Touch Your Infrastructure

Most engineering teams treat LLM inference costs as an infrastructure problem. They evaluate model quantization, shop for cheaper GPU rentals, debate whether to move from GPT-4o to Claude Sonnet, and benchmark open-source alternatives. I have watched teams spend weeks on this and save fifteen percent. The same teams were running their system prompts with a timestamp in the first line and paying full token price on every single request.

The optimization I am talking about is prompt caching. Anthropic charges $0.30 per million tokens for cache reads versus $3.00 per million for fresh input tokens — a 10x price difference for bytes the model already processed in the last hour. OpenAI applies automatic 50% discounts on cached tokens. The savings are not theoretical. They compound over every request your system makes, and most teams are not capturing them because they are breaking the cache themselves.

What cache-busting actually looks like

Caching works by hashing the prefix of your prompt. If the hash matches a recent request, you pay the cheap rate. If it does not, you pay full price on the entire prefix.

The failure mode is deceptively simple: any dynamic content in your system prompt breaks the hash. A current timestamp. A user ID. A session context block. A "today's date is April 30, 2026" string you added because the model kept getting dates wrong. Any of these changes between requests, which pushes the cache hit rate to near zero and guarantees you pay $3.00/M on every input token.

I have seen this specific mistake in every LLM system I have audited that started as a quick prototype. The system prompt grows organically — someone adds a date, someone adds a user's account tier, someone adds a "recent conversation summary" block — and by the time the system is in production, the cacheable prefix is maybe the first hundred tokens of a twenty-thousand-token prompt. Cache hit rate: seven percent.

The ProjectDiscovery case

ProjectDiscovery's engineering team published a detailed breakdown of exactly this problem in early 2025. Their security agent Neo runs an average of 26 steps per task with roughly 40 tool calls. Each step sent a prompt that included a 20,000-token system prompt — 2,500 lines of YAML, tool definitions, and runtime state including working memory and skills context that changed every step.

Their initial cache hit rate: 7%.

The fix was structural. They moved dynamic content — working memory, runtime variables, skills context — out of the system prompt and into a user message appended at the tail of the conversation. The static system prompt stayed static. The dynamic state moved to the only place it should have been: after the stable prefix.

Cache hit rate after the change: 74% within the same deployment cycle, 84% by mid-March 2025. Total cost reduction: 59% compared to baseline. Over the six weeks following deployment, they served 9.8 billion input tokens from cache rather than paying full price for them.

That last number is worth sitting with. 9.8 billion tokens at $0.30/M instead of $3.00/M. The engineering work took days.

The structural rule

Everything static goes first. Everything dynamic goes last.

Your system prompt — instructions, persona, output format, tool definitions — is static. It changes when you ship a new version, not on every request. Mark it as cacheable and never mix runtime state into it. Your dynamic content — user context, current date, session variables, retrieved chunks from RAG — goes into the user message at the end of the conversation. This is where the model expects context to live anyway.

Tool definitions deserve a specific note. If your tool list is partly static (the core tools your agent always has) and partly dynamic (tools you inject based on user permissions or task context), sort the static tools first and place them before the dynamic ones. ProjectDiscovery made tool definitions their second cache breakpoint, keeping a 1-hour TTL on the stable portion even in conversations with changing tool sets. The incremental token cost of alphabetically sorting a list of tool names before your agent runs is zero. The cache savings compound over every task.

How to tell if you are affected

Pull your Anthropic usage metrics for the last seven days. If your cache read token rate is below 40% of total input tokens, your prompts are almost certainly structured wrong. Below 20% means something dynamic is almost certainly in the system prompt itself — probably a date, a user attribute, or a context block that varies per session.

For OpenAI, automatic caching applies the 50% discount whenever a matching prefix exists, so the failure mode is less visible in billing. You are still missing cache hits, you just do not see the rate directly in your dashboard without explicitly measuring prefix stability.

Where this sits in the optimization stack

Before re-ranking. Before switching embedding models. Before evaluating managed vector databases. Before any model fine-tuning. The ZenML survey of 1,200 production LLM deployments cites Care Access achieving 86% cost reduction through prompt caching, and Riskspan cutting per-deal processing costs by 90x through LLM optimization. Both numbers are large enough to sound inflated, but the mechanism is reliable: if you move from a 7% cache hit rate to a 74% cache hit rate on a 20,000-token prompt, you have changed the effective price of those input tokens by roughly 6x.

The audit your system needs first is not an architecture review. It is reading your own system prompt and asking which lines change between requests. Move those lines to the bottom. The savings will show up in your billing dashboard before the end of the week.