Qss Technosoft

Posted on May 18

Cut Your LLM Costs by 90% With Prompt Caching (And Why Most Developers Don't)

#ai #llm #claude #costoptimization

You're Building an AI Feature. Then the Bill Arrives.

You're building an AI-powered feature.

Your Claude API bill arrives.

It's $2,400/month higher than expected.

The problem isn't your code.

It's that you're recomputing the same system prompts, tool definitions, and context across thousands of API calls.

This is exactly the problem prompt caching solves — and it can cut LLM costs by up to 90%.

We learned this the hard way at QSS Technosoft while building healthcare AI systems.

Here's what you need to know.

The Problem: You're Paying for Repetition

When you call an LLM API, the entire prompt is processed token-by-token every time.

*If you have:
*
2,000-token system prompt
500-token tool definitions
300-token context instructions

That's 2,800 tokens processed for every request, even if those tokens never change.

Now multiply that by 1,000 API calls per day.

*You are processing:
*
2.8 million tokens per day just to repeat the same system prompt.

At Claude pricing, this quickly compounds into thousands of dollars in monthly costs.

*The Math
*
2,800 cached tokens
× 1,000 requests per day
× 30 days

= 84 million input tokens per month

Without caching: ~$1,260/month
With caching: ~$126/month

Savings: ~90%

What Prompt Caching Actually Is

Prompt caching (also called prefix caching) works like HTTP caching, but for LLM computation.

When you send a prompt to Claude with caching enabled:

*First Request
*
Claude:

Processes the full prompt
Creates a cache key (hash of static content)

*Subsequent Requests
*
*Claude:
*
Recognizes the cached prefix
Skips recomputation
Processes only the new tokens

Result

Faster response times
Up to 90% cost reduction on cached tokens

How It Works (Code Example)

*Setting Up Prompt Caching with Claude API
*
import anthropic

client = anthropic.Anthropic(api_key="your-api-key")

system_prompt = """You are a clinical decision support AI.
You have access to patient records, lab results, and clinical history.
Always cite source data when making recommendations.
Follow HIPAA guidelines for all responses.
Prioritize patient safety over speed.
"""

tool_definitions = [
{
"name": "search_patient_records",
"description": "Search patient medical history",
"input_schema": {...}
},
{
"name": "get_lab_results",
"description": "Retrieve lab test results",
"input_schema": {...}
}
]

response = client.messages.create(
model="claude-opus-4-7",
max_tokens=1024,
system=[
{
"type": "text",
"text": system_prompt,
"cache_control": {"type": "ephemeral"}
},
{
"type": "text",
"text": f"Available tools: {tool_definitions}",
"cache_control": {"type": "ephemeral"}
}
],
messages=[
{
"role": "user",
"content": "Analyze patient ABC123's recent lab results"
}
]
)
What You Get Back

*First Request
*
Cache creation tokens: 2800
Cache read tokens: 0

*Second Request
*
Cache creation tokens: 0
Cache read tokens: 2800
Regular input tokens: 42

Only the user query gets recomputed.

Why Most Developers Don't Use Prompt Caching

*1. It's Not Enabled by Default
*
Developers must explicitly add:

cache_control: {"type": "ephemeral"}

Many developers don't know this feature exists.

*2. The Cache Lifecycle Confuses People
*
Two main cache types exist:

Ephemeral cache

Lives for 5 minutes

Persistent cache

Lives for 24 hours

Developers often choose the wrong strategy.

*3. Cache Invalidation is Hard
*
If your system prompt changes, the cache becomes invalid.

You must:

Invalidate manually
Or wait for expiration

Best Practices for Prompt Caching

*1. Cache Static Content
*
Cache elements that never change, such as:

System prompts
Tool definitions
Instruction frameworks

Example:

{
"type": "text",
"text": "You are a customer support AI...",
"cache_control": {"type": "ephemeral"}
}
*2. Put Dynamic Content at the End
*
Prompt caching works using prefix matching.

Wrong Structure

User query
System prompt
Context

Correct Structure

System prompt (cached)
Context (cached if static)
User query (dynamic)

*3. Monitor Cache Hit Rates
*
Always track cache metrics.

cache_hit_rate = response.usage.cache_read_input_tokens / (
response.usage.cache_read_input_tokens + response.usage.input_tokens
)

Target:

60%+ hit rate on stable workloads

If you're under 30%, your caching strategy needs tuning.

*4. Use Ephemeral for APIs, Persistent for Batch Jobs
*
*Ephemeral cache
*
Best for:

API endpoints
High-frequency requests

*Persistent cache
*
Best for:

Batch processing
Long-running workflows

Real-World Cost Example

*Scenario
*
Healthcare AI agent processing 10,000 patient queries/day

*Without Caching
*
Per request tokens:

System prompt: 2,000
Tool definitions: 500
Patient context: 1,500
User query: 50

Total: 4,050 tokens/request

Monthly cost:

$3,645

With Caching

**
Cached tokens:

System prompt: 2,000
Tool definitions: 500

Total cached: 2,500 tokens

Remaining per request:

Patient context: 1,500
Query: 50

Monthly cost:

$1,417.50

Savings: $2,227.50/month

When NOT to Use Prompt Caching

Prompt caching isn't always useful.

Avoid it for:

Highly dynamic prompts
Low-volume applications (<100 requests/day)
One-off tasks
Systems requiring extremely tight real-time responses

The Bigger Lesson: Treat LLMs Like an API Gateway

Prompt caching isn't just a cost optimization trick.

It's a core infrastructure design principle.

Think of LLM calls like API requests:

Cache expensive static content
Recompute dynamic data
Monitor cache hit rates
Version prompt changes

This mindset becomes critical when building agentic workflows that orchestrate multiple LLM calls.

Tools That Help Implement Prompt Caching

If you want caching without building everything manually:

*Helicone *— drop-in proxy with LLM caching
Anthropic SDK — built-in cache control
*LangChain *— prompt caching in agent loops
Cloudflare Workers AI — server-side caching layer

Next Steps

If you're running LLM workloads today, start with these steps:

Audit your prompts — identify static tokens
Enable caching using cache_control
Monitor metrics like cache_read_input_tokens
Measure savings month-over-month

If you're processing 1,000+ LLM requests/day, prompt caching can save hundreds or thousands of dollars per month.

You just need to turn it on.

Have You Implemented Prompt Caching?

I'd love to hear from other developers:

What cache hit rate did you achieve?
How much did your LLM bill drop?
What challenges did you face?

Drop your experience in the comments.

About QSS Technosoft

QSS Technosoft builds production AI and healthcare systems at scale.

Our team has implemented Claude-based workflows across:

Clinical decision support
Diagnostic imaging systems
Enterprise healthcare integrations

*One lesson we've learned repeatedly:
*
Prompt caching alone can save $50K+ annually on LLM infrastructure costs.

DEV Community