ScaleMind

Posted on Dec 17, 2025 • Originally published at scalemind.ai

How to Reduce LLM Costs by 40% in 24 Hours (2025)

#ai #machinelearning #python

Originally published on ScaleMind Blog

An AI gateway is the fastest path to LLM cost optimization, but you can start cutting costs in minutes with these 5 tested strategies in 2026.

TL;DR

If you're spending over $1k/month on LLM APIs without optimization, you're overpaying by at least 40%. This guide covers five strategies you can implement today:

Strategy	Time to Implement	Expected Savings
Prompt Caching	10 minutes	50-90% on cached tokens
Model Routing	2-4 hours	20-60%
Semantic Caching	1-2 hours	15-30%
Batch Processing	30 minutes	50% on async workloads
AI Gateway	5 minutes	40-70% combined

Why Are LLM Costs Spiraling Out of Control?

LLM costs scale linearly with usage, but most teams don't notice the bleeding until it's too late. The problem isn't model pricing, it's routing every request to the most expensive model regardless of task complexity.

Here's the current cost landscape for 1 million input tokens (December 2025):

Model Family	Model	Cost / 1M Input	Best For
Frontier	GPT-4o	$2.50	Complex reasoning, coding
Frontier	Claude 3.5 Sonnet	$3.00	Nuanced writing, RAG
Efficient	GPT-4o Mini	$0.15	Summarization, classification
Efficient	Claude 3.5 Haiku	$0.80	Speed-critical tasks
Ultra-Low	Gemini 1.5 Flash-8B	$0.0375	Bulk processing, extraction

Sources: OpenAI Pricing, Anthropic Pricing, Google AI Pricing

The math is brutal: Processing 100M tokens/month with Claude 3.5 Sonnet costs ~$300 in input tokens alone. Route 50% of that to Gemini 1.5 Flash? That portion drops from $150 to $1.88. The savings compound fast.

This guide is for engineering teams ready to stop the bleeding today-not next quarter.

How Does Prompt Caching Reduce LLM Costs?

Prompt caching stores frequently-used context (system prompts, RAG documents, few-shot examples) so you don't pay full price every time you resend them. Both OpenAI and Anthropic offer native prompt caching with substantial discounts.

When it works best:

Chatbots - System prompt + conversation history sent repeatedly
RAG applications - Same documents analyzed across multiple queries
Coding assistants - Full codebase context included with every request

Implementation: Anthropic Prompt Caching

# Before: Paying full price for the same 10k-token context every request
response = client.messages.create(
    model="claude-3-5-sonnet-20241022",
    max_tokens=1024,
    system="You are a legal analyst assistant. " + large_context,
    messages=[{"role": "user", "content": query}]
)

# After: 90% discount on cached tokens
response = client.messages.create(
    model="claude-3-5-sonnet-20241022",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": "You are a legal analyst assistant.",
            "cache_control": {"type": "ephemeral"}  # Cache this block
        },
        {
            "type": "text",
            "text": large_context,  # Your 10k-token document
            "cache_control": {"type": "ephemeral"}  # Cached at 90% discount
        }
    ],
    messages=[{"role": "user", "content": query}]
)

# Check your savings
print(f"Cache Creation: {response.usage.cache_creation_input_tokens} tokens")
print(f"Cache Read: {response.usage.cache_read_input_tokens} tokens (90% off)")

Savings breakdown:

Provider	Cache Discount	Notes
Anthropic	~90% on reads	Explicit `cache_control` header required
OpenAI	~50% on reads	Automatic for prompts > 1,024 tokens

Time to implement: 10 minutes for existing prompts.

See the Anthropic Prompt Caching docs for minimum token requirements and TTL details.

How Does Model Routing Cut Costs Without Sacrificing Quality?

Model routing directs each request to the cheapest model capable of handling it. Using GPT-4o for password reset instructions is like hiring a PhD to answer the phone-expensive and unnecessary.

The routing principle:

Task Type	Use This Model	Why
Creative writing, complex reasoning, coding	GPT-4o, Claude 3.5 Sonnet	Requires frontier intelligence
Summarization, classification, extraction	GPT-4o Mini, Haiku	10-20x cheaper, same quality
Bulk data processing	Gemini Flash, open-source	Sub-penny per request

Implementation: Basic Complexity Router

def classify_complexity(prompt: str) -> str:
    """
    Simple heuristic router. Production systems often use:
    - A small classifier model (BERT, DistilBERT)
    - Keyword/regex matching
    - Token count thresholds
    """
    complexity_signals = ["code", "reason", "analyze", "compare", "debug"]

    if len(prompt) > 2000:
        return "complex"
    if any(signal in prompt.lower() for signal in complexity_signals):
        return "complex"
    return "simple"


def route_request(prompt: str):
    complexity = classify_complexity(prompt)

    if complexity == "simple":
        # 94% cheaper than GPT-4o
        return client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[{"role": "user", "content": prompt}]
        )
    else:
        return client.chat.completions.create(
            model="gpt-4o",
            messages=[{"role": "user", "content": prompt}]
        )


# Real-world scenario:
# - 70% of requests are simple (FAQ, summaries, classification)
# - 30% require frontier models
# 
# Cost calculation:
# Before: 100% at $2.50 = $2.50 avg
# After: (0.7 × $0.15) + (0.3 × $2.50) = $0.855 avg
# Savings: 66%

Expected savings: 20-60% depending on your traffic mix. Most B2B apps see 60-70% of requests routable to efficient models.

What is Semantic Caching and How Does It Work?

Semantic caching uses vector embeddings to recognize that "How do I reset my password?" and "I forgot my password, help!" are the same question. Instead of hitting the LLM again, it returns the cached response-zero API cost.

Standard Redis caching only works on exact string matches. Semantic caching works on meaning.

Implementation: LangChain + Redis

from langchain.cache import RedisSemanticCache
from langchain.globals import set_llm_cache
from langchain_openai import OpenAIEmbeddings

# One-time setup
set_llm_cache(RedisSemanticCache(
    redis_url="redis://localhost:6379",
    embedding=OpenAIEmbeddings(),
    score_threshold=0.2  # Lower = stricter matching
))

# First request: hits API, costs money, stores response
response_1 = llm.invoke("What's the refund policy?")

# Second request: semantically similar, returns from cache ($0)
response_2 = llm.invoke("How can I get my money back?")

# Third request: different enough, hits API
response_3 = llm.invoke("What products do you sell?")

When semantic caching shines:

Customer support bots (20-40% query overlap typical)
FAQ-style applications
Search result explanations
Any high-repetition use case

Expected savings: If 20% of your queries are semantically similar, you save 20% immediately. The embedding lookup cost is negligible (~$0.02/1M tokens with text-embedding-3-small).

For a deeper dive, see our Semantic Caching for LLMs guide.

When Should You Use Batch Processing for LLM Requests?

Batch processing is the right choice for any workload that doesn't need real-time responses. OpenAI's Batch API offers a flat 50% discount for requests that can wait up to 24 hours (they usually complete within minutes).

Ideal batch candidates:

Nightly content generation
Sentiment analysis on yesterday's support tickets
Bulk document summarization
Evaluation and testing pipelines

Implementation: OpenAI Batch API

from openai import OpenAI
client = OpenAI()

# Step 1: Create JSONL file with requests
# batch_requests.jsonl:
# {"custom_id": "req-1", "method": "POST", "url": "/v1/chat/completions", "body": {"model": "gpt-4o", "messages": [{"role": "user", "content": "Summarize: ..."}]}}
# {"custom_id": "req-2", "method": "POST", "url": "/v1/chat/completions", "body": {"model": "gpt-4o", "messages": [{"role": "user", "content": "Summarize: ..."}]}}

# Step 2: Upload the file
batch_file = client.files.create(
    file=open("batch_requests.jsonl", "rb"),
    purpose="batch"
)

# Step 3: Create the batch job (50% discount tier)
batch_job = client.batches.create(
    input_file_id=batch_file.id,
    endpoint="/v1/chat/completions",
    completion_window="24h"
)

print(f"Batch {batch_job.id} created. Status: {batch_job.status}")

# GPT-4o batch pricing:
# - Standard: $2.50/1M input, $10.00/1M output
# - Batch:    $1.25/1M input, $5.00/1M output
# Savings: 50% flat

Expected savings: 50% on all eligible workloads. If 30% of your LLM usage is async, that's 15% off your total bill.

See OpenAI's Batch API cookbook for error handling and retrieval patterns.

What is an AI Gateway and Why Does It Matter?

An AI gateway is a proxy layer between your application and LLM providers that handles routing, caching, fallbacks, and cost optimization automatically. Instead of implementing the four strategies above separately, a gateway gives you all of them out of the box.

What an AI gateway handles:

Feature	DIY Effort	With Gateway
Model routing	Custom classifier + routing logic	Config file
Prompt caching	Provider-specific implementation	Automatic
Semantic caching	Redis + embeddings + maintenance	Built-in
Failover (OpenAI down → Anthropic)	Complex error handling	Automatic
Cost tracking	Custom logging + dashboards	Real-time UI

The tradeoff: You're adding a dependency. The benefit is shipping faster and not maintaining infrastructure that isn't your core product.

# Before: Direct OpenAI call (no caching, no fallback, no cost tracking)
from openai import OpenAI
client = OpenAI()

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": prompt}]
)

# After: ScaleMind gateway (same API, automatic optimization)
from scalemind import OpenAI
client = OpenAI()

response = client.chat.completions.create(
    model="gpt-4o",  # Routes to cheapest capable model
    messages=[{"role": "user", "content": prompt}]
)
# Automatic: caching, fallback to Anthropic if OpenAI fails, cost logging

Expected savings: 40-70% combined, depending on workload characteristics.

Building an AI-powered frontend? Tools like Forge can generate the UI in minutes while an AI gateway handles your backend cost optimization.

Case Studies: Real Companies, Real Savings

Jellypod: 88% Cost Reduction

Problem: Jellypod converts newsletters into podcasts. They were using GPT-4 for every summarization task, burning cash as usage scaled.

Solution: Implemented model routing (Strategy 2) and fine-tuned a smaller Mistral model for their specific summarization task.

Result: Inference costs dropped from ~$10/1M tokens to ~$1.20/1M tokens-an 88% reduction without quality loss for their use case.

Supernormal: 80% Cost Reduction

Problem: Supernormal's AI meeting note-taker faced spiraling costs as user growth exploded.

Solution: Moved to specialized fine-tuned infrastructure, optimized prompt context length, and implemented intelligent routing.

Result: 80% cost reduction, enabling them to scale to thousands of daily meetings without linear cost growth.

Source: Confident AI Case Study

The 24-Hour Implementation Checklist

Here's your action plan, prioritized by effort-to-impact ratio.

Hours 0-2: Audit and Baseline

[ ] Export usage logs from OpenAI/Anthropic dashboards
[ ] Identify your top 3 most expensive prompts (longest or most frequent)
[ ] Calculate current cost-per-user or cost-per-request baseline

Hours 2-4: Quick Wins (No Code)

[ ] Move all background jobs to Batch API (50% savings, 30 min work)
[ ] Switch obvious low-stakes features to gpt-4o-mini
[ ] Review system prompts-can any be shortened?

Hours 4-8: Code Changes

[ ] Enable prompt caching on all system prompts > 1,024 tokens (Anthropic) or let it auto-enable (OpenAI)
[ ] Set up semantic caching with Redis if you use LangChain

Hours 8-24: Routing Infrastructure

[ ] Build and deploy classify_complexity() router
[ ] Start at 30% traffic to cheaper models, monitor quality
[ ] Increase routing percentage as confidence grows

What Results Can You Expect?

Optimization Level	Strategies Implemented	Typical Savings
Basic	Prompt caching + batch API	30-40%
Intermediate	+ Model routing	50-60%
Advanced	+ Semantic caching + AI gateway	60-70%

ROI example: A startup spending $5,000/month on LLM APIs implements basic + routing optimizations. At 50% savings, that's $30,000/year back in the budget-from one day of engineering work.

Key Takeaways

Start with prompt caching. It's 10 minutes of work for immediate savings on any repeated context.
Route by complexity. Most production traffic doesn't need GPT-4o. Build a simple classifier and start at 30% routing.
Batch everything async. If it can wait 24 hours, it should use the Batch API (50% off).
Semantic caching compounds. High-repetition use cases (support, FAQ, search) see 20%+ savings.
Gateways handle the complexity. If you don't want to maintain routing/caching infrastructure, tools like ScaleMind handle it automatically.

The tools to cut your bill in half exist right now. You don't need to wait for GPT-5 to lower prices.

DEV Community