DEV Community

Cover image for How to Reduce LLM Costs by 40% in 24 Hours (2025)
ScaleMind
ScaleMind

Posted on • Originally published at scalemind.ai

How to Reduce LLM Costs by 40% in 24 Hours (2025)

Originally published on ScaleMind Blog

An AI gateway is the fastest path to LLM cost optimization, but you can start cutting costs in minutes with these 5 tested strategies in 2026.

TL;DR

If you're spending over $1k/month on LLM APIs without optimization, you're overpaying by at least 40%. This guide covers five strategies you can implement today:

Strategy Time to Implement Expected Savings
Prompt Caching 10 minutes 50-90% on cached tokens
Model Routing 2-4 hours 20-60%
Semantic Caching 1-2 hours 15-30%
Batch Processing 30 minutes 50% on async workloads
AI Gateway 5 minutes 40-70% combined

Why Are LLM Costs Spiraling Out of Control?

LLM costs scale linearly with usage, but most teams don't notice the bleeding until it's too late. The problem isn't model pricing, it's routing every request to the most expensive model regardless of task complexity.

Here's the current cost landscape for 1 million input tokens (December 2025):

Model Family Model Cost / 1M Input Best For
Frontier GPT-4o $2.50 Complex reasoning, coding
Frontier Claude 3.5 Sonnet $3.00 Nuanced writing, RAG
Efficient GPT-4o Mini $0.15 Summarization, classification
Efficient Claude 3.5 Haiku $0.80 Speed-critical tasks
Ultra-Low Gemini 1.5 Flash-8B $0.0375 Bulk processing, extraction

Sources: OpenAI Pricing, Anthropic Pricing, Google AI Pricing

The math is brutal: Processing 100M tokens/month with Claude 3.5 Sonnet costs ~$300 in input tokens alone. Route 50% of that to Gemini 1.5 Flash? That portion drops from $150 to $1.88. The savings compound fast.

This guide is for engineering teams ready to stop the bleeding today-not next quarter.


How Does Prompt Caching Reduce LLM Costs?

Prompt caching stores frequently-used context (system prompts, RAG documents, few-shot examples) so you don't pay full price every time you resend them. Both OpenAI and Anthropic offer native prompt caching with substantial discounts.

When it works best:

  • Chatbots - System prompt + conversation history sent repeatedly
  • RAG applications - Same documents analyzed across multiple queries
  • Coding assistants - Full codebase context included with every request

Implementation: Anthropic Prompt Caching

# Before: Paying full price for the same 10k-token context every request
response = client.messages.create(
    model="claude-3-5-sonnet-20241022",
    max_tokens=1024,
    system="You are a legal analyst assistant. " + large_context,
    messages=[{"role": "user", "content": query}]
)

# After: 90% discount on cached tokens
response = client.messages.create(
    model="claude-3-5-sonnet-20241022",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": "You are a legal analyst assistant.",
            "cache_control": {"type": "ephemeral"}  # Cache this block
        },
        {
            "type": "text",
            "text": large_context,  # Your 10k-token document
            "cache_control": {"type": "ephemeral"}  # Cached at 90% discount
        }
    ],
    messages=[{"role": "user", "content": query}]
)

# Check your savings
print(f"Cache Creation: {response.usage.cache_creation_input_tokens} tokens")
print(f"Cache Read: {response.usage.cache_read_input_tokens} tokens (90% off)")
Enter fullscreen mode Exit fullscreen mode

Savings breakdown:

Provider Cache Discount Notes
Anthropic ~90% on reads Explicit cache_control header required
OpenAI ~50% on reads Automatic for prompts > 1,024 tokens

Time to implement: 10 minutes for existing prompts.

See the Anthropic Prompt Caching docs for minimum token requirements and TTL details.


How Does Model Routing Cut Costs Without Sacrificing Quality?

Model routing directs each request to the cheapest model capable of handling it. Using GPT-4o for password reset instructions is like hiring a PhD to answer the phone-expensive and unnecessary.

The routing principle:

Task Type Use This Model Why
Creative writing, complex reasoning, coding GPT-4o, Claude 3.5 Sonnet Requires frontier intelligence
Summarization, classification, extraction GPT-4o Mini, Haiku 10-20x cheaper, same quality
Bulk data processing Gemini Flash, open-source Sub-penny per request

Implementation: Basic Complexity Router

def classify_complexity(prompt: str) -> str:
    """
    Simple heuristic router. Production systems often use:
    - A small classifier model (BERT, DistilBERT)
    - Keyword/regex matching
    - Token count thresholds
    """
    complexity_signals = ["code", "reason", "analyze", "compare", "debug"]

    if len(prompt) > 2000:
        return "complex"
    if any(signal in prompt.lower() for signal in complexity_signals):
        return "complex"
    return "simple"


def route_request(prompt: str):
    complexity = classify_complexity(prompt)

    if complexity == "simple":
        # 94% cheaper than GPT-4o
        return client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[{"role": "user", "content": prompt}]
        )
    else:
        return client.chat.completions.create(
            model="gpt-4o",
            messages=[{"role": "user", "content": prompt}]
        )


# Real-world scenario:
# - 70% of requests are simple (FAQ, summaries, classification)
# - 30% require frontier models
# 
# Cost calculation:
# Before: 100% at $2.50 = $2.50 avg
# After: (0.7 × $0.15) + (0.3 × $2.50) = $0.855 avg
# Savings: 66%
Enter fullscreen mode Exit fullscreen mode

Expected savings: 20-60% depending on your traffic mix. Most B2B apps see 60-70% of requests routable to efficient models.


What is Semantic Caching and How Does It Work?

Semantic caching uses vector embeddings to recognize that "How do I reset my password?" and "I forgot my password, help!" are the same question. Instead of hitting the LLM again, it returns the cached response-zero API cost.

Standard Redis caching only works on exact string matches. Semantic caching works on meaning.

Implementation: LangChain + Redis

from langchain.cache import RedisSemanticCache
from langchain.globals import set_llm_cache
from langchain_openai import OpenAIEmbeddings

# One-time setup
set_llm_cache(RedisSemanticCache(
    redis_url="redis://localhost:6379",
    embedding=OpenAIEmbeddings(),
    score_threshold=0.2  # Lower = stricter matching
))

# First request: hits API, costs money, stores response
response_1 = llm.invoke("What's the refund policy?")

# Second request: semantically similar, returns from cache ($0)
response_2 = llm.invoke("How can I get my money back?")

# Third request: different enough, hits API
response_3 = llm.invoke("What products do you sell?")
Enter fullscreen mode Exit fullscreen mode

When semantic caching shines:

  • Customer support bots (20-40% query overlap typical)
  • FAQ-style applications
  • Search result explanations
  • Any high-repetition use case

Expected savings: If 20% of your queries are semantically similar, you save 20% immediately. The embedding lookup cost is negligible (~$0.02/1M tokens with text-embedding-3-small).

For a deeper dive, see our Semantic Caching for LLMs guide.


When Should You Use Batch Processing for LLM Requests?

Batch processing is the right choice for any workload that doesn't need real-time responses. OpenAI's Batch API offers a flat 50% discount for requests that can wait up to 24 hours (they usually complete within minutes).

Ideal batch candidates:

  • Nightly content generation
  • Sentiment analysis on yesterday's support tickets
  • Bulk document summarization
  • Evaluation and testing pipelines

Implementation: OpenAI Batch API

from openai import OpenAI
client = OpenAI()

# Step 1: Create JSONL file with requests
# batch_requests.jsonl:
# {"custom_id": "req-1", "method": "POST", "url": "/v1/chat/completions", "body": {"model": "gpt-4o", "messages": [{"role": "user", "content": "Summarize: ..."}]}}
# {"custom_id": "req-2", "method": "POST", "url": "/v1/chat/completions", "body": {"model": "gpt-4o", "messages": [{"role": "user", "content": "Summarize: ..."}]}}

# Step 2: Upload the file
batch_file = client.files.create(
    file=open("batch_requests.jsonl", "rb"),
    purpose="batch"
)

# Step 3: Create the batch job (50% discount tier)
batch_job = client.batches.create(
    input_file_id=batch_file.id,
    endpoint="/v1/chat/completions",
    completion_window="24h"
)

print(f"Batch {batch_job.id} created. Status: {batch_job.status}")

# GPT-4o batch pricing:
# - Standard: $2.50/1M input, $10.00/1M output
# - Batch:    $1.25/1M input, $5.00/1M output
# Savings: 50% flat
Enter fullscreen mode Exit fullscreen mode

Expected savings: 50% on all eligible workloads. If 30% of your LLM usage is async, that's 15% off your total bill.

See OpenAI's Batch API cookbook for error handling and retrieval patterns.


What is an AI Gateway and Why Does It Matter?

An AI gateway is a proxy layer between your application and LLM providers that handles routing, caching, fallbacks, and cost optimization automatically. Instead of implementing the four strategies above separately, a gateway gives you all of them out of the box.

What an AI gateway handles:

Feature DIY Effort With Gateway
Model routing Custom classifier + routing logic Config file
Prompt caching Provider-specific implementation Automatic
Semantic caching Redis + embeddings + maintenance Built-in
Failover (OpenAI down → Anthropic) Complex error handling Automatic
Cost tracking Custom logging + dashboards Real-time UI

The tradeoff: You're adding a dependency. The benefit is shipping faster and not maintaining infrastructure that isn't your core product.

# Before: Direct OpenAI call (no caching, no fallback, no cost tracking)
from openai import OpenAI
client = OpenAI()

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": prompt}]
)

# After: ScaleMind gateway (same API, automatic optimization)
from scalemind import OpenAI
client = OpenAI()

response = client.chat.completions.create(
    model="gpt-4o",  # Routes to cheapest capable model
    messages=[{"role": "user", "content": prompt}]
)
# Automatic: caching, fallback to Anthropic if OpenAI fails, cost logging
Enter fullscreen mode Exit fullscreen mode

Expected savings: 40-70% combined, depending on workload characteristics.

Building an AI-powered frontend? Tools like Forge can generate the UI in minutes while an AI gateway handles your backend cost optimization.

Read more: What is an AI Gateway? and AI Gateway vs API Gateway: What's the Difference?


Case Studies: Real Companies, Real Savings

Jellypod: 88% Cost Reduction

Problem: Jellypod converts newsletters into podcasts. They were using GPT-4 for every summarization task, burning cash as usage scaled.

Solution: Implemented model routing (Strategy 2) and fine-tuned a smaller Mistral model for their specific summarization task.

Result: Inference costs dropped from ~$10/1M tokens to ~$1.20/1M tokens-an 88% reduction without quality loss for their use case.

Supernormal: 80% Cost Reduction

Problem: Supernormal's AI meeting note-taker faced spiraling costs as user growth exploded.

Solution: Moved to specialized fine-tuned infrastructure, optimized prompt context length, and implemented intelligent routing.

Result: 80% cost reduction, enabling them to scale to thousands of daily meetings without linear cost growth.

Source: Confident AI Case Study


The 24-Hour Implementation Checklist

Here's your action plan, prioritized by effort-to-impact ratio.

Hours 0-2: Audit and Baseline

  • [ ] Export usage logs from OpenAI/Anthropic dashboards
  • [ ] Identify your top 3 most expensive prompts (longest or most frequent)
  • [ ] Calculate current cost-per-user or cost-per-request baseline

Hours 2-4: Quick Wins (No Code)

  • [ ] Move all background jobs to Batch API (50% savings, 30 min work)
  • [ ] Switch obvious low-stakes features to gpt-4o-mini
  • [ ] Review system prompts-can any be shortened?

Hours 4-8: Code Changes

  • [ ] Enable prompt caching on all system prompts > 1,024 tokens (Anthropic) or let it auto-enable (OpenAI)
  • [ ] Set up semantic caching with Redis if you use LangChain

Hours 8-24: Routing Infrastructure

  • [ ] Build and deploy classify_complexity() router
  • [ ] Start at 30% traffic to cheaper models, monitor quality
  • [ ] Increase routing percentage as confidence grows

What Results Can You Expect?

Optimization Level Strategies Implemented Typical Savings
Basic Prompt caching + batch API 30-40%
Intermediate + Model routing 50-60%
Advanced + Semantic caching + AI gateway 60-70%

ROI example: A startup spending $5,000/month on LLM APIs implements basic + routing optimizations. At 50% savings, that's $30,000/year back in the budget-from one day of engineering work.


Key Takeaways

  1. Start with prompt caching. It's 10 minutes of work for immediate savings on any repeated context.
  2. Route by complexity. Most production traffic doesn't need GPT-4o. Build a simple classifier and start at 30% routing.
  3. Batch everything async. If it can wait 24 hours, it should use the Batch API (50% off).
  4. Semantic caching compounds. High-repetition use cases (support, FAQ, search) see 20%+ savings.
  5. Gateways handle the complexity. If you don't want to maintain routing/caching infrastructure, tools like ScaleMind handle it automatically.

The tools to cut your bill in half exist right now. You don't need to wait for GPT-5 to lower prices.

Try ScaleMind for automated cost optimization →


Resources


Read more on the ScaleMind Blog:

Try ScaleMind for automated cost optimization → scalemind.ai

Top comments (0)