DEV Community

Alex Spinov
Alex Spinov

Posted on • Edited on

The Real Cost of Running an LLM in Production (I Did the Math)

Everyone talks about AI capabilities. Nobody talks about the bill.

I calculated the actual cost of running different LLMs in production for a typical SaaS app handling 10,000 requests per day. The numbers were eye-opening.

The Setup

A typical AI-powered SaaS feature:

  • 10,000 requests/day
  • Average input: 500 tokens
  • Average output: 200 tokens
  • 30 days/month

Monthly Cost Comparison

Model Input Cost Output Cost Monthly Total
GPT-4o $0.0025/1K $0.01/1K $975
Claude 3.5 Sonnet $0.003/1K $0.015/1K $1,350
GPT-4o mini $0.00015/1K $0.0006/1K $58.50
Claude 3.5 Haiku $0.0008/1K $0.004/1K $360
Llama 3 70B (self-hosted) ~$1,500/mo GPU - $1,500
Llama 3 8B (self-hosted) ~$300/mo GPU - $300
Llama 3 8B (4-bit quantized) ~$150/mo GPU - $150

The Surprising Findings

1. GPT-4o mini is absurdly cheap

At $58.50/month for 10K requests/day, it's cheaper than most SaaS subscriptions. For 90% of use cases (classification, summarization, extraction), it's good enough.

2. Self-hosting only makes sense at scale

Self-hosting Llama 3 70B costs $1,500/mo (GPU rental). That's more than using GPT-4o via API. Self-hosting only wins when you hit 50K+ requests/day or need data privacy.

3. The real cost isn't the model

The model API is maybe 30% of your actual cost. The rest:

  • Embedding storage: Vector DB for RAG = $100-500/mo
  • Prompt engineering/testing: Developer time = $$$$
  • Error handling & fallbacks: Retry logic, rate limit handling
  • Monitoring: Token usage tracking, quality monitoring

4. Quantized models are the sweet spot

Running Llama 3 8B with 4-bit quantization gives you:

  • $150/mo (single mid-range GPU)
  • Full data privacy
  • No rate limits
  • 90% of 70B quality for most tasks

With Google's new TurboQuant technique, expect even better quality at lower bit widths soon.

Cost Optimization Tricks

# 1. Cache common responses
import hashlib

cache = {}
def cached_llm_call(prompt):
    key = hashlib.md5(prompt.encode()).hexdigest()
    if key in cache:
        return cache[key]
    response = call_llm(prompt)
    cache[key] = response
    return response

# 2. Use cheaper models for classification, expensive for generation
def smart_routing(task):
    if task.type == 'classify':
        return call_model('gpt-4o-mini', task)  # $0.06/day
    elif task.type == 'generate':
        return call_model('gpt-4o', task)  # Only for complex tasks

# 3. Truncate inputs aggressively
def optimize_prompt(text, max_tokens=500):
    # Most context can be compressed without quality loss
    return text[:max_tokens * 4]  # ~4 chars per token
Enter fullscreen mode Exit fullscreen mode

The Bottom Line

For most startups:

  1. Start with GPT-4o mini ($58/mo for 10K req/day)
  2. Add caching (reduces costs 30-50%)
  3. Use smart routing (cheap model for easy tasks, expensive for hard ones)
  4. Consider self-hosting only when you hit 50K+ req/day

What's your LLM infrastructure costing you? Are you tracking per-request costs?


Related:


More from me: 10 Dev Tools I Use Daily | 77 Scrapers on a Schedule | 150+ Free APIs


Need web scraping or data extraction? I've built 77+ production scrapers. Email spinov001@gmail.com — quote in 2 hours. Or try my ready-made Apify actors — no code needed.

Top comments (0)