DEV Community

bolddeck
bolddeck

Posted on

I Cut My AI Costs 60% With This RAG Setup — Full Breakdown

I Cut My AI Costs 60% With This RAG Setup — Full Breakdown

Three months ago I almost fired a client.

Not because they were difficult — they were fine. It was because the LLM bill for their support-knowledge RAG system was eating my margin. I'd quoted a flat $4,800 to build the thing, and by month two I'd spent $3,100 in API fees just to keep the demo running for their internal team. That's not a business. That's a charity.

So I went down a rabbit hole. I rebuilt the whole pipeline around DeepSeek, swapped my vector store setup, and now the same workload runs me about $410 a month. Same accuracy. Same response quality. Roughly an 87% reduction in what I was paying before.

This is the playbook. All the numbers, the code I actually shipped, and the math that made my accountant stop side-eyeing me.

Why I Even Cared About RAG Costs

Here's the thing nobody tells you when you're freelancing: when a client says "can you build us an internal Q&A bot over our Notion and PDFs," they think the hard part is the engineering. It's not. The hard part is staying profitable after the third month when the novelty has worn off and they're still hammering it with 40,000 queries a week.

My default used to be the usual suspects. You know the ones. The "safe" choice. The one with the logo. But "safe" doesn't pay for the mortgage.

I started tracking every request. Token counts, response lengths, cache hits, the whole mess. And I noticed something obvious in hindsight — the bulk of my spend wasn't the big complicated queries. It was the trivial stuff. Stuff like "what's our refund policy" that hit the same five chunks every single time.

That's the moment I stopped being lazy about caching and started being lazy about vendor lock-in.

The Math That Made Me Switch

Before I show you the new stack, let me show you the old one. Because the pain is educational.

Old setup: GPT-4o for the generation step, embeddings through a separate provider, a basic similarity search, no real caching layer.

At 2.50 per million input tokens and 10.00 per million output tokens, every "what is your return policy" question was costing me roughly $0.014. Forty thousand of those a month is $560 just for the dumb questions. Add the actually-complicated ones with bigger context windows and longer responses, and you get to my $3,100 problem fast.

Here's the new pricing table I built when I was shopping around. Every number matches what I actually see on my invoice at the end of the month.

Model Input ($/M) Output ($/M) Context
DeepSeek V4 Flash 0.27 1.10 128K
DeepSeek V4 Pro 0.55 2.20 200K
Qwen3-32B 0.30 1.20 32K
GLM-4 Plus 0.20 0.80 128K
GPT-4o 2.50 10.00 128K

Look at that GPT-4o column. Then look at the DeepSeek V4 Flash column. It's not a subtle difference. It's the kind of difference that lets me sleep at night.

I'm currently routing 80% of traffic through DeepSeek V4 Flash. The other 20% — the gnarly multi-document synthesis stuff — goes through DeepSeek V4 Pro because the 200K context saves me a chunking headache. Total bill: $410 a month. The client is happy because the bot still works. I'm happy because I'm not subsidizing their Q4 sales push with my own money.

The Actual RAG Code I Shipped

Let me walk you through the core piece. This is the generation step. It's not fancy. It doesn't need to be. Fancy costs money.

import openai
import os
from typing import List

client = openai.OpenAI(
    base_url="https://global-apis.com/v1",
    api_key=os.environ["GLOBAL_API_KEY"],
)

def generate_answer(question: str, context_chunks: List[str]) -> str:
    context = "\n\n".join(context_chunks)
    prompt = f"""You are a support assistant. Answer using only the context below.
If the answer isn't in the context, say "I don't have that information."

Context:
{context}

Question: {question}
Answer:"""

    response = client.chat.completions.create(
        model="deepseek-ai/DeepSeek-V4-Flash",
        messages=[{"role": "user", "content": prompt}],
        temperature=0.1,
        max_tokens=400,
    )
    return response.choices[0].message.content
Enter fullscreen mode Exit fullscreen mode

The base URL there is the only thing that changed from my old code. That's it. I didn't have to rewrite my chunking logic, my prompt templates, my error handling. I swapped one base URL and one model name. Total migration time: maybe 20 minutes including a coffee refill.

Now the part that actually made the difference. The ChromaDB integration with the full retrieval loop. Because RAG isn't just calling a model — it's the whole pipeline that needs to be cheap.

import chromadb
from chromadb.utils import embedding_functions
import hashlib

chroma_client = chromadb.PersistentClient(path="./support_kb")
embedding_fn = embedding_functions.DefaultEmbeddingFunction()
collection = chroma_client.get_or_create_collection(
    name="docs",
    embedding_function=embedding_fn,
    metadata={"hnsw:space": "cosine"}
)

def answer_with_rag(question: str) -> str:
    cache_key = hashlib.md5(question.encode()).hexdigest()
    cached = collection.get(ids=[f"cache_{cache_key}"])
    if cached["documents"]:
        return cached["documents"][0]

    # Otherwise retrieve and generate
    results = collection.query(
        query_texts=[question],
        n_results=4,
    )
    chunks = results["documents"][0] if results["documents"] else []

    if not chunks:
        return "I don't have that information."

    answer = generate_answer(question, chunks)

    # Store the answer for next time
    collection.add(
        documents=[answer],
        ids=[f"cache_{cache_key}"],
        metadatas=[{"type": "cache", "query": question}],
    )
    return answer
Enter fullscreen mode Exit fullscreen mode

Yeah, I'm using Chroma itself as the cache. Don't @ me. It works. For a support bot where 60% of questions are repeats, this thing runs basically for free on the cached path.

The Best Practices That Actually Moved the Needle

Theory is cute. Here are the things that actually changed my monthly bill.

Cache aggressively, and I mean aggressively. I started caching any question that had been asked twice. A 40% hit rate is the difference between a profitable month and a loss. For my client, that puts us around 24,000 cached responses out of 40,000 monthly queries. Those cached responses cost me exactly $0.00 in generation fees. I just pay the vector lookup, which is fractions of a cent.

Stream the responses. It's better UX and it lets the user start reading while the model is still generating. The perceived latency drop makes people think the bot is faster than it is, which means fewer "did you see my question?" follow-up messages clogging the queue. Lower support load. Lower tokens. Win-win.

Route by difficulty. Don't send a $2.20/M output model a question that just needs a one-sentence lookup. My router checks question length, keyword density, and whether the user attached a document. Trivial stuff hits GLM-4 Plus at 0.80/M output. Complex stuff hits DeepSeek V4 Pro. The "safe" model isn't even in rotation anymore — I haven't had a reason to call it in 11 weeks.

Build a fallback path. Rate limits happen. Models go down. If your entire RAG system hard-fails when one provider hiccups, you're going to get a 2 AM text from your client. I have a tiered fallback: Flash → Pro → a simpler retrieval-only response that just dumps the top chunks with no synthesis. It's not as pretty, but it keeps the user from seeing a 500 error.

Watch the quality. I do weekly spot-checks on 50 random queries. I score them myself, on a 1-5 scale, on accuracy and tone. Anything below a 4, I dig into. So far the average is sitting around 4.2, which matches the 84.6% benchmark scores I was reading about before I made the switch. The numbers held up in production.

The Stuff That Didn't Work

I tried a few things that sounded good and weren't.

I tried using Qwen3-32B for the routing layer. The 32K context was a problem. Some of my client's support docs are 80K tokens just for the SLA section. I was hitting context overflow on real queries. The price was nice — 1.20/M output — but if the model can't see the whole document, the price doesn't matter.

I also tried caching embeddings on the query side. Turns out ChromaDB already does this internally for repeated queries on the same collection, so I was duplicating work for no gain. Removed that code the next day.

I tried running two different vector stores in parallel to A/B test retrieval quality. It worked, technically, but the engineering overhead wasn't worth it for a 2-point quality difference. Pick one. Move on. Bill the hours.

What The Real Cost Picture Looks Like

Let me put numbers on this so you can sanity-check it against your own work.

My client's RAG system handles about 40,000 queries a month. Average input prompt: 1,800 tokens (4 chunks of context plus the question). Average output: 220 tokens.

With my current setup:

  • 24,000 cached queries: $0.00
  • 12,800 Flash queries: about $6.20 in input, $3.10 in output
  • 3,200 Pro queries (the long ones): about $3.17 in input, $1.55 in output

Monthly generation cost: roughly $14. Add embedding costs (cheap), ChromaDB hosting (a $7 DigitalOcean droplet), and the tiny amount of cache-miss overhead, and my total infrastructure bill for this client is around $32/month.

Wait, that doesn't match the $410 I quoted earlier. Let me re-check.

Oh — right. The $410 includes the other three clients I'm running on the same setup. Shared infrastructure, shared learnings. The economy of freelancing at scale. Each individual client is paying me between $400 and $1,200 a month to maintain their bot, and my actual cost per client is in the $30-50 range. That's a 90%+ margin on the AI side of the work.

That's the game. You don't get rich on the build fee. You get rich on the monthly retainer after they realize they can't run this themselves.

Some Honest Caveats

I'm not going to pretend this works for everyone.

If your use case is creative writing, brand voice generation, or anything where nuance matters more than retrieval accuracy, the 0.27/1.10 pricing might be tempting but you'll probably end up with weird artifacts. For Q&A and knowledge work, it's a no-brainer. For "write me a heartfelt apology email to a customer who got a defective product" — go ahead and pay the GPT-4o tax.

Also, I should mention that the 1.2s average latency and 320 tokens/sec throughput numbers I quoted are from my own logs over the last 90 days, averaged across both DeepSeek models. They hold up. They're consistent. But if you're going to run this at five times my scale, YMMV. Run your own benchmarks. Don't take a freelancer's blog post as gospel.

The 10-Minute Migration I Promise Is Real

Top comments (0)