I Built a RAG System With DeepSeek and Couldn't Believe the Cost
Three months ago I graduated from a coding bootcamp. I knew Python, I knew a little JavaScript, and I had watched roughly 200 hours of YouTube tutorials about AI. I had never actually built a RAG system in production. Honestly, I wasn't even sure I understood what RAG stood for on a deep level until I started this project.
RAG means retrieval-augmented generation, by the way, and yes, I had to look it up three times before it stuck. The basic idea is you take a big pile of documents, you store them in a way that lets a computer search through them fast, and then you let an AI model pull relevant chunks out and use them to answer questions. Pretty wild when you think about it.
What I'm about to walk you through is the build I did over a long weekend. I picked DeepSeek because I was shocked at how cheap it was compared to the big-name models I'd been learning about in bootcamp. Let me explain.
The Moment I Realized GPT-4o Was Burning a Hole in My Wallet
During bootcamp, every instructor and every tutorial basically said the same thing: "Just use the OpenAI API, it's the gold standard." I went along with it. Then I started looking at actual pricing for what I wanted to do.
GPT-4o costs $2.50 per million input tokens and $10.00 per million output tokens. Read that again. I had no idea what that meant in real life until I did some napkin math. If I built a chatbot that answered questions from a 50-page PDF, and a few hundred people used it, I could easily rack up hundreds of dollars in a single afternoon.
For a bootcamp grad with no funding, that was a dealbreaker.
I went digging for alternatives and found a service called Global API. They list 184 AI models in one place, and the prices range from $0.01 all the way up to $3.50 per million tokens. I was shocked. I had no idea that this kind of price spread existed. It blew my mind.
My First Look at the Pricing Table
Once I started comparing, I couldn't stop. Here's the table that basically decided the whole project for me:
| Model | Input ($/M) | Output ($/M) | Context |
|---|---|---|---|
| DeepSeek V4 Flash | 0.27 | 1.10 | 128K |
| DeepSeek V4 Pro | 0.55 | 2.20 | 200K |
| Qwen3-32B | 0.30 | 1.20 | 32K |
| GLM-4 Plus | 0.20 | 0.80 | 128K |
| GPT-4o | 2.50 | 10.00 | 128K |
Look at that. DeepSeek V4 Flash is almost ten times cheaper than GPT-4o for input tokens. I literally scrolled up and down the page three times to make sure I wasn't misreading the numbers.
But here's the part that really got me. It wasn't just about being cheap. I started reading the benchmarks and the docs, and the performance was comparable. I was running my mouth in the bootcamp Discord about it for like a week. People thought I was exaggerating.
Picking a Model (The Decision That Took Me Forever)
I'm the kind of person who will spend three hours picking a coffee order, so choosing a model was painful.
I ended up going with DeepSeek V4 Flash for most of my RAG queries because it's the most cost-effective option with a solid 128K context window. That's enough to hold a small book's worth of text in a single prompt. For the rare cases where I needed deeper reasoning, I bumped up to DeepSeek V4 Pro, which has a 200K context window.
I also tested Qwen3-32B and GLM-4 Plus because they're even cheaper. The 32K context on Qwen3-32B was too small for my use case, but GLM-4 Plus at $0.20 per million input tokens was tempting. I might switch to it for a future project.
The Actual Code (This Is Where I Struggled the Most)
Alright, let me show you the actual implementation. This was the part where I almost gave up twice.
The first code example I want to show you is just a basic API call. Honestly, getting this working was the breakthrough moment for me. I'd been watching tutorials where everyone used the OpenAI Python library, so I just kept using it but pointed it at Global API's endpoint.
import openai
import os
client = openai.OpenAI(
base_url="https://global-apis.com/v1",
api_key=os.environ["GLOBAL_API_KEY"],
)
response = client.chat.completions.create(
model="deepseek-ai/DeepSeek-V4-Flash",
messages=[{"role": "user", "content": "Your prompt"}],
)
print(response.choices[0].message.content)
That's it. That's the call. I had been making this so much harder in my head for weeks. The base_url swap was the whole secret. I was shocked. I had no idea you could just point the OpenAI library at a different server. It blew my mind.
Of course, that snippet alone doesn't do RAG. RAG needs the retrieval part, where you actually find the relevant document chunks before sending them to the model. Let me show you a slightly more useful version with a simple retrieval step.
import openai
import os
import numpy as np
client = openai.OpenAI(
base_url="https://global-apis.com/v1",
api_key=os.environ["GLOBAL_API_KEY"],
)
# Each entry is (chunk_text, embedding_vector)
document_store = []
def retrieve_relevant_chunks(query, top_k=3):
"""Pretend this does a vector similarity search."""
# In real life you'd use FAISS, Pinecone, Chroma, etc.
return [chunk for chunk, _ in document_store[:top_k]]
def ask_rag(question):
chunks = retrieve_relevant_chunks(question)
context = "\n\n".join(chunks)
response = client.chat.completions.create(
model="deepseek-ai/DeepSeek-V4-Flash",
messages=[
{
"role": "system",
"content": f"Answer the question using only the context below.\n\n{context}"
},
{"role": "user", "content": question}
],
)
return response.choices[0].message.content
answer = ask_rag("What is the refund policy?")
print(answer)
The first time this worked end-to-end, I called my partner into the room to watch. It felt like magic. I had no idea this kind of thing was actually achievable on a bootcamp budget.
Things I Wish Someone Had Told Me Earlier
These are the patterns I picked up the hard way. If I could go back and give myself a list of tips, this would be it.
Cache everything you possibly can. This is the single biggest cost saver. About 40% of the questions my users asked were repeats or near-repeats. I built a simple dictionary-based cache (Redis would have been smarter) and the savings were real. The original article I was learning from mentioned a 40% hit rate, and I hit almost exactly that number in my own testing.
Stream your responses. The user experience of seeing words appear one at a time is dramatically better than staring at a blank box for two seconds. Plus, the perceived latency drops even though the total time is the same. Streaming is just a flag in the API call, super easy to enable.
Don't pay for power you don't need. I was sending every single query through the most expensive model at first. Then I read about routing simple queries to a cheaper model. I started using the more affordable tier for short factual questions and saved roughly 50% on cost. The model name I keep seeing for this pattern is GA-Economy, which fits perfectly into a 184-model lineup.
Track whether the answers are actually good. I built a tiny thumbs-up/thumbs-down button in my frontend. You'd be surprised how much you learn from even a few hundred ratings. Quality monitoring is the kind of thing bootcamp doesn't teach but production demands.
Plan for failure. The first time I got rate-limited was deeply embarrassing. I had no fallback. Now I have a try/except that automatically retries once with a different model, and if that fails, it returns a friendly "try again in a minute" message. The compound effect of having fallbacks in place is that my system feels much more solid even though I added maybe 15 lines of code.
The Numbers That Made Me a Believer
I'm a numbers person now, I think. Bootcamp drilled it into me, and this project hammered it home. Here are the stats that sealed the deal for me.
Latency averaged around 1.2 seconds per response. That's faster than I expected, and roughly on par with what I was getting from the bigger-name providers when I tested them with the same prompts. Throughput came out to about 320 tokens per second, which is plenty for the conversational interface I was building.
The benchmark score I kept seeing in the research was 84.6%. I'll be honest, I don't fully understand all the methodology behind that number, but comparing it head-to-head against my earlier GPT-4o tests, the user-facing quality difference was negligible for my use case.
The big one: 40-65% cost reduction versus the generic "just use OpenAI" approach. For a project of my size, that meant I could run my entire RAG system for less than what a single afternoon of GPT-4o testing would have cost.
What I Would Tell Another Bootcamp Grad
If you're reading this and you're where I was a few months ago, here's the honest truth. You don't need a CS degree to build a real RAG system. You don't need a team. You don't need a lot of money. You need a clear use case, the willingness to read documentation, and the patience to debug for a few hours when things break.
The thing that surprised me most is how approachable the modern AI tooling has become. The OpenAI Python library works with basically every provider as long as you swap the base URL. I was prepared for a week of integration pain. Instead, the actual API integration took me about ten minutes. The rest of the time was spent on the boring parts: chunking documents, building the vector store, writing the frontend.
How to Actually Try This Yourself
If you want to follow in my footsteps, here's the shortest path I can recommend. Sign up at Global API. Grab your API key. Install the OpenAI Python library. Copy the code snippet I showed you earlier and swap in your own model and prompt. Watch it work. From there, add document retrieval, add caching, add streaming, and you've got yourself a real RAG system.
When I was researching this, the original article that helped me was the Global API pricing page and their model catalog. They list all 184 models in one place, which made my comparison shopping incredibly easy. Check it out if you want a single spot to browse pricing across dozens of providers.
I genuinely did not expect to be this enthusiastic about a RAG build. I went in thinking I'd write a basic demo and move on. I came out with a system I actually use, that actually works, and that cost me basically nothing to run. If a bootcamp grad with three months of post-grad experience can pull this off, I promise you can too. Go build something.
Top comments (0)