DEV Community

rarenode
rarenode

Posted on

Building DeepSeek RAG From Scratch: What Nobody Tells You

Building DeepSeek RAG From Scratch: What Nobody Tells You

Six months ago I was staring at a $47,000 monthly OpenAI bill for a RAG pipeline that served maybe 12 enterprise customers. That's when I started taking DeepSeek seriously. If you're a founder or CTO weighing whether to rebuild your retrieval-augmented generation stack on something other than the usual suspects, this is the post I wish someone had handed me before I burned through that budget.

I'm not going to dress this up. RAG is one of those things every team builds once, then rebuilds twice, then realises the third architecture is the one that survives production. My goal here is to save you the middle rebuild by sharing what actually worked, what didn't, and what the real cost picture looks like once you're serving real traffic.

The Setup That Forced My Hand

My company runs a B2B document intelligence product. Customers upload contracts, financial filings, technical manuals, and the usual nightmare of mixed-format PDFs. They ask questions in natural language and expect citations. Classic RAG territory.

For the first year I ran everything on GPT-4o. It worked brilliantly. Latency was solid, the answers were clean, the reasoning was strong. Then I checked the invoice.

The unit economics were brutal. At GPT-4o pricing of $2.50 per million input tokens and $10.00 per million output tokens, a single complex query against a 40-page contract was running me somewhere between 8 and 14 cents once you factored in the chunking, the embedding reranking, and the verification passes. When one of our larger customers started doing 40,000 queries a day, the math stopped working.

I had three options: raise prices, accept margin compression, or rethink the stack. I picked door number three.

Why DeepSeek Kept Coming Up

I went deep on benchmarks for two weeks. I read every comparison I could find, ran my own evals against our internal test set of 800 legal and financial questions, and pinged other CTOs in my network about what they were actually shipping.

DeepSeek kept winning on the cost-adjusted quality axis. The two variants I kept circling back to were DeepSeek V4 Flash at $0.27 input / $1.10 output with 128K context, and DeepSeek V4 Pro at $0.55 input / $2.20 output with 200K context. Compare those numbers against GPT-4o at $2.50 / $10.00 and you start to see why my CFO suddenly wanted to have coffee.

But cheap is only interesting if quality holds. In my evals, DeepSeek V4 Flash scored in the mid-80s on our internal rubric, which was within a few points of GPT-4o for the kinds of structured extraction and summarization tasks RAG cares about. When you multiply the small quality gap by the cost delta, the decision makes itself.

One thing I want to flag up front: the broader market I'm shopping in through Global API has 184 models with prices ranging from $0.01 to $3.50 per million tokens. Having that range available without signing twelve separate enterprise contracts is what makes the architecture I describe below actually possible. Vendor lock-in isn't just about exit costs. It's about how quickly you can A/B a new model when one drops that shifts the landscape. More on that in a minute.

The Architecture I Actually Shipped

Let me walk you through what I'm running in production today. I want to be specific because most RAG blog posts hand-wave the hard parts.

The pipeline has five stages: ingestion, chunking, embedding, retrieval, and generation. I keep the embedding and retrieval pieces model-agnostic using a standard vector store (Pinecone, but the choice doesn't matter here). The interesting decision is the generation layer, which is where DeepSeek lives.

Here's the routing logic. Easy queries - things like "what's the termination clause" or "summarize section 4" - go to DeepSeek V4 Flash. The model is fast, the answers are adequate for the price, and at $1.10 per million output tokens I genuinely don't care if a user runs 200 of those queries a session.

Hard queries - multi-hop reasoning across documents, financial calculations, anything where the user is going to be unhappy with a wrong answer - go to DeepSeek V4 Pro. The $2.20 output rate is still 78% cheaper than GPT-4o for the same volume, and the 200K context window means I can stuff whole contracts in without aggressive summarization that loses information.

This is the part that took me the longest to figure out: build the router, not the model. If you let your application code make assumptions about which LLM is generating, you've already lost. You will want to swap models. The vendor releasing the better one next quarter might not be the one you're using today. Architect for optionality.

The Code, In Case You're Wiring This Up Tonight

Here's the actual snippet I have running in production. I'm using Global API as the unified gateway so I can flip between vendors with a single config change:

import openai
import os

client = openai.OpenAI(
    base_url="https://global-apis.com/v1",
    api_key=os.environ["GLOBAL_API_KEY"],
)

def generate_answer(query: str, context_chunks: list[str], complexity: str = "simple") -> str:
    model = "deepseek-ai/DeepSeek-V4-Flash" if complexity == "simple" else "deepseek-ai/DeepSeek-V4-Pro"

    context = "\n\n".join(context_chunks)
    prompt = f"""Answer the question using only the context below. Cite specific sections.

Context:
{context}

Question: {query}

Answer:"""

    response = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}],
        temperature=0.1,
        max_tokens=1024,
    )
    return response.choices[0].message.content
Enter fullscreen mode Exit fullscreen mode

The complexity flag is set by a separate classifier I built that's embarrassingly simple - it looks at query length, presence of numbers, and a few keyword triggers. You could make it fancier but this works.

The cost difference between routes is substantial. A simple query hitting V4 Flash runs me about $0.0008. The same query against GPT-4o would have been $0.0078. That's nearly 10x cheaper. Across 40,000 daily queries where 70% classify as simple, the monthly savings add up to real money.

What About Caching and Streaming?

Yes. Do both. I'm not going to lecture you on streaming UX, but the perceived latency difference is enormous and it costs nothing to implement.

Caching is where the real economics live. I'm running a semantic cache layer in front of the LLM. When a user asks something semantically similar to a previous query - and in document Q&A, this happens constantly - I return the cached answer without touching the model at all. My hit rate sits around 40%, which alone saves a meaningful chunk of the bill.

The general guideline is straightforward: a 40% cache hit rate effectively cuts your generation spend in half. When you're paying $1.10 per million output tokens instead of $10.00, that "half" is still meaningful. It's the difference between a feature that loses money and a feature that funds the next sprint.

One more thing on cost optimization. Global API exposes a tier called GA-Economy that I route truly trivial queries through. Think single-document lookups, short answers, template-based responses. I get roughly 50% cost reduction on those queries compared to my standard tier. It's not glamorous but it's where margin lives at scale.

The Vendor Lock-In Question

Let me talk about this directly because it's the question I get from every CTO I mention this stack to.

The single biggest architectural decision you can make for RAG in 2026 is to never let a vendor's SDK touch your application code. I learned this the hard way migrating off an early vector DB choice. Every line of vendor-specific code I had written was technical debt I had to pay down later.

That's why I'm religious about the OpenAI-compatible interface. By pointing at a generic endpoint - in my case https://global-apis.com/v1 - I can swap the underlying model, the vendor, or the routing logic without touching the rest of my stack. Last month I A/B tested GLM-4 Plus ($0.20 input / $0.80 output) against my DeepSeek routing for two weeks just to see if it moved the needle on a specific query class. The test took one config change and zero refactoring.

Some other models I'm keeping on my radar through the same gateway: Qwen3-32B at $0.30 / $1.20 for specialized reasoning workloads, and obviously the DeepSeek family for general production. The point isn't that any one of these is permanently correct. The point is that I can find out in an afternoon, not a quarter.

The Numbers At Scale

Let me give you real production numbers from my deployment, because I think this is where most guides fail. They tell you what works in a notebook and skip the part where it has to keep working when 200 concurrent users are hammering it.

Latency: My average response time sits at about 1.2 seconds end-to-end, including retrieval and the generation call. This is on DeepSeek V4 Flash for the simple path. On V4 Pro for complex queries it's around 2.1 seconds, which is still faster than the GPT-4o baseline I was running.

Throughput: I'm seeing roughly 320 tokens per second sustained on the Flash variant. Pro is slower but I use it less frequently.

Quality: My internal benchmark across 800 questions gives me an 84.6% correctness score averaged across both model variants. That's within 2 points of my GPT-4o baseline.

Setup time: From a clean repo to a working RAG endpoint against the same documents, the initial integration took me under 10 minutes. Most of that was me deciding on chunk sizes. The actual API wiring was copy-paste.

The Mistakes I'd Avoid Next Time

A few things that cost me days I won't get back:

First, I over-engineered the chunking initially. I was doing semantic chunking with embeddings, fancy overlap strategies, the works. Then I tried fixed-size chunks with a 10% overlap and it worked just as well for my use case. Don't gold-plate this.

Second, I waited too long to add a fallback path. I built everything against DeepSeek and only added a graceful degradation route to a secondary model after I got rate-limited during a customer demo. Embarrassing. Always have a fallback configured, even if you think you'll never need it. Rate limits are a real thing at scale.

Third, I didn't instrument cost from day one. I knew my latency was fine because I had dashboards. I didn't know my cost-per-query was ballooning until the invoice arrived. Now I track every generation's input and output tokens in my telemetry pipeline, and I have alerts on cost-per-session that fire before things go sideways.

Who Should And Shouldn't Do This

If you're running a B2B SaaS with moderate query volumes and tight margins, the DeepSeek route on a unified gateway is honestly a no-brainer. The cost reduction of 40-65% versus typical incumbent pricing, combined with comparable quality, is the kind of margin improvement that changes your fundraising math.

If you're building a consumer product where query costs are your entire business model, you have to be even more aggressive. You'd be looking at GA-Economy for almost everything, aggressive caching, possibly running your own quantized models. That's a different post.

If you're in a regulated industry where data residency and model provenance matter, you need to do your own diligence on which models are trained on data you're comfortable with. I'm not your lawyer, and the model landscape is moving fast.

The Part I Keep Coming Back To

The thing I want you to walk away with is this: the RAG stack you build in 2026 should be designed for the assumption that you'll want to change the model in 2027. Maybe 2028 at the latest. The vendors are releasing better models on quarterly cadences now, and pricing is dropping faster than anyone predicted. If your architecture can't take advantage of that, you're leaving real money on the table.

What I built gives me that flexibility. DeepSeek V4 Flash is my workhorse today at $0.27 input and $1.10 output. If something better lands next quarter at a comparable price, I can route traffic to it in an afternoon. If DeepSeek raises prices, I move to GLM-4 Plus or Qwen3-32B with the same effort. That's the whole game.

If you want to test this yourself, Global API gives you 100 free credits to start poking at all 184 models through the same gateway I'm using. Took me about an evening to validate the approach against my own data before I committed to the migration. Worth checking out if you're staring at your own OpenAI bill and doing the math I was doing.

Top comments (0)