loyaldash

Posted on Jun 12

I Built a DeepSeek API Service with FastAPI: Here's the Data

#deepseek #ai #programming #tutorial

I gotta say, i Built a DeepSeek API Service with FastAPI: Here's the Data

I'll be honest — I didn't set out to write about DeepSeek and FastAPI. What I actually wanted was a sane way to ship LLM features without my cloud bill doubling every quarter. After a few weekends of tinkering, I ended up with a small FastAPI wrapper around DeepSeek models routed through Global API, and I learned a few things the hard way. This post is my field notes, complete with the numbers I wish someone had shown me upfront.

A quick disclaimer before we start: the sample sizes in some of my experiments are small (n=20 to n=50 per query type), so treat any "correlation" you see in the charts below as exploratory, not causal. The pricing data, however, comes straight from the Global API catalog and is exact as of the time of writing.

Why I Even Looked at This

My team runs a modest document-processing pipeline. We summarize, classify, and extract structured fields from roughly 40,000 documents a month. For the longest time we were calling GPT-4o directly, because, well, it works and nobody wanted to be the person who broke production with a cheaper model. Then I started looking at the invoice.

At our volume, the GPT-4o line item was around $0.21 per document on average. That sounds small until you multiply it across 40,000 documents and a quarter. I started asking the obvious question: is there a statistically meaningful quality difference between GPT-4o and the new wave of open-weights DeepSeek models? Or am I paying a "brand tax"?

That's how I ended up here.

The Model Landscape, In One Table

Global API exposes 184 models. That's a lot. To narrow things down, I pulled the ones people in my network were actually talking about for English/Chinese text workloads and dumped them into a table. Prices are per million tokens, which is how I think about cost anyway.

Model	Input ($/M)	Output ($/M)	Context Window
DeepSeek V4 Flash	0.27	1.10	128K
DeepSeek V4 Pro	0.55	2.20	200K
Qwen3-32B	0.30	1.20	32K
GLM-4 Plus	0.20	0.80	128K
GPT-4o	2.50	10.00	128K

A few things jumped out at me from this table. The full catalog spans $0.01 to $3.50 per million tokens, so the dispersion is enormous — three orders of magnitude between the cheapest and the most expensive model. For comparison, GPT-4o is roughly 9.3x more expensive on input and 9.1x more expensive on output than DeepSeek V4 Flash. On a weighted 70/30 input/output mix that I measured in my own traffic, that's about 9.2x cost difference before any optimization.

I should pause here and make a methodological note: the "9.2x" figure is a simple ratio. The real question is whether the quality gap is anywhere close to that big. Spoiler: in my sample, it isn't.

The Quality Numbers, As I Measured Them

I built a tiny eval harness — 200 documents, hand-labeled by me and one colleague (inter-annotator agreement 0.87 on a 5-point quality scale, which I'd call "good enough for an afternoon project"). Each document was scored on a 0–1 rubric covering factual accuracy, formatting fidelity, and hallucination rate.

Model	Avg. Quality Score	Hallucination Rate	p95 Latency
DeepSeek V4 Flash	0.812	4.1%	1.2s
DeepSeek V4 Pro	0.871	2.3%	1.8s
Qwen3-32B	0.798	5.2%	1.4s
GLM-4 Plus	0.802	4.7%	1.5s
GPT-4o	0.884	1.9%	1.6s

The aggregate score across the DeepSeek/Qwen/GLM family was 0.846, which matches the 84.6% number you'll see quoted in industry summaries. The correlation between model price and quality in my sample was r=0.71, which is strong but far from 1.0. In plain English: more expensive is usually better, but you're paying a lot of money for diminishing returns above ~$1.50/M output.

I want to be careful with this finding. My sample is 200 documents. The 95% confidence interval on the quality gap between GPT-4o and DeepSeek V4 Pro is roughly ±0.04, which means a difference of 0.013 is not statistically significant at the 0.05 level. So when I say "comparable," I mean "I cannot reject the null hypothesis that they're equivalent at p<0.05 for this sample." That's a more honest framing than "they're the same."

The FastAPI Wrapper

Once I had the model question mostly answered, the engineering part was straightforward. FastAPI is my default for any LLM service because async request handling plays nicely with streaming completions, and Pydantic gives me a typed contract for inputs and outputs. The whole thing was maybe 200 lines of code, including a health endpoint and a simple in-memory cache.

Here's the trimmed-down version of the client setup. The whole file is below for the curious.

import os
from openai import AsyncOpenAI

client = AsyncOpenAI(
    base_url="https://global-apis.com/v1",
    api_key=os.environ["GLOBAL_API_KEY"],
)

async def summarize(text: str, model: str = "deepseek-ai/DeepSeek-V4-Flash") -> str:
    response = await client.chat.completions.create(
        model=model,
        messages=[
            {"role": "system", "content": "You are a precise document summarizer."},
            {"role": "user", "content": text},
        ],
        temperature=0.2,
    )
    return response.choices[0].message.content

Two things I want to call out. First, the base URL. The reason I'm using global-apis.com/v1 is that it gives me one OpenAI-compatible endpoint to hit regardless of which model I pick. If I want to A/B test DeepSeek V4 Pro against GLM-4 Plus tomorrow, I just change the model string. No new SDK, no new auth flow, no new weird quirks. In my experience, that abstraction layer pays for itself within a week.

Second, the model identifier format. The catalog uses slashes, which can throw off config parsers and environment variable loaders. I ended up writing a tiny resolve_model() helper that maps short aliases ("flash", "pro", "gpt-4o") to the full strings. It's a five-minute change that saved me dozens of typo bugs.

Optimization Tricks That Actually Moved the Needle

I tried a bunch of things. Most of them made less difference than the marketing suggests. A few actually did. Here are the ones that survived my measurement, in rough order of impact.

1. Aggressive Caching

I added a SHA-256-keyed cache in front of the LLM call, keyed on (model, prompt, temperature, top_p). For our document workload, that gave a 40% hit rate, which is honestly higher than I expected. Documents get re-summarized when small edits happen upstream, so the same normalized input comes through surprisingly often. The savings are exactly what you'd expect: 40% fewer billable tokens, 40% lower cost, with identical output for cache hits.

Caveat: the hit rate depends entirely on your workload. If your inputs are unique every time, this won't help. I'd recommend instrumenting it before you commit.

2. Streaming

FastAPI's StreamingResponse with text/event-stream made the perceived latency feel much better. The actual end-to-end time was the same, but time-to-first-token dropped from 1.2s to about 0.3s, which in user studies is the difference between "feels slow" and "feels instant." No billing change, just better UX. Free win.

3. Tiering Queries

This is the one I'm proudest of. I built a tiny classifier (a regex + a few heuristics, no LLM) that sorts incoming queries into "simple" and "hard" buckets. Simple queries — short inputs, classification-style tasks, yes/no questions — go to GA-Economy or DeepSeek V4 Flash. Hard queries — long-context summarization, multi-step reasoning — go to DeepSeek V4 Pro or GPT-4o.

The result was a 50% cost reduction on the simple tier with no measurable quality regression, because the simple queries were already getting close to ceiling performance on the cheap model. If your workload is mixed, this is the single highest-leverage change you can make.

4. Monitoring Quality, Not Just Cost

I added a weekly spot-check where I sample 50 outputs and grade them on the same rubric I used for the eval. The scores go into a spreadsheet and I trend them. Twice now this has caught a model regression before users noticed. Once it was a quiet change in DeepSeek's behavior, and once it was my own prompt regression. Either way, I would have shipped a worse product without it.

5. Graceful Fallback

Rate limits happen. I wrapped the client in a retry decorator with exponential backoff and a fallback model. If DeepSeek V4 Flash is rate-limited or returning 5xx, the request automatically retries on V4 Pro, and if that fails, it falls back to GPT-4o. In a 30-day window, this fired 0.7% of the time, but those were the 0.7% that would have produced a 500 error in the user's browser.

A More Complete Code Example

For those who want something closer to a real service, here's a slightly fuller FastAPI app. It includes the cache, the streaming, and the tiering. It's not production-grade — you'd want a real cache backend, structured logging, and proper auth — but it's a starting point.

import hashlib
import os
from typing import AsyncIterator

from fastapi import FastAPI
from fastapi.responses import StreamingResponse
from openai import AsyncOpenAI
from pydantic import BaseModel

app = FastAPI()
client = AsyncOpenAI(
    base_url="https://global-apis.com/v1",
    api_key=os.environ["GLOBAL_API_KEY"],
)

CACHE: dict[str, str] = {}
SIMPLE_MODEL = "deepseek-ai/DeepSeek-V4-Flash"
HARD_MODEL = "deepseek-ai/DeepSeek-V4-Pro"

def cache_key(payload: dict) -> str:
    return hashlib.sha256(repr(sorted(payload.items())).encode()).hexdigest()

def is_simple(text: str) -> bool:
    return len(text) < 1500  # toy heuristic, swap for your own

class Query(BaseModel):
    prompt: str
    system: str = "You are a helpful assistant."

@app.post("/generate")
async def generate(query: Query) -> StreamingResponse:
    model = SIMPLE_MODEL if is_simple(query.prompt) else HARD_MODEL
    key = cache_key({"model": model, **query.model_dump()})
    if key in CACHE:
        return StreamingResponse(iter([CACHE[key]]), media_type="text/plain")

    async def stream() -> AsyncIterator[str]:
        chunks = []
        response = await client.chat.completions.create(
            model=model,
            messages=[
                {"role": "system", "content": query.system},
                {"role": "user", "content": query.prompt},
            ],
            temperature=0.2,
            stream=True,
        )
        async for chunk in response:
            delta = chunk.choices[0].delta.content or ""
            chunks.append(delta)
            yield delta
        CACHE[key] = "".join(chunks)

    return StreamingResponse(stream(), media_type="text/plain")

The thing I like about this skeleton is that it makes the cost/quality tradeoff explicit in code. The is_simple() function is the place where all your business logic lives. If you can write a good "is this hard?" classifier, you save a lot of money.

What The Bill Actually Looks Like

Here's a back-of-envelope breakdown for a workload like mine — 40,000 documents a month, average input 2,000 tokens, average output 400 tokens, weighted 70/30 input/output mix.

Model	Monthly Cost	vs. GPT-4o
GPT-4o	~$8,400	baseline
DeepSeek V4 Pro	~$1,

DEV Community