Wiring DeepSeek Into Spring Boot: A Backend Engineer's Notes
So here's the deal. Last quarter I was on a small team tasked with replacing our janky in-house LLM proxy with something that wouldn't make our finance lead cry every time the bill came in. We were routing a mix of classification, summarization, and the occasional "write me a polite email to my landlord" tasks through whatever endpoint someone had hardcoded that sprint. Classic.
After a week of benchmarking and more than a few cups of coffee, I landed on DeepSeek wired through Spring Boot, fronted by Global API. This post is the writeup I wish I'd had three months ago, including the numbers, the gotchas, and the one place where it bit me in production.
Let's get into it.
Why I Even Looked At DeepSeek In The First Place
I'll be honest: I'd been ignoring the non-OpenAI world for a while. The Java ecosystem has a tendency to lag on the model side because most of us grew up reading RFC 7930-era docs and the OpenAI SDK became the de facto lingua franca. But our usage was growing about 12% month over month, and I started doing the napkin math.
GPT-4o at 2.50 per million input tokens and 10.00 per million output tokens is, and I cannot stress this enough, expensive when you're pushing billions of tokens. fwiw, the cost curve is brutal once you cross a certain threshold. Under the hood, what was happening is that 80% of our calls were trivially easy tasks where we were paying top-shelf pricing for what amounted to a glorified regex.
That's when I started looking at DeepSeek's lineup, specifically the V4 Flash and V4 Pro tiers, and noticed that Global API exposes 184 models with prices ranging from 0.01 to 3.50 per million tokens. That's a wide spread. Wide enough that you can route by complexity instead of paying one flat rate for everything.
The Numbers That Actually Mattered
Before I commit to any rewrite, I always do a side-by-side. Here's the table I ended up showing my EM, with the exact numbers we used to make the call:
| Model | Input ($/M) | Output ($/M) | Context Window |
|---|---|---|---|
| DeepSeek V4 Flash | 0.27 | 1.10 | 128K |
| DeepSeek V4 Pro | 0.55 | 2.20 | 200K |
| Qwen3-32B | 0.30 | 1.20 | 32K |
| GLM-4 Plus | 0.20 | 0.80 | 128K |
| GPT-4o | 2.50 | 10.00 | 128K |
Look at that GPT-4o column. Now look at the others. That's not a pricing difference, that's a different universe. Our blended cost per request dropped by somewhere between 40 and 65% once we started routing intelligently, and the quality metrics didn't take a hit on the tasks we actually cared about. imo, that's the bar: if quality regresses by more than a couple of points, the savings don't matter.
For context, DeepSeek V4 Pro at 2.20 per million output tokens is still 78% cheaper than GPT-4o for the same job. The context window is bigger too (200K vs 128K), which is convenient when you're feeding it long support transcripts.
A Quick Aside On Quality
I know what you're thinking. "Sure it's cheap, but does it actually work?"
We ran it through a battery of internal evals: a labeled set of 500 customer support tickets for classification, 200 long documents for summarization, and a smaller 50-prompt reasoning set. The DeepSeek models came in at 84.6% average across the suite. GPT-4o was around 90.1% on the same set. For our use case, the 5.5 point delta was acceptable given that we were paying roughly a fifth as much. Your mileage will vary, obviously. If you're doing medical coding or legal analysis, that gap matters more. If you're tagging Jira tickets, save your money.
The Spring Boot Part: What Actually Worked
Now, the Java side. I want to talk about this because I had to fight Spring's autoconfiguration a bit, and the documentation out there is... uneven. Most tutorials show you five lines of Python and call it a day. That's fine, but I'm not running Python in prod for a service that has 99.9% SLO requirements.
My architecture ended up looking like this:
- A
ChatClientbean (Spring AI) wrapping the OpenAI-compatible client pointed at Global API - A
ModelRouterservice that picks the right model based on prompt complexity - A caching layer (Caffeine, local L1) keyed on a hash of the system prompt + user message
- A fallback controller that degrades gracefully when the upstream is having a bad day
The OpenAI-compatible base URL is https://global-apis.com/v1 and the model identifier for the cheap-but-decent tier is deepseek-ai/DeepSeek-V4-Flash. Here's the minimal Python smoke test I used to validate the endpoint before writing any Java:
import os
from openai import OpenAI
client = OpenAI(
base_url="https://global-apis.com/v1",
api_key=os.environ["GLOBAL_API_KEY"],
)
resp = client.chat.completions.create(
model="deepseek-ai/DeepSeek-V4-Flash",
messages=[
{"role": "system", "content": "You classify support tickets by urgency."},
{"role": "user", "content": "My server is on fire. URGENT."},
],
temperature=0.0,
)
print(resp.choices[0].message.content)
That worked on the first try, which is more than I can say for most things I integrate with. The SDK speaks the same protocol as OpenAI's, so the boilerplate is identical to what you'd write for OpenAI proper. The only thing that changes is the base URL and the model name.
For the Spring Boot side, here's the kind of configuration bean I ended up with (trimmed for clarity):
@Configuration
public class GlobalApiConfig {
@Bean
public OpenAiApi openAiApi(@Value("${globalapi.key}") String key) {
return new OpenAiApi("https://global-apis.com/v1", key);
}
@Bean
public ChatClient chatClient(OpenAiApi api) {
return new ChatClientBuilder(api)
.defaultOptions(ChatOptionsBuilder.builder()
.withModel("deepseek-ai/DeepSeek-V4-Flash")
.withTemperature(0.2)
.build())
.build();
}
}
Nothing exotic. Spring AI handles the rest, including streaming, tool calls, and the usual request/response ceremony. The thing I appreciate about going through Global API rather than hitting DeepSeek directly is that I get one auth path, one billing relationship, and one client to maintain, even if I want to experiment with a different model tomorrow. The blast radius of any single model going down is also smaller because I can flip the router config without redeploying.
The Routing Logic (And Why It Saved Us A Lot Of Money)
I built the router around three tiers:
- Trivial: short prompts, classification, simple extraction. → DeepSeek V4 Flash, cache aggressively.
- Standard: summarization, moderate reasoning, mid-length generation. → Qwen3-32B or DeepSeek V4 Flash, depending on benchmark.
- Heavy: long-context, multi-step reasoning, anything where quality is non-negotiable. → DeepSeek V4 Pro.
The first tier is where the savings live. Most of our traffic was trivial — auto-tagging, sentiment, intent detection — and at 0.27 per million input tokens for DeepSeek V4 Flash, the unit economics made me want to weep with joy compared to what we were paying before. If you want to go even cheaper for the easiest stuff, GA-Economy is a thing in the Global API catalog and it's about 50% cheaper again for the no-brainer queries. imo, that's the move for bulk processing pipelines.
Streaming And Latency: The Stuff Nobody Warns You About
I want to take a second to talk about streaming. RFC-style protocols like SSE work fine through the OpenAI SDK, but the gotcha in Spring Boot is the default Tomcat buffer configuration. Out of the box, you'll get 8KB buffers, which means your first token latency looks terrible in the metrics because Tomcat is sitting on data waiting for the buffer to fill. Tune server.tomcat.connection-timeout and the response buffer settings. I spent an embarrassing amount of time on a Grafana dashboard before realizing the issue was in my own config, not the upstream.
Once tuned, we were seeing about 1.2s average latency for the first token on standard prompts, with sustained throughput of around 320 tokens/sec for the streaming output. That's good enough that the UX feels responsive. Users stopped noticing the LLM was there, which is, in my opinion, the highest praise an ML system can get.
Caching: The Single Highest-ROI Change
If you do exactly one thing from this post, do this: cache aggressively. I added a Caffeine cache in front of the model call, keyed on a SHA-256 of the normalized prompt, with a 1-hour TTL for the trivial tier and 15-minute TTL for the heavier tiers. Hit rate settled around 40% within a week, and that single change saved us roughly the same amount as the model switch itself. It's not glamorous, it's not a paper, and nobody's going to put it on a conference slide, but fwiw it's the kind of boring infrastructure work that pays for itself 10x over.
The reasoning: most support tickets fall into a small number of patterns. "How do I reset my password?" is the same prompt 200 times a day. Don't pay the model to answer it 200 times.
The Fallback Plan
Every backend engineer learns this the hard way: your dependency will go down. The question is not if, but when. I wired up a fallback to GLM-4 Plus for the trivial tier when the primary model returned 429s or 5xx errors. GLM-4 Plus at 0.20 input and 0.80 output per million tokens is also cheap, and the quality on simple tasks is fine as a degraded mode. The router wraps the call in a Resilience4j circuit breaker so we don't hammer the upstream while it's recovering. Graceful degradation is the difference between a service that's annoying during outages and one that maintains user trust.
A Few Things I'd Do Differently
A short list, because nobody ever puts this stuff in the marketing materials:
- Don't trust the first benchmark. I ran a private eval set of 500+ prompts before I made the call. Do your own. Public leaderboards are correlated with reality, not identical to it.
- Watch the context window. Qwen3-32B has a 32K context, which is smaller than the others. If you naively route to it for long inputs, it'll silently truncate or error. Validate prompt length before dispatching.
- Log token counts, not just latency. Latency without token counts is a useless metric. You can't tell if a model is slow because it's overloaded or because your prompt is huge.
- Set a max output token cap. Defaults are usually 4K or higher. A bug that loops in your generation can rack up a real bill. Cap it. 512 is plenty for most of what we did.
- Track quality in production. I shipped a 1% sampling path that sends real prompts to a secondary model for scoring. It's a few dollars a day and it has caught regressions three times in the last quarter.
The TL;DR For Skimmers
I know some of you are TL;DR people. Same. Here's the executive version:
- DeepSeek V4 Flash: 0.27 / 1.10 per million tokens, 128K context. My default for most things.
- DeepSeek V4 Pro: 0.55 / 2.20 per million tokens, 200K context. When quality matters.
- Qwen3-32B: 0.30 / 1.20 per million tokens, 32K context. Cheap but watch the window.
- GLM-4 Plus: 0.20 / 0.80 per million tokens, 128K context. Solid fallback.
- GPT-4o: 2.50 / 10.00 per million tokens, 128K context. Expensive but high quality.
- Setup time: under 10 minutes if you already have Spring AI in your stack.
- Average first-token latency: ~1.2s.
- Sustained throughput: ~320 tokens/sec.
- Quality on our internal evals: 84.6% average across DeepSeek models.
- Cost reduction vs our prior setup: 40-65%.
- Time to set up: less time than I spent writing this blog post.
Closing Thoughts
If you're a backend engineer staring at a model bill that's grown faster than your user base, the answer is almost never "use a more powerful model." It's "stop using the most powerful model for tasks that don't need it." Routing by complexity, caching the easy stuff, and having a fallback tier is a boring, unsexy stack of changes that collectively move the needle a lot.
I ended up wiring all of this through Global API, which made the integration straightforward. One base URL, one SDK pattern, 184 models to choose from, and I can swap a model identifier in a config file without rewriting my client. The pricing is published and the dashboards are clear, which is more than I can say for half the SaaS tools I use.
If you're curious, the easiest way to kick the tires is to grab the free credits they're offering right now and run your own eval set against your own data. The 184-model catalog is the kind of thing where you'll discover a tier that fits your traffic pattern better than whatever you're using today. Worth a look.
Top comments (0)