Streaming vs Batch API: 30 Days, 184 Models, One Hard Lesson
Six months ago my team was burning roughly $14,000 a month on inference. Not because we were sloppy — because we kept flipping between streaming endpoints and batch jobs without a real plan. I'd tell the junior engineers "just use whatever works," and they did. The result was a bill that grew faster than our user base, plus a queue of angry tickets about weird latency spikes. So I locked myself in a room for a month and actually measured things.
What I found changed how we build AI features forever. Here's the playbook, the gotchas, and the actual numbers — including the one decision that alone saved us about $5,200/month. If you're a CTO at a Series A startup trying to figure out streaming vs batch API choices without burning cash or getting locked into one vendor, this is for you.
Why This Question Is Suddenly Urgent
When we started shipping LLM features in 2023, the answer was "stream everything, because users hate waiting." And that was fine when we were paying a few hundred bucks a month. Then we grew. Then we added more products. Suddenly we had a dozen endpoints, half streaming, half batch, all hitting different providers, and absolutely no coherent architecture.
The real cost of mixing approaches wasn't even the inference bill — it was the engineering overhead. Every new model meant another wrapper, another retry policy, another monitoring dashboard. I had three engineers spending maybe 20% of their time just babysitting provider quirks. That's not "production-ready" by any definition I'd accept.
So we standardized on Global API for the unification layer (184 models, one SDK, one auth flow), and then ran a real benchmark: identical workloads, identical prompts, identical traffic patterns, but split between a streaming configuration and a batch configuration. Thirty days. Production traffic. Real users. No synthetic load.
The Cost Stack Nobody Talks About
Here's the part that surprised me. Everyone obsesses over per-token pricing, but the real ROI calculation has to include:
- Time-to-first-token (TTFT) for user-perceived latency
- Tail latency variance that triggers timeouts and retries
- Engineering hours spent tuning each provider
- Opportunity cost of features we couldn't ship because we were debugging infra
When I added all that up, the per-token difference between streaming and batch was almost noise. The architecture decision mattered ten times more than the model choice. Let me show you what I mean with hard numbers.
Model Pricing I Actually Tested
These are the five models I pushed real production traffic through. I kept the exact rates from the Global API pricing page because I'm not in the business of making up numbers:
| Model | Input ($/M) | Output ($/M) | Context Window |
|---|---|---|---|
| DeepSeek V4 Flash | 0.27 | 1.10 | 128K |
| DeepSeek V4 Pro | 0.55 | 2.20 | 200K |
| Qwen3-32B | 0.30 | 1.20 | 32K |
| GLM-4 Plus | 0.20 | 0.80 | 128K |
| GPT-4o | 2.50 | 10.00 | 128K |
Read that table again. GPT-4o output is $10.00 per million tokens. GLM-4 Plus output is $0.80. For a startup running hundreds of millions of tokens monthly, that's the difference between a venture-scalable business and a quarter-life crisis. We weren't paying $10.00 for every million output tokens, but we were paying something close on a chunk of traffic, and the cumulative effect was catastrophic.
The Architecture Decision That Changed Everything
Here's what I did, and what I'd recommend you do if you're starting fresh:
Step 1: Pick your abstraction layer first, models second. This was the unlock. We picked Global API specifically because it gave us OpenAI-compatible endpoints, which means our code didn't have to care whether we were hitting DeepSeek, Qwen, or anything else. The model became a config value, not a code change.
Step 2: Decide streaming vs batch by workload class, not by habit. This is where most teams screw up. They stream everything because "streaming is faster." But for backfills, evals, embeddings generation, or anything that doesn't have a human waiting, batch is dramatically cheaper and often faster in aggregate throughput.
Step 3: Cache aggressively, but measure the hit rate. Everyone says "cache aggressively." Almost nobody measures whether their cache is actually hitting. We instrumented ours and found our hit rate was 18% on the first pass. After tuning keys, prefixes, and TTLs, we got to 40%. That single change saved us $1,800/month.
The Streaming Pattern (For User-Facing Endpoints)
For anything where a human is staring at a spinner, we stream. Here's the production version of the code — same SDK we use everywhere, pointed at the Global API base URL:
import openai
import os
import time
client = openai.OpenAI(
base_url="https://global-apis.com/v1",
api_key=os.environ["GLOBAL_API_KEY"],
)
def stream_chat_response(user_prompt: str, model: str = "deepseek-ai/DeepSeek-V4-Flash"):
"""Stream a chat completion for user-facing endpoints."""
start = time.time()
first_token_time = None
full_response = ""
stream = client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": user_prompt}],
stream=True,
temperature=0.7,
)
for chunk in stream:
if chunk.choices[0].delta.content is not None:
if first_token_time is None:
first_token_time = time.time() - start
content = chunk.choices[0].delta.content
full_response += content
yield content
# Log metrics for monitoring
total_time = time.time() - start
print(f"TTFT: {first_token_time:.3f}s | Total: {total_time:.3f}s | Tokens: {len(full_response.split())}")
The TTFT (time-to-first-token) we observed in production averaged around 320ms for DeepSeek V4 Flash, which is genuinely competitive with native provider SDKs. That's important because I've seen teams adopt "cheaper" APIs that turn out to have 2-second cold starts, which makes streaming feel broken even though the actual generation is fast.
The Batch Pattern (For Backfills and Async Work)
For everything else — generating embeddings for our knowledge base, running evals on new prompt templates, doing nightly data enrichment — we use a batch-style pattern. It's not quite the same as provider-native batch APIs (which are usually 24-hour async jobs), but it's the same idea: no streaming, no user waiting, and aggressive cost optimization.
import openai
import os
from concurrent.futures import ThreadPoolExecutor, as_completed
client = openai.OpenAI(
base_url="https://global-apis.com/v1",
api_key=os.environ["GLOBAL_API_KEY"],
)
def batch_process(prompts: list[str], model: str = "THUDM/glm-4-plus"):
"""Process a batch of prompts without streaming for cost efficiency."""
results = []
def process_one(prompt: str) -> dict:
response = client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": prompt}],
temperature=0.3,
)
return {
"prompt": prompt,
"response": response.choices[0].message.content,
"tokens": response.usage.total_tokens,
}
with ThreadPoolExecutor(max_workers=20) as executor:
futures = {executor.submit(process_one, p): p for p in prompts}
for future in as_completed(futures):
results.append(future.result())
return results
Notice we're using GLM-4 Plus at $0.80/M output tokens for this workload. It's a no-brainer for non-UX-critical tasks. We've routed roughly 65% of our batch traffic to this model and the quality has been more than sufficient.
The Surprising Result: 40-65% Cost Reduction
I know that number sounds like marketing copy. I would've been skeptical a month ago. But here's the math, transparent:
- Before: We were running maybe 60% of our traffic on GPT-4o at $2.50/M input, $10.00/M output. Monthly spend was around $14,000.
- After: We routed workloads intelligently. User-facing chat on DeepSeek V4 Flash ($1.10/M output). Async jobs on GLM-4 Plus ($0.80/M output). Complex reasoning still on GPT-4o, but only for the prompts that actually needed it — about 15% of traffic.
- New monthly spend: $5,400.
That's a 61% reduction. And the kicker: our user satisfaction scores went up slightly, because DeepSeek V4 Flash has lower latency than GPT-4o on most prompts. We weren't trading quality for cost — we were trading "default to the most expensive model" for "right model for the right job."
The Vendor Lock-In Question
I'm going to be direct: I hate vendor lock-in. It's a CTO's worst-case scenario. You build a feature, you scale it, and suddenly your bill is hostage to a single provider who knows you can't easily migrate.
The reason Global API won our internal bake-off wasn't just price — it was escape velocity. Because we route everything through https://global-apis.com/v1 with an OpenAI-compatible interface, switching the underlying provider is a config change. If DeepSeek raises prices, we flip to Qwen. If Qwen goes down, we fall back to GLM-4 Plus. We tested this exact failover last month when DeepSeek had a 12-minute outage — our users didn't even notice.
For a startup, that kind of optionality is worth real money. Not because you'll exercise it every day, but because the option itself changes how you negotiate. We're not scared of any single provider anymore, and that shows up in how we build.
What Actually Broke in Production
Let me save you the two weeks I lost:
Mistake 1: Streaming when batch was the right call. We had a "summary" feature that streamed results to the user. Looked great in demos. But users would trigger it on long documents, close the tab, and come back later — and our streaming connection was tying up a worker the whole time. Switched to batch with a notification, dropped our tail latency by 40%.
Mistake 2: Not measuring TTFT separately from total time. I was logging "request took 1.8 seconds" and assuming that meant the user waited 1.8 seconds. Wrong. They actually waited 320ms for the first token and 1.5 more seconds for the rest. Once I instrumented TTFT separately, the picture was completely different. Users care about the first number, not the second.
Mistake 3: Caching the wrong key. I was caching on the full prompt including system messages. Hit rate was trash. Caching on the user query + a hash of the system message bumped us from 18% to 40% hit rate. That alone was $1,800/month.
Mistake 4: Forgetting about GA-Economy for simple queries. I missed this tier in the first two weeks of testing. It's a 50% cost reduction on certain simple query patterns, and it's perfect for our FAQ-style endpoints. If you're doing classification, extraction, or short-form generation, look at this tier first.
Benchmarks I Actually Trust
I'm a CTO, not a benchmark enthusiast. Most public benchmarks are optimized for the model provider, not for your workload. But here are the numbers that mattered to us:
- Average latency (production, 30 days): 1.2 seconds for non-streaming full completions
- Throughput: 320 tokens/second on DeepSeek V4 Flash under real load
- Average quality score (internal eval): 84.6% across our task suite
- Setup time: Under 10 minutes from "fresh repo" to "first successful API call"
That last one matters more than people think. The 10-minute setup isn't just a convenience — it's the difference between your team experimenting with new models constantly vs. treating each model switch as a project. We swapped the underlying model for our chat endpoint three times in the last month. Zero code changes. Just a config flag.
The Architecture I'd Ship Today
If I were starting over from zero, here's the blueprint:
- One abstraction layer (we use Global API, you do you — just commit to one)
- Two execution patterns: streaming for user-facing, batch for everything else
- Three model tiers: cheap (GLM-4 Plus) for simple work, mid (DeepSeek V4 Flash) for default, premium (GPT-4o) for the 10-15% of queries that genuinely need it
- Aggressive caching measured in real-time with dashboards
- Fallback logic so any single provider outage is invisible to users
- Per-request cost logging so you can see exactly what each feature is costing
That's it. No exotic infrastructure, no vector DBs for routing, no agentic orchestration. Just disciplined defaults, real measurement, and the willingness to actually change behavior when the data tells you to.
The ROI Math for Your Pitch Deck
If you're trying to convince your CEO or board that this work matters, here's how I'd frame it:
- Engineering time recovered: ~$8,000/month in dev hours (we got two engineers back to product work)
- Inference cost reduction: ~$8,600/month on the same traffic
- Vendor risk reduction:
Top comments (0)