Harshvardhan Singh

Posted on Jul 5

The Hidden Cost of Every LLM API Call

#ai #backend #webdev #architecture

What actually happens after your app sends a prompt to an LLM?

~6 min read

You call client.messages.create(...). A few hundred ms later, tokens start streaming back.

Feels simple. Isn't. Here's the full path, broken into fast, skimmable sections.

1. Your SDK does work before anything leaves your laptop

Serializes your messages to JSON
Attaches headers (API key, content-type)
Decides HTTP/1.1 vs HTTP/2
Sets up retry/backoff logic

💡 Common Mistake: Making a new client instance per request. You lose connection pooling and pay full TCP + TLS setup cost every time. Reuse the client.

2. DNS: finding the server

api.anthropic.com → Resolver → 203.0.113.42

Cold lookup: 20–120ms. Cached: basically free. This is why connection reuse (skip re-resolving DNS on every call) is a real win at scale.

3. TLS: locking the channel

Client → TCP handshake → TLS handshake → Encrypted request →

TLS 1.3 trimmed this to ~1 round trip. Still not free — especially on mobile networks with higher latency.

4. Load balancer: you're not hitting one server

Request → [Load Balancer] → Server A / B / C

Health checks, geographic routing, traffic spike absorption. This is why one dead server never becomes your problem.

5. API Gateway: airport security for your request

Auth — is this API key valid, whose account is it?
Rate limiting — protects shared infra from noisy neighbors
Validation — malformed JSON or bad params get rejected here, before wasting GPU time downstream

💡 Engineering Insight: Rate limits aren't there to annoy you — they keep one client from degrading service for everyone sharing that hardware.

6. Logging (async, non-blocking)

Request IDs, token counts, per-stage latency — feeds debugging, abuse detection, and your invoice. Doesn't block your request.

7. Tokenization: words become numbers

"Explain quantum entanglement" → [16350, 14294, 4776, 385, 1997]

Two things this affects directly:

💰 Cost — billed per token, not per character
📏 Context limit — "200K context" = token budget, not word count

💡 Real Example: Code and non-English text often burn more tokens than plain English for the same "amount" of meaning — the tokenizer saw those patterns less during training.

💡 Performance Tip: Trim repeated boilerplate/system prompts. Every token costs money and context space.

8. Model routing

A routing layer picks which model + cluster serves your request based on capacity and region. Provider-specific, mostly undocumented in detail — but this general shape is common everywhere.

9. GPU scheduling: the real bottleneck

[User A][User B][User C][User D] → batched onto one GPU

GPUs can't spin up instantly like a web server. Batching multiple requests keeps them efficient. Continuous batching (slotting new requests into an in-flight batch) is why modern serving is so much faster than naive one-at-a-time processing.

💡 This is also why your latency varies call to call — you're sharing hardware.

10. KV Cache: the trick behind fast generation

Token 1 → compute + cache
Token 2 → reuse cache + compute new token only

Without this, every new token would mean reprocessing the whole conversation. With it, generation stays fast — but the cache grows with context length, eating GPU memory the whole time your request is active.

This is also the mechanism behind prompt caching — reusing cached state for a shared prefix (like a system prompt) across calls, cutting cost + latency.

11. Transformer inference (the part everyone pictures)

Per token: embed → run through N transformer layers (self-attention + feed-forward) → probability distribution over vocabulary → sample next token.

💡 Common Mistake: Higher temperature ≠ smarter model. It just changes sampling randomness.

12. Streaming: why it feels like typing

Prompt → [t1] → [t1,t2] → [t1,t2,t3] → ...

Tokens generate one at a time (autoregressive) and get streamed to you as each one is ready — usually via Server-Sent Events.

💡 Performance Tip: Always stream user-facing responses longer than a sentence. Total time is the same, but perceived latency drops massively — first token in ms instead of a blank screen.

13. Billing, running in parallel

Input tokens + output tokens metered (cached tokens often cheaper). Feeds your invoice and sometimes real-time quota checks back into the rate limiter from step 5.

💡 A long repeated system prompt quietly becomes a big line item unless the provider discounts the repeated prefix via caching.

The whole pipeline, one diagram

Your Code → SDK → DNS → TLS → Load Balancer → Gateway (auth/limit/validate)
   → Logging → Tokenization → Routing → GPU Scheduling → KV Cache
   → Inference → Generation → Streaming → (Billing, parallel) → Your Code

~15 systems, different teams, different hardware — cooperating in well under a second.

TL;DR

This is a distributed systems problem first, ML problem second
Reuse connections — DNS + TLS cost adds up
Tokens = cost + context budget, treat them as a resource
Latency variance = GPU batching, not "harder thinking"
KV cache = why long chats cost more server-side
Streaming = better perceived speed, not better actual speed

Discussion: As agent chains stack tool calls on tool calls, how much of this overhead gets duplicated at every hop — and what should get collapsed into one shared layer instead?

Provider-specific details (routing, scheduling, caching) vary — the patterns above are common across large-scale LLM serving systems, not any one provider's exact internals.

References: Anthropic Docs · OpenAI Docs · Vaswani et al., "Attention Is All You Need" (2017) · Cloudflare: How TLS Works

DEV Community