DEV Community

Cover image for The Hidden Cost of Every LLM API Call
Harshvardhan Singh
Harshvardhan Singh

Posted on

The Hidden Cost of Every LLM API Call

What actually happens after your app sends a prompt to an LLM?

~6 min read

You call client.messages.create(...). A few hundred ms later, tokens start streaming back.

Feels simple. Isn't. Here's the full path, broken into fast, skimmable sections.


1. Your SDK does work before anything leaves your laptop

  • Serializes your messages to JSON
  • Attaches headers (API key, content-type)
  • Decides HTTP/1.1 vs HTTP/2
  • Sets up retry/backoff logic

๐Ÿ’ก Common Mistake: Making a new client instance per request. You lose connection pooling and pay full TCP + TLS setup cost every time. Reuse the client.


2. DNS: finding the server

api.anthropic.com โ†’ Resolver โ†’ 203.0.113.42
Enter fullscreen mode Exit fullscreen mode

Cold lookup: 20โ€“120ms. Cached: basically free. This is why connection reuse (skip re-resolving DNS on every call) is a real win at scale.


3. TLS: locking the channel

Client โ†’ TCP handshake โ†’ TLS handshake โ†’ Encrypted request โ†’
Enter fullscreen mode Exit fullscreen mode

TLS 1.3 trimmed this to ~1 round trip. Still not free โ€” especially on mobile networks with higher latency.


4. Load balancer: you're not hitting one server

Request โ†’ [Load Balancer] โ†’ Server A / B / C
Enter fullscreen mode Exit fullscreen mode

Health checks, geographic routing, traffic spike absorption. This is why one dead server never becomes your problem.


5. API Gateway: airport security for your request

  • Auth โ€” is this API key valid, whose account is it?
  • Rate limiting โ€” protects shared infra from noisy neighbors
  • Validation โ€” malformed JSON or bad params get rejected here, before wasting GPU time downstream

๐Ÿ’ก Engineering Insight: Rate limits aren't there to annoy you โ€” they keep one client from degrading service for everyone sharing that hardware.


6. Logging (async, non-blocking)

Request IDs, token counts, per-stage latency โ€” feeds debugging, abuse detection, and your invoice. Doesn't block your request.


7. Tokenization: words become numbers

"Explain quantum entanglement" โ†’ [16350, 14294, 4776, 385, 1997]
Enter fullscreen mode Exit fullscreen mode

Two things this affects directly:

  • ๐Ÿ’ฐ Cost โ€” billed per token, not per character
  • ๐Ÿ“ Context limit โ€” "200K context" = token budget, not word count

๐Ÿ’ก Real Example: Code and non-English text often burn more tokens than plain English for the same "amount" of meaning โ€” the tokenizer saw those patterns less during training.

๐Ÿ’ก Performance Tip: Trim repeated boilerplate/system prompts. Every token costs money and context space.


8. Model routing

A routing layer picks which model + cluster serves your request based on capacity and region. Provider-specific, mostly undocumented in detail โ€” but this general shape is common everywhere.


9. GPU scheduling: the real bottleneck

[User A][User B][User C][User D] โ†’ batched onto one GPU
Enter fullscreen mode Exit fullscreen mode

GPUs can't spin up instantly like a web server. Batching multiple requests keeps them efficient. Continuous batching (slotting new requests into an in-flight batch) is why modern serving is so much faster than naive one-at-a-time processing.

๐Ÿ’ก This is also why your latency varies call to call โ€” you're sharing hardware.


10. KV Cache: the trick behind fast generation

Token 1 โ†’ compute + cache
Token 2 โ†’ reuse cache + compute new token only
Enter fullscreen mode Exit fullscreen mode

Without this, every new token would mean reprocessing the whole conversation. With it, generation stays fast โ€” but the cache grows with context length, eating GPU memory the whole time your request is active.

This is also the mechanism behind prompt caching โ€” reusing cached state for a shared prefix (like a system prompt) across calls, cutting cost + latency.


11. Transformer inference (the part everyone pictures)

Per token: embed โ†’ run through N transformer layers (self-attention + feed-forward) โ†’ probability distribution over vocabulary โ†’ sample next token.

๐Ÿ’ก Common Mistake: Higher temperature โ‰  smarter model. It just changes sampling randomness.


12. Streaming: why it feels like typing

Prompt โ†’ [t1] โ†’ [t1,t2] โ†’ [t1,t2,t3] โ†’ ...
Enter fullscreen mode Exit fullscreen mode

Tokens generate one at a time (autoregressive) and get streamed to you as each one is ready โ€” usually via Server-Sent Events.

๐Ÿ’ก Performance Tip: Always stream user-facing responses longer than a sentence. Total time is the same, but perceived latency drops massively โ€” first token in ms instead of a blank screen.


13. Billing, running in parallel

Input tokens + output tokens metered (cached tokens often cheaper). Feeds your invoice and sometimes real-time quota checks back into the rate limiter from step 5.

๐Ÿ’ก A long repeated system prompt quietly becomes a big line item unless the provider discounts the repeated prefix via caching.


The whole pipeline, one diagram

Your Code โ†’ SDK โ†’ DNS โ†’ TLS โ†’ Load Balancer โ†’ Gateway (auth/limit/validate)
   โ†’ Logging โ†’ Tokenization โ†’ Routing โ†’ GPU Scheduling โ†’ KV Cache
   โ†’ Inference โ†’ Generation โ†’ Streaming โ†’ (Billing, parallel) โ†’ Your Code
Enter fullscreen mode Exit fullscreen mode

~15 systems, different teams, different hardware โ€” cooperating in well under a second.


TL;DR

  • This is a distributed systems problem first, ML problem second
  • Reuse connections โ€” DNS + TLS cost adds up
  • Tokens = cost + context budget, treat them as a resource
  • Latency variance = GPU batching, not "harder thinking"
  • KV cache = why long chats cost more server-side
  • Streaming = better perceived speed, not better actual speed

Discussion: As agent chains stack tool calls on tool calls, how much of this overhead gets duplicated at every hop โ€” and what should get collapsed into one shared layer instead?


Provider-specific details (routing, scheduling, caching) vary โ€” the patterns above are common across large-scale LLM serving systems, not any one provider's exact internals.

References: Anthropic Docs ยท OpenAI Docs ยท Vaswani et al., "Attention Is All You Need" (2017) ยท Cloudflare: How TLS Works

Top comments (0)