What actually happens after your app sends a prompt to an LLM?
~6 min read
You call client.messages.create(...). A few hundred ms later, tokens start streaming back.
Feels simple. Isn't. Here's the full path, broken into fast, skimmable sections.
1. Your SDK does work before anything leaves your laptop
- Serializes your messages to JSON
- Attaches headers (API key, content-type)
- Decides HTTP/1.1 vs HTTP/2
- Sets up retry/backoff logic
๐ก Common Mistake: Making a new client instance per request. You lose connection pooling and pay full TCP + TLS setup cost every time. Reuse the client.
2. DNS: finding the server
api.anthropic.com โ Resolver โ 203.0.113.42
Cold lookup: 20โ120ms. Cached: basically free. This is why connection reuse (skip re-resolving DNS on every call) is a real win at scale.
3. TLS: locking the channel
Client โ TCP handshake โ TLS handshake โ Encrypted request โ
TLS 1.3 trimmed this to ~1 round trip. Still not free โ especially on mobile networks with higher latency.
4. Load balancer: you're not hitting one server
Request โ [Load Balancer] โ Server A / B / C
Health checks, geographic routing, traffic spike absorption. This is why one dead server never becomes your problem.
5. API Gateway: airport security for your request
- Auth โ is this API key valid, whose account is it?
- Rate limiting โ protects shared infra from noisy neighbors
- Validation โ malformed JSON or bad params get rejected here, before wasting GPU time downstream
๐ก Engineering Insight: Rate limits aren't there to annoy you โ they keep one client from degrading service for everyone sharing that hardware.
6. Logging (async, non-blocking)
Request IDs, token counts, per-stage latency โ feeds debugging, abuse detection, and your invoice. Doesn't block your request.
7. Tokenization: words become numbers
"Explain quantum entanglement" โ [16350, 14294, 4776, 385, 1997]
Two things this affects directly:
- ๐ฐ Cost โ billed per token, not per character
- ๐ Context limit โ "200K context" = token budget, not word count
๐ก Real Example: Code and non-English text often burn more tokens than plain English for the same "amount" of meaning โ the tokenizer saw those patterns less during training.
๐ก Performance Tip: Trim repeated boilerplate/system prompts. Every token costs money and context space.
8. Model routing
A routing layer picks which model + cluster serves your request based on capacity and region. Provider-specific, mostly undocumented in detail โ but this general shape is common everywhere.
9. GPU scheduling: the real bottleneck
[User A][User B][User C][User D] โ batched onto one GPU
GPUs can't spin up instantly like a web server. Batching multiple requests keeps them efficient. Continuous batching (slotting new requests into an in-flight batch) is why modern serving is so much faster than naive one-at-a-time processing.
๐ก This is also why your latency varies call to call โ you're sharing hardware.
10. KV Cache: the trick behind fast generation
Token 1 โ compute + cache
Token 2 โ reuse cache + compute new token only
Without this, every new token would mean reprocessing the whole conversation. With it, generation stays fast โ but the cache grows with context length, eating GPU memory the whole time your request is active.
This is also the mechanism behind prompt caching โ reusing cached state for a shared prefix (like a system prompt) across calls, cutting cost + latency.
11. Transformer inference (the part everyone pictures)
Per token: embed โ run through N transformer layers (self-attention + feed-forward) โ probability distribution over vocabulary โ sample next token.
๐ก Common Mistake: Higher temperature โ smarter model. It just changes sampling randomness.
12. Streaming: why it feels like typing
Prompt โ [t1] โ [t1,t2] โ [t1,t2,t3] โ ...
Tokens generate one at a time (autoregressive) and get streamed to you as each one is ready โ usually via Server-Sent Events.
๐ก Performance Tip: Always stream user-facing responses longer than a sentence. Total time is the same, but perceived latency drops massively โ first token in ms instead of a blank screen.
13. Billing, running in parallel
Input tokens + output tokens metered (cached tokens often cheaper). Feeds your invoice and sometimes real-time quota checks back into the rate limiter from step 5.
๐ก A long repeated system prompt quietly becomes a big line item unless the provider discounts the repeated prefix via caching.
The whole pipeline, one diagram
Your Code โ SDK โ DNS โ TLS โ Load Balancer โ Gateway (auth/limit/validate)
โ Logging โ Tokenization โ Routing โ GPU Scheduling โ KV Cache
โ Inference โ Generation โ Streaming โ (Billing, parallel) โ Your Code
~15 systems, different teams, different hardware โ cooperating in well under a second.
TL;DR
- This is a distributed systems problem first, ML problem second
- Reuse connections โ DNS + TLS cost adds up
- Tokens = cost + context budget, treat them as a resource
- Latency variance = GPU batching, not "harder thinking"
- KV cache = why long chats cost more server-side
- Streaming = better perceived speed, not better actual speed
Discussion: As agent chains stack tool calls on tool calls, how much of this overhead gets duplicated at every hop โ and what should get collapsed into one shared layer instead?
Provider-specific details (routing, scheduling, caching) vary โ the patterns above are common across large-scale LLM serving systems, not any one provider's exact internals.
References: Anthropic Docs ยท OpenAI Docs ยท Vaswani et al., "Attention Is All You Need" (2017) ยท Cloudflare: How TLS Works
Top comments (0)