---
title: "What Happens in 400ms — Inside an LLM API Call"
published: true
description: "The 7 stages and 14+ layers between your LLM API call and the response, with latency breakdowns per stage."
tags: architecture, api, cloud, performance
canonical_url: https://blog.mvpfactory.co/what-happens-in-400ms-inside-an-llm-api-call
---
## What You Will Learn
By the end of this walkthrough, you will understand exactly what happens in the ~400ms between firing an LLM API call and receiving the first token back. We will trace a request through all 7 stages — API gateway, load balancer, tokenizer, model router, inference (prefill + decode), post-processing, and billing — with real latency numbers at each stop. Once you see where the time goes, you stop wasting effort optimizing the wrong things.
## Prerequisites
- Basic familiarity with REST APIs
- A general understanding of what LLMs do (prompt in, text out)
- No ML expertise required — we are focusing on the infrastructure, not the math
## Step by Step: The 7-Stage Journey
### Stages 1–4: The Fast Path (~11ms)
The **API gateway** (~5ms) terminates TLS, authenticates your key, enforces rate limits, validates the request schema, and starts the billing clock. If you have ever hit a `429 Too Many Requests`, this is where your request died — before a GPU ever saw it.
The **load balancer** (~2ms) routes your request using geographic proximity and least-connections algorithms while checking backend cluster health. This explains why latency varies between identical calls: your request may land on a different node each time.
**Tokenization** (~3ms) converts your text into tokens using algorithms like BPE, SentencePiece, or WordPiece. The rough conversion is ~4 characters per token. This is also where the context window check happens. Exceed the limit and your request gets rejected.
The **model router** (~1ms) decides where your request runs. Large models go to multi-GPU clusters, smaller models to single-GPU instances, embedding requests to dedicated clusters. Queue management happens here too — if all GPUs are saturated, you wait.
### Stage 5: Inference (~300–800ms, ~95% of Total)
Here is where the real work happens. Let me show you a pattern I use to think about it — two distinct phases:
**Prefill phase:** Your entire input is processed in parallel. The model computes query-key (QK) attention scores across all input tokens and generates the KV cache — a stored representation of your prompt that avoids redundant computation during generation.
**Decode phase:** This part is sequential: one token per forward pass. Each step reuses the KV cache from prefill, applies temperature and top-p sampling to select the next token, and (if streaming is enabled) sends each token to you immediately. This is why streaming feels faster — you see tokens as they are generated rather than waiting for the full response.
The hardware layer matters more than most people realize. GPUs are typically A100, H100, or H200 with 80GB+ HBM. Tensor parallelism splits a single model across multiple GPUs. Multiple requests get batched together to maximize utilization. Flash Attention reduces memory overhead; Grouped-Query Attention (GQA) cuts KV cache size. GPU compute runs about $2–3/hr per card.
### Stages 6–7: The Exit Path (~6ms)
**Post-processing** (~5ms) detokenizes the output back into text, runs a safety classifier, checks for stop sequences, and packages everything into JSON. **Billing** (<1ms) calculates your final cost — output tokens cost 3–5x more than input tokens because each one requires a full forward pass.
## Where to Actually Optimize
Here is the minimal setup to get this working for you in practice:
Optimization target -> Action
Reduce input tokens -> Shorter prompts, prompt caching
Reduce output tokens -> Constrained output, max_tokens limits
Reduce latency -> Streaming, smaller models, geographic routing
Reduce cost -> Cache prefixes, batch requests, right-size models
## Gotchas
**"I will optimize my gateway config for speed."** Do not bother. Stages 1–4 account for ~2.6% of total latency. Cutting 200 tokens from your system prompt or enabling prompt caching will do more than any amount of gateway tuning.
**"Input and output tokens cost the same."** They do not. Output tokens cost 3–5x more because each requires a full sequential forward pass. Controlling output length with `max_tokens`, structured output schemas, and precise instructions has the most direct impact on your bill.
**"I will wait for the full response before rendering."** The docs do not mention this, but the decode phase generates tokens one at a time. Streaming delivers each token as it is produced, making a 600ms response feel near-instant. If you are not streaming, you are making users stare at a spinner for no reason.
**"Latency is consistent across calls."** It is not. The load balancer routes to different nodes, GPU queues fluctuate, and batching behavior changes with traffic. Design for variance.
## Wrapping Up
~95% of latency lives in the prefill/decode cycle. Output tokens are your biggest cost lever. Streaming transforms perceived performance. Once you internalize this pipeline, every architectural decision — prompt design, model selection, caching strategy — gets sharper. You stop guessing and start targeting the stage that actually matters.
Top comments (0)