You write something simple like this:
response = client.responses.create(
model="gpt-4o",
input="Explain backpressure in simple terms"
)
A few hundred milliseconds later, text begins streaming back.
It feels instant.
It feels simple.
But that single API call triggers a surprisingly complex distributed system involving:
- Global traffic routing
- Authentication and token-based quota enforcement
- Multi-tenant scheduling
- GPU memory management
- Continuous batching
- Autoregressive token decoding
- Streaming transport over persistent connections
An LLM API is not just “a model running on a server.”
It is a real-time scheduling and resource allocation system built on top of extremely expensive hardware.
Under the hood, your request is competing with thousands of others for:
- GPU compute
- GPU memory
- Context window capacity
- Batch slots
- Network bandwidth
Understanding this pipeline changes how you think about:
- Latency
- Rate limiting
- Prompt size
- Streaming
- Retries
- System reliability
In this article, we’ll walk through exactly what happens — step by step — from the moment your request hits the edge of the network to the moment tokens stream back to your client.
No hype.
No marketing language.
Just the infrastructure.
The Big Picture
Before diving into details, here’s the high-level flow of a typical LLM API request:
- Your request hits a global edge endpoint
- It passes authentication and quota checks
- It enters an inference queue
- A scheduler batches it with other requests
- The model performs a prefill pass over your prompt
- The model generates tokens one-by-one (decode phase)
- Tokens stream back over a persistent connection
- Resources are cleaned up and metrics are recorded
Each of these steps exists for a reason.
Each introduces tradeoffs.
And each can become a bottleneck under load.
Let’s break them down.
Title: LLM API Request Lifecycle
Client
↓
Edge / Load Balancer
↓
API Gateway
↓
Auth & Quota
↓
Request Queue
↓
Scheduler
↓
GPU Worker
↓
Streaming Response
↓
Client
Small labels under each step:
- Edge → region routing
- Auth & Quota → token-based limits
- Queue → backpressure control
- Scheduler → continuous batching
- GPU Worker → prefill + decode
- Streaming → token-by-token output
This diagram gives the reader a mental map before diving deeper.
Title: Why Latency Explodes Under Load
Key takeaway:
Arrival rate > processing rate → queue grows → latency explodes
This makes queueing behavior intuitive without math.
Title: Naive Batching vs Continuous Batching
Naive batching:
Time →
[ Batch 1 ] idle [ Batch 2 ] idle [ Batch 3 ]
- Fixed batch boundaries
- Idle GPU time
- Poor utilization
Continuous batching:
Time →
A B C
D E
F
(all decoding together)
- Requests join dynamically
- GPU stays busy
- Higher throughput
This diagram explains why modern inference systems behave differently.
Title: Two Phases of LLM Inference
Prefill phase:
- Entire prompt processed
- KV cache created
- High GPU memory usage
- Expensive but parallel
Decode phase:
- One token at a time
- KV cache reused
- Lower per-step compute
- Enables streaming
Visual flow:
Prompt tokens → KV Cache
KV Cache → Token 1 → Token 2 → Token 3 → ...
This diagram makes streaming behavior obvious.
Title: Token Streaming Lifecycle
Request start
↓
Prefill
↓
Decode token 1 → sent
Decode token 2 → sent
Decode token 3 → sent
↓
Client disconnect?
├─ Yes → cancel → cleanup resources
└─ No → continue decoding
This highlights an often-overlooked detail:
cancellation must propagate through the system to avoid wasted GPU work.
Next, we’ll dive into what happens when your request enters the inference queue — and why that queue is where most latency problems begin.
Top comments (0)