What Actually Happens When You Call an LLM API?

#api #distributedsystems #llm #systemdesign

You write something simple like this:

response = client.responses.create(
    model="gpt-4o",
    input="Explain backpressure in simple terms"
)

A few hundred milliseconds later, text begins streaming back.

It feels instant.
It feels simple.

But that single API call triggers a surprisingly complex distributed system involving:

Global traffic routing
Authentication and token-based quota enforcement
Multi-tenant scheduling
GPU memory management
Continuous batching
Autoregressive token decoding
Streaming transport over persistent connections

An LLM API is not just “a model running on a server.”
It is a real-time scheduling and resource allocation system built on top of extremely expensive hardware.

Under the hood, your request is competing with thousands of others for:

GPU compute
GPU memory
Context window capacity
Batch slots
Network bandwidth

Understanding this pipeline changes how you think about:

Latency
Rate limiting
Prompt size
Streaming
Retries
System reliability

In this article, we’ll walk through exactly what happens — step by step — from the moment your request hits the edge of the network to the moment tokens stream back to your client.

No hype.
No marketing language.
Just the infrastructure.

The Big Picture

Before diving into details, here’s the high-level flow of a typical LLM API request:

Your request hits a global edge endpoint
It passes authentication and quota checks
It enters an inference queue
A scheduler batches it with other requests
The model performs a prefill pass over your prompt
The model generates tokens one-by-one (decode phase)
Tokens stream back over a persistent connection
Resources are cleaned up and metrics are recorded

Each of these steps exists for a reason.
Each introduces tradeoffs.
And each can become a bottleneck under load.

Let’s break them down.

Title: LLM API Request Lifecycle

Client
  ↓
Edge / Load Balancer
  ↓
API Gateway
  ↓
Auth & Quota
  ↓
Request Queue
  ↓
Scheduler
  ↓
GPU Worker
  ↓
Streaming Response
  ↓
Client

Small labels under each step:

Edge → region routing
Auth & Quota → token-based limits
Queue → backpressure control
Scheduler → continuous batching
GPU Worker → prefill + decode
Streaming → token-by-token output

This diagram gives the reader a mental map before diving deeper.

Title: Why Latency Explodes Under Load

Key takeaway:

Arrival rate > processing rate → queue grows → latency explodes

This makes queueing behavior intuitive without math.

Title: Naive Batching vs Continuous Batching

Naive batching:

Time →
[ Batch 1 ]   idle   [ Batch 2 ]   idle   [ Batch 3 ]

Fixed batch boundaries
Idle GPU time
Poor utilization

Continuous batching:

Time →
A B C
  D E
    F
(all decoding together)

Requests join dynamically
GPU stays busy
Higher throughput

This diagram explains why modern inference systems behave differently.

Title: Two Phases of LLM Inference

Prefill phase:

Entire prompt processed
KV cache created
High GPU memory usage
Expensive but parallel

Decode phase:

One token at a time
KV cache reused
Lower per-step compute
Enables streaming

Visual flow:

Prompt tokens → KV Cache
KV Cache → Token 1 → Token 2 → Token 3 → ...

This diagram makes streaming behavior obvious.

Title: Token Streaming Lifecycle

Request start
   ↓
Prefill
   ↓
Decode token 1 → sent
Decode token 2 → sent
Decode token 3 → sent
   ↓
Client disconnect?
   ├─ Yes → cancel → cleanup resources
   └─ No  → continue decoding

This highlights an often-overlooked detail:
cancellation must propagate through the system to avoid wasted GPU work.

Next, we’ll dive into what happens when your request enters the inference queue — and why that queue is where most latency problems begin.