DEV Community

Cover image for What Actually Happens When You Call an LLM API?
April
April

Posted on

What Actually Happens When You Call an LLM API?

You write something simple like this:

response = client.responses.create(
    model="gpt-4o",
    input="Explain backpressure in simple terms"
)
Enter fullscreen mode Exit fullscreen mode

A few hundred milliseconds later, text begins streaming back.

It feels instant.
It feels simple.

But that single API call triggers a surprisingly complex distributed system involving:

  • Global traffic routing
  • Authentication and token-based quota enforcement
  • Multi-tenant scheduling
  • GPU memory management
  • Continuous batching
  • Autoregressive token decoding
  • Streaming transport over persistent connections

An LLM API is not just “a model running on a server.”
It is a real-time scheduling and resource allocation system built on top of extremely expensive hardware.

Under the hood, your request is competing with thousands of others for:

  • GPU compute
  • GPU memory
  • Context window capacity
  • Batch slots
  • Network bandwidth

Understanding this pipeline changes how you think about:

  • Latency
  • Rate limiting
  • Prompt size
  • Streaming
  • Retries
  • System reliability

In this article, we’ll walk through exactly what happens — step by step — from the moment your request hits the edge of the network to the moment tokens stream back to your client.

No hype.
No marketing language.
Just the infrastructure.


The Big Picture

Before diving into details, here’s the high-level flow of a typical LLM API request:

  1. Your request hits a global edge endpoint
  2. It passes authentication and quota checks
  3. It enters an inference queue
  4. A scheduler batches it with other requests
  5. The model performs a prefill pass over your prompt
  6. The model generates tokens one-by-one (decode phase)
  7. Tokens stream back over a persistent connection
  8. Resources are cleaned up and metrics are recorded

Each of these steps exists for a reason.
Each introduces tradeoffs.
And each can become a bottleneck under load.

Let’s break them down.


Title: LLM API Request Lifecycle

Client
  ↓
Edge / Load Balancer
  ↓
API Gateway
  ↓
Auth & Quota
  ↓
Request Queue
  ↓
Scheduler
  ↓
GPU Worker
  ↓
Streaming Response
  ↓
Client
Enter fullscreen mode Exit fullscreen mode

Small labels under each step:

  • Edge → region routing
  • Auth & Quota → token-based limits
  • Queue → backpressure control
  • Scheduler → continuous batching
  • GPU Worker → prefill + decode
  • Streaming → token-by-token output

This diagram gives the reader a mental map before diving deeper.


Title: Why Latency Explodes Under Load

Key takeaway:

Arrival rate > processing rate → queue grows → latency explodes

This makes queueing behavior intuitive without math.


Title: Naive Batching vs Continuous Batching

Naive batching:

Time →
[ Batch 1 ]   idle   [ Batch 2 ]   idle   [ Batch 3 ]
Enter fullscreen mode Exit fullscreen mode
  • Fixed batch boundaries
  • Idle GPU time
  • Poor utilization

Continuous batching:

Time →
A B C
  D E
    F
(all decoding together)
Enter fullscreen mode Exit fullscreen mode
  • Requests join dynamically
  • GPU stays busy
  • Higher throughput

This diagram explains why modern inference systems behave differently.


Title: Two Phases of LLM Inference

Prefill phase:

  • Entire prompt processed
  • KV cache created
  • High GPU memory usage
  • Expensive but parallel

Decode phase:

  • One token at a time
  • KV cache reused
  • Lower per-step compute
  • Enables streaming

Visual flow:

Prompt tokens → KV Cache
KV Cache → Token 1 → Token 2 → Token 3 → ...
Enter fullscreen mode Exit fullscreen mode

This diagram makes streaming behavior obvious.


Title: Token Streaming Lifecycle

Request start
   ↓
Prefill
   ↓
Decode token 1 → sent
Decode token 2 → sent
Decode token 3 → sent
   ↓
Client disconnect?
   ├─ Yes → cancel → cleanup resources
   └─ No  → continue decoding
Enter fullscreen mode Exit fullscreen mode

This highlights an often-overlooked detail:
cancellation must propagate through the system to avoid wasted GPU work.

Next, we’ll dive into what happens when your request enters the inference queue — and why that queue is where most latency problems begin.

Top comments (0)