astronaut

Posted on Nov 27

🧑‍🚀 LLM Engine Telemetry: How to Profile Models and See Where Performance is Lost

#ai #machinelearning #deeplearning #python

“Any LLM is an engine. It can be massive or compact, but if you don't look at the telemetry, you'll never understand where you're burning energy inefficiently.”
— Astronaut Engineer, Logbook #4

🌌 Introduction: Why LLMs Need Profiling

When engineers discuss LLM performance, three key phases are most often mentioned:

Tokenization latency
TTFT (Time To First Token)
tokens/sec

But it's easier to think of it this way:

An LLM is an engine, and the profiler is its dashboard. The rest is visible through the readings—and we're about to break them down.

Just like in real machinery:

Startup is always more expensive than the cruising phase,
Different engine components consume energy differently,
The true picture is only visible through telemetry.

👨‍🚀 Caption: "Before launch—rely only on the instruments"

🚀 Mission Plan

We are launching the GPT-2 model in three scenarios:

Short prompt
Medium prompt
Long prompt

Each test goes through three key phases:

Tokenization — Preparing the input.
Prefill - The initial prompt processing that establishes TTFT.
Decode / Steady-State — The cruising phase of generation.

We measure:

Tokenization time,
TTFT,
Generation speed (ms/token, tokens/sec),
Memory usage (peakRSS),
The most expensive low-level operations.

All data is collected via torch.profiler and displayed in TensorBoard.

🧩 LLM Operation Phases: What Happens Under the Hood

1. Tokenization - Input Preparation

The text is converted into tokens using the chosen tokenizer. On short texts, measuring this phase can be highly susceptible to system noise (jitter), which is why tokenization is almost always measured separately.

2. Prefill - Prompt Processing and Model State Establishment

In this phase, the model:

Runs the entire prompt through all layers once,
Computes attention for the entire input sequence,
Populates the KV-Cache for subsequent generation,
Allocates temporary tensors and buffers.

Formally:

TTFT = Prefill time + first Decode step time

TTFT is the time required to complete the prompt processing and generate the first token. On a per-token basis, prefill is by far the most expensive phase, since the entire prompt is processed in one go.

3. Decode — Generating New Tokens

After prefill, the model transitions to sequential generation. Each new token requires:

1 forward pass → 1 token

Decode characteristics:

Operations are repeated with the same structure,
The KV-Cache prevents re-computing attention for the entire prompt,
Metrics become stable: ms/token, tokens/sec.

📡 Experimental Setup

The mission_profiler.py script:

Performs three launches (short / medium / long prompt)
Executes two generations for each:

Prefill → TTFT and full generation → Steady

Saves traces to TensorBoard,
Outputs a summary metrics table.

⚠️ We do not perform any warmup, so the first run (short_prompt) may be slower.

🛠️ Launch Telemetry Yourself!

You can replicate this "flight" and study the profiler logs on your own machine.

All the code, settings, and launch instructions are available in the mission repository: GitHub: LLM Profiler Mission - Engine Telemetry

📈 Mission Results

================= MISSION SUMMARY =================
tag            prompt_len   tokenize   TTFT(ms)   steady(ms)   actual_tok   ms/token   tok/s   peakRSS(MB)
--------------------------------------------------------------------------------------------------------
short_prompt           19        6.6      920.9        823.5         32      25.73     38.9       2541.2
medium_prompt          56        1.4       43.2       1047.4         32      32.73     30.6       2866.3
long_prompt           116        1.7       32.5        894.0         32      27.94     35.8       2886.8

How to Read These Numbers:

🔹 Tokenize We can't reliably compare tokenizers using this data—a dedicated, large-scale benchmark is needed. Short strings are heavily affected by system noise, so tokenization performance is evaluated separately.

🔹 TTFT (Time-To-First-Token) The most interesting observation:

Short prompt → 921 ms
Medium → 43 ms
Long → 32 ms

Why the difference?

The first run (short_prompt) bore the full impact of the cold start: it includes CUDA/MPS warmup, allocations, and JIT compilation of kernels.
TTFT is sensitive to the very first execution.
In subsequent runs (medium, long prompt), after warmup, TTFT stabilizes, and the difference between the medium and long prompt becomes minimal.

Conclusion:

TTFT should be measured either after a dedicated warmup or averaged over several runs.

🔹 Steady-State (ms/token) The cost per token remains relatively stable:

~26–33 ms/token
~30–39 tok/s

This is the engine's speed in cruising mode. As expected, the per-token latency shows almost no dependence on prompt length.

🔹 Peak RSS 2541 → 2866 → 2886 MB. Memory usage jumps noticeably when going from the short to the medium prompt (due to the growth of the KV-Cache and general allocations), but further lengthening shows minimal increase. This confirms that the primary VRAM/RAM allocation is for the model itself, while the KV-cache consumes only a small fraction. Its size does, however, grow linearly with input length.

📊 Who is Really Consuming Resources:

In the profiler, all operations fall into two camps:

🔥 1. Main Thrust (Useful Work)

On the GPU, these are:

addmm
mm / matmul
scaled_dot_product_attention

They consume the majority of the CUDA time. These are the matrix computation kernels—the operations that truly propel the LLM engine forward.

⚙️ 2. Control Expenses (Overhead)

Utility operations:

_local_scalar_dense
item
cat
copy_
to
Mask checks (eq, all)

These are responsible for:

Data movement,
Synchronization between the CPU/host and the GPU/device,
Scalar extraction,
Utility logic for generation.

This is not the engine's thrust, but the cost of flight control.

🧭 The Big Picture

Core computational kernels (attention, matmuls, addmm)—these determine whether the model is fast or slow.
Overhead operations —these are non-productive costs that can be reduced through optimizations: minimizing synchronizations, using use_cache=True, and reducing the number of small tensor operations.
On CUDA, matrix kernels dominate (as they should), but on MPS, utility operations often dominate.

The real profile of LLM performance is hidden in the balance between these two groups.

⚙️ Why Profiling LLMs is Essential

The profiler turns:

❌ "The model is running slow" into ✔ "Here is the specific operation that's consuming energy."

It helps reveal:

Where the bottleneck is located,
The cost of prefill,
The cost of each token,
How memory behaves,
The overhead created by HuggingFace generate().

🏁 Conclusion

The LLM is an engine. Sometimes powerful, sometimes compact, but always complex and sensitive to overloads. And until you open the profiler:

You won't see the expensive matrix operations,
You won't see the synchronization overhead,
You won't know the cost of prefill,
You won't see the growth of the KV-Cache.

The profiler is our flight recorder. It shows:

Where the engine is pulling,
Where it's stalling,
And where the energy is going.

No one launches a rocket without a flight recorder.

DEV Community

🧑‍🚀 LLM Engine Telemetry: How to Profile Models and See Where Performance is Lost

🌌 Introduction: Why LLMs Need Profiling

🚀 Mission Plan

🧩 LLM Operation Phases: What Happens Under the Hood

📈 Mission Results

Top comments (0)