- Measuring the right signals: throughput, latency, utilization, and memory
- Using NVIDIA Nsight to map CPU–GPU timelines and find hotspots
- Profiling with PyTorch Profiler and TPU tools for LLM workloads
- Bottlenecks you'll see and surgical fixes
- Automating benchmarks and performance regression testing
Profiling LLM training and inference is a forensic exercise: you must prove which resource—compute, memory, or IO—is starving the rest, and then apply a narrowly scoped fix that moves the wall-clock needle. The combination of nvidia nsight, torch.profiler, and TPU profiling tools gives you the instrumentation to do that with evidence instead of hunches.
The symptoms you see are predictable: training stalls despite “full” GPUs, inference p95 spikes during production, or throughput that refuses to scale with batch size. Those symptoms hide different root causes—data-loading stalls, memory-bandwidth saturation, or microkernel overhead—and the right profile pinpoints which one. The rest of this piece is a compact, operational playbook: what metrics to collect, concrete steps with nsys/ncu/torch.profiler/TPU tools, how to read the results, and exactly which mitigations move the numbers.
Measuring the right signals: throughput, latency, utilization, and memory
You must measure the right signals, in the right units, and across steady-state runs.
- Throughput (primary KPI for training & batched inference). Training: tokens/sec = steps/sec × batch_size × seq_len. Inference: samples/sec or tokens/sec depending on your scenario. Use a timed, reproducible loop and report steady-state throughput after warmup. MLPerf-style guidance on warmup and steady-state is a useful reference for run discipline.
- Latency (primary KPI for low-latency inference). Report p50, p95, p99 and tail latencies measured end-to-end (including CPU-side preprocessing and device transfer). Single-shot latency and batched latency are distinct metrics; measure both if you support dynamic batch sizing.
-
GPU utilization and SM/TensorCore activity.
nvidia-smigives a high-level view (utilization.gpu,utilization.memory);nsysandncugive SM occupancy, TensorCore usage and instruction-level counters. Use those to separate idle GPUs from busy but memory-starved GPUs. -
Memory bandwidth and capacity. Look at achieved DRAM throughput and achieved memory bandwidth in
ncureports and Nsight metrics; compare against the device peak using a roofline mindset (operational intensity → compute vs memory bound). The Roofline model helps you interpret whether adding compute optimizations will help. -
Host CPU, IO and network metrics. Measure dataloader latency, disk throughput, and network/NCCL times to find host-side stalls that leave GPUs idle.
nsyscan visualize the CPU threads and system calls that align with GPU idle time.
Practical measurement checklist
- Warm up the model for a small number of iterations before measuring.
- Measure multiple runs, report median (or mean ± std) across runs.
- Record environment: driver, CUDA, container digest, commit hash,
nvidia-smisnapshot. MLPerf-style reproducibility rules are the right discipline for CI-grade measurements.
Quick tool→metric map (short)
| Metric | Where to capture |
|---|---|
| Throughput / steps/sec, tokens/sec | In-script timers (Python) + torch.profiler logs |
| Tail latency (p95/p99) | Client-side timers for inference, or framework trace |
| SM utilization / TensorCore activity | Nsight Systems / Nsight Compute (nsys / ncu). |
| Memory bandwidth (achieved) | Nsight Compute --metrics DRAM throughput counters. |
| Dataprep latency / CPU blocks | nsys timeline, torch.profiler CPU events. |
| TPU execution traces | TPU XProf / TensorBoard plugin, or torch_xla debug profiler. |
Using NVIDIA Nsight to map CPU–GPU timelines and find hotspots
Use Nsight Systems as your first stop: it gives a system-wide timeline that answers “where does time go?” and correlates CPU activity, kernel launches, and NVTX annotations.
Recommended workflow
- Add NVTX ranges to mark iteration boundaries and high-level stages (data load, forward, backward, optimizer). Use
torch.cuda.nvtx.range_pushortorch.autograd.profiler.emit_nvtxso the timeline maps directly to your code. - Capture a focused window with
nsysrather than trying to record the entire 24‑hour job. Use capture-range hooks (NVTX, start/stop API) to limit trace size and overhead.
Example: targeted nsys capture
# capture a single epoch region annotated with NVTX "PROFILE"
NSYS_NVTX_PROFILER_REGISTER_ONLY=0 \
nsys profile -o llm_profile \
--trace=cuda,cublas,cudnn,nvtx,osrt \
--gpu-metrics-devices=all \
--capture-range=nvtx --nvtx-capture=PROFILE \
python train.py --config=configs/large.yml
nsys generates a timeline you open in the Nsight UI; zoom to iterations, and look for gaps in the GPU HW lane where there is no kernel activity.
Drill down with Nsight Compute (ncu)
- When you find a heavy kernel in the timeline, right-click and launch
ncu(Nsight Compute) to collect per-kernel metrics: achieved occupancy, instruction throughput, memory throughput and cache hit ratios.ncugives the what at the instruction and register level.
Example ncu invocation (kernel-level):
ncu --metrics achieved_occupancy,sm__inst_executed,dram__throughput \
-o big_kernel_report ./train.py --some-args
Interpretation tips
-
Long CPU sections between kernel launches → data loader / serialization / Python-side overhead. Check
torch.profilerCPU timings for the data pipeline. - GPU active but low achieved FLOPS with high DRAM throughput → memory-bound kernel. Apply roofline thinking: increase operational intensity or reduce memory traffic.
- High small-kernel overhead (many micro-kernels with short durations) → kernel-launch overhead; fuse ops or use custom kernels (Triton) or compiler fusion.
Important callout
Sample small windows, then iterate.
nsystrace files grow quickly andncureplay has overhead; use capture-range and NVTX so traces are representative without being massive.
Profiling with PyTorch Profiler and TPU tools for LLM workloads
PyTorch Profiler (torch.profiler) is the fastest path to operator-level insights inside PyTorch and integrates with TensorBoard. For long-running training jobs, use schedule and on_trace_ready to collect a few representative cycles rather than tracing everything.
Representative torch.profiler setup
from torch.profiler import profile, record_function, ProfilerActivity, schedule, tensorboard_trace_handler
my_schedule = schedule(skip_first=10, wait=5, warmup=2, active=3, repeat=2)
with profile(
activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
schedule=my_schedule,
on_trace_ready=tensorboard_trace_handler("./profiler_runs"),
record_shapes=True,
profile_memory=True,
) as prof:
for step, batch in enumerate(train_loader):
with record_function("train_step"):
outputs = model(batch)
loss = loss_fn(outputs, batch.targets)
loss.backward()
optimizer.step()
prof.step()
Key PyTorch profiler outputs
-
key_averages().table()for operator-level hotpaths. -
export_chrome_trace()or TensorBoard plugin for a timeline view. -
export_memory_timeline()for allocation patterns and peak usage.
TPU profiling (XProf / Torch XLA)
- For Cloud TPU VMs and PyTorch XLA, use the XProf tooling: start the profiler server, wrap the region with
xp.start_trace()/xp.stop_trace(), and visualize in TensorBoard with thetensorboard_plugin_profile. The Cloud TPU docs include complete examples fortorch_xla.debug.profiler.
TPU example (PyTorch XLA)
import torch_xla.debug.profiler as xp
server = xp.start_server(9012)
xp.start_trace('/root/logs/')
# run representative steps
xp.stop_trace()
Then run:
pip install tensorboard tensorboard_plugin_profile
tensorboard --logdir /root/logs/
This gives a timeline comparable to nsys for TPU workloads.
Bottlenecks you'll see and surgical fixes
Use this table as the first diagnostic map: read the symptom, confirm with the tool/counter, then apply the pointed fix.
| Symptom | How you confirm (tool / counter) | Surgical fix (what to change now) |
|---|---|---|
| Low GPU utilization (<50%), CPU busy |
nsys timeline: long CPU-side ranges between kernel launches; torch.profiler dataloader timings high. |
Move costly transforms off the main thread: increase DataLoader(num_workers), pin_memory=True, persistent_workers=True, prefetch, or use NVIDIA DALI. Use non_blocking=True on .to(device, non_blocking=True). |
| High memory bandwidth utilization; low FLOPS |
ncu memory throughput high; roofline shows low operational intensity. |
Reduce memory traffic: fuse pointwise ops (custom Triton kernels or fused CUDA/ATen kernels), use mixed precision to shrink working set (autocast/GradScaler), or algorithmic changes that increase compute per byte. |
| Out-of-memory / fragmentation | Profiler memory timeline, OOM stack traces | Activation checkpointing (torch.utils.checkpoint) and parameter partitioning (ZeRO) or offload parameters to CPU/NVMe (ZeRO‑Offload / ZeRO‑Infinity). Flatten and allocate contiguous buffers to avoid fragmentation. |
| High PCIe / host-device traffic |
nsys GPU Metrics: PCIe throughput spikes; nvidia-smi shows frequent transfers |
Reduce host↔device transfers; batch transfers; keep tensors on device; use pinned memory to speed transfers. If multi-GPU, favor NVLink / CUDA P2P and reorder work to avoid host round trips. |
| Communication stalls in distributed training |
nsys and NCCL logs; long allreduce times shown in timeline |
Overlap communication with computation (reduce-scatter / async collectives), tune NCCL_SOCKET_IFNAME, NCCL_BUFFSIZE and related env vars. Ensure topology-aware NCCL config. |
| Many small kernels (kernel-launch overhead) |
nsys shows many short kernel bars; kernels are < a few µs |
Fuse operators or use graph compilation (torch.compile) / kernel generators (Triton) to reduce launches and increase kernel granularity. |
Detailed notes on high-value fixes
-
Mixed precision: Using
torch.cuda.amp.autocastunlocks Tensor Cores and reduces memory traffic for matrix ops; it often produces a 1.5–3× throughput improvement depending on GPU generation. Profile after enabling to ensure numerical stability and operator coverage. -
Operator fusion / custom kernels: When
ncushows expensive memory traffic per op, write fused kernels (Triton or custom CUDA) to keep data in registers/shared memory across ops. Nsight Compute will show the drop in DRAM throughput after a successful fusion. - Memory partitioning for huge models: DeepSpeed ZeRO stages partition optimizer state/gradients/parameters and enable training models that otherwise OOM. Offloading to CPU/NVMe is a pragmatic path for extremely large models where latency is less critical.
-
Dataloader tuning:
num_workers,pin_memory,prefetch_factorare low-effort knobs to eliminate CPU-side stalls—measure before you tune and prefer incremental changes (increasenum_workersuntil CPU saturates).
Important: never change multiple knobs at once. Measure, change one variable, re-measure. The profile is the experiment’s atomic record.
Automating benchmarks and performance regression testing
Automation is the difference between an optimization and a reproducible speedup you can ship. The automation strategy below is intentionally minimal and robust.
Canonical benchmark protocol (short)
- Decide a canonical scenario: e.g., training for N steps on a fixed subset, or inference on 10k synthetic prompts matching production shape. Record inputs and seeds.
- Build an immutable artifact: container image or pinned
requirements.txt+ driver/kernel versions. Record image digest. - Warmup then measure a steady window (e.g., run 100 measured iterations after 10 warmup iterations). Capture metrics and traces as artifacts.
- Save the following per run:
metrics.json(throughput, latencies p50/p95/p99, memory_peak),nvidia-smi.csvsnapshot,nsystrace (optional),profilertrace folder, and environment metadata (commit, driver). - Run the benchmark multiple times (≥3) and use the median or a robust estimator; store historical baselines.
Minimal automated runner (example)
-
run_bench.sh— runs a short, reproducible workload and writesmetrics.json.
#!/usr/bin/env bash
set -euo pipefail
OUTDIR=${1:-./bench_out}
mkdir -p $OUTDIR
# Start light nvidia-smi logger in background
nvidia-smi --query-gpu=timestamp,name,utilization.gpu,utilization.memory,memory.used --format=csv -l 1 > $OUTDIR/nvidia-smi.csv &
SMI_PID=$!
# Run a short training job instrumented with torch.profiler schedule that writes to $OUTDIR/profiler
python run_small_bench.py --steps 120 --warmup 10 --outdir $OUTDIR
kill $SMI_PID
# Summarize metrics (user script produces metrics.json)
cat $OUTDIR/metrics.json
Example run_small_bench.py should:
- pin seeds, set deterministic flags (if appropriate),
- perform warmup and steady iterations,
- measure
steps/secand token throughput, - optionally call
nsysfor a single representative capture, and - emit
metrics.jsonwith fieldsthroughput,p50_ms,p95_ms,peak_mem_mb,commit,image.
CI / GitHub Actions snippet (self-hosted runner with GPU)
name: perf-bench
on:
push:
branches: [ main ]
jobs:
bench:
runs-on: self-hosted-gpu
steps:
- uses: actions/checkout@v3
- name: Run benchmark
run: |
./ci/run_bench.sh ./bench_artifacts/${GITHUB_SHA}
- name: Upload artifacts
uses: actions/upload-artifact@v4
with:
name: bench-${{ github.sha }}
path: ./bench_artifacts/${{ github.sha }}
Regression detection strategy
- Keep a JSON
baseline.jsonwith the canonical metrics for the current release. - After a CI bench, load
metrics.jsonand compare primary KPIs:- Fail if throughput drops by >X% (system-dependent; start with 5–10%).
- Fail if p95/p99 latency increases by >Y ms (set by SLA).
- For noisy workloads, require statistical significance (median across N runs) or use a sliding window of historical medians to avoid false positives. MLPerf-style run discipline is instructive here.
What traces to collect in CI
- Collect
nvidia-smiCSV continuously (low overhead). - Collect
torch.profilershort cycles (low-to-moderate overhead) for operator regressions. - Reserve
nsys/ncucaptures for triage runs only (high overhead, large files). Automate their collection only on benchmark failures or when a deeper investigation is triggered.
Automation checklist (artifact hygiene)
- Save:
metrics.json,nvidia-smi.csv,profiler_runs/*,nsys/*.qdrep(if collected),Dockerfileor image digest,commitandgit diff. - Store artifacts in an immutable store (object storage) and link them in your CI failure ticket.
- Record system topology: GPU model(s), PCIe/NVLink layout, NUMA layout, and
nvidia-smidriver output. These explain many regressions.
Bottleneck debugging playbook (2-minute method)
- Measure simple throughput (tokens/sec) and latency baseline.
- Run
nvidia-smiwhile running to see GPU-level utilization and memory use. - If GPU utilization low →
nsystargeted capture around steady-state and inspect CPU lanes and NVTX ranges. - If a kernel looks expensive →
ncuthe kernel and check DRAM throughput vs compute; use roofline logic. - Apply one fix (e.g.,
pin_memory=Trueor enableautocast) and re-run the same steps to validate impact.
Profile, fix, validate, repeat. Each iteration should have a recorded artifact that proves the impact.
Profile data is evidence. Treat it as such: annotate the code (NVTX), save the trace, attach it to your issue. Store baseline artifacts so you can compare later.
Sources:
NVIDIA Nsight Systems - Overview of Nsight Systems: system-wide timeline, GPU/CPU correlation, and recommended workflow for low-overhead traces and NVTX usage.
Nsight Systems User Guide (2025.6) - CLI nsys options, capture-range controls, GPU metrics sampling, and guidance for practical profiling.
Nsight Compute Profiling Guide - Kernel-level metrics, ncu --metrics reference and interpretation for occupancy, memory throughput, and instruction throughput.
PyTorch Profiler tutorial (recipes) - torch.profiler schedule usage, on_trace_ready and TensorBoard integration for long-running jobs.
torch.profiler API reference - export_chrome_trace, memory timeline exports, and profiler configuration options.
Profile your model on Cloud TPU VMs - XProf/TensorBoard profiling for Cloud TPU VMs and use of the tensorboard_plugin_profile.
Profile PyTorch XLA workloads (Cloud TPU guide) - torch_xla.debug.profiler examples (xp.start_trace, xp.stop_trace) and visualization with TensorBoard.
DeepSpeed ZeRO (documentation) - Memory partitioning strategies (ZeRO stages), offload options and configuration examples for training very large models.
Roofline model (Williams, Waterman, Patterson) - The Roofline performance model for reasoning about compute vs memory-bound kernels and operational intensity.
NVIDIA Hopper architecture (developer blog) - Tensor Core capabilities and mixed-precision benefits on modern NVIDIA GPUs.
Useful nvidia-smi queries (NVIDIA support) - nvidia-smi --query-gpu options and best-practice queries for logging GPU utilization and memory.
MLCommons / MLPerf inference guidance (reproducibility & run rules) - Example rules and run-discipline (warmup, steady-state, reproducibility) useful when building regression tests.
NCCL environment variables and tuning guide - Important NCCL env vars (NCCL_SOCKET_IFNAME, NCCL_BUFFSIZE, debug options) to tune collective performance.
torch.utils.checkpoint (activation checkpointing) - Activation checkpointing API and trade-offs (compute for memory).
PyTorch DataLoader documentation (pin_memory, num_workers, prefetch_factor) - DataLoader options and practical guidance for reducing host-side stalls.
Automatic Mixed Precision (torch.cuda.amp) - autocast, GradScaler and recommended usage patterns to use lower-precision compute safely.
Profile surgically, change one variable, and record the artifact that proves the change moved the needle; that discipline converts optimization work into reliable, repeatable throughput improvements.
Top comments (0)