beefed.ai

Posted on Apr 14 • Originally published at beefed.ai

Profiling and Benchmarking LLMs with Nsight and TPU Tools

#machinelearning

Measuring the right signals: throughput, latency, utilization, and memory
Using NVIDIA Nsight to map CPU–GPU timelines and find hotspots
Profiling with PyTorch Profiler and TPU tools for LLM workloads
Bottlenecks you'll see and surgical fixes
Automating benchmarks and performance regression testing

Profiling LLM training and inference is a forensic exercise: you must prove which resource—compute, memory, or IO—is starving the rest, and then apply a narrowly scoped fix that moves the wall-clock needle. The combination of nvidia nsight, torch.profiler, and TPU profiling tools gives you the instrumentation to do that with evidence instead of hunches.

The symptoms you see are predictable: training stalls despite “full” GPUs, inference p95 spikes during production, or throughput that refuses to scale with batch size. Those symptoms hide different root causes—data-loading stalls, memory-bandwidth saturation, or microkernel overhead—and the right profile pinpoints which one. The rest of this piece is a compact, operational playbook: what metrics to collect, concrete steps with nsys/ncu/torch.profiler/TPU tools, how to read the results, and exactly which mitigations move the numbers.

Measuring the right signals: throughput, latency, utilization, and memory

You must measure the right signals, in the right units, and across steady-state runs.

Throughput (primary KPI for training & batched inference). Training: tokens/sec = steps/sec × batch_size × seq_len. Inference: samples/sec or tokens/sec depending on your scenario. Use a timed, reproducible loop and report steady-state throughput after warmup. MLPerf-style guidance on warmup and steady-state is a useful reference for run discipline.
Latency (primary KPI for low-latency inference). Report p50, p95, p99 and tail latencies measured end-to-end (including CPU-side preprocessing and device transfer). Single-shot latency and batched latency are distinct metrics; measure both if you support dynamic batch sizing.
GPU utilization and SM/TensorCore activity. nvidia-smi gives a high-level view (utilization.gpu, utilization.memory); nsys and ncu give SM occupancy, TensorCore usage and instruction-level counters. Use those to separate idle GPUs from busy but memory-starved GPUs.
Memory bandwidth and capacity. Look at achieved DRAM throughput and achieved memory bandwidth in ncu reports and Nsight metrics; compare against the device peak using a roofline mindset (operational intensity → compute vs memory bound). The Roofline model helps you interpret whether adding compute optimizations will help.
Host CPU, IO and network metrics. Measure dataloader latency, disk throughput, and network/NCCL times to find host-side stalls that leave GPUs idle. nsys can visualize the CPU threads and system calls that align with GPU idle time.

Practical measurement checklist

Warm up the model for a small number of iterations before measuring.
Measure multiple runs, report median (or mean ± std) across runs.
Record environment: driver, CUDA, container digest, commit hash, nvidia-smi snapshot. MLPerf-style reproducibility rules are the right discipline for CI-grade measurements.

Quick tool→metric map (short)
| Metric | Where to capture |
|---|---|
| Throughput / steps/sec, tokens/sec | In-script timers (Python) + torch.profiler logs |
| Tail latency (p95/p99) | Client-side timers for inference, or framework trace |
| SM utilization / TensorCore activity | Nsight Systems / Nsight Compute (nsys / ncu). |
| Memory bandwidth (achieved) | Nsight Compute --metrics DRAM throughput counters. |
| Dataprep latency / CPU blocks | nsys timeline, torch.profiler CPU events. |
| TPU execution traces | TPU XProf / TensorBoard plugin, or torch_xla debug profiler. |

Using NVIDIA Nsight to map CPU–GPU timelines and find hotspots

Use Nsight Systems as your first stop: it gives a system-wide timeline that answers “where does time go?” and correlates CPU activity, kernel launches, and NVTX annotations.

Recommended workflow

Add NVTX ranges to mark iteration boundaries and high-level stages (data load, forward, backward, optimizer). Use torch.cuda.nvtx.range_push or torch.autograd.profiler.emit_nvtx so the timeline maps directly to your code.
Capture a focused window with nsys rather than trying to record the entire 24‑hour job. Use capture-range hooks (NVTX, start/stop API) to limit trace size and overhead.

Example: targeted nsys capture

# capture a single epoch region annotated with NVTX "PROFILE"
NSYS_NVTX_PROFILER_REGISTER_ONLY=0 \
nsys profile -o llm_profile \
  --trace=cuda,cublas,cudnn,nvtx,osrt \
  --gpu-metrics-devices=all \
  --capture-range=nvtx --nvtx-capture=PROFILE \
  python train.py --config=configs/large.yml

nsys generates a timeline you open in the Nsight UI; zoom to iterations, and look for gaps in the GPU HW lane where there is no kernel activity.

Drill down with Nsight Compute (ncu)

When you find a heavy kernel in the timeline, right-click and launch ncu (Nsight Compute) to collect per-kernel metrics: achieved occupancy, instruction throughput, memory throughput and cache hit ratios. ncu gives the what at the instruction and register level.

Example ncu invocation (kernel-level):

ncu --metrics achieved_occupancy,sm__inst_executed,dram__throughput \
    -o big_kernel_report ./train.py --some-args

Interpretation tips

Long CPU sections between kernel launches → data loader / serialization / Python-side overhead. Check torch.profiler CPU timings for the data pipeline.
GPU active but low achieved FLOPS with high DRAM throughput → memory-bound kernel. Apply roofline thinking: increase operational intensity or reduce memory traffic.
High small-kernel overhead (many micro-kernels with short durations) → kernel-launch overhead; fuse ops or use custom kernels (Triton) or compiler fusion.

Important callout

Sample small windows, then iterate. nsys trace files grow quickly and ncu replay has overhead; use capture-range and NVTX so traces are representative without being massive.

Profiling with PyTorch Profiler and TPU tools for LLM workloads

PyTorch Profiler (torch.profiler) is the fastest path to operator-level insights inside PyTorch and integrates with TensorBoard. For long-running training jobs, use schedule and on_trace_ready to collect a few representative cycles rather than tracing everything.

Representative torch.profiler setup

from torch.profiler import profile, record_function, ProfilerActivity, schedule, tensorboard_trace_handler

my_schedule = schedule(skip_first=10, wait=5, warmup=2, active=3, repeat=2)

with profile(
    activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
    schedule=my_schedule,
    on_trace_ready=tensorboard_trace_handler("./profiler_runs"),
    record_shapes=True,
    profile_memory=True,
) as prof:
    for step, batch in enumerate(train_loader):
        with record_function("train_step"):
            outputs = model(batch)
            loss = loss_fn(outputs, batch.targets)
            loss.backward()
            optimizer.step()
        prof.step()

Key PyTorch profiler outputs

key_averages().table() for operator-level hotpaths.
export_chrome_trace() or TensorBoard plugin for a timeline view.
export_memory_timeline() for allocation patterns and peak usage.

TPU profiling (XProf / Torch XLA)

For Cloud TPU VMs and PyTorch XLA, use the XProf tooling: start the profiler server, wrap the region with xp.start_trace() / xp.stop_trace(), and visualize in TensorBoard with the tensorboard_plugin_profile. The Cloud TPU docs include complete examples for torch_xla.debug.profiler.

TPU example (PyTorch XLA)

import torch_xla.debug.profiler as xp

server = xp.start_server(9012)
xp.start_trace('/root/logs/')
# run representative steps
xp.stop_trace()

Then run:

pip install tensorboard tensorboard_plugin_profile
tensorboard --logdir /root/logs/

This gives a timeline comparable to nsys for TPU workloads.

Bottlenecks you'll see and surgical fixes

Use this table as the first diagnostic map: read the symptom, confirm with the tool/counter, then apply the pointed fix.

Symptom	How you confirm (tool / counter)	Surgical fix (what to change now)
Low GPU utilization (<50%), CPU busy	`nsys` timeline: long CPU-side ranges between kernel launches; `torch.profiler` dataloader timings high.	Move costly transforms off the main thread: increase `DataLoader(num_workers)`, `pin_memory=True`, `persistent_workers=True`, prefetch, or use NVIDIA DALI. Use `non_blocking=True` on `.to(device, non_blocking=True)`.
High memory bandwidth utilization; low FLOPS	`ncu` memory throughput high; roofline shows low operational intensity.	Reduce memory traffic: fuse pointwise ops (custom Triton kernels or fused CUDA/ATen kernels), use mixed precision to shrink working set (`autocast`/`GradScaler`), or algorithmic changes that increase compute per byte.
Out-of-memory / fragmentation	Profiler memory timeline, OOM stack traces	Activation checkpointing (`torch.utils.checkpoint`) and parameter partitioning (ZeRO) or offload parameters to CPU/NVMe (ZeRO‑Offload / ZeRO‑Infinity). Flatten and allocate contiguous buffers to avoid fragmentation.
High PCIe / host-device traffic	`nsys` GPU Metrics: PCIe throughput spikes; `nvidia-smi` shows frequent transfers	Reduce host↔device transfers; batch transfers; keep tensors on device; use pinned memory to speed transfers. If multi-GPU, favor NVLink / CUDA P2P and reorder work to avoid host round trips.
Communication stalls in distributed training	`nsys` and NCCL logs; long allreduce times shown in timeline	Overlap communication with computation (reduce-scatter / async collectives), tune `NCCL_SOCKET_IFNAME`, `NCCL_BUFFSIZE` and related env vars. Ensure topology-aware NCCL config.
Many small kernels (kernel-launch overhead)	`nsys` shows many short kernel bars; kernels are < a few µs	Fuse operators or use graph compilation (`torch.compile`) / kernel generators (Triton) to reduce launches and increase kernel granularity.

Detailed notes on high-value fixes

Mixed precision: Using torch.cuda.amp.autocast unlocks Tensor Cores and reduces memory traffic for matrix ops; it often produces a 1.5–3× throughput improvement depending on GPU generation. Profile after enabling to ensure numerical stability and operator coverage.
Operator fusion / custom kernels: When ncu shows expensive memory traffic per op, write fused kernels (Triton or custom CUDA) to keep data in registers/shared memory across ops. Nsight Compute will show the drop in DRAM throughput after a successful fusion.
Memory partitioning for huge models: DeepSpeed ZeRO stages partition optimizer state/gradients/parameters and enable training models that otherwise OOM. Offloading to CPU/NVMe is a pragmatic path for extremely large models where latency is less critical.
Dataloader tuning: num_workers, pin_memory, prefetch_factor are low-effort knobs to eliminate CPU-side stalls—measure before you tune and prefer incremental changes (increase num_workers until CPU saturates).

Important: never change multiple knobs at once. Measure, change one variable, re-measure. The profile is the experiment’s atomic record.

Automating benchmarks and performance regression testing

Automation is the difference between an optimization and a reproducible speedup you can ship. The automation strategy below is intentionally minimal and robust.

Canonical benchmark protocol (short)

Decide a canonical scenario: e.g., training for N steps on a fixed subset, or inference on 10k synthetic prompts matching production shape. Record inputs and seeds.
Build an immutable artifact: container image or pinned requirements.txt + driver/kernel versions. Record image digest.
Warmup then measure a steady window (e.g., run 100 measured iterations after 10 warmup iterations). Capture metrics and traces as artifacts.
Save the following per run: metrics.json (throughput, latencies p50/p95/p99, memory_peak), nvidia-smi.csv snapshot, nsys trace (optional), profiler trace folder, and environment metadata (commit, driver).
Run the benchmark multiple times (≥3) and use the median or a robust estimator; store historical baselines.

Minimal automated runner (example)

run_bench.sh — runs a short, reproducible workload and writes metrics.json.

#!/usr/bin/env bash
set -euo pipefail
OUTDIR=${1:-./bench_out}
mkdir -p $OUTDIR

# Start light nvidia-smi logger in background
nvidia-smi --query-gpu=timestamp,name,utilization.gpu,utilization.memory,memory.used --format=csv -l 1 > $OUTDIR/nvidia-smi.csv &
SMI_PID=$!

# Run a short training job instrumented with torch.profiler schedule that writes to $OUTDIR/profiler
python run_small_bench.py --steps 120 --warmup 10 --outdir $OUTDIR

kill $SMI_PID
# Summarize metrics (user script produces metrics.json)
cat $OUTDIR/metrics.json

Example run_small_bench.py should:

pin seeds, set deterministic flags (if appropriate),
perform warmup and steady iterations,
measure steps/sec and token throughput,
optionally call nsys for a single representative capture, and
emit metrics.json with fields throughput, p50_ms, p95_ms, peak_mem_mb, commit, image.

CI / GitHub Actions snippet (self-hosted runner with GPU)

name: perf-bench
on:
  push:
    branches: [ main ]
jobs:
  bench:
    runs-on: self-hosted-gpu
    steps:
      - uses: actions/checkout@v3
      - name: Run benchmark
        run: |
          ./ci/run_bench.sh ./bench_artifacts/${GITHUB_SHA}
      - name: Upload artifacts
        uses: actions/upload-artifact@v4
        with:
          name: bench-${{ github.sha }}
          path: ./bench_artifacts/${{ github.sha }}

Regression detection strategy

Keep a JSON baseline.json with the canonical metrics for the current release.
After a CI bench, load metrics.json and compare primary KPIs:
- Fail if throughput drops by >X% (system-dependent; start with 5–10%).
- Fail if p95/p99 latency increases by >Y ms (set by SLA).
For noisy workloads, require statistical significance (median across N runs) or use a sliding window of historical medians to avoid false positives. MLPerf-style run discipline is instructive here.

What traces to collect in CI

Collect nvidia-smi CSV continuously (low overhead).
Collect torch.profiler short cycles (low-to-moderate overhead) for operator regressions.
Reserve nsys/ncu captures for triage runs only (high overhead, large files). Automate their collection only on benchmark failures or when a deeper investigation is triggered.

Automation checklist (artifact hygiene)

Save: metrics.json, nvidia-smi.csv, profiler_runs/*, nsys/*.qdrep (if collected), Dockerfile or image digest, commit and git diff.
Store artifacts in an immutable store (object storage) and link them in your CI failure ticket.
Record system topology: GPU model(s), PCIe/NVLink layout, NUMA layout, and nvidia-smi driver output. These explain many regressions.

Bottleneck debugging playbook (2-minute method)

Measure simple throughput (tokens/sec) and latency baseline.
Run nvidia-smi while running to see GPU-level utilization and memory use.
If GPU utilization low → nsys targeted capture around steady-state and inspect CPU lanes and NVTX ranges.
If a kernel looks expensive → ncu the kernel and check DRAM throughput vs compute; use roofline logic.
Apply one fix (e.g., pin_memory=True or enable autocast) and re-run the same steps to validate impact.

Profile, fix, validate, repeat. Each iteration should have a recorded artifact that proves the impact.

Profile data is evidence. Treat it as such: annotate the code (NVTX), save the trace, attach it to your issue. Store baseline artifacts so you can compare later.

Sources:
NVIDIA Nsight Systems - Overview of Nsight Systems: system-wide timeline, GPU/CPU correlation, and recommended workflow for low-overhead traces and NVTX usage.

Nsight Systems User Guide (2025.6) - CLI nsys options, capture-range controls, GPU metrics sampling, and guidance for practical profiling.

Nsight Compute Profiling Guide - Kernel-level metrics, ncu --metrics reference and interpretation for occupancy, memory throughput, and instruction throughput.

PyTorch Profiler tutorial (recipes) - torch.profiler schedule usage, on_trace_ready and TensorBoard integration for long-running jobs.

torch.profiler API reference - export_chrome_trace, memory timeline exports, and profiler configuration options.

Profile your model on Cloud TPU VMs - XProf/TensorBoard profiling for Cloud TPU VMs and use of the tensorboard_plugin_profile.

Profile PyTorch XLA workloads (Cloud TPU guide) - torch_xla.debug.profiler examples (xp.start_trace, xp.stop_trace) and visualization with TensorBoard.

DeepSpeed ZeRO (documentation) - Memory partitioning strategies (ZeRO stages), offload options and configuration examples for training very large models.

Roofline model (Williams, Waterman, Patterson) - The Roofline performance model for reasoning about compute vs memory-bound kernels and operational intensity.

NVIDIA Hopper architecture (developer blog) - Tensor Core capabilities and mixed-precision benefits on modern NVIDIA GPUs.

Useful nvidia-smi queries (NVIDIA support) - nvidia-smi --query-gpu options and best-practice queries for logging GPU utilization and memory.

MLCommons / MLPerf inference guidance (reproducibility & run rules) - Example rules and run-discipline (warmup, steady-state, reproducibility) useful when building regression tests.

NCCL environment variables and tuning guide - Important NCCL env vars (NCCL_SOCKET_IFNAME, NCCL_BUFFSIZE, debug options) to tune collective performance.

torch.utils.checkpoint (activation checkpointing) - Activation checkpointing API and trade-offs (compute for memory).

PyTorch DataLoader documentation (pin_memory, num_workers, prefetch_factor) - DataLoader options and practical guidance for reducing host-side stalls.

Automatic Mixed Precision (torch.cuda.amp) - autocast, GradScaler and recommended usage patterns to use lower-precision compute safely.

Profile surgically, change one variable, and record the artifact that proves the change moved the needle; that discipline converts optimization work into reliable, repeatable throughput improvements.