DEV Community

Cover image for Profiling and Benchmarking LLMs with Nsight and TPU Tools
beefed.ai
beefed.ai

Posted on • Originally published at beefed.ai

Profiling and Benchmarking LLMs with Nsight and TPU Tools

  • Measuring the right signals: throughput, latency, utilization, and memory
  • Using NVIDIA Nsight to map CPU–GPU timelines and find hotspots
  • Profiling with PyTorch Profiler and TPU tools for LLM workloads
  • Bottlenecks you'll see and surgical fixes
  • Automating benchmarks and performance regression testing

Profiling LLM training and inference is a forensic exercise: you must prove which resource—compute, memory, or IO—is starving the rest, and then apply a narrowly scoped fix that moves the wall-clock needle. The combination of nvidia nsight, torch.profiler, and TPU profiling tools gives you the instrumentation to do that with evidence instead of hunches.

The symptoms you see are predictable: training stalls despite “full” GPUs, inference p95 spikes during production, or throughput that refuses to scale with batch size. Those symptoms hide different root causes—data-loading stalls, memory-bandwidth saturation, or microkernel overhead—and the right profile pinpoints which one. The rest of this piece is a compact, operational playbook: what metrics to collect, concrete steps with nsys/ncu/torch.profiler/TPU tools, how to read the results, and exactly which mitigations move the numbers.

Measuring the right signals: throughput, latency, utilization, and memory

You must measure the right signals, in the right units, and across steady-state runs.

  • Throughput (primary KPI for training & batched inference). Training: tokens/sec = steps/sec × batch_size × seq_len. Inference: samples/sec or tokens/sec depending on your scenario. Use a timed, reproducible loop and report steady-state throughput after warmup. MLPerf-style guidance on warmup and steady-state is a useful reference for run discipline.
  • Latency (primary KPI for low-latency inference). Report p50, p95, p99 and tail latencies measured end-to-end (including CPU-side preprocessing and device transfer). Single-shot latency and batched latency are distinct metrics; measure both if you support dynamic batch sizing.
  • GPU utilization and SM/TensorCore activity. nvidia-smi gives a high-level view (utilization.gpu, utilization.memory); nsys and ncu give SM occupancy, TensorCore usage and instruction-level counters. Use those to separate idle GPUs from busy but memory-starved GPUs.
  • Memory bandwidth and capacity. Look at achieved DRAM throughput and achieved memory bandwidth in ncu reports and Nsight metrics; compare against the device peak using a roofline mindset (operational intensity → compute vs memory bound). The Roofline model helps you interpret whether adding compute optimizations will help.
  • Host CPU, IO and network metrics. Measure dataloader latency, disk throughput, and network/NCCL times to find host-side stalls that leave GPUs idle. nsys can visualize the CPU threads and system calls that align with GPU idle time.

Practical measurement checklist

  • Warm up the model for a small number of iterations before measuring.
  • Measure multiple runs, report median (or mean ± std) across runs.
  • Record environment: driver, CUDA, container digest, commit hash, nvidia-smi snapshot. MLPerf-style reproducibility rules are the right discipline for CI-grade measurements.

Quick tool→metric map (short)
| Metric | Where to capture |
|---|---|
| Throughput / steps/sec, tokens/sec | In-script timers (Python) + torch.profiler logs |
| Tail latency (p95/p99) | Client-side timers for inference, or framework trace |
| SM utilization / TensorCore activity | Nsight Systems / Nsight Compute (nsys / ncu). |
| Memory bandwidth (achieved) | Nsight Compute --metrics DRAM throughput counters. |
| Dataprep latency / CPU blocks | nsys timeline, torch.profiler CPU events. |
| TPU execution traces | TPU XProf / TensorBoard plugin, or torch_xla debug profiler. |

Using NVIDIA Nsight to map CPU–GPU timelines and find hotspots

Use Nsight Systems as your first stop: it gives a system-wide timeline that answers “where does time go?” and correlates CPU activity, kernel launches, and NVTX annotations.

Recommended workflow

  1. Add NVTX ranges to mark iteration boundaries and high-level stages (data load, forward, backward, optimizer). Use torch.cuda.nvtx.range_push or torch.autograd.profiler.emit_nvtx so the timeline maps directly to your code.
  2. Capture a focused window with nsys rather than trying to record the entire 24‑hour job. Use capture-range hooks (NVTX, start/stop API) to limit trace size and overhead.

Example: targeted nsys capture

# capture a single epoch region annotated with NVTX "PROFILE"
NSYS_NVTX_PROFILER_REGISTER_ONLY=0 \
nsys profile -o llm_profile \
  --trace=cuda,cublas,cudnn,nvtx,osrt \
  --gpu-metrics-devices=all \
  --capture-range=nvtx --nvtx-capture=PROFILE \
  python train.py --config=configs/large.yml
Enter fullscreen mode Exit fullscreen mode

nsys generates a timeline you open in the Nsight UI; zoom to iterations, and look for gaps in the GPU HW lane where there is no kernel activity.

Drill down with Nsight Compute (ncu)

  • When you find a heavy kernel in the timeline, right-click and launch ncu (Nsight Compute) to collect per-kernel metrics: achieved occupancy, instruction throughput, memory throughput and cache hit ratios. ncu gives the what at the instruction and register level.

Example ncu invocation (kernel-level):

ncu --metrics achieved_occupancy,sm__inst_executed,dram__throughput \
    -o big_kernel_report ./train.py --some-args
Enter fullscreen mode Exit fullscreen mode

Interpretation tips

  • Long CPU sections between kernel launches → data loader / serialization / Python-side overhead. Check torch.profiler CPU timings for the data pipeline.
  • GPU active but low achieved FLOPS with high DRAM throughput → memory-bound kernel. Apply roofline thinking: increase operational intensity or reduce memory traffic.
  • High small-kernel overhead (many micro-kernels with short durations) → kernel-launch overhead; fuse ops or use custom kernels (Triton) or compiler fusion.

Important callout

Sample small windows, then iterate. nsys trace files grow quickly and ncu replay has overhead; use capture-range and NVTX so traces are representative without being massive.

Profiling with PyTorch Profiler and TPU tools for LLM workloads

PyTorch Profiler (torch.profiler) is the fastest path to operator-level insights inside PyTorch and integrates with TensorBoard. For long-running training jobs, use schedule and on_trace_ready to collect a few representative cycles rather than tracing everything.

Representative torch.profiler setup

from torch.profiler import profile, record_function, ProfilerActivity, schedule, tensorboard_trace_handler

my_schedule = schedule(skip_first=10, wait=5, warmup=2, active=3, repeat=2)

with profile(
    activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
    schedule=my_schedule,
    on_trace_ready=tensorboard_trace_handler("./profiler_runs"),
    record_shapes=True,
    profile_memory=True,
) as prof:
    for step, batch in enumerate(train_loader):
        with record_function("train_step"):
            outputs = model(batch)
            loss = loss_fn(outputs, batch.targets)
            loss.backward()
            optimizer.step()
        prof.step()
Enter fullscreen mode Exit fullscreen mode

Key PyTorch profiler outputs

  • key_averages().table() for operator-level hotpaths.
  • export_chrome_trace() or TensorBoard plugin for a timeline view.
  • export_memory_timeline() for allocation patterns and peak usage.

TPU profiling (XProf / Torch XLA)

  • For Cloud TPU VMs and PyTorch XLA, use the XProf tooling: start the profiler server, wrap the region with xp.start_trace() / xp.stop_trace(), and visualize in TensorBoard with the tensorboard_plugin_profile. The Cloud TPU docs include complete examples for torch_xla.debug.profiler.

TPU example (PyTorch XLA)

import torch_xla.debug.profiler as xp

server = xp.start_server(9012)
xp.start_trace('/root/logs/')
# run representative steps
xp.stop_trace()
Enter fullscreen mode Exit fullscreen mode

Then run:

pip install tensorboard tensorboard_plugin_profile
tensorboard --logdir /root/logs/
Enter fullscreen mode Exit fullscreen mode

This gives a timeline comparable to nsys for TPU workloads.

Bottlenecks you'll see and surgical fixes

Use this table as the first diagnostic map: read the symptom, confirm with the tool/counter, then apply the pointed fix.

Symptom How you confirm (tool / counter) Surgical fix (what to change now)
Low GPU utilization (<50%), CPU busy nsys timeline: long CPU-side ranges between kernel launches; torch.profiler dataloader timings high. Move costly transforms off the main thread: increase DataLoader(num_workers), pin_memory=True, persistent_workers=True, prefetch, or use NVIDIA DALI. Use non_blocking=True on .to(device, non_blocking=True).
High memory bandwidth utilization; low FLOPS ncu memory throughput high; roofline shows low operational intensity. Reduce memory traffic: fuse pointwise ops (custom Triton kernels or fused CUDA/ATen kernels), use mixed precision to shrink working set (autocast/GradScaler), or algorithmic changes that increase compute per byte.
Out-of-memory / fragmentation Profiler memory timeline, OOM stack traces Activation checkpointing (torch.utils.checkpoint) and parameter partitioning (ZeRO) or offload parameters to CPU/NVMe (ZeRO‑Offload / ZeRO‑Infinity). Flatten and allocate contiguous buffers to avoid fragmentation.
High PCIe / host-device traffic nsys GPU Metrics: PCIe throughput spikes; nvidia-smi shows frequent transfers Reduce host↔device transfers; batch transfers; keep tensors on device; use pinned memory to speed transfers. If multi-GPU, favor NVLink / CUDA P2P and reorder work to avoid host round trips.
Communication stalls in distributed training nsys and NCCL logs; long allreduce times shown in timeline Overlap communication with computation (reduce-scatter / async collectives), tune NCCL_SOCKET_IFNAME, NCCL_BUFFSIZE and related env vars. Ensure topology-aware NCCL config.
Many small kernels (kernel-launch overhead) nsys shows many short kernel bars; kernels are < a few µs Fuse operators or use graph compilation (torch.compile) / kernel generators (Triton) to reduce launches and increase kernel granularity.

Detailed notes on high-value fixes

  • Mixed precision: Using torch.cuda.amp.autocast unlocks Tensor Cores and reduces memory traffic for matrix ops; it often produces a 1.5–3× throughput improvement depending on GPU generation. Profile after enabling to ensure numerical stability and operator coverage.
  • Operator fusion / custom kernels: When ncu shows expensive memory traffic per op, write fused kernels (Triton or custom CUDA) to keep data in registers/shared memory across ops. Nsight Compute will show the drop in DRAM throughput after a successful fusion.
  • Memory partitioning for huge models: DeepSpeed ZeRO stages partition optimizer state/gradients/parameters and enable training models that otherwise OOM. Offloading to CPU/NVMe is a pragmatic path for extremely large models where latency is less critical.
  • Dataloader tuning: num_workers, pin_memory, prefetch_factor are low-effort knobs to eliminate CPU-side stalls—measure before you tune and prefer incremental changes (increase num_workers until CPU saturates).

Important: never change multiple knobs at once. Measure, change one variable, re-measure. The profile is the experiment’s atomic record.

Automating benchmarks and performance regression testing

Automation is the difference between an optimization and a reproducible speedup you can ship. The automation strategy below is intentionally minimal and robust.

Canonical benchmark protocol (short)

  1. Decide a canonical scenario: e.g., training for N steps on a fixed subset, or inference on 10k synthetic prompts matching production shape. Record inputs and seeds.
  2. Build an immutable artifact: container image or pinned requirements.txt + driver/kernel versions. Record image digest.
  3. Warmup then measure a steady window (e.g., run 100 measured iterations after 10 warmup iterations). Capture metrics and traces as artifacts.
  4. Save the following per run: metrics.json (throughput, latencies p50/p95/p99, memory_peak), nvidia-smi.csv snapshot, nsys trace (optional), profiler trace folder, and environment metadata (commit, driver).
  5. Run the benchmark multiple times (≥3) and use the median or a robust estimator; store historical baselines.

Minimal automated runner (example)

  • run_bench.sh — runs a short, reproducible workload and writes metrics.json.
#!/usr/bin/env bash
set -euo pipefail
OUTDIR=${1:-./bench_out}
mkdir -p $OUTDIR

# Start light nvidia-smi logger in background
nvidia-smi --query-gpu=timestamp,name,utilization.gpu,utilization.memory,memory.used --format=csv -l 1 > $OUTDIR/nvidia-smi.csv &
SMI_PID=$!

# Run a short training job instrumented with torch.profiler schedule that writes to $OUTDIR/profiler
python run_small_bench.py --steps 120 --warmup 10 --outdir $OUTDIR

kill $SMI_PID
# Summarize metrics (user script produces metrics.json)
cat $OUTDIR/metrics.json
Enter fullscreen mode Exit fullscreen mode

Example run_small_bench.py should:

  • pin seeds, set deterministic flags (if appropriate),
  • perform warmup and steady iterations,
  • measure steps/sec and token throughput,
  • optionally call nsys for a single representative capture, and
  • emit metrics.json with fields throughput, p50_ms, p95_ms, peak_mem_mb, commit, image.

CI / GitHub Actions snippet (self-hosted runner with GPU)

name: perf-bench
on:
  push:
    branches: [ main ]
jobs:
  bench:
    runs-on: self-hosted-gpu
    steps:
      - uses: actions/checkout@v3
      - name: Run benchmark
        run: |
          ./ci/run_bench.sh ./bench_artifacts/${GITHUB_SHA}
      - name: Upload artifacts
        uses: actions/upload-artifact@v4
        with:
          name: bench-${{ github.sha }}
          path: ./bench_artifacts/${{ github.sha }}
Enter fullscreen mode Exit fullscreen mode

Regression detection strategy

  • Keep a JSON baseline.json with the canonical metrics for the current release.
  • After a CI bench, load metrics.json and compare primary KPIs:
    • Fail if throughput drops by >X% (system-dependent; start with 5–10%).
    • Fail if p95/p99 latency increases by >Y ms (set by SLA).
  • For noisy workloads, require statistical significance (median across N runs) or use a sliding window of historical medians to avoid false positives. MLPerf-style run discipline is instructive here.

What traces to collect in CI

  • Collect nvidia-smi CSV continuously (low overhead).
  • Collect torch.profiler short cycles (low-to-moderate overhead) for operator regressions.
  • Reserve nsys/ncu captures for triage runs only (high overhead, large files). Automate their collection only on benchmark failures or when a deeper investigation is triggered.

Automation checklist (artifact hygiene)

  • Save: metrics.json, nvidia-smi.csv, profiler_runs/*, nsys/*.qdrep (if collected), Dockerfile or image digest, commit and git diff.
  • Store artifacts in an immutable store (object storage) and link them in your CI failure ticket.
  • Record system topology: GPU model(s), PCIe/NVLink layout, NUMA layout, and nvidia-smi driver output. These explain many regressions.

Bottleneck debugging playbook (2-minute method)

  1. Measure simple throughput (tokens/sec) and latency baseline.
  2. Run nvidia-smi while running to see GPU-level utilization and memory use.
  3. If GPU utilization low → nsys targeted capture around steady-state and inspect CPU lanes and NVTX ranges.
  4. If a kernel looks expensive → ncu the kernel and check DRAM throughput vs compute; use roofline logic.
  5. Apply one fix (e.g., pin_memory=True or enable autocast) and re-run the same steps to validate impact.

Profile, fix, validate, repeat. Each iteration should have a recorded artifact that proves the impact.

Profile data is evidence. Treat it as such: annotate the code (NVTX), save the trace, attach it to your issue. Store baseline artifacts so you can compare later.

Sources:
NVIDIA Nsight Systems - Overview of Nsight Systems: system-wide timeline, GPU/CPU correlation, and recommended workflow for low-overhead traces and NVTX usage.

Nsight Systems User Guide (2025.6) - CLI nsys options, capture-range controls, GPU metrics sampling, and guidance for practical profiling.

Nsight Compute Profiling Guide - Kernel-level metrics, ncu --metrics reference and interpretation for occupancy, memory throughput, and instruction throughput.

PyTorch Profiler tutorial (recipes) - torch.profiler schedule usage, on_trace_ready and TensorBoard integration for long-running jobs.

torch.profiler API reference - export_chrome_trace, memory timeline exports, and profiler configuration options.

Profile your model on Cloud TPU VMs - XProf/TensorBoard profiling for Cloud TPU VMs and use of the tensorboard_plugin_profile.

Profile PyTorch XLA workloads (Cloud TPU guide) - torch_xla.debug.profiler examples (xp.start_trace, xp.stop_trace) and visualization with TensorBoard.

DeepSpeed ZeRO (documentation) - Memory partitioning strategies (ZeRO stages), offload options and configuration examples for training very large models.

Roofline model (Williams, Waterman, Patterson) - The Roofline performance model for reasoning about compute vs memory-bound kernels and operational intensity.

NVIDIA Hopper architecture (developer blog) - Tensor Core capabilities and mixed-precision benefits on modern NVIDIA GPUs.

Useful nvidia-smi queries (NVIDIA support) - nvidia-smi --query-gpu options and best-practice queries for logging GPU utilization and memory.

MLCommons / MLPerf inference guidance (reproducibility & run rules) - Example rules and run-discipline (warmup, steady-state, reproducibility) useful when building regression tests.

NCCL environment variables and tuning guide - Important NCCL env vars (NCCL_SOCKET_IFNAME, NCCL_BUFFSIZE, debug options) to tune collective performance.

torch.utils.checkpoint (activation checkpointing) - Activation checkpointing API and trade-offs (compute for memory).

PyTorch DataLoader documentation (pin_memory, num_workers, prefetch_factor) - DataLoader options and practical guidance for reducing host-side stalls.

Automatic Mixed Precision (torch.cuda.amp) - autocast, GradScaler and recommended usage patterns to use lower-precision compute safely.

Profile surgically, change one variable, and record the artifact that proves the change moved the needle; that discipline converts optimization work into reliable, repeatable throughput improvements.

Top comments (0)