Myoungho Shin

Posted on Mar 9 • Edited on Apr 1

Profiling GPU (CUDA) — Getting Started with GPU Flight's Python Package

#cuda #cpp #gpu #python

In the previous posts I've been showing how to investigate GPU occupancy utilization and optimize kernels that aren't using the hardware fully. That was just one case — I'll cover more occupancy scenarios in future posts.

Today, I want to go through how to use GPU Flight in Python, especially with PyTorch. Since GPU Flight is still in active development, the current version is v0.1.0.dev7. You can install it with:

pip install gpufl==0.1.0.dev7

However, I highly recommend building from source inside a CUDA container. There are two reasons:

Prerequisite libraries — GPU Flight's backend needs CUPTI, the CUDA runtime, and NVML headers at compile time. Getting these right on a bare system is fiddly.
NVML support — the pre-built PyPI wheel is compiled in a minimal CI environment that doesn't include NVML stubs. This means the wheel works for kernel profiling, but can't collect runtime GPU utilization or VRAM usage. Building from source inside the nvidia/cuda:*-devel image picks up NVML automatically.

In this post, I'll show how to use Docker to set up an environment that's ready to go — with GPU Flight built from source, PyTorch, and Jupyter Lab all pre-installed.

The Dockerfile

Here's the full Dockerfile. It's straightforward — CUDA 13.1 base, PyTorch, GPU Flight, and Jupyter Lab:

FROM nvidia/cuda:13.1.0-devel-ubuntu24.04

ENV DEBIAN_FRONTEND=noninteractive

# System dependencies (Ubuntu 24.04 ships Python 3.12)
# NOTE: cmake/ninja come from pip (build-system.requires needs >=3.31, apt has 3.28)
RUN apt-get update && apt-get install -y \
    python3 \
    python3-venv \
    python3-dev \
    python3-pip \
    git \
    curl \
    && rm -rf /var/lib/apt/lists/*

# Create venv to avoid PEP 668 issues
RUN python3 -m venv /opt/venv
ENV PATH="/opt/venv/bin:$PATH"

# Upgrade pip
RUN pip install --upgrade pip

# Install PyTorch with CUDA 13.1 support
RUN pip install torch --index-url https://download.pytorch.org/whl/cu130

# Build gpufl from source so it picks up NVML from the CUDA devel image
ARG GPUFL_VERSION=main
RUN git clone --depth 1 --branch ${GPUFL_VERSION} \
        https://github.com/gpu-flight/gpufl-client.git /tmp/gpufl-client \
    && CMAKE_ARGS="-DBUILD_TESTING=OFF" \
       pip install -v "/tmp/gpufl-client[analyzer,viz]" \
    && rm -rf /tmp/gpufl-client

# Install Jupyter
RUN pip install jupyterlab

# Working directory for notebooks
WORKDIR /workspace

# Expose Jupyter port
EXPOSE 8888

# Start Jupyter Lab
CMD ["jupyter", "lab", "--ip=0.0.0.0", "--port=8888", "--no-browser", \
     "--allow-root", "--NotebookApp.token=''", "--NotebookApp.password=''"]

A few things to note:

Ubuntu 24.04 — ships Python 3.12 natively, which is what GPU Flight requires. No PPA hacks needed.
devel image — we use nvidia/cuda:13.1.0-devel-ubuntu24.04 because the devel variant includes CUPTI, CUDA headers, and NVML stubs that GPU Flight's backend needs at compile time.
Building from source — we clone the repo and build with pip install rather than using the pre-built PyPI wheel. This is important: the devel image has NVML stubs at /usr/local/cuda/lib64/stubs/libnvidia-ml.so, so CMake detects them and compiles in the NVML collector. The pre-built wheel doesn't have this, which means no GPU utilization or VRAM monitoring.
PyTorch cu130 — at the time of writing, PyTorch doesn't publish a cu131 wheel yet. The cu130 build is forward-compatible with the CUDA 13.1 runtime in the container, so this works fine.
No token — Jupyter starts without authentication. This is fine for local development; don't expose this to the internet.

Building and Running

Prerequisites

You need two things on your host machine:

Docker — any recent version
NVIDIA Container Toolkit — this lets Docker containers access your GPU

Important: Having an NVIDIA driver installed on your host is not enough. Docker doesn't know how to talk to your GPU on its own — you need the NVIDIA Container Toolkit to bridge that gap. Without it, --gpus all will fail with:

docker: Error response from daemon: could not select device driver "" with capabilities: [[gpu]]

You can check if it's already installed by running nvidia-ctk --version. If not, here's how to set it up:

# Add the NVIDIA container toolkit repo
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey \
  | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg

curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list \
  | sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' \
  | sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

# Install and configure
sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

That last systemctl restart is easy to forget — Docker needs to be restarted after the runtime is configured, or it won't pick up the new GPU capability.

You can verify it worked with:

docker run --rm --gpus all nvidia/cuda:13.1.0-base-ubuntu24.04 nvidia-smi

If you see your GPU listed, you're good to go.

Build the Image

docker build -t gpufl-python .

This will take a few minutes the first time — mostly downloading PyTorch.

Run the Container

docker run --gpus all -p 8888:8888 -v $(pwd)/notebooks:/workspace gpufl-python

Breaking that down:

Flag	What it does
`--gpus all`	Passes all GPUs into the container
`-p 8888:8888`	Maps Jupyter's port to your host
`-v $(pwd)/notebooks:/workspace`	Mounts a local folder so your notebooks persist

Connect

Open your browser and go to:

http://localhost:8888

You'll land in Jupyter Lab with GPU Flight, PyTorch, and a CUDA-capable GPU ready to go.

Quick Smoke Test

Create a new notebook and run this to verify everything is working:

import torch
import gpufl
from gpufl import ProfilingEngine

print(f"PyTorch: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"Device: {torch.cuda.get_device_name(0)}")

# Initialize GPU Flight
gpufl.init("smoke-test",
           log_path="./smoke_test",
           sampling_auto_start=True,
           enable_kernel_details=True,
           enable_stack_trace=True,
           profiling_engine = ProfilingEngine.RangeProfiler)

# Run a simple operation
with gpufl.Scope("RandomGeneration"):
    a = torch.randn(1024, 1024, device="cuda")
    b = torch.randn(1024, 1024, device="cuda")
with gpufl.Scope("a @ b"):
    c = a @ b
    torch.cuda.synchronize()

gpufl.shutdown()
print("GPU Flight logs written!")

After running this, you should see *.log files in your working directory. These are your GPU Flight recordings — every kernel launch, memory copy, and timing event that happened during that matrix multiply.

Analyzing the Results

GPU Flight's Python analyzer can load those logs directly in the notebook:

from gpufl.analyzer import GpuFlightSession

session = GpuFlightSession(".", log_prefix="smoke_test")
session.print_summary()

GpuFlightSession takes two main arguments: the directory where logs live, and the log_prefix matching your log_path from init. It automatically finds and loads smoke_test.device.log, smoke_test.scope.log, and smoke_test.system.log.

print_summary() gives you a quick dashboard — total duration, kernel count, GPU busy time, average utilization, and peak VRAM.

Now let's look at the kernel hotspots:

session.inspect_hotspots(top_n=10)

This gives you a Rich-formatted table of your hottest kernels with occupancy, register usage, shared memory, and the per-resource occupancy breakdown showing exactly what's limiting each kernel.

Here's what that actually looks like — this is real output from the matrix multiply we just ran:

Now let's look at 33.3% occupancy. That sounds bad, right? Let's break it down.

The kernel is ampere_sgemm_128x64_nn — cuBLAS's single-precision matrix multiply. It uses 122 registers per thread. That's a lot. Let's trace through what happens on an Ampere SM:

128 threads per block = 4 warps per block
122 regs/thread × 32 threads/warp = 3,904 → rounded up to the hardware allocation granularity of 256 → 4,096 regs/warp
4 warps × 4,096 = 16,384 registers per block
An Ampere SM has 65,536 registers total → 65,536 / 16,384 = 4 blocks max
4 blocks × 4 warps = 16 active warps out of 48 max = 33.3%

The breakdown confirms it: reg 33.3% is the bottleneck, while shared memory (66.7%), warps (100%), and block count (100%) all have headroom.

But Is This Actually a Problem?

Not necessarily. If the algorithm itself doesn't require all those registers, high register usage might be a problem — but it could also be by design. This is a good example of why occupancy alone doesn't tell the whole story — you need to understand what's limiting it and whether that tradeoff makes sense for the workload.

If you saw 33% occupancy with limiting_resource: shared_mem on your own custom kernel, that might be worth investigating.

What's Next

Now that you have a working environment, you can start profiling your own models. The occupancy breakdown makes it easy to spot which kernels are underutilizing the GPU and — more importantly — why. Not every low-occupancy kernel is a problem, but when one is, you'll know exactly which resource to optimize.

In the next post, I'll cover GPU Flight's profiling engines — PC sampling, SASS metrics, and the range profiler — which let you go beyond kernel metadata and collect hardware-level data about what's happening inside the GPU while your kernels run.

DEV Community