Myoungho Shin

Posted on May 22

Profiling a CUDA Python Program with GPUFlight

#performance #python #cuda #gpu

In the previous post, I used a C++ CUDA example to look at memory coalescing and how memory access patterns affect GPU performance.

This time, I wanted to look at a similar performance problem from Python.

I usually write CUDA code in C++, but recently I have been spending more time with Python, especially PyTorch and Numba.

Numba is interesting because it lets you write a real GPU kernel directly in Python. You can decorate a function with @cuda.jit, launch it with kernel[grid, block](...), and Numba compiles it down to GPU machine code that runs on the actual hardware.

The good news is that GPUFlight can profile Python GPU programs as well.

In this post, I’ll profile a simple Numba matrix multiplication kernel with GPUFlight. Then I’ll read the report step by step and show how the report points to a real optimization: shared-memory tiling.

One important note before we start: this example uses GPUFlight’s deeper profiling mode with SASS-level metrics and PC sampling. So the duration numbers in the report should not be treated as clean baseline kernel timing. They include profiling overhead. The main goal here is not to benchmark Numba against an optimized library like cuBLAS. The goal is to show how GPUFlight helps explain what is happening inside the kernel.

Setup

Both GPUFlight and Numba can be installed from PyPI. On a fresh Linux machine:

sudo apt-get install -y python3.12-venv
python3 --version            # expect Python 3.12.x

python3 -m venv ~/gpufl-venv
source ~/gpufl-venv/bin/activate

pip install --upgrade pip
pip install gpufl "numba-cuda[cu13]"

python -c "import gpufl; print('gpufl', gpufl.__version__)"

You should see something like:

gpufl 1.x.x

At the time I am writing this, the version is 1.0.2.

Before using the profiler, it is a good idea to confirm that Numba can find your GPU:

python -c "from numba import cuda; print('cuda available:', cuda.is_available()); cuda.detect()"

Now we are ready to run a Python CUDA application with GPUFlight.

The sample kernel

Here is the sample code I am using:

import gpufl as gfl
from gpufl.report import generate_report
from numba import cuda
import numpy as np
import math
import os

@cuda.jit
def matmul_kernel(A, B, C):
    row, col = cuda.grid(2)

    if row < C.shape[0] and col < C.shape[1]:
        tmp = 0.0

        for k in range(A.shape[1]):
            tmp += A[row, k] * B[k, col]

        C[row, col] = tmp

LOG_PATH = "./gfl_logs"

gfl.init(
    app_name="matmul_sample",
    log_path=LOG_PATH,
    sampling_auto_start=True,
    system_sample_rate_ms=100,
    profiling_engine=gfl.ProfilingEngine.PcSamplingWithSass,
)

try:
    N = 2048

    A = cuda.to_device(np.random.rand(N, N).astype(np.float32))
    B = cuda.to_device(np.random.rand(N, N).astype(np.float32))
    C = cuda.to_device(np.zeros((N, N), dtype=np.float32))

    tpb = (16, 16)
    bpg = (math.ceil(N / tpb[0]), math.ceil(N / tpb[1]))

    with gfl.Scope("matrix_mul_compute", "math"):
        for _ in range(10):
            matmul_kernel[bpg, tpb](A, B, C)

    _ = C.copy_to_host()
    print("[OK] compute finished")

finally:
    gfl.shutdown()

    print(
        generate_report(
            os.path.dirname(LOG_PATH) or ".",
            log_prefix=os.path.basename(LOG_PATH),
            top_n=10,
        )
    )

This is a very simple matrix multiplication kernel.

Each thread computes one output element. For each element, the thread walks through one full row of A and one full column of B.

This is intentionally not optimized. I want to start with a simple kernel, because it makes the profiling report easier to understand.

Let’s run it and see what GPUFlight tells us.

===============================================================================
                           GPU Flight Session Report
                       Generated: 2026-05-22 05:05:33 UTC
===============================================================================

===============================================================================
  Session Summary
===============================================================================
  Application:          matmul_sample
  Session ID:           565d3c32-86cc-415d-8642-9c140f856f2b
  Duration:             17.91 s
  GPU Device:           NVIDIA GeForce RTX 5060 Laptop GPU
    SMs:                26
    Registers/Block:    65536

===============================================================================
  Kernel Execution Summary
===============================================================================
  Total Kernels:        10
  Unique Kernels:       1
  Total GPU Time:       17.40 s
  GPU Busy:             97.2%
  Avg Duration:         1.74 s
  Median Duration:      1.74 s
  Min Duration:         1.71 s
  Max Duration:         1.78 s

===============================================================================
  Top 10 Kernels by Total GPU Time
===============================================================================
  #   Kernel                                   Calls       Total         Avg         Max
  --------------------------------------------------------------------------------------
  1   __main__::matmul_kernel                     10     17.40 s      1.74 s      1.78 s

===============================================================================
  Kernel Details (Top 10)
===============================================================================

  __main__::matmul_kernel
  =======================
    Grid:               (128,128,1)
    Block:              (16,16,1)
    Occupancy:          100.0%
    Reg Occupancy:      100.0%
    SMem Occupancy:     100.0%
    Warp Occupancy:     100.0%
    Block Occupancy:    100.0%
    Limiting Resource:  warps
    Registers/Thread:   40
    Shared Memory:      0 B dyn + 0 B static

===============================================================================
  Memory Transfer Summary
===============================================================================
  Total Transfers:      4
  Total Bytes:          64.0 MB

  Direction      Count     Total Bytes    Avg Throughput
  ------------------------------------------------------
  HtoD               3         48.0 MB        11.68 GB/s
  DtoH               1         16.0 MB         4.40 GB/s

===============================================================================
  System Metrics
===============================================================================
  GPU Metrics:
    Utilization:        avg 96.6%  peak 100%  min 0%
    Temperature:        avg 53.4 C  peak 58 C
    Power:              avg 71.0 W  peak 75.6 W
    VRAM Usage:         peak 1105 MiB
    SM Clock:           avg 2631 MHz  peak 2790 MHz

  Host Metrics:
    CPU Utilization:    avg 8.6%  peak 29.1%
    RAM Usage:          peak 27593 / 32189 MiB (85.7%)

===============================================================================
  Scope Summary
===============================================================================
  Scope Timing:
  Scope                          Calls       Total         Avg         Max
  ------------------------------------------------------------------------
  matrix_mul_compute                 1   195.21 ms   195.21 ms   195.21 ms

  GPU Time by Scope:
  Scope                          Kernels      GPU Time         Avg
  ----------------------------------------------------------------
  matrix_mul_compute                  10       17.40 s      1.74 s

===============================================================================
  Profile / SASS Analysis
===============================================================================

  SASS Metrics Summary:
  Metric                                                   Total
  --------------------------------------------------------------
  smsp__sass_thread_inst_executed                   2235815690240
  smsp__sass_inst_executed                           69869240320
  smsp__sass_sectors_mem_global                      45654999040
  smsp__sass_sectors_mem_global_ideal                13427015680

  Thread Divergence Analysis:
    Warp Instructions:    69869240320
    Thread Instructions:  2235815690240
    Avg Threads/Warp:     32.0 / 32
    Warp Efficiency:      100.0%

Now let’s read the report carefully.

A profiling report is only useful if we can turn it into a decision. So instead of just looking at numbers, I usually ask a few questions.

1. Is the GPU actually busy?

Yes.

The report shows:

GPU Busy:             97.2%
GPU Util avg:         96.6%
Total GPU Time:       17.40 s
Duration:             17.91 s

This means the GPU was working for almost the entire run. Out of 17.91 s of wall-clock time, 17.40 s were spent running GPU kernels.

The SM clock is also boosted to 2631 MHz, and power is around 71.0 W, which is close to the laptop GPU’s power limit.

So this is not a case where the CPU is too slow, the input data is too small, or the GPU is waiting for work. The GPU is busy.

That means if we want to improve performance, we need to look inside the kernel.

2. How long did each profiled launch take?

The report shows:

Avg Duration:         1.74 s
Median Duration:      1.74 s
Min Duration:         1.71 s
Max Duration:         1.78 s

However, this number needs to be read carefully.

This run includes deeper profiling, including SASS-level metrics and sampling. That means the measured duration includes profiling overhead. So I should not treat 1.74 s as the clean baseline runtime of the kernel.

I would not use this number alone to claim how fast or slow the raw Numba kernel is. But it is still useful as the runtime under this profiling configuration.

3. Is the problem occupancy?

Probably not.

The report shows:

Occupancy:          100.0%
Reg Occupancy:      100.0%
SMem Occupancy:     100.0%
Warp Occupancy:     100.0%
Block Occupancy:    100.0%
Limiting Resource:  warps

This tells us the GPU has enough active warps. The SMs are not sitting empty because we launched too few threads.

Occupancy is not the same thing as performance, but in this case low occupancy does not look like the main problem.

4. Is the problem thread divergence?

Also no.

The report shows:

Avg Threads/Warp:     32.0 / 32
Warp Efficiency:      100.0%

This means every warp is using all 32 threads. There is no meaningful branch divergence here.

That makes sense because the kernel is simple. The 16 x 16 block and 128 x 128 grid map cleanly to the 2048 x 2048 output matrix.

So far, the report says:

The GPU is busy.
Occupancy is high.
Warp efficiency is perfect.

So now we need to look at memory behavior.

5. What do the memory sectors say?

This is the most useful part of the report:

SASS Metrics Summary:
Metric                                                   Total
--------------------------------------------------------------
smsp__sass_thread_inst_executed                   2235815690240
smsp__sass_inst_executed                           69869240320
smsp__sass_sectors_mem_global                      45654999040
smsp__sass_sectors_mem_global_ideal                13427015680

The important two numbers are:

smsp__sass_sectors_mem_global          45,654,999,040
smsp__sass_sectors_mem_global_ideal    13,427,015,680

The kernel is accessing about 45.7B global memory sectors, while the ideal number is about 13.4B.

That is roughly:

45.7 / 13.4 ≈ 3.4x

So the kernel is moving about 3.4x more global memory traffic than the ideal case.

Another way to read it:

13.4 / 45.7 ≈ 29%

The memory access efficiency is only around 29%.

This is the real story.

The naive kernel makes each thread re-read values from global memory. Many threads need overlapping data from A and B, but the kernel does not reuse that data efficiently. So the same data crosses the memory system again and again.

The GPU is busy, the warps are full, and the lanes are active. But the memory access pattern is wasteful.

6. The fix: shared-memory tiling

For this kind of matrix multiplication kernel, the classic fix is shared-memory tiling.

Instead of letting each thread repeatedly read everything from global memory, each block cooperatively loads a tile of A and a tile of B into shared memory. Then the threads reuse those values many times before loading the next tile.

Here is the improved kernel:

from numba import cuda, float32

TPB = 16

@cuda.jit
def matmul_kernel_perf(A, B, C):
    sA = cuda.shared.array((TPB, TPB), dtype=float32)
    sB = cuda.shared.array((TPB, TPB), dtype=float32)

    x, y = cuda.grid(2)

    tx = cuda.threadIdx.x
    ty = cuda.threadIdx.y

    tmp = float32(0.0)

    n_tiles = (A.shape[1] + TPB - 1) // TPB

    for i in range(n_tiles):
        sA[ty, tx] = 0.0
        sB[ty, tx] = 0.0

        if y < A.shape[0] and (tx + i * TPB) < A.shape[1]:
            sA[ty, tx] = A[y, tx + i * TPB]

        if x < B.shape[1] and (ty + i * TPB) < B.shape[0]:
            sB[ty, tx] = B[ty + i * TPB, x]

        cuda.syncthreads()

        for j in range(TPB):
            tmp += sA[ty, j] * sB[j, tx]

        cuda.syncthreads()

    if y < C.shape[0] and x < C.shape[1]:
        C[y, x] = tmp

Now let’s run the same profiling mode again.

===============================================================================
                           GPU Flight Session Report
                       Generated: 2026-05-22 05:20:40 UTC
===============================================================================

===============================================================================
  Session Summary
===============================================================================
  Application:          matmul_sample_perf
  Session ID:           d44e5478-ba19-4cd1-b3cf-f6d31ab8b0ca
  Duration:             2.90 s
  GPU Device:           NVIDIA GeForce RTX 5060 Laptop GPU
    SMs:                26
    Registers/Block:    65536

===============================================================================
  Kernel Execution Summary
===============================================================================
  Total Kernels:        10
  Unique Kernels:       1
  Total GPU Time:       2.22 s
  GPU Busy:             76.4%
  Avg Duration:         221.64 ms
  Median Duration:      216.89 ms
  Min Duration:         215.38 ms
  Max Duration:         250.06 ms

===============================================================================
  Top 10 Kernels by Total GPU Time
===============================================================================
  #   Kernel                                   Calls       Total         Avg         Max
  --------------------------------------------------------------------------------------
  1   __main__::matmul_kernel_perf                10      2.22 s   221.64 ms   250.06 ms

===============================================================================
  Kernel Details (Top 10)
===============================================================================

  __main__::matmul_kernel_perf
  ============================
    Grid:               (128,128,1)
    Block:              (16,16,1)
    Occupancy:          100.0%
    Reg Occupancy:      100.0%
    SMem Occupancy:     100.0%
    Warp Occupancy:     100.0%
    Block Occupancy:    100.0%
    Limiting Resource:  warps
    Registers/Thread:   37
    Shared Memory:      0 B dyn + 2.0 KB static

===============================================================================
  Memory Transfer Summary
===============================================================================
  Total Transfers:      4
  Total Bytes:          64.0 MB

  Direction      Count     Total Bytes    Avg Throughput
  ------------------------------------------------------
  HtoD               3         48.0 MB         9.87 GB/s
  DtoH               1         16.0 MB         4.45 GB/s

===============================================================================
  System Metrics
===============================================================================
  GPU Metrics:
    Utilization:        avg 74.9%  peak 100%  min 0%
    Temperature:        avg 43.0 C  peak 48 C
    Power:              avg 51.0 W  peak 76.1 W
    VRAM Usage:         peak 958 MiB
    SM Clock:           avg 2180 MHz  peak 2812 MHz

  Host Metrics:
    CPU Utilization:    avg 16.0%  peak 46.0%
    RAM Usage:          peak 27019 / 32189 MiB (83.9%)

===============================================================================
  Scope Summary
===============================================================================
  Scope Timing:
  Scope                          Calls       Total         Avg         Max
  ------------------------------------------------------------------------
  matrix_mul_compute_perf            1   330.58 ms   330.58 ms   330.58 ms

  GPU Time by Scope:
  Scope                          Kernels      GPU Time         Avg
  ----------------------------------------------------------------
  matrix_mul_compute_perf             10        2.22 s   221.64 ms

===============================================================================
  Profile / SASS Analysis
===============================================================================

  SASS Metrics Summary:
  Metric                                                   Total
  --------------------------------------------------------------
  smsp__sass_thread_inst_executed                   298005299200
  smsp__sass_inst_executed                            9312665600
  smsp__sass_sectors_mem_global                       1347420160
  smsp__sass_sectors_mem_global_ideal                 1347420160

  Thread Divergence Analysis:
    Warp Instructions:    9312665600
    Thread Instructions:  298005299200
    Avg Threads/Warp:     32.0 / 32
    Warp Efficiency:      100.0%

The result is much better under the same profiling configuration.

The full session duration goes down from 17.91 s to 2.90 s.

Total GPU time goes down from 17.40 s to 2.22 s.

The average profiled kernel duration goes down from 1.74 s to 221.64 ms.

Again, these are still profiled durations, not clean baseline timings. But because both runs use the same deep profiling mode, this comparison is still useful. It tells us the tiled version behaves much better under the same measurement setup.

7. What changed?

The most important change is in the memory-sector metrics.

Naive version:

smsp__sass_sectors_mem_global          45,654,999,040
smsp__sass_sectors_mem_global_ideal    13,427,015,680

Tiled version:

smsp__sass_sectors_mem_global           1,347,420,160
smsp__sass_sectors_mem_global_ideal     1,347,420,160

In the naive kernel, actual global memory sectors were about 3.4x higher than ideal.

In the tiled kernel, actual and ideal global memory sectors are the same.

That is exactly what we wanted to see.

The optimized kernel also uses shared memory:

Shared Memory:      0 B dyn + 2.0 KB static

That means each block is now reusing data through shared memory instead of repeatedly pulling the same values from global memory.

Instruction count also drops a lot:

Naive thread instructions:  2,235,815,690,240
Tiled thread instructions:    298,005,299,200

So the optimized kernel is not only reducing memory traffic. It is also doing much less total instruction work.

Summary

This example is not a full benchmark. I am not comparing Numba against cuBLAS, and I am not claiming these numbers are the raw kernel runtimes. The run uses SASS-level profiling and sampling, so there is overhead.

But the report is still useful because both versions were measured with the same profiling mode. More importantly, the report explains why the naive kernel is slow.

The first version had:

high GPU utilization,
100% occupancy,
100% warp efficiency,
but very inefficient global memory access.

That means the problem was not lack of work or branch divergence. The problem was the memory access pattern.

After changing the kernel to use shared-memory tiling:

total profiled GPU time dropped from 17.40 s to 2.22 s,
average profiled kernel time dropped from 1.74 s to 221.64 ms,
global memory sectors dropped from 45.65B to 1.35B,
and actual global memory sectors matched the ideal number.

So the main takeaway is not just “the optimized kernel is faster.”

The more important takeaway is that GPUFlight helped point to the right fix. The report showed that the naive kernel was wasting memory bandwidth, and the optimized version confirmed that shared-memory tiling reduced that waste.

That is the workflow I want GPUFlight to support:

Run your program normally, collect useful GPU metrics, and turn the report into a concrete optimization decision.

DEV Community

Profiling a CUDA Python Program with GPUFlight

Setup

The sample kernel

1. Is the GPU actually busy?

2. How long did each profiled launch take?

3. Is the problem occupancy?

4. Is the problem thread divergence?

5. What do the memory sectors say?

6. The fix: shared-memory tiling

7. What changed?

Summary

Top comments (0)