Myoungho Shin

Posted on Jun 24 • Originally published at blog.gpuflight.com

Profiling Your CUDA App with GPUFlight Trace

#cuda #cpp #gpu #nvidia

In the previous GPUFlight posts, I focused on the embedded SDK approach: adding GPUFlight directly to a CUDA application so the app can create its own profiling scopes.

That workflow is useful when you own the code and want explicit instrumentation. But sometimes you just want to profile a CUDA program without editing it, rebuilding it, or adding a new dependency.

That is where GPUFlight Trace comes in.

What is GPUFlight Trace?

GPUFlight Trace is a launch-time profiling mode for CUDA applications. Instead of embedding GPUFlight into your program, you run the program through the gpufl trace command:

gpufl trace -- your_cuda_application

For example, you can profile a PyTorch training script like this:

gpufl trace -- python train.py

Or a compiled CUDA executable like this:

& $gpufl trace `
  --name tutorial-01-vector-add `
  --output .\gpufl-logs `
  -- .\build\Release\gpufl_tutorial_01.exe

The idea is simple: GPUFlight runs your program, records what the program does on the GPU side, and writes trace logs without touching your application code.

That makes it a good first profiling step when you already have an executable or script and want to answer:

I launched CUDA kernels. What actually happened on the GPU?

CUDA programs can become difficult to reason about quickly. You may have kernels, memory copies, synchronization points, CUDA streams, framework-generated kernels, and library calls all mixed together. The program output might say "success", but that does not tell you how the work moved through the GPU.

How does it work?

At a high level, gpufl trace uses a hook that the NVIDIA driver already exposes for CUDA tooling.

Before starting your program, gpufl trace sets CUDA_INJECTION64_PATH to point at GPUFlight's injection library. On Linux, it also uses LD_PRELOAD so GPUFlight can attach early enough for more CUDA runtime cases.

Then the flow looks like this:

gpufl trace starts your program
  -> your program makes its first CUDA call
  -> the NVIDIA driver loads GPUFlight's injection library
  -> GPUFlight registers CUPTI callbacks
  -> CUDA kernels, copies, streams, and sync events are captured

You normally do not need to set CUDA_INJECTION64_PATH yourself. The gpufl trace launcher handles that for the target process.

The CUDA sample

Now let's look at a CUDA sample application.

The sample is intentionally small. It allocates three vectors, launches a vector-add kernel 50 times, copies the result back, and validates the output.

The runnable tutorial project is here:

gpu-flight/gpufl-tutorial/tutorial-01

Here is the kernel:

__global__ void vector_add(const float* a, const float* b, float* c, int n) {
    const int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < n) {
        c[idx] = a[idx] + b[idx];
    }
}

The CUDA source file is here:

vector_add.cu

This is not meant to be a performance benchmark. It is deliberately predictable so the captured trace is easy to inspect.

Build the sample

On Windows, configure the project with CMake and Visual Studio:

cmake -S . -B build -G "Visual Studio 17 2022" -A x64

Then build the Release executable:

cmake --build build --config Release

Run the executable once without profiling:

.\build\Release\gpufl_tutorial_01.exe

Expected output:

Vector add completed successfully: 50 kernel launches, 1048576 elements

At this point, we know the CUDA app works normally. Now we can profile it.

Capture a trace

The capture step uses the gpufl launcher from gpufl-client. If you don't have it yet, see the getting-started guide. Then point a PowerShell variable at the launcher:

$gpufl = "C:\path\to\gpufl-client\build-windows\daemon\launcher\Release\gpufl.exe"

Run the CUDA app through gpufl trace:

& $gpufl trace `
  --name tutorial-01-vector-add `
  --output .\gpufl-logs `
  -- .\build\Release\gpufl_tutorial_01.exe

The command creates a gpufl-logs directory. Inside it, GPUFlight writes a generated session folder.

Inside the session folder, you will see one compressed log file per channel:

gpufl-logs/
  <session-id>/
    device.1.log.gz
    sass.1.log.gz
    scope.1.log.gz
    system.1.log.gz
    system.2.log.gz

Each channel is a separate stream of events:

device.* — kernel launches and memory copies: the core GPU activity of the trace.
system.* — device-level metrics such as utilization and memory usage, rotated into numbered windows (system.1, system.2, ...) over the run.
scope.* — user-defined profiling scopes (the embedded-SDK feature from the earlier posts). An uninstrumented app like this one defines none.
sass.* — SASS-level instruction data, which only a deeper profiling mode collects. The default trace does not, so this file is created but stays empty.

GPUFlight writes the standard channel files even when a channel has nothing to record, so an empty sass.1.log.gz on a plain trace is expected, not an error. Together, these files are the local source of truth for the captured run.

Upload the trace manually

Open the GPUFlight dashboard:

https://app.gpuflight.com

If you do not have an account yet, register here:

https://app.gpuflight.com/register

Go to Uploads and drag the generated log files into the upload area.

For this run, I selected the compressed log files from the generated session folder.

GPUFlight detects the files and shows the upload plan.

During upload, the newest row appears with a Received status while the files stream to the backend.

When all files are sent, the upload panel marks the session complete.

The flow is:

Drop the generated log files.
GPUFlight detects the session.
Click the upload button.
The upload row moves from received to completed.
The session appears in Sessions.

Inspect the uploaded session

After processing starts, the session appears on the Sessions page.

Once processing completes, the session is ready to inspect.

For this vector-add app, the dashboard should show:

one GPU
50 kernel launches
the vector_add kernel
per-launch timing
grid and block dimensions
occupancy-related values
a timeline view showing launches on a wall-clock axis

The kernel view answers: what ran, and how long did it take?

The timeline answers: when did it happen?

That distinction becomes more important as an application gets more complex. A table is good for ranking kernels by cost. A timeline is better for understanding ordering, gaps, transfers, synchronization, and overlap.

Stream the upload with `gpufl-agent`

Drag-and-drop upload is useful for the first trace because it makes the generated files visible. For repeated profiling, I usually want the trace command to upload while it runs.

For that, create an API key in the dashboard under Settings > API keys.

Generate a key for the local uploader.

Copy the generated key immediately. The dashboard only shows the full key once.

Download the gpufl-agent JAR from the releases page:

https://github.com/gpu-flight/gpufl-agent/releases/download/v1.0.1/gpufl-agent.jar

Then set the upload environment:

$env:GPUFL_BACKEND_URL = "https://api.gpuflight.com"
$env:GPUFL_API_KEY = "gpfl_xxxxxxxxxxxx"
$agentJar = "C:\path\to\gpufl-agent.jar"

Run the trace with upload enabled:

& $gpufl trace `
  --name tutorial-01-vector-add `
  --output .\gpufl-logs `
  --upload `
  --agent-jar $agentJar `
  -- .\build\Release\gpufl_tutorial_01.exe

The agent tails the generated log files and streams them to the backend.

When the command completes, the session is already on its way to the dashboard.

For a local tutorial, starting the agent from gpufl trace is convenient. In a production environment, you may prefer to run the uploader separately, either as a Docker container or as a standalone Java application. That lets gpufl trace focus on generating trace files while a long-running agent watches and uploads them in the background. This can be useful when you do not want the profiled command to wait at the end of the run for every generated file to finish uploading.

When to use `gpufl trace`

Use gpufl trace when:

you have an existing CUDA program
you can launch it from a terminal
you want a first activity trace without changing source code
you want to inspect kernel events and a timeline in the dashboard

Use the embedded GPUFlight SDK when:

you own the application source
you want explicit GFL_SCOPE regions
you want GPUFlight initialized directly inside the process
you want tighter control over capture boundaries

For the first look at an executable, gpufl trace is the lower-friction path. For richer application context, embedded instrumentation gives more control.

DEV Community

Profiling Your CUDA App with GPUFlight Trace

What is GPUFlight Trace?

How does it work?

The CUDA sample

Build the sample

Capture a trace

Upload the trace manually

Inspect the uploaded session

Stream the upload with `gpufl-agent`

When to use `gpufl trace`

Top comments (0)

What is GPUFlight Trace?

How does it work?

The CUDA sample

Build the sample

Capture a trace

Upload the trace manually

Inspect the uploaded session

Stream the upload with gpufl-agent

When to use gpufl trace

Stream the upload with `gpufl-agent`

When to use `gpufl trace`