DEV Community

Myoungho Shin
Myoungho Shin

Posted on • Originally published at blog.gpuflight.com

Profiling Your CUDA App with GPUFlight Trace

In the previous GPUFlight posts, I focused on the embedded SDK approach: adding GPUFlight directly to a CUDA application so the app can create its own profiling scopes.

That workflow is useful when you own the code and want explicit instrumentation. But sometimes you just want to profile a CUDA program without editing it, rebuilding it, or adding a new dependency.

That is where GPUFlight Trace comes in.

What is GPUFlight Trace?

GPUFlight Trace is a launch-time profiling mode for CUDA applications. Instead of embedding GPUFlight into your program, you run the program through the gpufl trace command:

gpufl trace -- your_cuda_application
Enter fullscreen mode Exit fullscreen mode

For example, you can profile a PyTorch training script like this:

gpufl trace -- python train.py
Enter fullscreen mode Exit fullscreen mode

Or a compiled CUDA executable like this:

& $gpufl trace `
  --name tutorial-01-vector-add `
  --output .\gpufl-logs `
  -- .\build\Release\gpufl_tutorial_01.exe
Enter fullscreen mode Exit fullscreen mode

The idea is simple: GPUFlight runs your program, records what the program does on the GPU side, and writes trace logs without touching your application code.

That makes it a good first profiling step when you already have an executable or script and want to answer:

I launched CUDA kernels. What actually happened on the GPU?

CUDA programs can become difficult to reason about quickly. You may have kernels, memory copies, synchronization points, CUDA streams, framework-generated kernels, and library calls all mixed together. The program output might say "success", but that does not tell you how the work moved through the GPU.

How does it work?

At a high level, gpufl trace uses a hook that the NVIDIA driver already exposes for CUDA tooling.

Before starting your program, gpufl trace sets CUDA_INJECTION64_PATH to point at GPUFlight's injection library. On Linux, it also uses LD_PRELOAD so GPUFlight can attach early enough for more CUDA runtime cases.

Then the flow looks like this:

gpufl trace starts your program
  -> your program makes its first CUDA call
  -> the NVIDIA driver loads GPUFlight's injection library
  -> GPUFlight registers CUPTI callbacks
  -> CUDA kernels, copies, streams, and sync events are captured
Enter fullscreen mode Exit fullscreen mode

You normally do not need to set CUDA_INJECTION64_PATH yourself. The gpufl trace launcher handles that for the target process.

The CUDA sample

Now let's look at a CUDA sample application.

The sample is intentionally small. It allocates three vectors, launches a vector-add kernel 50 times, copies the result back, and validates the output.

The runnable tutorial project is here:

gpu-flight/gpufl-tutorial/tutorial-01

Here is the kernel:

__global__ void vector_add(const float* a, const float* b, float* c, int n) {
    const int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < n) {
        c[idx] = a[idx] + b[idx];
    }
}
Enter fullscreen mode Exit fullscreen mode

The CUDA source file is here:

vector_add.cu

This is not meant to be a performance benchmark. It is deliberately predictable so the captured trace is easy to inspect.

Build the sample

On Windows, configure the project with CMake and Visual Studio:

cmake -S . -B build -G "Visual Studio 17 2022" -A x64
Enter fullscreen mode Exit fullscreen mode

CMake configure output for the CUDA vector-add sample on Windows

Then build the Release executable:

cmake --build build --config Release
Enter fullscreen mode Exit fullscreen mode

Release build output for the CUDA vector-add sample on Windows

Run the executable once without profiling:

.\build\Release\gpufl_tutorial_01.exe
Enter fullscreen mode Exit fullscreen mode

Expected output:

Vector add completed successfully: 50 kernel launches, 1048576 elements
Enter fullscreen mode Exit fullscreen mode

At this point, we know the CUDA app works normally. Now we can profile it.

Capture a trace

The capture step uses the gpufl launcher from gpufl-client. If you don't have it yet, see the getting-started guide. Then point a PowerShell variable at the launcher:

$gpufl = "C:\path\to\gpufl-client\build-windows\daemon\launcher\Release\gpufl.exe"
Enter fullscreen mode Exit fullscreen mode

Run the CUDA app through gpufl trace:

& $gpufl trace `
  --name tutorial-01-vector-add `
  --output .\gpufl-logs `
  -- .\build\Release\gpufl_tutorial_01.exe
Enter fullscreen mode Exit fullscreen mode

Running the CUDA sample through gpufl trace on Windows

The command creates a gpufl-logs directory. Inside it, GPUFlight writes a generated session folder.

Generated gpufl-logs directory with one session folder

Inside the session folder, you will see one compressed log file per channel:

gpufl-logs/
  <session-id>/
    device.1.log.gz
    sass.1.log.gz
    scope.1.log.gz
    system.1.log.gz
    system.2.log.gz
Enter fullscreen mode Exit fullscreen mode

Each channel is a separate stream of events:

  • device.* — kernel launches and memory copies: the core GPU activity of the trace.
  • system.* — device-level metrics such as utilization and memory usage, rotated into numbered windows (system.1, system.2, ...) over the run.
  • scope.* — user-defined profiling scopes (the embedded-SDK feature from the earlier posts). An uninstrumented app like this one defines none.
  • sass.* — SASS-level instruction data, which only a deeper profiling mode collects. The default trace does not, so this file is created but stays empty.

GPUFlight writes the standard channel files even when a channel has nothing to record, so an empty sass.1.log.gz on a plain trace is expected, not an error. Together, these files are the local source of truth for the captured run.

Upload the trace manually

Open the GPUFlight dashboard:

https://app.gpuflight.com

If you do not have an account yet, register here:

https://app.gpuflight.com/register

Go to Uploads and drag the generated log files into the upload area.

Uploads page with the drop zone for GPUFlight logs

For this run, I selected the compressed log files from the generated session folder.

Selecting compressed GPUFlight log files from the generated session folder

GPUFlight detects the files and shows the upload plan.

Upload page showing one discovered session before upload

During upload, the newest row appears with a Received status while the files stream to the backend.

Upload in progress with received status and progress bar

When all files are sent, the upload panel marks the session complete.

Upload complete with all files sent

The flow is:

  1. Drop the generated log files.
  2. GPUFlight detects the session.
  3. Click the upload button.
  4. The upload row moves from received to completed.
  5. The session appears in Sessions.

Inspect the uploaded session

After processing starts, the session appears on the Sessions page.

Sessions page showing the uploaded tutorial session while it is still processing

Once processing completes, the session is ready to inspect.

Sessions page showing the tutorial session completed

For this vector-add app, the dashboard should show:

  • one GPU
  • 50 kernel launches
  • the vector_add kernel
  • per-launch timing
  • grid and block dimensions
  • occupancy-related values
  • a timeline view showing launches on a wall-clock axis

The kernel view answers: what ran, and how long did it take?

Kernel events view for the uploaded vector-add trace

The timeline answers: when did it happen?

Timeline view for the uploaded vector-add trace

That distinction becomes more important as an application gets more complex. A table is good for ranking kernels by cost. A timeline is better for understanding ordering, gaps, transfers, synchronization, and overlap.

Stream the upload with gpufl-agent

Drag-and-drop upload is useful for the first trace because it makes the generated files visible. For repeated profiling, I usually want the trace command to upload while it runs.

For that, create an API key in the dashboard under Settings > API keys.

API keys settings page in the GPUFlight dashboard

Generate a key for the local uploader.

Generate API key dialog with a tutorial-uploader key name

Copy the generated key immediately. The dashboard only shows the full key once.

Generated API key dialog with copy button

Download the gpufl-agent JAR from the releases page:

https://github.com/gpu-flight/gpufl-agent/releases/download/v1.0.1/gpufl-agent.jar
Enter fullscreen mode Exit fullscreen mode

Then set the upload environment:

$env:GPUFL_BACKEND_URL = "https://api.gpuflight.com"
$env:GPUFL_API_KEY = "gpfl_xxxxxxxxxxxx"
$agentJar = "C:\path\to\gpufl-agent.jar"
Enter fullscreen mode Exit fullscreen mode

Run the trace with upload enabled:

& $gpufl trace `
  --name tutorial-01-vector-add `
  --output .\gpufl-logs `
  --upload `
  --agent-jar $agentJar `
  -- .\build\Release\gpufl_tutorial_01.exe
Enter fullscreen mode Exit fullscreen mode

PowerShell command setting GPUFlight upload environment variables and running gpufl trace with agent upload

The agent tails the generated log files and streams them to the backend.

Streaming upload output from gpufl-agent while gpufl trace finishes

When the command completes, the session is already on its way to the dashboard.

For a local tutorial, starting the agent from gpufl trace is convenient. In a production environment, you may prefer to run the uploader separately, either as a Docker container or as a standalone Java application. That lets gpufl trace focus on generating trace files while a long-running agent watches and uploads them in the background. This can be useful when you do not want the profiled command to wait at the end of the run for every generated file to finish uploading.

When to use gpufl trace

Use gpufl trace when:

  • you have an existing CUDA program
  • you can launch it from a terminal
  • you want a first activity trace without changing source code
  • you want to inspect kernel events and a timeline in the dashboard

Use the embedded GPUFlight SDK when:

  • you own the application source
  • you want explicit GFL_SCOPE regions
  • you want GPUFlight initialized directly inside the process
  • you want tighter control over capture boundaries

For the first look at an executable, gpufl trace is the lower-friction path. For richer application context, embedded instrumentation gives more control.

Related links:

Top comments (0)