Myoungho Shin

Posted on Feb 24

Profiling GPU (CUDA) — Introducing GPU Flight

#cuda #gpu #cpp #monitoring

Last year, I took a GPU programming course at Johns Hopkins University as part of my graduate studies, where I learned CUDA programming. For my final project, I built a lightweight GPU monitoring and profiling tool focused on CUDA.

I enjoyed the process so much that I decided to continue developing it beyond the course.

In this post, I’d like to briefly introduce the project:

GPU Flight — a 100% open-source GPU observability tool

GitHub: https://github.com/gpu-flight/gpufl-client

Why I Started GPU Flight

When profiling a CUDA application, you typically:

Install profiling tools such as Nsight
Or manually integrate CUPTI into your application, which often makes the code complex and difficult to manage
Deal with additional complexity in cloud or containerized environments

This workflow can be inconvenient — especially in production systems.

I wanted something lighter.

Something that works more like a flight recorder for GPUs.

So I built GPU Flight.

Instead of requiring heavy tooling at runtime, GPU Flight writes structured profiling logs directly on the host machine. A separate component (GPUFL Agent) crawls these log files and forwards them to a backend service or other destinations.

This makes GPU observability more flexible and easier to integrate into distributed systems.

What is GPU Flight?

GPU Flight is designed to be lightweight and modular.

If you only need monitoring, the overhead is minimal.
Enabling deeper profiling provides more detailed metrics.

The goal is to expose useful GPU metrics so you can clearly understand:

How the GPU manages resources
How your program utilizes GPU resources
Where performance bottlenecks occur

Project Structure

GPU Flight currently consists of several components:

1️⃣ gpufl-client

https://github.com/gpu-flight/gpufl-client

The client library that users embed into their applications for monitoring and profiling.

2️⃣ gpufl-agent

https://github.com/gpu-flight/gpufl-agent

Despite the name, this is not an AI agent 🙂

It tracks log files and forwards profiling data to the configured destination.

3️⃣ gpufl-desktop

https://github.com/gpu-flight/gpufl-desktop

Originally, I planned to build a desktop viewer.

Due to time constraints, I’m currently focusing on a web-based frontend.

Some repositories are still private. I plan to open them once the core functionality stabilizes.

What Metrics Does GPU Flight Support?

GPU Flight captures observability at multiple layers.

1️⃣ System & GPU Monitoring (NVML)

Host memory usage
GPU memory usage (used/free/total)
GPU utilization
Memory utilization
Temperature
Power consumption
Clock speeds (GFX / SM / Memory)
PCIe RX/TX bandwidth
Power and thermal throttling flags

Example JSON snippet:

{
  "type": "system_sample",
  "util_gpu": 57,
  "temp_c": 39,
  "power_mw": 54415,
  "clk_sm": 1740
}

2️⃣ CUDA Device Capabilities

Static architectural information:

Compute capability
L2 cache size
Shared memory per block
Registers per block
SM count
Warp size

3️⃣ CUDA API & Kernel Events (CUPTI)

API enter/exit timestamps
Kernel execution start/end timestamps
Grid/block dimensions
Shared memory usage
Register usage
Occupancy
Correlation IDs
Memory copy events (HtoD, DtoH)

Python Support

GPU Flight is also being extended to support Python applications that use CUDA (e.g., PyTorch).

Example:

https://github.com/gpu-flight/gpufl-client/blob/main/example/python/03_pytorch_benchmark.py

This allows profiling GPU-heavy ML workloads without deeply modifying existing code.

What’s Next?

In the next post, I’ll walk through a minimal CUDA example and show how to:

Integrate gpufl-client
Run a kernel
Inspect generated profiling logs
Interpret stall reasons and metrics

Thanks for reading — this is just the beginning

DEV Community