<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Myoungho Shin</title>
    <description>The latest articles on DEV Community by Myoungho Shin (@codinginavan).</description>
    <link>https://dev.to/codinginavan</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3788007%2F09e840db-b49e-41e5-9eca-f28d4329e1d4.png</url>
      <title>DEV Community: Myoungho Shin</title>
      <link>https://dev.to/codinginavan</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/codinginavan"/>
    <language>en</language>
    <item>
      <title>GPU Flight — System Architecture</title>
      <dc:creator>Myoungho Shin</dc:creator>
      <pubDate>Mon, 16 Mar 2026 05:37:39 +0000</pubDate>
      <link>https://dev.to/codinginavan/gpu-flight-system-architecture-coe</link>
      <guid>https://dev.to/codinginavan/gpu-flight-system-architecture-coe</guid>
      <description>&lt;p&gt;The &lt;a href="https://dev.to/codinginavan/detecting-thread-divergence-with-sass-metrics-and-gpu-flight"&gt;previous post&lt;/a&gt; was about thread divergence at the SASS level. Before I move on to other optimization strategies, I think it makes sense to first review GPU Flight’s architecture. Understanding the overall structure will make the upcoming topics easier to follow, and it should also be helpful for anyone trying to run the full system locally or in the cloud.&lt;/p&gt;

&lt;p&gt;In this post, I want to step back and show the bigger picture: what each component does, why I separated them this way, and what kind of deployment setup makes the most sense in production.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Big Picture
&lt;/h2&gt;

&lt;p&gt;This is how all the components fit together:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ff9c1ue5x80md164g2mxa.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ff9c1ue5x80md164g2mxa.png" alt=" " width="800" height="378"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In short:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;gpufl-client&lt;/strong&gt; — runs on the GPU machine, hooks into CUDA activity, and writes structured logs&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;gpufl-agent&lt;/strong&gt; — watches those logs and sends them to the backend through HTTP or Kafka&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;gpufl-backend&lt;/strong&gt; — receives the events, stores them, and provides APIs for querying&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;gpufl-front&lt;/strong&gt; — React UI for browsing sessions and checking GPU behavior&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  gpufl-client — Capturing GPU Events
&lt;/h2&gt;

&lt;p&gt;The client is a C++ library, or you can use the &lt;code&gt;gpufl&lt;/code&gt; Python package. You link it into your application, call &lt;code&gt;gpufl::init()&lt;/code&gt; when the program starts, and from there GPU Flight begins collecting GPU-related events automatically.&lt;/p&gt;

&lt;p&gt;It can capture things like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Kernel launches&lt;/strong&gt; — kernel name, grid/block dimensions, register count, shared memory, occupancy, timing&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Memory copies&lt;/strong&gt; — copy direction, size, duration, throughput&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;SASS metrics&lt;/strong&gt; — warp-level per-instruction counters when the SassMetrics engine is enabled&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;PC samples&lt;/strong&gt; — warp stall reason distribution when the PcSampling engine is enabled&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;System metrics&lt;/strong&gt; — periodic NVML snapshots such as GPU utilization, VRAM usage, temperature, power draw, and fan speed&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These are written into three JSONL log files:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;&amp;lt;prefix&amp;gt;.device.log   ← kernel launches, memory copies, timing
&amp;lt;prefix&amp;gt;.scope.log    ← GFL_SCOPE blocks with optional SASS/PC data
&amp;lt;prefix&amp;gt;.system.log   ← periodic NVML snapshots
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Log rotation is handled automatically. The agent tracks file offsets so events are never re-sent across rotations.&lt;/p&gt;




&lt;h2&gt;
  
  
  gpufl-agent — The Flexible Shipper
&lt;/h2&gt;

&lt;p&gt;This part is important because it reflects one of the main design choices in GPU Flight.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The client just writes files.&lt;/strong&gt; It does not know anything about HTTP, Kafka, or where the data is going. That responsibility belongs entirely to the agent.&lt;/p&gt;

&lt;p&gt;The agent is a separate process that tails the log files and forwards the events to the backend. Because it is separated from the client, it can run anywhere as long as it can access the log folder.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌──────────────────────────────┐
│         GPU Machine          │
│                              │
│  gpufl-client → .log files   │
│                    │         │
│             gpufl-agent      │   ← simplest: agent on the same machine
└────────────────────┼─────────┘
                     │  (or NFS / shared volume)
                     ▼
              gpufl-agent           ← or: agent on a separate machine
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I wanted to keep profiling logic as isolated as possible. Once the client is mixed with transport logic, the instrumentation library becomes harder to maintain and harder to use in different environments. By splitting them, I can change how data is shipped without changing the application binary.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What the agent does:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Tails all three log streams concurrently&lt;/li&gt;
&lt;li&gt;Tracks file positions via a cursor file — survives restarts without re-sending events&lt;/li&gt;
&lt;li&gt;Follows log rotation, picking up renamed files automatically&lt;/li&gt;
&lt;li&gt;Publishes events via HTTP or Kafka&lt;/li&gt;
&lt;li&gt;Optionally archives the compressed raw log files to S3-compatible object storage&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That last part is also useful. The raw &lt;code&gt;.log&lt;/code&gt; files are the full original event stream. Keeping them in object storage can help later if you want long-term retention, auditability, or reprocess old runs with updated analysis logic.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Configuration examples:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# HTTP — simplest, good for dev or small deployments&lt;/span&gt;
gpufl-agent &lt;span class="nt"&gt;--folder&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;/var/log/gpuflight &lt;span class="se"&gt;\&lt;/span&gt;
            &lt;span class="nt"&gt;--type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;http &lt;span class="se"&gt;\&lt;/span&gt;
            &lt;span class="nt"&gt;--url&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;http://backend:8080/api/v1/events/

&lt;span class="c"&gt;# Kafka — recommended for production&lt;/span&gt;
gpufl-agent &lt;span class="nt"&gt;--folder&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;/var/log/gpuflight &lt;span class="se"&gt;\&lt;/span&gt;
            &lt;span class="nt"&gt;--type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;kafka &lt;span class="se"&gt;\&lt;/span&gt;
            &lt;span class="nt"&gt;--brokers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;kafka:9092

&lt;span class="c"&gt;# Kafka + S3 archival&lt;/span&gt;
gpufl-agent &lt;span class="nt"&gt;--folder&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;/var/log/gpuflight &lt;span class="se"&gt;\&lt;/span&gt;
            &lt;span class="nt"&gt;--type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;kafka &lt;span class="nt"&gt;--brokers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;kafka:9092 &lt;span class="se"&gt;\&lt;/span&gt;
            &lt;span class="nt"&gt;--archiver-endpoint&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;https://s3.amazonaws.com &lt;span class="se"&gt;\&lt;/span&gt;
            &lt;span class="nt"&gt;--archiver-bucket&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;gpu-traces &lt;span class="se"&gt;\&lt;/span&gt;
            &lt;span class="nt"&gt;--archiver-access-key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;KEY &lt;span class="se"&gt;\&lt;/span&gt;
            &lt;span class="nt"&gt;--archiver-secret-key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;SECRET
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Why a message queue in production?
&lt;/h3&gt;

&lt;p&gt;When multiple GPU machines are sending data simultaneously, publishing directly to the backend via HTTP couples the producer rate to the backend's processing speed. If the backend is slow, restarting, or temporarily down, events back up or are lost.&lt;/p&gt;

&lt;p&gt;A message queue decouples the two sides:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;  GPU node 1 ──┐
  GPU node 2 ──┼── gpufl-agent ──► Kafka ──► gpufl-backend
  GPU node 3 ──┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Buffering&lt;/strong&gt; — Kafka holds events if the backend restarts or lags behind&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Burst handling&lt;/strong&gt; — GPU workloads produce spiky event rates; Kafka absorbs the peaks&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multiple consumers&lt;/strong&gt; — other services can read from the same topic independently: alerting, analytics, archival&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For a single machine or a development setup, HTTP is simpler and perfectly adequate. The agent supports both modes with a single flag.&lt;/p&gt;




&lt;h2&gt;
  
  
  gpufl-backend — Ingestion and API
&lt;/h2&gt;

&lt;p&gt;The backend is a Spring Boot REST service. It has two responsibilities: ingesting events and answering queries.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Ingestion&lt;/strong&gt; happens via &lt;code&gt;POST /api/v1/events/{eventType}&lt;/code&gt;, where &lt;code&gt;eventType&lt;/code&gt; is &lt;code&gt;device&lt;/code&gt;, &lt;code&gt;scope&lt;/code&gt;, or &lt;code&gt;system&lt;/code&gt;. In a Kafka deployment, a consumer bridge reads from the topic and calls the same endpoint internally.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The three event types:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;device&lt;/code&gt; — kernel launches, memory copies, precise timing from CUPTI&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;scope&lt;/code&gt; — &lt;code&gt;GFL_SCOPE&lt;/code&gt; timing blocks, optionally enriched with SASS metrics and PC samples&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;system&lt;/code&gt; — periodic NVML snapshots (GPU util%, VRAM, temperature, power draw, fan speed)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Query API&lt;/strong&gt; — sessions list, system metrics over time, kernel events per session, profile samples per scope. The frontend is built entirely on top of this API; there's no special internal interface.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Auth&lt;/strong&gt; — Bearer tokens for direct HTTP ingestion from the agent; API keys for programmatic access from the frontend or external tooling.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Data retention&lt;/strong&gt; — configurable cleanup policy so disk usage stays bounded. Old sessions are pruned automatically once they exceed the configured retention window.&lt;/p&gt;




&lt;h2&gt;
  
  
  gpufl-front — The Dashboard
&lt;/h2&gt;

&lt;p&gt;The frontend is a React + TypeScript SPA with a dark theme, connected to the backend REST API.&lt;/p&gt;

&lt;p&gt;The main view is a &lt;strong&gt;three-tab dashboard&lt;/strong&gt; scoped to a single profiling session:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌────────────────────────────────────────────────────┐
│  Session: my_training_run  [host: gpu-node-01]     │
├──────────────┬─────────────┬───────────────────────┤
│   Kernels    │   Scopes    │       System           │
└──────────────┴─────────────┴───────────────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Kernels tab&lt;/strong&gt; — timeline of kernel launches. Click any kernel to open the Inspector: occupancy, register count, shared memory, grid and block dimensions, and the CPU call stack that triggered the launch.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Scopes tab&lt;/strong&gt; — hierarchical tree of &lt;code&gt;GFL_SCOPE&lt;/code&gt; blocks with timing breakdown. For sessions using the SASS Metrics engine, each scope shows the divergence data from the previous post.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;System tab&lt;/strong&gt; — time-series charts for GPU util%, VRAM usage, temperature, power draw, and fan speed across the full session duration.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The session selector lets you browse by host and application name, so you can compare runs across machines or across time.&lt;/p&gt;




&lt;h2&gt;
  
  
  What's Next
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Next post&lt;/strong&gt; is a local setup guide: a Docker Compose file that brings up the full stack — client example, agent, backend, and frontend — on a single machine. If you want to try GPU Flight without configuring anything manually, that post will cover it end-to-end.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cloud demo coming soon.&lt;/strong&gt; I'm currently working on a live deployment. Once it's up, I'll post a link so you can browse real GPU profiling data in the UI without setting up anything yourself.&lt;/p&gt;

</description>
      <category>cuda</category>
      <category>gpu</category>
      <category>performance</category>
      <category>monitoring</category>
    </item>
    <item>
      <title>Detecting Thread Divergence with SASS Metrics and GPU Flight</title>
      <dc:creator>Myoungho Shin</dc:creator>
      <pubDate>Tue, 10 Mar 2026 06:54:51 +0000</pubDate>
      <link>https://dev.to/codinginavan/detecting-thread-divergence-with-sass-metrics-and-gpu-flight-kfc</link>
      <guid>https://dev.to/codinginavan/detecting-thread-divergence-with-sass-metrics-and-gpu-flight-kfc</guid>
      <description>&lt;p&gt;In the &lt;a href="https://dev.to/codinginavan/profiling-gpu-cuda-getting-started-with-gpu-flights-python-package-1pl8"&gt;previous post&lt;/a&gt; I showed how to set up GPU Flight with Python and read kernel-level profiling data — occupancy, register counts, and resource bottlenecks. That tells you &lt;em&gt;how well&lt;/em&gt; a kernel uses the hardware. But it doesn't tell you what's happening &lt;em&gt;inside&lt;/em&gt; the kernel.&lt;/p&gt;

&lt;p&gt;Today I want to look at one specific problem: &lt;strong&gt;thread divergence&lt;/strong&gt;. When threads within a warp take different code paths, the GPU serializes execution — it runs one branch, then the other, while idle threads wait. If half the threads branch left and half branch right, you're running at 50% efficiency on those instructions.&lt;/p&gt;

&lt;p&gt;GPU Flight's &lt;strong&gt;SASS Metrics engine&lt;/strong&gt; gives you a direct way to measure this. It instruments the GPU at the assembly (SASS) level and reports two key counters per instruction:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;smsp__sass_inst_executed&lt;/code&gt;&lt;/strong&gt; — the number of warp-level instruction executions&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;smsp__sass_thread_inst_executed&lt;/code&gt;&lt;/strong&gt; — the total number of thread-level instruction executions&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The ratio &lt;code&gt;thread_executed / (inst_executed × 32)&lt;/code&gt; tells you the average number of active threads per warp at each instruction. If it's 32.0, every thread was active. If it's 16.0, half were diverged. If it's 8.0, only a quarter was doing useful work.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Demo: Five Divergence Patterns
&lt;/h2&gt;

&lt;p&gt;I wrote a small CUDA program with five kernels, each demonstrating a different divergence pattern. The full source is in the GPU Flight repo at &lt;code&gt;example/cuda/sass_divergence_demo.cu&lt;/code&gt;. Here's a summary:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Kernel&lt;/th&gt;
&lt;th&gt;Pattern&lt;/th&gt;
&lt;th&gt;Expected Active Threads&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;uniformWork&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;No divergence (baseline)&lt;/td&gt;
&lt;td&gt;32&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;branchByWarpLane&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;if (threadIdx.x % 2)&lt;/code&gt; — even/odd split&lt;/td&gt;
&lt;td&gt;16 in each branch&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;branchByWarpQuad&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;if (threadIdx.x % 4 == 0)&lt;/code&gt; — 1-in-4&lt;/td&gt;
&lt;td&gt;8 in hot path&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;earlyExit&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Data-dependent early return&lt;/td&gt;
&lt;td&gt;Varies (~16)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;indirectBranch&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;4-way switch on random data&lt;/td&gt;
&lt;td&gt;Varies (~8)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Each kernel is wrapped in a &lt;code&gt;GFL_SCOPE&lt;/code&gt; so GPU Flight can attribute the SASS metrics to the right section.&lt;/p&gt;

&lt;h3&gt;
  
  
  Kernel 1: Uniform Work (Baseline)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cuda"&gt;&lt;code&gt;&lt;span class="k"&gt;__global__&lt;/span&gt;
&lt;span class="kt"&gt;void&lt;/span&gt; &lt;span class="nf"&gt;uniformWork&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;float&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;out&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;const&lt;/span&gt; &lt;span class="kt"&gt;float&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;in&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;idx&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;blockIdx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;blockDim&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;threadIdx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;idx&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="kt"&gt;float&lt;/span&gt; &lt;span class="n"&gt;val&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;in&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;idx&lt;/span&gt;&lt;span class="p"&gt;];&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;512&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="n"&gt;val&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;val&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mf"&gt;1.01&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mf"&gt;0.001&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="n"&gt;out&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;idx&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;val&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Every thread does the same thing. No branches inside the loop, no divergence. This is the baseline — you should see &lt;code&gt;thread_executed / inst_executed&lt;/code&gt; close to 32 for the loop body instructions.&lt;/p&gt;

&lt;h3&gt;
  
  
  Kernel 2: Even/Odd Divergence
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cuda"&gt;&lt;code&gt;&lt;span class="k"&gt;__global__&lt;/span&gt;
&lt;span class="kt"&gt;void&lt;/span&gt; &lt;span class="nf"&gt;branchByWarpLane&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;float&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;out&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;const&lt;/span&gt; &lt;span class="kt"&gt;float&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;in&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;idx&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;blockIdx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;blockDim&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;threadIdx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;idx&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="kt"&gt;float&lt;/span&gt; &lt;span class="n"&gt;val&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;in&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;idx&lt;/span&gt;&lt;span class="p"&gt;];&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;threadIdx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;512&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                &lt;span class="n"&gt;val&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;val&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mf"&gt;1.01&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mf"&gt;0.001&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;512&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                &lt;span class="n"&gt;val&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;val&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mf"&gt;0.001&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;float&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="n"&gt;out&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;idx&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;val&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is the classic divergence example. Within every warp, 16 threads go left, 16 go right. The GPU executes both paths sequentially with half the threads masked off each time. The SASS metrics will show ~16 active threads for instructions inside each branch.&lt;/p&gt;

&lt;h3&gt;
  
  
  Kernel 3: Quad Divergence
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cuda"&gt;&lt;code&gt;&lt;span class="k"&gt;__global__&lt;/span&gt;
&lt;span class="kt"&gt;void&lt;/span&gt; &lt;span class="nf"&gt;branchByWarpQuad&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;float&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;out&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;const&lt;/span&gt; &lt;span class="kt"&gt;float&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;in&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;idx&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;blockIdx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;blockDim&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;threadIdx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;idx&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="kt"&gt;float&lt;/span&gt; &lt;span class="n"&gt;val&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;in&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;idx&lt;/span&gt;&lt;span class="p"&gt;];&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;threadIdx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;2048&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                &lt;span class="n"&gt;val&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;val&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mf"&gt;1.001&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mf"&gt;0.0001&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="n"&gt;out&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;idx&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;val&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Only every 4th thread enters the loop. That's 8 out of 32 threads doing the heavy work while 24 sit idle. Worse than 50/50 — 75% of the warp is wasted during the loop body.&lt;/p&gt;

&lt;h3&gt;
  
  
  Kernel 4: Early Exit
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cuda"&gt;&lt;code&gt;&lt;span class="k"&gt;__global__&lt;/span&gt;
&lt;span class="kt"&gt;void&lt;/span&gt; &lt;span class="nf"&gt;earlyExit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;float&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;out&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;const&lt;/span&gt; &lt;span class="kt"&gt;float&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;in&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;float&lt;/span&gt; &lt;span class="n"&gt;threshold&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;idx&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;blockIdx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;blockDim&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;threadIdx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;idx&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="kt"&gt;float&lt;/span&gt; &lt;span class="n"&gt;val&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;in&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;idx&lt;/span&gt;&lt;span class="p"&gt;];&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;val&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;threshold&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="n"&gt;out&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;idx&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;val&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;1024&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;val&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;val&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mf"&gt;1.01&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mf"&gt;0.005&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
        &lt;span class="n"&gt;out&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;idx&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;val&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is data-dependent. Threads whose input is below the threshold return early, while the rest do the expensive computation. With random inputs in [0, 1) and a threshold of 0.5, roughly half the threads will exit early. But unlike Kernel 2, the split isn't uniform across warps — some warps might have 20 threads exit, others might have 10.&lt;/p&gt;

&lt;h3&gt;
  
  
  Kernel 5: Data-Dependent Switch
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cuda"&gt;&lt;code&gt;&lt;span class="k"&gt;__global__&lt;/span&gt;
&lt;span class="kt"&gt;void&lt;/span&gt; &lt;span class="nf"&gt;indirectBranch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;float&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;out&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;const&lt;/span&gt; &lt;span class="kt"&gt;float&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;in&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;idx&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;blockIdx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;blockDim&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;threadIdx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;idx&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="kt"&gt;float&lt;/span&gt; &lt;span class="n"&gt;val&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;in&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;idx&lt;/span&gt;&lt;span class="p"&gt;];&lt;/span&gt;
        &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;category&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;int&lt;/span&gt;&lt;span class="p"&gt;)(&lt;/span&gt;&lt;span class="n"&gt;val&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mf"&gt;4.0&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
        &lt;span class="k"&gt;switch&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;category&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="k"&gt;case&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;256&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;val&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;val&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mf"&gt;1.01&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;break&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
            &lt;span class="k"&gt;case&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;256&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;val&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;val&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mf"&gt;0.01&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;break&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
            &lt;span class="k"&gt;case&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;256&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;val&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;val&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mf"&gt;0.005&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;break&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
            &lt;span class="k"&gt;case&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;256&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;val&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;val&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mf"&gt;0.99&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;break&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="n"&gt;out&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;idx&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;val&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A 4-way branch driven by random data. On average, each case gets ~8 threads per warp, but the GPU must execute all 4 paths sequentially. This is the worst case — 4x the instruction count for the branch body.&lt;/p&gt;




&lt;h2&gt;
  
  
  Running the Demo
&lt;/h2&gt;

&lt;p&gt;Build and run from the GPU Flight repo:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/gpu-flight/gpufl-client.git
&lt;span class="nb"&gt;cd &lt;/span&gt;gpufl-client
cmake &lt;span class="nt"&gt;-B&lt;/span&gt; build &lt;span class="nt"&gt;-DCMAKE_BUILD_TYPE&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;Release
cmake &lt;span class="nt"&gt;--build&lt;/span&gt; build &lt;span class="nt"&gt;--target&lt;/span&gt; sass_divergence_demo
./build/example/cuda/sass_divergence_demo
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The key part is in &lt;code&gt;main()&lt;/code&gt; — initializing GPU Flight with the SASS Metrics engine:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cuda"&gt;&lt;code&gt;&lt;span class="n"&gt;gpufl&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;InitOptions&lt;/span&gt; &lt;span class="n"&gt;opts&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="n"&gt;opts&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;app_name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"sass_divergence_demo"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="n"&gt;opts&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;log_path&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"sass_divergence"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="n"&gt;opts&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;enable_kernel_details&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;true&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="n"&gt;opts&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sampling_auto_start&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;true&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="n"&gt;opts&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;profiling_engine&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;gpufl&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;ProfilingEngine&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;SassMetrics&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="n"&gt;gpufl&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;init&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;opts&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Setting &lt;code&gt;profiling_engine&lt;/code&gt; to &lt;code&gt;SassMetrics&lt;/code&gt; tells GPU Flight to instrument every kernel at the SASS level. Each &lt;code&gt;GFL_SCOPE&lt;/code&gt; block then collects per-instruction counters for the kernels launched inside it.&lt;/p&gt;




&lt;h2&gt;
  
  
  Results: RTX 3090
&lt;/h2&gt;

&lt;p&gt;Here's what I got running on an NVIDIA GeForce RTX 3090 (Ampere, SM 8.6, 82 SMs) with 1M elements:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Kernel                    Weighted Avg Active Threads    Instructions
----------------------------------------------------------------------
uniformWork                                      32.0             277
branchByWarpLane                                 16.3             796
branchByWarpQuad                                  8.2             281
earlyExit                                        16.2             280
indirectBranch                                    1.5            1062
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The "Weighted Avg Active Threads" is &lt;code&gt;thread_inst_executed / inst_executed&lt;/code&gt; across all SASS instructions in each kernel, weighted by execution count. "Instructions" is the number of unique PC offsets (SASS instructions) instrumented.&lt;/p&gt;

&lt;p&gt;Let's walk through what this tells us:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;uniformWork&lt;/code&gt; — 32.0 active threads.&lt;/strong&gt; Perfect. Every warp runs at full width. This is the expected baseline for a kernel with no divergence.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;branchByWarpLane&lt;/code&gt; — 16.3 active threads.&lt;/strong&gt; Very close to the theoretical 16. The slight overshoot comes from instructions outside the branch (the &lt;code&gt;if (idx &amp;lt; n)&lt;/code&gt; guard, loop control, and the final store) where all 32 threads are active. The 796 unique instructions — nearly 3x the baseline — show the cost: the compiler generates separate code for each branch, and both paths must be executed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;branchByWarpQuad&lt;/code&gt; — 8.2 active threads.&lt;/strong&gt; Again close to the theoretical 8 (only 1 in 4 threads enters the loop). Similar instruction count to the baseline since there's only one branch path — but every instruction in the hot loop runs with 75% of threads idle.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;earlyExit&lt;/code&gt; — 16.2 active threads.&lt;/strong&gt; Matches the expectation for a 50% threshold with random data. Threads that exit early become inactive for the remaining instructions.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;indirectBranch&lt;/code&gt; — 1.5 active threads, 1062 instructions.&lt;/strong&gt; This is the most striking result. A 4-way switch on random data drops the weighted average to just 1.5 active threads per warp — far worse than the other kernels. It also generates the highest instruction count at 1062, nearly 4x the baseline. This is a crucial insight: &lt;strong&gt;divergence doesn't just halve your throughput — multi-way branching on random data can drop you below 5%&lt;/strong&gt; when measured at the instruction level.&lt;/p&gt;




&lt;h2&gt;
  
  
  What This Means in Practice
&lt;/h2&gt;

&lt;p&gt;Thread divergence is easy to create and hard to notice. Your kernel still produces correct results. But you might be leaving 50-95% of your GPU's compute on the table.&lt;/p&gt;

&lt;p&gt;Here are the common patterns to watch for:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Lane-based branching&lt;/strong&gt; — &lt;code&gt;if (threadIdx.x % N)&lt;/code&gt;. This is almost always unintentional. Consider rearranging your data so that threads within a warp take the same path.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Data-dependent branches&lt;/strong&gt; — like the &lt;code&gt;earlyExit&lt;/code&gt; kernel. If your input distribution is skewed, some warps diverge heavily while others don't. The average might look okay, but the worst warps are bottlenecks.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Switch statements on computed values&lt;/strong&gt; — like &lt;code&gt;indirectBranch&lt;/code&gt;. This was the worst offender in our test — each additional case multiplies the predicated instruction overhead.&lt;/p&gt;

&lt;p&gt;The fix depends on the situation:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Sort or bin your data&lt;/strong&gt; so threads in the same warp hit the same branch&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Replace branches with predicated arithmetic&lt;/strong&gt; — branchless code runs all threads at full width&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Restructure your algorithm&lt;/strong&gt; so the branch happens at the warp or block level, not the thread level&lt;/li&gt;
&lt;/ul&gt;




</description>
      <category>gpu</category>
      <category>cpp</category>
      <category>cuda</category>
      <category>performance</category>
    </item>
    <item>
      <title>Profiling GPU (CUDA) — Getting Started with GPU Flight's Python Package</title>
      <dc:creator>Myoungho Shin</dc:creator>
      <pubDate>Mon, 09 Mar 2026 03:59:53 +0000</pubDate>
      <link>https://dev.to/codinginavan/profiling-gpu-cuda-getting-started-with-gpu-flights-python-package-1pl8</link>
      <guid>https://dev.to/codinginavan/profiling-gpu-cuda-getting-started-with-gpu-flights-python-package-1pl8</guid>
      <description>&lt;p&gt;In the &lt;a href="https://dev.to/codinginavan/profiling-gpu-cuda-what-is-actually-limiting-your-kernel-211e"&gt;previous posts&lt;/a&gt; I've been showing how to investigate GPU occupancy utilization and optimize kernels that aren't using the hardware fully. That was just one case — I'll cover more occupancy scenarios in future posts.&lt;/p&gt;

&lt;p&gt;Today, I want to go through how to use GPU Flight in Python, especially with PyTorch. Since GPU Flight is still in active development, the current version is &lt;code&gt;v0.1.0.dev7&lt;/code&gt;. You can install it with:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;&lt;span class="nv"&gt;gpufl&lt;/span&gt;&lt;span class="o"&gt;==&lt;/span&gt;0.1.0.dev7
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;However, I highly recommend &lt;strong&gt;building from source inside a CUDA container&lt;/strong&gt;. There are two reasons:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Prerequisite libraries&lt;/strong&gt; — GPU Flight's backend needs CUPTI, the CUDA runtime, and NVML headers at compile time. Getting these right on a bare system is fiddly.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;NVML support&lt;/strong&gt; — the pre-built PyPI wheel is compiled in a minimal CI environment that doesn't include NVML stubs. This means the wheel works for kernel profiling, but can't collect runtime GPU utilization or VRAM usage. Building from source inside the &lt;code&gt;nvidia/cuda:*-devel&lt;/code&gt; image picks up NVML automatically.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;In this post, I'll show how to use Docker to set up an environment that's ready to go — with GPU Flight built from source, PyTorch, and Jupyter Lab all pre-installed.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Dockerfile
&lt;/h2&gt;

&lt;p&gt;Here's the full Dockerfile. It's straightforward — CUDA 13.1 base, PyTorch, GPU Flight, and Jupyter Lab:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight docker"&gt;&lt;code&gt;&lt;span class="k"&gt;FROM&lt;/span&gt;&lt;span class="s"&gt; nvidia/cuda:13.1.0-devel-ubuntu24.04&lt;/span&gt;

&lt;span class="k"&gt;ENV&lt;/span&gt;&lt;span class="s"&gt; DEBIAN_FRONTEND=noninteractive&lt;/span&gt;

&lt;span class="c"&gt;# System dependencies (Ubuntu 24.04 ships Python 3.12)&lt;/span&gt;
&lt;span class="c"&gt;# NOTE: cmake/ninja come from pip (build-system.requires needs &amp;gt;=3.31, apt has 3.28)&lt;/span&gt;
&lt;span class="k"&gt;RUN &lt;/span&gt;apt-get update &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; apt-get &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-y&lt;/span&gt; &lt;span class="se"&gt;\
&lt;/span&gt;    python3 &lt;span class="se"&gt;\
&lt;/span&gt;    python3-venv &lt;span class="se"&gt;\
&lt;/span&gt;    python3-dev &lt;span class="se"&gt;\
&lt;/span&gt;    python3-pip &lt;span class="se"&gt;\
&lt;/span&gt;    git &lt;span class="se"&gt;\
&lt;/span&gt;    curl &lt;span class="se"&gt;\
&lt;/span&gt;    &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nb"&gt;rm&lt;/span&gt; &lt;span class="nt"&gt;-rf&lt;/span&gt; /var/lib/apt/lists/&lt;span class="k"&gt;*&lt;/span&gt;

&lt;span class="c"&gt;# Create venv to avoid PEP 668 issues&lt;/span&gt;
&lt;span class="k"&gt;RUN &lt;/span&gt;python3 &lt;span class="nt"&gt;-m&lt;/span&gt; venv /opt/venv
&lt;span class="k"&gt;ENV&lt;/span&gt;&lt;span class="s"&gt; PATH="/opt/venv/bin:$PATH"&lt;/span&gt;

&lt;span class="c"&gt;# Upgrade pip&lt;/span&gt;
&lt;span class="k"&gt;RUN &lt;/span&gt;pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;--upgrade&lt;/span&gt; pip

&lt;span class="c"&gt;# Install PyTorch with CUDA 13.1 support&lt;/span&gt;
&lt;span class="k"&gt;RUN &lt;/span&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;torch &lt;span class="nt"&gt;--index-url&lt;/span&gt; https://download.pytorch.org/whl/cu130

&lt;span class="c"&gt;# Build gpufl from source so it picks up NVML from the CUDA devel image&lt;/span&gt;
&lt;span class="k"&gt;ARG&lt;/span&gt;&lt;span class="s"&gt; GPUFL_VERSION=main&lt;/span&gt;
&lt;span class="k"&gt;RUN &lt;/span&gt;git clone &lt;span class="nt"&gt;--depth&lt;/span&gt; 1 &lt;span class="nt"&gt;--branch&lt;/span&gt; &lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;GPUFL_VERSION&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt; &lt;span class="se"&gt;\
&lt;/span&gt;        https://github.com/gpu-flight/gpufl-client.git /tmp/gpufl-client &lt;span class="se"&gt;\
&lt;/span&gt;    &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nv"&gt;CMAKE_ARGS&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"-DBUILD_TESTING=OFF"&lt;/span&gt; &lt;span class="se"&gt;\
&lt;/span&gt;       pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-v&lt;/span&gt; &lt;span class="s2"&gt;"/tmp/gpufl-client[analyzer,viz]"&lt;/span&gt; &lt;span class="se"&gt;\
&lt;/span&gt;    &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nb"&gt;rm&lt;/span&gt; &lt;span class="nt"&gt;-rf&lt;/span&gt; /tmp/gpufl-client

&lt;span class="c"&gt;# Install Jupyter&lt;/span&gt;
&lt;span class="k"&gt;RUN &lt;/span&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;jupyterlab

&lt;span class="c"&gt;# Working directory for notebooks&lt;/span&gt;
&lt;span class="k"&gt;WORKDIR&lt;/span&gt;&lt;span class="s"&gt; /workspace&lt;/span&gt;

&lt;span class="c"&gt;# Expose Jupyter port&lt;/span&gt;
&lt;span class="k"&gt;EXPOSE&lt;/span&gt;&lt;span class="s"&gt; 8888&lt;/span&gt;

&lt;span class="c"&gt;# Start Jupyter Lab&lt;/span&gt;
&lt;span class="k"&gt;CMD&lt;/span&gt;&lt;span class="s"&gt; ["jupyter", "lab", "--ip=0.0.0.0", "--port=8888", "--no-browser", \&lt;/span&gt;
     "--allow-root", "--NotebookApp.token=''", "--NotebookApp.password=''"]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A few things to note:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Ubuntu 24.04&lt;/strong&gt; — ships Python 3.12 natively, which is what GPU Flight requires. No PPA hacks needed.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;devel&lt;/code&gt; image&lt;/strong&gt; — we use &lt;code&gt;nvidia/cuda:13.1.0-devel-ubuntu24.04&lt;/code&gt; because the &lt;code&gt;devel&lt;/code&gt; variant includes CUPTI, CUDA headers, and &lt;strong&gt;NVML stubs&lt;/strong&gt; that GPU Flight's backend needs at compile time.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Building from source&lt;/strong&gt; — we clone the repo and build with &lt;code&gt;pip install&lt;/code&gt; rather than using the pre-built PyPI wheel. This is important: the &lt;code&gt;devel&lt;/code&gt; image has NVML stubs at &lt;code&gt;/usr/local/cuda/lib64/stubs/libnvidia-ml.so&lt;/code&gt;, so CMake detects them and compiles in the NVML collector. The pre-built wheel doesn't have this, which means no GPU utilization or VRAM monitoring.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;PyTorch &lt;code&gt;cu130&lt;/code&gt;&lt;/strong&gt; — at the time of writing, PyTorch doesn't publish a &lt;code&gt;cu131&lt;/code&gt; wheel yet. The &lt;code&gt;cu130&lt;/code&gt; build is forward-compatible with the CUDA 13.1 runtime in the container, so this works fine.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No token&lt;/strong&gt; — Jupyter starts without authentication. This is fine for local development; don't expose this to the internet.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Building and Running
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Prerequisites
&lt;/h3&gt;

&lt;p&gt;You need two things on your host machine:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Docker&lt;/strong&gt; — any recent version&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;NVIDIA Container Toolkit&lt;/strong&gt; — this lets Docker containers access your GPU&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Important:&lt;/strong&gt; Having an NVIDIA driver installed on your host is not enough. Docker doesn't know how to talk to your GPU on its own — you need the NVIDIA Container Toolkit to bridge that gap. Without it, &lt;code&gt;--gpus all&lt;/code&gt; will fail with:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;docker: Error response from daemon: could not select device driver "" with capabilities: [[gpu]]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You can check if it's already installed by running &lt;code&gt;nvidia-ctk --version&lt;/code&gt;. If not, here's how to set it up:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Add the NVIDIA container toolkit repo&lt;/span&gt;
curl &lt;span class="nt"&gt;-fsSL&lt;/span&gt; https://nvidia.github.io/libnvidia-container/gpgkey &lt;span class="se"&gt;\&lt;/span&gt;
  | &lt;span class="nb"&gt;sudo &lt;/span&gt;gpg &lt;span class="nt"&gt;--dearmor&lt;/span&gt; &lt;span class="nt"&gt;-o&lt;/span&gt; /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg

curl &lt;span class="nt"&gt;-s&lt;/span&gt; &lt;span class="nt"&gt;-L&lt;/span&gt; https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list &lt;span class="se"&gt;\&lt;/span&gt;
  | &lt;span class="nb"&gt;sed&lt;/span&gt; &lt;span class="s1"&gt;'s#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  | &lt;span class="nb"&gt;sudo tee&lt;/span&gt; /etc/apt/sources.list.d/nvidia-container-toolkit.list

&lt;span class="c"&gt;# Install and configure&lt;/span&gt;
&lt;span class="nb"&gt;sudo &lt;/span&gt;apt-get update &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nb"&gt;sudo &lt;/span&gt;apt-get &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-y&lt;/span&gt; nvidia-container-toolkit
&lt;span class="nb"&gt;sudo &lt;/span&gt;nvidia-ctk runtime configure &lt;span class="nt"&gt;--runtime&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;docker
&lt;span class="nb"&gt;sudo &lt;/span&gt;systemctl restart docker
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That last &lt;code&gt;systemctl restart&lt;/code&gt; is easy to forget — Docker needs to be restarted after the runtime is configured, or it won't pick up the new GPU capability.&lt;/p&gt;

&lt;p&gt;You can verify it worked with:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker run &lt;span class="nt"&gt;--rm&lt;/span&gt; &lt;span class="nt"&gt;--gpus&lt;/span&gt; all nvidia/cuda:13.1.0-base-ubuntu24.04 nvidia-smi
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you see your GPU listed, you're good to go.&lt;/p&gt;

&lt;h3&gt;
  
  
  Build the Image
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker build &lt;span class="nt"&gt;-t&lt;/span&gt; gpufl-python &lt;span class="nb"&gt;.&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This will take a few minutes the first time — mostly downloading PyTorch.&lt;/p&gt;

&lt;h3&gt;
  
  
  Run the Container
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker run &lt;span class="nt"&gt;--gpus&lt;/span&gt; all &lt;span class="nt"&gt;-p&lt;/span&gt; 8888:8888 &lt;span class="nt"&gt;-v&lt;/span&gt; &lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;pwd&lt;/span&gt;&lt;span class="si"&gt;)&lt;/span&gt;/notebooks:/workspace gpufl-python
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Breaking that down:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Flag&lt;/th&gt;
&lt;th&gt;What it does&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;--gpus all&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Passes all GPUs into the container&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;-p 8888:8888&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Maps Jupyter's port to your host&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;-v $(pwd)/notebooks:/workspace&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Mounts a local folder so your notebooks persist&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Connect
&lt;/h3&gt;

&lt;p&gt;Open your browser and go to:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;http://localhost:8888
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You'll land in Jupyter Lab with GPU Flight, PyTorch, and a CUDA-capable GPU ready to go.&lt;/p&gt;




&lt;h2&gt;
  
  
  Quick Smoke Test
&lt;/h2&gt;

&lt;p&gt;Create a new notebook and run this to verify everything is working:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;gpufl&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;gpufl&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ProfilingEngine&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;PyTorch: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;__version__&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;CUDA available: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cuda&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;is_available&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Device: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cuda&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_device_name&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Initialize GPU Flight
&lt;/span&gt;&lt;span class="n"&gt;gpufl&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;init&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;smoke-test&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
           &lt;span class="n"&gt;log_path&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;./smoke_test&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
           &lt;span class="n"&gt;sampling_auto_start&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
           &lt;span class="n"&gt;enable_kernel_details&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
           &lt;span class="n"&gt;enable_stack_trace&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
           &lt;span class="n"&gt;profiling_engine&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ProfilingEngine&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;RangeProfiler&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Run a simple operation
&lt;/span&gt;&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;gpufl&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Scope&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;RandomGeneration&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;randn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1024&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1024&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;device&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cuda&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;b&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;randn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1024&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1024&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;device&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cuda&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;gpufl&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Scope&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;a @ b&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;c&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="o"&gt;@&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;
    &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cuda&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;synchronize&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="n"&gt;gpufl&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;shutdown&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;GPU Flight logs written!&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;After running this, you should see &lt;code&gt;*.log&lt;/code&gt; files in your working directory. These are your GPU Flight recordings — every kernel launch, memory copy, and timing event that happened during that matrix multiply.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9h1s6nst0bym86n8g6bs.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9h1s6nst0bym86n8g6bs.png" alt="Running Test"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Analyzing the Results
&lt;/h2&gt;

&lt;p&gt;GPU Flight's Python analyzer can load those logs directly in the notebook:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;gpufl.analyzer&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;GpuFlightSession&lt;/span&gt;

&lt;span class="n"&gt;session&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;GpuFlightSession&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;log_prefix&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;smoke_test&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;print_summary&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;GpuFlightSession&lt;/code&gt; takes two main arguments: the directory where logs live, and the &lt;code&gt;log_prefix&lt;/code&gt; matching your &lt;code&gt;log_path&lt;/code&gt; from init. It automatically finds and loads &lt;code&gt;smoke_test.device.log&lt;/code&gt;, &lt;code&gt;smoke_test.scope.log&lt;/code&gt;, and &lt;code&gt;smoke_test.system.log&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;print_summary()&lt;/code&gt; gives you a quick dashboard — total duration, kernel count, GPU busy time, average utilization, and peak VRAM.&lt;/p&gt;

&lt;p&gt;Now let's look at the kernel hotspots:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;inspect_hotspots&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;top_n&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This gives you a Rich-formatted table of your hottest kernels with occupancy, register usage, shared memory, and the per-resource occupancy breakdown showing exactly what's limiting each kernel.&lt;/p&gt;

&lt;p&gt;Here's what that actually looks like — this is real output from the matrix multiply we just ran:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzbh2ra05j0eh83xg0dym.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzbh2ra05j0eh83xg0dym.png" alt="Analyzing Logs"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Now let's look at 33.3% occupancy. That sounds bad, right? Let's break it down.&lt;/p&gt;

&lt;p&gt;The kernel is &lt;code&gt;ampere_sgemm_128x64_nn&lt;/code&gt; — cuBLAS's single-precision matrix multiply. It uses &lt;strong&gt;122 registers per thread&lt;/strong&gt;. That's a lot. Let's trace through what happens on an Ampere SM:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;128 threads per block&lt;/strong&gt; = 4 warps per block&lt;/li&gt;
&lt;li&gt;122 regs/thread × 32 threads/warp = 3,904 → rounded up to the hardware allocation granularity of 256 → &lt;strong&gt;4,096 regs/warp&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;4 warps × 4,096 = &lt;strong&gt;16,384 registers per block&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;An Ampere SM has 65,536 registers total → 65,536 / 16,384 = &lt;strong&gt;4 blocks max&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;4 blocks × 4 warps = 16 active warps out of 48 max = &lt;strong&gt;33.3%&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The breakdown confirms it: &lt;code&gt;reg 33.3%&lt;/code&gt; is the bottleneck, while shared memory (66.7%), warps (100%), and block count (100%) all have headroom.&lt;/p&gt;

&lt;h3&gt;
  
  
  But Is This Actually a Problem?
&lt;/h3&gt;

&lt;p&gt;Not necessarily. If the algorithm itself doesn't require all those registers, high register usage might be a problem — but it could also be by design. This is a good example of why occupancy alone doesn't tell the whole story — you need to understand &lt;em&gt;what's limiting it&lt;/em&gt; and &lt;em&gt;whether that tradeoff makes sense for the workload&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;If you saw 33% occupancy with &lt;code&gt;limiting_resource: shared_mem&lt;/code&gt; on your own custom kernel, that might be worth investigating.&lt;/p&gt;




&lt;h2&gt;
  
  
  What's Next
&lt;/h2&gt;

&lt;p&gt;Now that you have a working environment, you can start profiling your own models. The occupancy breakdown makes it easy to spot which kernels are underutilizing the GPU and — more importantly — &lt;em&gt;why&lt;/em&gt;. Not every low-occupancy kernel is a problem, but when one is, you'll know exactly which resource to optimize.&lt;/p&gt;

&lt;p&gt;In the next post, I'll cover GPU Flight's &lt;strong&gt;profiling engines&lt;/strong&gt; — PC sampling, SASS metrics, and the range profiler — which let you go beyond kernel metadata and collect hardware-level data about what's happening &lt;em&gt;inside&lt;/em&gt; the GPU while your kernels run.&lt;/p&gt;

</description>
      <category>cuda</category>
      <category>cpp</category>
      <category>gpu</category>
      <category>python</category>
    </item>
    <item>
      <title>Profiling GPU (CUDA) — What Is Actually Limiting Your Kernel?</title>
      <dc:creator>Myoungho Shin</dc:creator>
      <pubDate>Mon, 02 Mar 2026 01:19:03 +0000</pubDate>
      <link>https://dev.to/codinginavan/profiling-gpu-cuda-what-is-actually-limiting-your-kernel-211e</link>
      <guid>https://dev.to/codinginavan/profiling-gpu-cuda-what-is-actually-limiting-your-kernel-211e</guid>
      <description>&lt;p&gt;In my &lt;a href="https://dev.to/codinginavan/profiling-gpu-cuda-introducing-gpu-flight-4p67"&gt;last post&lt;/a&gt; I introduced &lt;strong&gt;GPU Flight&lt;/strong&gt; — a lightweight CUDA observability tool that acts like a flight recorder for your GPU. We covered what it collects: system metrics, device capabilities, and per-kernel events.&lt;/p&gt;

&lt;p&gt;Today I want to talk about one specific metric that GPU Flight captures: &lt;strong&gt;occupancy&lt;/strong&gt;. It's one of the most important numbers for understanding GPU performance, and also one of the most misunderstood.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Is Occupancy?
&lt;/h2&gt;

&lt;p&gt;A GPU is organized around &lt;strong&gt;Streaming Multiprocessors (SMs)&lt;/strong&gt;. Each SM can run many threads simultaneously — not by context-switching like a CPU, but by actually running them in parallel. The unit of scheduling on an SM is a &lt;strong&gt;warp&lt;/strong&gt;: a group of 32 threads that execute the same instruction in lockstep.&lt;/p&gt;

&lt;p&gt;An SM has a fixed warp budget — say, 48 warps on a typical Ampere GPU. When you launch a kernel with blocks of 256 threads (8 warps each), the SM can hold up to 6 blocks concurrently to fill those 48 warp slots. If something prevents that — too many registers, too much shared memory — fewer blocks fit, and some warp slots sit idle.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Occupancy&lt;/strong&gt; measures how well those warp slots are filled:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;occupancy = active warps / maximum warps per SM
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A value of 1.0 means every slot is in use. A value of 0.5 means half the SM's compute capacity is being wasted while your kernel runs.&lt;/p&gt;




&lt;h2&gt;
  
  
  How GPU Flight Captures It
&lt;/h2&gt;

&lt;p&gt;GPU Flight records occupancy automatically for every kernel launch. No code changes needed — just initialize with &lt;code&gt;enableKernelDetails: true&lt;/code&gt; and it shows up in the log:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"kernel_event"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"_Z18block_reduce_naivePKfPfi"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"occupancy"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.833333&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"num_regs"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;16&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"static_shared_bytes"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;16384&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"dyn_shared_bytes"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"block"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"(256,1,1)"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"grid"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"(16384,1,1)"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"max_active_blocks"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="err"&gt;...&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Under the hood, GPU Flight calls &lt;code&gt;cudaOccupancyMaxActiveBlocksPerMultiprocessor&lt;/code&gt; at kernel launch time to get &lt;code&gt;max_active_blocks&lt;/code&gt;, then divides by the SM's warp budget to compute &lt;code&gt;occupancy&lt;/code&gt;. This happens inside the CUPTI callback — zero overhead to your kernel execution.&lt;/p&gt;

&lt;p&gt;That &lt;code&gt;0.833333&lt;/code&gt; immediately tells you something is off. This kernel only fills 5 out of 6 possible concurrent blocks on each SM. Some compute is being left on the table.&lt;/p&gt;




&lt;h2&gt;
  
  
  But What Is Actually Causing It?
&lt;/h2&gt;

&lt;p&gt;Here's where a single number hits its limit.&lt;/p&gt;

&lt;p&gt;Is it registers? Shared memory? The hardware block count cap? Looking at the log fields, you can make an educated guess — &lt;code&gt;static_shared_bytes: 16384&lt;/code&gt; is 16 KB of shared memory per block, which is pretty large. But you still have to do the math yourself against your specific GPU's properties to confirm.&lt;/p&gt;

&lt;p&gt;That manual detective work is exactly what I wanted to eliminate. So GPU Flight now also computes a &lt;strong&gt;per-resource occupancy breakdown&lt;/strong&gt; and identifies the limiting resource automatically. Let me show what this looks like with a concrete kernel.&lt;/p&gt;

&lt;h3&gt;
  
  
  The kernel
&lt;/h3&gt;

&lt;p&gt;Here's a simple parallel block reduction — it sums an array by having all 256 threads in a block cooperate through shared memory:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cuda"&gt;&lt;code&gt;&lt;span class="k"&gt;__global__&lt;/span&gt; &lt;span class="kt"&gt;void&lt;/span&gt; &lt;span class="nf"&gt;block_reduce_naive&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;const&lt;/span&gt; &lt;span class="kt"&gt;float&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;in&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;float&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;out&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;__shared__&lt;/span&gt; &lt;span class="kt"&gt;float&lt;/span&gt; &lt;span class="n"&gt;smem&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;4096&lt;/span&gt;&lt;span class="p"&gt;];&lt;/span&gt; &lt;span class="c1"&gt;// 16 KB — statically reserved&lt;/span&gt;

    &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;tid&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;threadIdx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;gid&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;blockIdx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;blockDim&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;tid&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

    &lt;span class="c1"&gt;// Load one element per thread into shared memory&lt;/span&gt;
    &lt;span class="n"&gt;smem&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;tid&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;gid&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;?&lt;/span&gt; &lt;span class="n"&gt;in&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;gid&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.0&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="n"&gt;__syncthreads&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;

    &lt;span class="c1"&gt;// Reduce in shared memory — each step halves the active threads&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;blockDim&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&amp;gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tid&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;smem&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;tid&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="n"&gt;smem&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;tid&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;];&lt;/span&gt;
        &lt;span class="n"&gt;__syncthreads&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="c1"&gt;// Thread 0 writes the block's result&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tid&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;out&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;blockIdx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;smem&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;];&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Launched with 256 threads per block across 4M elements:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cuda"&gt;&lt;code&gt;&lt;span class="k"&gt;const&lt;/span&gt; &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;BLOCK&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;256&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;const&lt;/span&gt; &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;GRID&lt;/span&gt;  &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;N&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;BLOCK&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;BLOCK&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="c1"&gt;// ~16384 blocks&lt;/span&gt;
&lt;span class="n"&gt;block_reduce_naive&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&amp;lt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;GRID&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;BLOCK&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;d_in&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;d_out&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;N&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Nothing unusual here — this is a textbook reduction. But GPU Flight flags a problem immediately.&lt;/p&gt;

&lt;h3&gt;
  
  
  What GPU Flight sees
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="err"&gt;...&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"occupancy"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;         &lt;/span&gt;&lt;span class="mf"&gt;0.833333&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"reg_occupancy"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;     &lt;/span&gt;&lt;span class="mf"&gt;1.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"smem_occupancy"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="mf"&gt;0.833333&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"warp_occupancy"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="mf"&gt;1.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"block_occupancy"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="mf"&gt;1.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"limiting_resource"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"shared_mem"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each &lt;code&gt;*_occupancy&lt;/code&gt; field answers: &lt;em&gt;"if only this constraint existed, what would occupancy be?"&lt;/em&gt; The &lt;code&gt;limiting_resource&lt;/code&gt; field names the one that's actually binding. Here — &lt;code&gt;smem_occupancy&lt;/code&gt; matches &lt;code&gt;occupancy&lt;/code&gt; and everything else is 1.0 — shared memory is definitively the culprit.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why
&lt;/h3&gt;

&lt;p&gt;The problem is &lt;code&gt;__shared__ float smem[4096]&lt;/code&gt;. Static shared memory is sized at compile time and reserved in full for the block's entire lifetime — even if the kernel only uses part of it. With 256 threads per block, this reduction only ever touches &lt;code&gt;smem[0]&lt;/code&gt; through &lt;code&gt;smem[255]&lt;/code&gt;, but all 4096 floats (16 KB) are locked up on the SM regardless. Every block is paying a 16 KB reservation it doesn't actually need, and that prevents the SM from scheduling as many concurrent blocks as the warp budget would otherwise allow.&lt;/p&gt;

&lt;h3&gt;
  
  
  The fix
&lt;/h3&gt;

&lt;p&gt;Switch to &lt;strong&gt;dynamic shared memory&lt;/strong&gt;, which is sized at launch time rather than compiled in:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cuda"&gt;&lt;code&gt;&lt;span class="k"&gt;__global__&lt;/span&gt; &lt;span class="kt"&gt;void&lt;/span&gt; &lt;span class="nf"&gt;block_reduce_optimized&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;const&lt;/span&gt; &lt;span class="kt"&gt;float&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;in&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;float&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;out&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;extern&lt;/span&gt; &lt;span class="k"&gt;__shared__&lt;/span&gt; &lt;span class="kt"&gt;float&lt;/span&gt; &lt;span class="n"&gt;smem&lt;/span&gt;&lt;span class="p"&gt;[];&lt;/span&gt; &lt;span class="c1"&gt;// size comes from the launch call&lt;/span&gt;

    &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;tid&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;threadIdx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;gid&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;blockIdx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;blockDim&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;tid&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

    &lt;span class="n"&gt;smem&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;tid&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;gid&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;?&lt;/span&gt; &lt;span class="n"&gt;in&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;gid&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.0&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="n"&gt;__syncthreads&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;

    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;blockDim&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&amp;gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tid&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;smem&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;tid&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="n"&gt;smem&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;tid&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;];&lt;/span&gt;
        &lt;span class="n"&gt;__syncthreads&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tid&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;out&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;blockIdx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;smem&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;];&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The kernel body is completely unchanged. The only differences are &lt;code&gt;extern __shared__&lt;/code&gt; instead of a fixed-size array, and passing the size as the third launch argument:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cuda"&gt;&lt;code&gt;&lt;span class="kt"&gt;size_t&lt;/span&gt; &lt;span class="n"&gt;smem_bytes&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;BLOCK&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="nf"&gt;sizeof&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;float&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt; &lt;span class="c1"&gt;// 256 × 4 = 1 KB&lt;/span&gt;
&lt;span class="n"&gt;block_reduce_optimized&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&amp;lt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;GRID&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;BLOCK&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;smem_bytes&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;d_in&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;d_out&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;N&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The shared memory footprint drops from 16 KB to 1 KB per block — 16× smaller — and now the SM can fit all 6 concurrent blocks instead of 5.&lt;/p&gt;

&lt;p&gt;GPU Flight confirms the fix worked:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"occupancy"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;         &lt;/span&gt;&lt;span class="mf"&gt;1.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"limiting_resource"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"warps"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;"warps"&lt;/code&gt; as the limiting resource means full occupancy — every SM warp slot is filled and shared memory is no longer in the way.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Full Sample Code&lt;/strong&gt;: &lt;a href="https://github.com/gpu-flight/gpufl-client/blob/main/example/cuda/occupancy_demo.cu" rel="noopener noreferrer"&gt;GitHub Repo&lt;/a&gt;&lt;/p&gt;




</description>
      <category>performance</category>
      <category>cuda</category>
      <category>gpu</category>
      <category>cpp</category>
    </item>
    <item>
      <title>Profiling GPU (CUDA) — Introducing GPU Flight</title>
      <dc:creator>Myoungho Shin</dc:creator>
      <pubDate>Tue, 24 Feb 2026 01:32:26 +0000</pubDate>
      <link>https://dev.to/codinginavan/profiling-gpu-cuda-introducing-gpu-flight-4p67</link>
      <guid>https://dev.to/codinginavan/profiling-gpu-cuda-introducing-gpu-flight-4p67</guid>
      <description>&lt;p&gt;Last year, I took a GPU programming course at Johns Hopkins University as part of my graduate studies, where I learned CUDA programming. For my final project, I built a lightweight GPU monitoring and profiling tool focused on CUDA.&lt;/p&gt;

&lt;p&gt;I enjoyed the process so much that I decided to continue developing it beyond the course.&lt;/p&gt;

&lt;p&gt;In this post, I’d like to briefly introduce the project:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;GPU Flight&lt;/strong&gt; — a 100% open-source GPU observability tool&lt;br&gt;&lt;br&gt;
GitHub: &lt;a href="https://github.com/gpu-flight/gpufl-client" rel="noopener noreferrer"&gt;https://github.com/gpu-flight/gpufl-client&lt;/a&gt;&lt;/p&gt;


&lt;h2&gt;
  
  
  Why I Started GPU Flight
&lt;/h2&gt;

&lt;p&gt;When profiling a CUDA application, you typically:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Install profiling tools such as &lt;strong&gt;Nsight&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Or manually integrate &lt;strong&gt;CUPTI&lt;/strong&gt; into your application, which often makes the code complex and difficult to manage&lt;/li&gt;
&lt;li&gt;Deal with additional complexity in cloud or containerized environments&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This workflow can be inconvenient — especially in production systems.&lt;/p&gt;

&lt;p&gt;I wanted something lighter.&lt;br&gt;&lt;br&gt;
Something that works more like a &lt;strong&gt;flight recorder for GPUs&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;So I built GPU Flight.&lt;/p&gt;

&lt;p&gt;Instead of requiring heavy tooling at runtime, GPU Flight writes structured profiling logs directly on the host machine. A separate component (&lt;strong&gt;GPUFL Agent&lt;/strong&gt;) crawls these log files and forwards them to a backend service or other destinations.&lt;/p&gt;

&lt;p&gt;This makes GPU observability more flexible and easier to integrate into distributed systems.&lt;/p&gt;


&lt;h2&gt;
  
  
  What is GPU Flight?
&lt;/h2&gt;

&lt;p&gt;GPU Flight is designed to be &lt;strong&gt;lightweight and modular&lt;/strong&gt;.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;If you only need monitoring, the overhead is minimal.&lt;/li&gt;
&lt;li&gt;Enabling deeper profiling provides more detailed metrics.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The goal is to expose useful GPU metrics so you can clearly understand:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;How the GPU manages resources&lt;/li&gt;
&lt;li&gt;How your program utilizes GPU resources&lt;/li&gt;
&lt;li&gt;Where performance bottlenecks occur&lt;/li&gt;
&lt;/ul&gt;


&lt;h2&gt;
  
  
  Project Structure
&lt;/h2&gt;

&lt;p&gt;GPU Flight currently consists of several components:&lt;/p&gt;
&lt;h3&gt;
  
  
  1️⃣ gpufl-client
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://github.com/gpu-flight/gpufl-client" rel="noopener noreferrer"&gt;https://github.com/gpu-flight/gpufl-client&lt;/a&gt;  &lt;/p&gt;

&lt;p&gt;The client library that users embed into their applications for monitoring and profiling.&lt;/p&gt;


&lt;h3&gt;
  
  
  2️⃣ gpufl-agent
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://github.com/gpu-flight/gpufl-agent" rel="noopener noreferrer"&gt;https://github.com/gpu-flight/gpufl-agent&lt;/a&gt;  &lt;/p&gt;

&lt;p&gt;Despite the name, this is &lt;strong&gt;not an AI agent 🙂&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
It tracks log files and forwards profiling data to the configured destination.&lt;/p&gt;


&lt;h3&gt;
  
  
  3️⃣ gpufl-desktop
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://github.com/gpu-flight/gpufl-desktop" rel="noopener noreferrer"&gt;https://github.com/gpu-flight/gpufl-desktop&lt;/a&gt;  &lt;/p&gt;

&lt;p&gt;Originally, I planned to build a desktop viewer. Due to time constraints and the need for better cross-platform accessibility, I pivoted to a web-based frontend.&lt;br&gt;
I am currently keeping the web frontend and backend repositories private as I develop them into a hosted cloud platform. To ensure the open-source community can still easily parse and utilize the trace logs locally, I am providing a lightweight Python viewer alongside the open-source C++ client.&lt;/p&gt;


&lt;h2&gt;
  
  
  What Metrics Does GPU Flight Support?
&lt;/h2&gt;

&lt;p&gt;GPU Flight captures observability at multiple layers.&lt;/p&gt;
&lt;h3&gt;
  
  
  1️⃣ System &amp;amp; GPU Monitoring (NVML)
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Host memory usage&lt;/li&gt;
&lt;li&gt;GPU memory usage (used/free/total)&lt;/li&gt;
&lt;li&gt;GPU utilization&lt;/li&gt;
&lt;li&gt;Memory utilization&lt;/li&gt;
&lt;li&gt;Temperature&lt;/li&gt;
&lt;li&gt;Power consumption&lt;/li&gt;
&lt;li&gt;Clock speeds (GFX / SM / Memory)&lt;/li&gt;
&lt;li&gt;PCIe RX/TX bandwidth&lt;/li&gt;
&lt;li&gt;Power and thermal throttling flags&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Example JSON snippet:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"system_sample"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"util_gpu"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;57&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"temp_c"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;39&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"power_mw"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;54415&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"clk_sm"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1740&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h3&gt;
  
  
  2️⃣ CUDA Device Capabilities
&lt;/h3&gt;

&lt;p&gt;Static architectural information:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Compute capability&lt;/li&gt;
&lt;li&gt;L2 cache size&lt;/li&gt;
&lt;li&gt;Shared memory per block&lt;/li&gt;
&lt;li&gt;Registers per block&lt;/li&gt;
&lt;li&gt;SM count&lt;/li&gt;
&lt;li&gt;Warp size&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  3️⃣ CUDA API &amp;amp; Kernel Events (CUPTI)
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;API enter/exit timestamps&lt;/li&gt;
&lt;li&gt;Kernel execution start/end timestamps&lt;/li&gt;
&lt;li&gt;Grid/block dimensions&lt;/li&gt;
&lt;li&gt;Shared memory usage&lt;/li&gt;
&lt;li&gt;Register usage&lt;/li&gt;
&lt;li&gt;Occupancy&lt;/li&gt;
&lt;li&gt;Correlation IDs&lt;/li&gt;
&lt;li&gt;Memory copy events (HtoD, DtoH)&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Python Support
&lt;/h2&gt;

&lt;p&gt;GPU Flight is also being extended to support Python applications that use CUDA (e.g., PyTorch).&lt;/p&gt;

&lt;p&gt;Example:&lt;br&gt;&lt;br&gt;
&lt;a href="https://github.com/gpu-flight/gpufl-client/blob/main/example/python/03_pytorch_benchmark.py" rel="noopener noreferrer"&gt;https://github.com/gpu-flight/gpufl-client/blob/main/example/python/03_pytorch_benchmark.py&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This allows profiling GPU-heavy ML workloads without deeply modifying existing code.&lt;/p&gt;




&lt;h2&gt;
  
  
  What’s Next?
&lt;/h2&gt;

&lt;p&gt;In the next post, I’ll walk through a minimal CUDA example and show how to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Integrate &lt;code&gt;gpufl-client&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Run a kernel&lt;/li&gt;
&lt;li&gt;Inspect generated profiling logs&lt;/li&gt;
&lt;li&gt;Interpret stall reasons and metrics&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Thanks for reading — this is just the beginning &lt;/p&gt;

</description>
      <category>cuda</category>
      <category>gpu</category>
      <category>cpp</category>
      <category>monitoring</category>
    </item>
  </channel>
</rss>
