<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Dang Manh Truong</title>
    <description>The latest articles on DEV Community by Dang Manh Truong (@dangmanhtruong1995).</description>
    <link>https://dev.to/dangmanhtruong1995</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F239267%2Ff4733a92-7a58-4439-b163-9daff65c654e.jpeg</url>
      <title>DEV Community: Dang Manh Truong</title>
      <link>https://dev.to/dangmanhtruong1995</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/dangmanhtruong1995"/>
    <language>en</language>
    <item>
      <title>A GPU-accelerated implementation of Forman-Ricci curvature-based graph clustering in CUDA.</title>
      <dc:creator>Dang Manh Truong</dc:creator>
      <pubDate>Sat, 17 Jan 2026 17:15:32 +0000</pubDate>
      <link>https://dev.to/dangmanhtruong1995/a-gpu-accelerated-implementation-of-forman-ricci-curvature-based-graph-clustering-in-cuda-3ob1</link>
      <guid>https://dev.to/dangmanhtruong1995/a-gpu-accelerated-implementation-of-forman-ricci-curvature-based-graph-clustering-in-cuda-3ob1</guid>
      <description>&lt;p&gt;Hi everyone! I've been trying to level up my CUDA skills (hoping it helps with job search), and I figured reimplementing existing GPU code wouldn't show much novelty. So I looked for algorithms that haven't been done on GPU yet.&lt;br&gt;
Luckily, I was trying to read this paper in a machine learning journal, and while most of its contents escaped me, the topic was about something called Ricci curvature-based clustering. The core idea is elegant: edges within clusters have high Ricci curvature, while edges between clusters ("bridges") have low curvature. By running a discrete Ricci flow and thresholding, you can reveal community structure. (Fun fact: This is related to the Ricci flow that Perelman used to prove the Poincaré conjecture!)&lt;br&gt;
There are two variants: Ollivier-Ricci and Forman-Ricci. I implemented Forman-Ricci since I couldn't find any existing GPU implementation (and also because it's simpler, maybe will do Ollvier-Ricci in the future).&lt;br&gt;
How it works&lt;br&gt;
Graph generation: Uses Stochastic Block Model (SBM) with P_in (intra-cluster edge probability) and P_out (inter-cluster probability)&lt;br&gt;
Phase 1 — Ricci Flow: Iteratively compute curvature, update edge weights, normalize. Bridges end up with higher weights than intra-cluster edges.&lt;br&gt;
Phase 2 — Threshold search: Find optimal threshold using modularity. Cut edges above threshold, find connected components, pick the partition with highest modularity.&lt;br&gt;
The code is not too optimised but I think will help with large networks. During the implementation process, I was basically trying to learn every step of the way, so you can see functions for prefix sums, connected components (which I had a GPU version using BFS, not too efficient I know, but it was because I had previous code which I wrote for BFS so I used it lol), bitonic sort, etc. in CUDA (and not the best version, since i was still learning basically).&lt;br&gt;
As an aside, as I was working on this, I naturally used AI (I was using Claude) to debug, check for errors, etc. While they were helpful in doing basic tasks (e.g. "Check if I missed any free and cudaFree"), they were introducing bugs of their own, which was rather annoying. Firstly, there was this line: "&lt;br&gt;
if (w &amp;lt; threshold &amp;amp;&amp;amp; component_id[neighbor] == -1)"&lt;br&gt;
which the AI was rather obsessed with, it keeps periodically asks me to fix it, and change the sign to "&amp;gt;", then after a while, to "&amp;lt;=", etc. Of course I knew it was leading me by the nose, so I decided to check the papers again and did my own way. Secondly, somewhere during the debug process with the AI, it replaced the Forman-Ricci equation with local clustering coefficient, which is not what I want, and it took me some time before I could fix it. Thirdly, as I was debugging, I thought that if I ask the AI to create a Python version which implements the same thing, and if the Python version runs fine while the CUDA version fails, that would mean that the problem is about numerical stability issues. So I did it, and the Python code worked okay, however, during the threshold selection process, the AI sneaked in a second loop which chose the threshold using another metric called NMI (Normalized Mutual Information), the problem with it is that this one depends on the ground truth. That means that the AI makes the algorithm overfit to the data, which is completely wrong. Luckily I was able to fix it in time.&lt;br&gt;
Result (Tested on RTX 4090 (24GB VRAM), 64GB RAM:):&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Nodes&lt;/th&gt;
&lt;th&gt;Clusters&lt;/th&gt;
&lt;th&gt;Edges&lt;/th&gt;
&lt;th&gt;P_in&lt;/th&gt;
&lt;th&gt;P_out&lt;/th&gt;
&lt;th&gt;Iterations&lt;/th&gt;
&lt;th&gt;NMI&lt;/th&gt;
&lt;th&gt;GPU Time (s)&lt;/th&gt;
&lt;th&gt;CPU Time (s)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;5,000&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;~3M&lt;/td&gt;
&lt;td&gt;0.50&lt;/td&gt;
&lt;td&gt;0.01&lt;/td&gt;
&lt;td&gt;10&lt;/td&gt;
&lt;td&gt;1.00&lt;/td&gt;
&lt;td&gt;7.03&lt;/td&gt;
&lt;td&gt;15,189.21&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;50,000&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;~25M&lt;/td&gt;
&lt;td&gt;0.04&lt;/td&gt;
&lt;td&gt;0.001&lt;/td&gt;
&lt;td&gt;10&lt;/td&gt;
&lt;td&gt;1.00&lt;/td&gt;
&lt;td&gt;74.39&lt;/td&gt;
&lt;td&gt;162,401.93&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;100,000&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;~102M&lt;/td&gt;
&lt;td&gt;0.04&lt;/td&gt;
&lt;td&gt;0.001&lt;/td&gt;
&lt;td&gt;10&lt;/td&gt;
&lt;td&gt;1.00&lt;/td&gt;
&lt;td&gt;625.46&lt;/td&gt;
&lt;td&gt;TBA&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;500,000&lt;/td&gt;
&lt;td&gt;50&lt;/td&gt;
&lt;td&gt;~126M&lt;/td&gt;
&lt;td&gt;0.05&lt;/td&gt;
&lt;td&gt;0.00001&lt;/td&gt;
&lt;td&gt;20&lt;/td&gt;
&lt;td&gt;0.89&lt;/td&gt;
&lt;td&gt;1086.25&lt;/td&gt;
&lt;td&gt;TBA&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;I hope you guys can provide feedback and comments for my code, suggestions, etc. Suggestions for next steps in upskill in ML, job search, etc. would also be welcome. Thank you very much.&lt;br&gt;
Github link: &lt;a href="https://github.com/dangmanhtruong1995/Ricci-curvature-clustering-CUDA" rel="noopener noreferrer"&gt;https://github.com/dangmanhtruong1995/Ricci-curvature-clustering-CUDA&lt;/a&gt;&lt;br&gt;
Update: Here is the NCU report:&lt;br&gt;
array_sum_blockwise(double *, int, double *) (986034, 1, 1)x(256, 1, 1), Context 1, Stream 7, Device 0, CC 8.9&lt;br&gt;
Section: GPU Speed Of Light Throughput&lt;/p&gt;




&lt;p&gt;Metric Name             Metric Unit  Metric Value&lt;/p&gt;




&lt;p&gt;DRAM Frequency                  Ghz         10.24&lt;br&gt;
SM Frequency                    Ghz          2.22&lt;br&gt;
Elapsed Cycles                cycle    10,272,695&lt;br&gt;
Memory Throughput                 %         49.94&lt;br&gt;
DRAM Throughput                   %         49.94&lt;br&gt;
Duration                         ms          4.61&lt;br&gt;
L1/TEX Cache Throughput           %         32.78&lt;br&gt;
L2 Cache Throughput               %         19.64&lt;br&gt;
SM Active Cycles              cycle 10,245,038.77&lt;br&gt;
Compute (SM) Throughput           %         86.58&lt;/p&gt;




&lt;p&gt;INF   This workload is utilizing greater than 80.0% of the available compute or memory performance of the device.&lt;br&gt;
To further improve performance, work will likely need to be shifted from the most utilized to another unit.&lt;br&gt;
Start by analyzing workloads in the Compute Workload Analysis section.&lt;/p&gt;

&lt;p&gt;Section: Launch Statistics&lt;/p&gt;




&lt;p&gt;Metric Name                          Metric Unit    Metric Value&lt;/p&gt;




&lt;p&gt;Block Size                                                   256&lt;br&gt;
Function Cache Configuration                     CachePreferNone&lt;br&gt;
Grid Size                                                986,034&lt;br&gt;
Registers Per Thread             register/thread              16&lt;br&gt;
Shared Memory Configuration Size           Kbyte           65.54&lt;br&gt;
Driver Shared Memory Per Block       Kbyte/block            1.02&lt;br&gt;
Dynamic Shared Memory Per Block       byte/block               0&lt;br&gt;
Static Shared Memory Per Block       Kbyte/block            2.05&lt;/p&gt;

&lt;h1&gt;
  
  
  SMs                                         SM             128
&lt;/h1&gt;

&lt;p&gt;Stack Size                                                 1,024&lt;br&gt;
Threads                                   thread     252,424,704&lt;/p&gt;

&lt;h1&gt;
  
  
  TPCs                                                        64
&lt;/h1&gt;

&lt;p&gt;Enabled TPC IDs                                              all&lt;br&gt;
Uses Green Context                                             0&lt;br&gt;
Waves Per SM                                            1,283.90&lt;/p&gt;




&lt;p&gt;Section: Occupancy&lt;/p&gt;




&lt;p&gt;Metric Name                     Metric Unit Metric Value&lt;/p&gt;




&lt;p&gt;Block Limit SM                        block           24&lt;br&gt;
Block Limit Registers                 block           16&lt;br&gt;
Block Limit Shared Mem                block           21&lt;br&gt;
Block Limit Warps                     block            6&lt;br&gt;
Theoretical Active Warps per SM        warp           48&lt;br&gt;
Theoretical Occupancy                     %          100&lt;br&gt;
Achieved Occupancy                        %        94.58&lt;br&gt;
Achieved Active Warps Per SM           warp        45.40&lt;/p&gt;




&lt;p&gt;Section: GPU and Memory Workload Distribution&lt;/p&gt;




&lt;p&gt;Metric Name                Metric Unit  Metric Value&lt;/p&gt;




&lt;p&gt;Average DRAM Active Cycles       cycle 23,572,417.33&lt;br&gt;
Total DRAM Elapsed Cycles        cycle   566,390,784&lt;br&gt;
Average L1 Active Cycles         cycle 10,245,038.77&lt;br&gt;
Total L1 Elapsed Cycles          cycle 1,311,918,402&lt;br&gt;
Average L2 Active Cycles         cycle  8,924,399.94&lt;br&gt;
Total L2 Elapsed Cycles          cycle   321,527,232&lt;br&gt;
Average SM Active Cycles         cycle 10,245,038.77&lt;br&gt;
Total SM Elapsed Cycles          cycle 1,311,918,402&lt;br&gt;
Average SMSP Active Cycles       cycle 10,244,765.62&lt;br&gt;
Total SMSP Elapsed Cycles        cycle 5,247,673,608&lt;/p&gt;




&lt;p&gt;normalize_weights(double *, double, int, int) (986034, 1, 1)x(256, 1, 1), Context 1, Stream 7, Device 0, CC 8.9&lt;br&gt;
Section: GPU Speed Of Light Throughput&lt;/p&gt;




&lt;p&gt;Metric Name             Metric Unit  Metric Value&lt;/p&gt;




&lt;p&gt;DRAM Frequency                  Ghz         10.24&lt;br&gt;
SM Frequency                    Ghz          2.23&lt;br&gt;
Elapsed Cycles                cycle    12,282,906&lt;br&gt;
Memory Throughput                 %         73.96&lt;br&gt;
DRAM Throughput                   %         73.96&lt;br&gt;
Duration                         ms          5.50&lt;br&gt;
L1/TEX Cache Throughput           %          9.04&lt;br&gt;
L2 Cache Throughput               %         18.41&lt;br&gt;
SM Active Cycles              cycle 12,269,948.88&lt;br&gt;
Compute (SM) Throughput           %         88.37&lt;/p&gt;




&lt;p&gt;INF   This workload is utilizing greater than 80.0% of the available compute or memory performance of the device.&lt;br&gt;
To further improve performance, work will likely need to be shifted from the most utilized to another unit.&lt;br&gt;
Start by analyzing workloads in the Compute Workload Analysis section.&lt;/p&gt;

&lt;p&gt;Section: Launch Statistics&lt;/p&gt;




&lt;p&gt;Metric Name                          Metric Unit    Metric Value&lt;/p&gt;




&lt;p&gt;Block Size                                                   256&lt;br&gt;
Function Cache Configuration                     CachePreferNone&lt;br&gt;
Grid Size                                                986,034&lt;br&gt;
Registers Per Thread             register/thread              24&lt;br&gt;
Shared Memory Configuration Size           Kbyte           16.38&lt;br&gt;
Driver Shared Memory Per Block       Kbyte/block            1.02&lt;br&gt;
Dynamic Shared Memory Per Block       byte/block               0&lt;br&gt;
Static Shared Memory Per Block        byte/block               0&lt;/p&gt;

&lt;h1&gt;
  
  
  SMs                                         SM             128
&lt;/h1&gt;

&lt;p&gt;Stack Size                                                 1,024&lt;br&gt;
Threads                                   thread     252,424,704&lt;/p&gt;

&lt;h1&gt;
  
  
  TPCs                                                        64
&lt;/h1&gt;

&lt;p&gt;Enabled TPC IDs                                              all&lt;br&gt;
Uses Green Context                                             0&lt;br&gt;
Waves Per SM                                            1,283.90&lt;/p&gt;




&lt;p&gt;Section: Occupancy&lt;/p&gt;




&lt;p&gt;Metric Name                     Metric Unit Metric Value&lt;/p&gt;




&lt;p&gt;Block Limit SM                        block           24&lt;br&gt;
Block Limit Registers                 block           10&lt;br&gt;
Block Limit Shared Mem                block           16&lt;br&gt;
Block Limit Warps                     block            6&lt;br&gt;
Theoretical Active Warps per SM        warp           48&lt;br&gt;
Theoretical Occupancy                     %          100&lt;br&gt;
Achieved Occupancy                        %        90.44&lt;br&gt;
Achieved Active Warps Per SM           warp        43.41&lt;/p&gt;




&lt;p&gt;Section: GPU and Memory Workload Distribution&lt;/p&gt;




&lt;p&gt;Metric Name                Metric Unit  Metric Value&lt;/p&gt;




&lt;p&gt;Average DRAM Active Cycles       cycle 41,691,337.33&lt;br&gt;
Total DRAM Elapsed Cycles        cycle   676,417,536&lt;br&gt;
Average L1 Active Cycles         cycle 12,269,948.88&lt;br&gt;
Total L1 Elapsed Cycles          cycle 1,570,992,776&lt;br&gt;
Average L2 Active Cycles         cycle 10,694,785.44&lt;br&gt;
Total L2 Elapsed Cycles          cycle   385,577,928&lt;br&gt;
Average SM Active Cycles         cycle 12,269,948.88&lt;br&gt;
Total SM Elapsed Cycles          cycle 1,570,992,776&lt;br&gt;
Average SMSP Active Cycles       cycle 12,269,202.65&lt;br&gt;
Total SMSP Elapsed Cycles        cycle 6,283,971,104&lt;/p&gt;




&lt;p&gt;calc_augmented_forman_ricci_curvature(int *, int *, int *, int *, int *, double *, double *, int) (493017, 1, 1)x(256, 1, 1), Context 1, Stream 7, Device 0, CC 8.9&lt;br&gt;
Section: GPU Speed Of Light Throughput&lt;/p&gt;




&lt;p&gt;Metric Name             Metric Unit      Metric Value&lt;/p&gt;




&lt;p&gt;DRAM Frequency                  Ghz             10.24&lt;br&gt;
SM Frequency                    Ghz              2.23&lt;br&gt;
Elapsed Cycles                cycle    27,686,168,816&lt;br&gt;
Memory Throughput                 %             15.03&lt;br&gt;
DRAM Throughput                   %              0.12&lt;br&gt;
Duration                          s             12.39&lt;br&gt;
L1/TEX Cache Throughput           %              7.32&lt;br&gt;
L2 Cache Throughput               %             15.03&lt;br&gt;
SM Active Cycles              cycle 27,676,378,817.82&lt;br&gt;
Compute (SM) Throughput           %             88.25&lt;/p&gt;




&lt;p&gt;INF   This workload is utilizing greater than 80.0% of the available compute or memory performance of the device.&lt;br&gt;
To further improve performance, work will likely need to be shifted from the most utilized to another unit.&lt;br&gt;
Start by analyzing workloads in the Compute Workload Analysis section.&lt;/p&gt;

&lt;p&gt;Section: Launch Statistics&lt;/p&gt;




&lt;p&gt;Metric Name                          Metric Unit    Metric Value&lt;/p&gt;




&lt;p&gt;Block Size                                                   256&lt;br&gt;
Function Cache Configuration                     CachePreferNone&lt;br&gt;
Grid Size                                                493,017&lt;br&gt;
Registers Per Thread             register/thread              38&lt;br&gt;
Shared Memory Configuration Size           Kbyte           16.38&lt;br&gt;
Driver Shared Memory Per Block       Kbyte/block            1.02&lt;br&gt;
Dynamic Shared Memory Per Block       byte/block               0&lt;br&gt;
Static Shared Memory Per Block        byte/block               0&lt;/p&gt;

&lt;h1&gt;
  
  
  SMs                                         SM             128
&lt;/h1&gt;

&lt;p&gt;Stack Size                                                 1,024&lt;br&gt;
Threads                                   thread     126,212,352&lt;/p&gt;

&lt;h1&gt;
  
  
  TPCs                                                        64
&lt;/h1&gt;

&lt;p&gt;Enabled TPC IDs                                              all&lt;br&gt;
Uses Green Context                                             0&lt;br&gt;
Waves Per SM                                              641.95&lt;/p&gt;




&lt;p&gt;Section: Occupancy&lt;/p&gt;




&lt;p&gt;Metric Name                     Metric Unit Metric Value&lt;/p&gt;




&lt;p&gt;Block Limit SM                        block           24&lt;br&gt;
Block Limit Registers                 block            6&lt;br&gt;
Block Limit Shared Mem                block           16&lt;br&gt;
Block Limit Warps                     block            6&lt;br&gt;
Theoretical Active Warps per SM        warp           48&lt;br&gt;
Theoretical Occupancy                     %          100&lt;br&gt;
Achieved Occupancy                        %        78.85&lt;br&gt;
Achieved Active Warps Per SM           warp        37.85&lt;/p&gt;




&lt;p&gt;OPT   Est. Local Speedup: 21.15%&lt;br&gt;
The difference between calculated theoretical (100.0%) and measured achieved occupancy (78.9%) can be the&lt;br&gt;
result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can&lt;br&gt;
occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices&lt;br&gt;
Guide (&lt;a href="https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy" rel="noopener noreferrer"&gt;https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy&lt;/a&gt;) for more details on&lt;br&gt;
optimizing occupancy.&lt;/p&gt;

&lt;p&gt;Section: GPU and Memory Workload Distribution&lt;/p&gt;




&lt;p&gt;Metric Name                Metric Unit       Metric Value&lt;/p&gt;




&lt;p&gt;Average DRAM Active Cycles       cycle     150,666,781.33&lt;br&gt;
Total DRAM Elapsed Cycles        cycle  1,522,442,628,096&lt;br&gt;
Average L1 Active Cycles         cycle  27,676,378,817.82&lt;br&gt;
Total L1 Elapsed Cycles          cycle  3,543,769,283,906&lt;br&gt;
Average L2 Active Cycles         cycle  23,831,160,793.03&lt;br&gt;
Total L2 Elapsed Cycles          cycle    869,605,685,676&lt;br&gt;
Average SM Active Cycles         cycle  27,676,378,817.82&lt;br&gt;
Total SM Elapsed Cycles          cycle  3,543,769,283,906&lt;br&gt;
Average SMSP Active Cycles       cycle  27,672,031,239.79&lt;br&gt;
Total SMSP Elapsed Cycles        cycle 14,175,077,135,624&lt;/p&gt;




&lt;p&gt;update_weight(int *, int *, int *, int *, int *, double *, double *, double, int) (493017, 1, 1)x(256, 1, 1), Context 1, Stream 7, Device 0, CC 8.9&lt;br&gt;
Section: GPU Speed Of Light Throughput&lt;/p&gt;




&lt;p&gt;Metric Name             Metric Unit   Metric Value&lt;/p&gt;




&lt;p&gt;DRAM Frequency                  Ghz          10.24&lt;br&gt;
SM Frequency                    Ghz           2.23&lt;br&gt;
Elapsed Cycles                cycle    452,955,045&lt;br&gt;
Memory Throughput                 %          71.00&lt;br&gt;
DRAM Throughput                   %           3.10&lt;br&gt;
Duration                         ms         202.87&lt;br&gt;
L1/TEX Cache Throughput           %          35.02&lt;br&gt;
L2 Cache Throughput               %          71.00&lt;br&gt;
SM Active Cycles              cycle 455,399,514.55&lt;br&gt;
Compute (SM) Throughput           %           8.85&lt;/p&gt;




&lt;p&gt;OPT   Memory is more heavily utilized than Compute: Look at the Memory Workload Analysis section to identify the L2&lt;br&gt;
bottleneck. Check memory replay (coalescing) metrics to make sure you're efficiently utilizing the bytes&lt;br&gt;
transferred. Also consider whether it is possible to do more work per memory access (kernel fusion) or&lt;br&gt;
whether there are values you can (re)compute.&lt;/p&gt;

&lt;p&gt;Section: Launch Statistics&lt;/p&gt;




&lt;p&gt;Metric Name                          Metric Unit    Metric Value&lt;/p&gt;




&lt;p&gt;Block Size                                                   256&lt;br&gt;
Function Cache Configuration                     CachePreferNone&lt;br&gt;
Grid Size                                                493,017&lt;br&gt;
Registers Per Thread             register/thread              16&lt;br&gt;
Shared Memory Configuration Size           Kbyte           16.38&lt;br&gt;
Driver Shared Memory Per Block       Kbyte/block            1.02&lt;br&gt;
Dynamic Shared Memory Per Block       byte/block               0&lt;br&gt;
Static Shared Memory Per Block        byte/block               0&lt;/p&gt;

&lt;h1&gt;
  
  
  SMs                                         SM             128
&lt;/h1&gt;

&lt;p&gt;Stack Size                                                 1,024&lt;br&gt;
Threads                                   thread     126,212,352&lt;/p&gt;

&lt;h1&gt;
  
  
  TPCs                                                        64
&lt;/h1&gt;

&lt;p&gt;Enabled TPC IDs                                              all&lt;br&gt;
Uses Green Context                                             0&lt;br&gt;
Waves Per SM                                              641.95&lt;/p&gt;




&lt;p&gt;Section: Occupancy&lt;/p&gt;




&lt;p&gt;Metric Name                     Metric Unit Metric Value&lt;/p&gt;




&lt;p&gt;Block Limit SM                        block           24&lt;br&gt;
Block Limit Registers                 block           16&lt;br&gt;
Block Limit Shared Mem                block           16&lt;br&gt;
Block Limit Warps                     block            6&lt;br&gt;
Theoretical Active Warps per SM        warp           48&lt;br&gt;
Theoretical Occupancy                     %          100&lt;br&gt;
Achieved Occupancy                        %        82.44&lt;br&gt;
Achieved Active Warps Per SM           warp        39.57&lt;/p&gt;




&lt;p&gt;OPT   Est. Local Speedup: 17.56%&lt;br&gt;
The difference between calculated theoretical (100.0%) and measured achieved occupancy (82.4%) can be the&lt;br&gt;
result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can&lt;br&gt;
occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices&lt;br&gt;
Guide (&lt;a href="https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy" rel="noopener noreferrer"&gt;https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy&lt;/a&gt;) for more details on&lt;br&gt;
optimizing occupancy.&lt;/p&gt;

&lt;p&gt;Section: GPU and Memory Workload Distribution&lt;/p&gt;




&lt;p&gt;Metric Name                Metric Unit    Metric Value&lt;/p&gt;




&lt;p&gt;Average DRAM Active Cycles       cycle   64,409,714.67&lt;br&gt;
Total DRAM Elapsed Cycles        cycle  24,932,431,872&lt;br&gt;
Average L1 Active Cycles         cycle  455,399,514.55&lt;br&gt;
Total L1 Elapsed Cycles          cycle  57,739,469,562&lt;br&gt;
Average L2 Active Cycles         cycle  396,726,153.17&lt;br&gt;
Total L2 Elapsed Cycles          cycle  14,226,411,996&lt;br&gt;
Average SM Active Cycles         cycle  455,399,514.55&lt;br&gt;
Total SM Elapsed Cycles          cycle  57,739,469,562&lt;br&gt;
Average SMSP Active Cycles       cycle  454,534,130.16&lt;br&gt;
Total SMSP Elapsed Cycles        cycle 230,957,878,248&lt;/p&gt;




</description>
      <category>programming</category>
      <category>cuda</category>
      <category>clustering</category>
      <category>ai</category>
    </item>
    <item>
      <title>Using a local LLM AI agent to solve the N puzzle</title>
      <dc:creator>Dang Manh Truong</dc:creator>
      <pubDate>Mon, 18 Aug 2025 16:01:03 +0000</pubDate>
      <link>https://dev.to/dangmanhtruong1995/using-a-local-llm-ai-agent-to-solve-the-n-puzzle-38d2</link>
      <guid>https://dev.to/dangmanhtruong1995/using-a-local-llm-ai-agent-to-solve-the-n-puzzle-38d2</guid>
      <description>&lt;p&gt;Hi everyone, I have just made some program to make an AI agent solve the N puzzle.&lt;/p&gt;

&lt;p&gt;Github link: &lt;a href="https://github.com/dangmanhtruong1995/N-puzzle-Agent/tree/main" rel="noopener noreferrer"&gt;https://github.com/dangmanhtruong1995/N-puzzle-Agent/tree/main&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Youtube link: &lt;a href="https://www.youtube.com/watch?v=Ntol4F4tilg" rel="noopener noreferrer"&gt;https://www.youtube.com/watch?v=Ntol4F4tilg&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;qwen3:latest&lt;/code&gt; model in the Ollama library was used as the agent, while I chose a simple N puzzle as the problem for it to solve.&lt;/p&gt;

&lt;p&gt;Experiments were done on an ASUS Vivobook Pro 15 laptop, with a NVIDIA GeForce RTX 4060 having 8GB of VRAM.&lt;/p&gt;

&lt;h2&gt;
  
  
  Overview
&lt;/h2&gt;

&lt;p&gt;This project demonstrates an AI agent solving the classic N-puzzle (sliding tile puzzle) by:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Analyzing and planning optimal moves using the Qwen3 language model&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Executing moves through automated mouse clicks on the GUI&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  How it works
&lt;/h2&gt;

&lt;p&gt;The LLM is given some prompt, with instructions that it could control the following functions: &lt;code&gt;move_up, move_down, move_left, move_right&lt;/code&gt;. At each turn, the LLM will try to choose from those functions, and the moves would then be made. Code is inspired from the following tutorials on functional calling and ReAct agent from scratch:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://www.philschmid.de/gemma-function-calling" rel="noopener noreferrer"&gt;https://www.philschmid.de/gemma-function-calling&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://www.philschmid.de/langgraph-gemini-2-5-react-agent" rel="noopener noreferrer"&gt;https://www.philschmid.de/langgraph-gemini-2-5-react-agent&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Installation
&lt;/h2&gt;

&lt;p&gt;To install the necessary libraries, type the following (assuming you are using &lt;code&gt;conda&lt;/code&gt;):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;
conda create &lt;span class="nt"&gt;--name&lt;/span&gt; aiagent &lt;span class="nv"&gt;python&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;3.14

conda activate aiagent

pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-r&lt;/span&gt; requirements.txt

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  How to run
&lt;/h2&gt;

&lt;p&gt;There are two files, &lt;code&gt;demo_1_n_puzzle_gui.py&lt;/code&gt; (for GUI) and &lt;code&gt;demo_1_agent.py&lt;/code&gt; (for the AI agent). First, run the GUi file:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;
python demo_1_n_puzzle_gui.py

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The N puzzle GUI will show up. Now, what you need to do is to move it to a proper position of your choosing (I used the top left corner). The reason we need to do this is that the AI agent will control the mouse to click on the move up, down, left, right buttons to interact with the GUI.&lt;/p&gt;

&lt;p&gt;Next, we need to use the &lt;code&gt;Pyautogui&lt;/code&gt; library to make the AI agent program aware of the button locations. Follow the tutorial here to get the coordinates: &lt;a href="https://pyautogui.readthedocs.io/en/latest/quickstart.html" rel="noopener noreferrer"&gt;link&lt;/a&gt;). An example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;
&lt;span class="o"&gt;(&lt;/span&gt;aiagent&lt;span class="o"&gt;)&lt;/span&gt; C:&lt;span class="se"&gt;\T&lt;/span&gt;RUONG&lt;span class="se"&gt;\C&lt;/span&gt;ode_tu_hoc&lt;span class="se"&gt;\A&lt;/span&gt;I_agent_tutorials&lt;span class="se"&gt;\N&lt;/span&gt;_puzzle_agent&lt;span class="se"&gt;\d&lt;/span&gt;emo1&amp;gt;python

Python 3.13.5 | packaged by Anaconda, Inc. | &lt;span class="o"&gt;(&lt;/span&gt;main, Jun 12 2025, 16:37:03&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;[&lt;/span&gt;MSC v.1929 64 bit &lt;span class="o"&gt;(&lt;/span&gt;AMD64&lt;span class="o"&gt;)]&lt;/span&gt; on win32

Type &lt;span class="s2"&gt;"help"&lt;/span&gt;, &lt;span class="s2"&gt;"copyright"&lt;/span&gt;, &lt;span class="s2"&gt;"credits"&lt;/span&gt; or &lt;span class="s2"&gt;"license"&lt;/span&gt; &lt;span class="k"&gt;for &lt;/span&gt;more information.

&lt;span class="o"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; import pyautogui

&lt;span class="o"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; pyautogui.position&lt;span class="o"&gt;()&lt;/span&gt; &lt;span class="c"&gt;# current mouse x and y. Move the mouse into position before enter&lt;/span&gt;

&lt;span class="o"&gt;(&lt;/span&gt;968, 56&lt;span class="o"&gt;)&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Once you get the coordinates, please populate the following fields in the &lt;code&gt;demo_1_agent.py&lt;/code&gt; file:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;
MOVE_UP_BUTTON_POS &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;285, 559&lt;span class="o"&gt;)&lt;/span&gt;

MOVE_DOWN_BUTTON_POS &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;279, 718&lt;span class="o"&gt;)&lt;/span&gt;

MOVE_LEFT_BUTTON_POS &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;195, 646&lt;span class="o"&gt;)&lt;/span&gt;

MOVE_RIGHT_BUTTON_POS &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;367, 647&lt;span class="o"&gt;)&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Next, open another Anaconda Prompt and run:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;
ollama run qwen3:latest

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now, open yet another Anaconda Prompt and run:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;
python demo_1_agent.py

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You should start seein the model's thinking trace. Be patient, it takes a while for the AI agent to find the solution.&lt;/p&gt;

&lt;p&gt;However, a limitation of this code is that when I tried to run on bigger problems (4x4 puzzle) the AI agent failed to solve it. Perharps if I run models which can fit on 24GB VRAM then it might work, but then I would need to do additional experiments.&lt;/p&gt;

</description>
      <category>programming</category>
      <category>ai</category>
      <category>llm</category>
    </item>
  </channel>
</rss>
