DEV Community: Dang Manh Truong

A GPU-accelerated implementation of Forman-Ricci curvature-based graph clustering in CUDA.

Dang Manh Truong — Sat, 17 Jan 2026 17:15:32 +0000

Hi everyone! I've been trying to level up my CUDA skills (hoping it helps with job search), and I figured reimplementing existing GPU code wouldn't show much novelty. So I looked for algorithms that haven't been done on GPU yet.
Luckily, I was trying to read this paper in a machine learning journal, and while most of its contents escaped me, the topic was about something called Ricci curvature-based clustering. The core idea is elegant: edges within clusters have high Ricci curvature, while edges between clusters ("bridges") have low curvature. By running a discrete Ricci flow and thresholding, you can reveal community structure. (Fun fact: This is related to the Ricci flow that Perelman used to prove the Poincaré conjecture!)
There are two variants: Ollivier-Ricci and Forman-Ricci. I implemented Forman-Ricci since I couldn't find any existing GPU implementation (and also because it's simpler, maybe will do Ollvier-Ricci in the future).
How it works
Graph generation: Uses Stochastic Block Model (SBM) with P_in (intra-cluster edge probability) and P_out (inter-cluster probability)
Phase 1 — Ricci Flow: Iteratively compute curvature, update edge weights, normalize. Bridges end up with higher weights than intra-cluster edges.
Phase 2 — Threshold search: Find optimal threshold using modularity. Cut edges above threshold, find connected components, pick the partition with highest modularity.
The code is not too optimised but I think will help with large networks. During the implementation process, I was basically trying to learn every step of the way, so you can see functions for prefix sums, connected components (which I had a GPU version using BFS, not too efficient I know, but it was because I had previous code which I wrote for BFS so I used it lol), bitonic sort, etc. in CUDA (and not the best version, since i was still learning basically).
As an aside, as I was working on this, I naturally used AI (I was using Claude) to debug, check for errors, etc. While they were helpful in doing basic tasks (e.g. "Check if I missed any free and cudaFree"), they were introducing bugs of their own, which was rather annoying. Firstly, there was this line: "
if (w < threshold && component_id[neighbor] == -1)"
which the AI was rather obsessed with, it keeps periodically asks me to fix it, and change the sign to ">", then after a while, to "<=", etc. Of course I knew it was leading me by the nose, so I decided to check the papers again and did my own way. Secondly, somewhere during the debug process with the AI, it replaced the Forman-Ricci equation with local clustering coefficient, which is not what I want, and it took me some time before I could fix it. Thirdly, as I was debugging, I thought that if I ask the AI to create a Python version which implements the same thing, and if the Python version runs fine while the CUDA version fails, that would mean that the problem is about numerical stability issues. So I did it, and the Python code worked okay, however, during the threshold selection process, the AI sneaked in a second loop which chose the threshold using another metric called NMI (Normalized Mutual Information), the problem with it is that this one depends on the ground truth. That means that the AI makes the algorithm overfit to the data, which is completely wrong. Luckily I was able to fix it in time.
Result (Tested on RTX 4090 (24GB VRAM), 64GB RAM:):

Nodes	Clusters	Edges	P_in	P_out	Iterations	NMI	GPU Time (s)	CPU Time (s)
5,000	2	~3M	0.50	0.01	10	1.00	7.03	15,189.21
50,000	2	~25M	0.04	0.001	10	1.00	74.39	162,401.93
100,000	2	~102M	0.04	0.001	10	1.00	625.46	TBA
500,000	50	~126M	0.05	0.00001	20	0.89	1086.25	TBA

I hope you guys can provide feedback and comments for my code, suggestions, etc. Suggestions for next steps in upskill in ML, job search, etc. would also be welcome. Thank you very much.
Github link: https://github.com/dangmanhtruong1995/Ricci-curvature-clustering-CUDA
Update: Here is the NCU report:
array_sum_blockwise(double *, int, double *) (986034, 1, 1)x(256, 1, 1), Context 1, Stream 7, Device 0, CC 8.9
Section: GPU Speed Of Light Throughput

Metric Name Metric Unit Metric Value

DRAM Frequency Ghz 10.24
SM Frequency Ghz 2.22
Elapsed Cycles cycle 10,272,695
Memory Throughput % 49.94
DRAM Throughput % 49.94
Duration ms 4.61
L1/TEX Cache Throughput % 32.78
L2 Cache Throughput % 19.64
SM Active Cycles cycle 10,245,038.77
Compute (SM) Throughput % 86.58

INF This workload is utilizing greater than 80.0% of the available compute or memory performance of the device.
To further improve performance, work will likely need to be shifted from the most utilized to another unit.
Start by analyzing workloads in the Compute Workload Analysis section.

Section: Launch Statistics

Metric Name Metric Unit Metric Value

Block Size 256
Function Cache Configuration CachePreferNone
Grid Size 986,034
Registers Per Thread register/thread 16
Shared Memory Configuration Size Kbyte 65.54
Driver Shared Memory Per Block Kbyte/block 1.02
Dynamic Shared Memory Per Block byte/block 0
Static Shared Memory Per Block Kbyte/block 2.05

SMs SM 128

Stack Size 1,024
Threads thread 252,424,704

TPCs 64

Enabled TPC IDs all
Uses Green Context 0
Waves Per SM 1,283.90

Section: Occupancy

Metric Name Metric Unit Metric Value

Block Limit SM block 24
Block Limit Registers block 16
Block Limit Shared Mem block 21
Block Limit Warps block 6
Theoretical Active Warps per SM warp 48
Theoretical Occupancy % 100
Achieved Occupancy % 94.58
Achieved Active Warps Per SM warp 45.40

Section: GPU and Memory Workload Distribution

Metric Name Metric Unit Metric Value

Average DRAM Active Cycles cycle 23,572,417.33
Total DRAM Elapsed Cycles cycle 566,390,784
Average L1 Active Cycles cycle 10,245,038.77
Total L1 Elapsed Cycles cycle 1,311,918,402
Average L2 Active Cycles cycle 8,924,399.94
Total L2 Elapsed Cycles cycle 321,527,232
Average SM Active Cycles cycle 10,245,038.77
Total SM Elapsed Cycles cycle 1,311,918,402
Average SMSP Active Cycles cycle 10,244,765.62
Total SMSP Elapsed Cycles cycle 5,247,673,608

normalize_weights(double *, double, int, int) (986034, 1, 1)x(256, 1, 1), Context 1, Stream 7, Device 0, CC 8.9
Section: GPU Speed Of Light Throughput

Metric Name Metric Unit Metric Value

DRAM Frequency Ghz 10.24
SM Frequency Ghz 2.23
Elapsed Cycles cycle 12,282,906
Memory Throughput % 73.96
DRAM Throughput % 73.96
Duration ms 5.50
L1/TEX Cache Throughput % 9.04
L2 Cache Throughput % 18.41
SM Active Cycles cycle 12,269,948.88
Compute (SM) Throughput % 88.37

Section: Launch Statistics

Metric Name Metric Unit Metric Value

Block Size 256
Function Cache Configuration CachePreferNone
Grid Size 986,034
Registers Per Thread register/thread 24
Shared Memory Configuration Size Kbyte 16.38
Driver Shared Memory Per Block Kbyte/block 1.02
Dynamic Shared Memory Per Block byte/block 0
Static Shared Memory Per Block byte/block 0

SMs SM 128

Stack Size 1,024
Threads thread 252,424,704

TPCs 64

Enabled TPC IDs all
Uses Green Context 0
Waves Per SM 1,283.90

Section: Occupancy

Metric Name Metric Unit Metric Value

Block Limit SM block 24
Block Limit Registers block 10
Block Limit Shared Mem block 16
Block Limit Warps block 6
Theoretical Active Warps per SM warp 48
Theoretical Occupancy % 100
Achieved Occupancy % 90.44
Achieved Active Warps Per SM warp 43.41

Section: GPU and Memory Workload Distribution

Metric Name Metric Unit Metric Value

Average DRAM Active Cycles cycle 41,691,337.33
Total DRAM Elapsed Cycles cycle 676,417,536
Average L1 Active Cycles cycle 12,269,948.88
Total L1 Elapsed Cycles cycle 1,570,992,776
Average L2 Active Cycles cycle 10,694,785.44
Total L2 Elapsed Cycles cycle 385,577,928
Average SM Active Cycles cycle 12,269,948.88
Total SM Elapsed Cycles cycle 1,570,992,776
Average SMSP Active Cycles cycle 12,269,202.65
Total SMSP Elapsed Cycles cycle 6,283,971,104

calc_augmented_forman_ricci_curvature(int *, int *, int *, int *, int *, double *, double *, int) (493017, 1, 1)x(256, 1, 1), Context 1, Stream 7, Device 0, CC 8.9
Section: GPU Speed Of Light Throughput

Metric Name Metric Unit Metric Value

DRAM Frequency Ghz 10.24
SM Frequency Ghz 2.23
Elapsed Cycles cycle 27,686,168,816
Memory Throughput % 15.03
DRAM Throughput % 0.12
Duration s 12.39
L1/TEX Cache Throughput % 7.32
L2 Cache Throughput % 15.03
SM Active Cycles cycle 27,676,378,817.82
Compute (SM) Throughput % 88.25

Section: Launch Statistics

Metric Name Metric Unit Metric Value

Block Size 256
Function Cache Configuration CachePreferNone
Grid Size 493,017
Registers Per Thread register/thread 38
Shared Memory Configuration Size Kbyte 16.38
Driver Shared Memory Per Block Kbyte/block 1.02
Dynamic Shared Memory Per Block byte/block 0
Static Shared Memory Per Block byte/block 0

SMs SM 128

Stack Size 1,024
Threads thread 126,212,352

TPCs 64

Enabled TPC IDs all
Uses Green Context 0
Waves Per SM 641.95

Section: Occupancy

Metric Name Metric Unit Metric Value

Block Limit SM block 24
Block Limit Registers block 6
Block Limit Shared Mem block 16
Block Limit Warps block 6
Theoretical Active Warps per SM warp 48
Theoretical Occupancy % 100
Achieved Occupancy % 78.85
Achieved Active Warps Per SM warp 37.85

OPT Est. Local Speedup: 21.15%
The difference between calculated theoretical (100.0%) and measured achieved occupancy (78.9%) can be the
result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can
occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices
Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on
optimizing occupancy.

Section: GPU and Memory Workload Distribution

Metric Name Metric Unit Metric Value

Average DRAM Active Cycles cycle 150,666,781.33
Total DRAM Elapsed Cycles cycle 1,522,442,628,096
Average L1 Active Cycles cycle 27,676,378,817.82
Total L1 Elapsed Cycles cycle 3,543,769,283,906
Average L2 Active Cycles cycle 23,831,160,793.03
Total L2 Elapsed Cycles cycle 869,605,685,676
Average SM Active Cycles cycle 27,676,378,817.82
Total SM Elapsed Cycles cycle 3,543,769,283,906
Average SMSP Active Cycles cycle 27,672,031,239.79
Total SMSP Elapsed Cycles cycle 14,175,077,135,624

update_weight(int *, int *, int *, int *, int *, double *, double *, double, int) (493017, 1, 1)x(256, 1, 1), Context 1, Stream 7, Device 0, CC 8.9
Section: GPU Speed Of Light Throughput

Metric Name Metric Unit Metric Value

DRAM Frequency Ghz 10.24
SM Frequency Ghz 2.23
Elapsed Cycles cycle 452,955,045
Memory Throughput % 71.00
DRAM Throughput % 3.10
Duration ms 202.87
L1/TEX Cache Throughput % 35.02
L2 Cache Throughput % 71.00
SM Active Cycles cycle 455,399,514.55
Compute (SM) Throughput % 8.85

OPT Memory is more heavily utilized than Compute: Look at the Memory Workload Analysis section to identify the L2
bottleneck. Check memory replay (coalescing) metrics to make sure you're efficiently utilizing the bytes
transferred. Also consider whether it is possible to do more work per memory access (kernel fusion) or
whether there are values you can (re)compute.

Section: Launch Statistics

Metric Name Metric Unit Metric Value

Block Size 256
Function Cache Configuration CachePreferNone
Grid Size 493,017
Registers Per Thread register/thread 16
Shared Memory Configuration Size Kbyte 16.38
Driver Shared Memory Per Block Kbyte/block 1.02
Dynamic Shared Memory Per Block byte/block 0
Static Shared Memory Per Block byte/block 0

SMs SM 128

Stack Size 1,024
Threads thread 126,212,352

TPCs 64

Enabled TPC IDs all
Uses Green Context 0
Waves Per SM 641.95

Section: Occupancy

Metric Name Metric Unit Metric Value

Block Limit SM block 24
Block Limit Registers block 16
Block Limit Shared Mem block 16
Block Limit Warps block 6
Theoretical Active Warps per SM warp 48
Theoretical Occupancy % 100
Achieved Occupancy % 82.44
Achieved Active Warps Per SM warp 39.57

OPT Est. Local Speedup: 17.56%
The difference between calculated theoretical (100.0%) and measured achieved occupancy (82.4%) can be the
result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can
occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices
Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on
optimizing occupancy.

Section: GPU and Memory Workload Distribution

Metric Name Metric Unit Metric Value

Average DRAM Active Cycles cycle 64,409,714.67
Total DRAM Elapsed Cycles cycle 24,932,431,872
Average L1 Active Cycles cycle 455,399,514.55
Total L1 Elapsed Cycles cycle 57,739,469,562
Average L2 Active Cycles cycle 396,726,153.17
Total L2 Elapsed Cycles cycle 14,226,411,996
Average SM Active Cycles cycle 455,399,514.55
Total SM Elapsed Cycles cycle 57,739,469,562
Average SMSP Active Cycles cycle 454,534,130.16
Total SMSP Elapsed Cycles cycle 230,957,878,248

Using a local LLM AI agent to solve the N puzzle

Dang Manh Truong — Mon, 18 Aug 2025 16:01:03 +0000

Hi everyone, I have just made some program to make an AI agent solve the N puzzle.

Github link: https://github.com/dangmanhtruong1995/N-puzzle-Agent/tree/main

Youtube link: https://www.youtube.com/watch?v=Ntol4F4tilg

The qwen3:latest model in the Ollama library was used as the agent, while I chose a simple N puzzle as the problem for it to solve.

Experiments were done on an ASUS Vivobook Pro 15 laptop, with a NVIDIA GeForce RTX 4060 having 8GB of VRAM.

Overview

This project demonstrates an AI agent solving the classic N-puzzle (sliding tile puzzle) by:

Analyzing and planning optimal moves using the Qwen3 language model
Executing moves through automated mouse clicks on the GUI

How it works

The LLM is given some prompt, with instructions that it could control the following functions: move_up, move_down, move_left, move_right. At each turn, the LLM will try to choose from those functions, and the moves would then be made. Code is inspired from the following tutorials on functional calling and ReAct agent from scratch:

Installation

To install the necessary libraries, type the following (assuming you are using conda):


conda create --name aiagent python=3.14

conda activate aiagent

pip install -r requirements.txt

How to run

There are two files, demo_1_n_puzzle_gui.py (for GUI) and demo_1_agent.py (for the AI agent). First, run the GUi file:


python demo_1_n_puzzle_gui.py

The N puzzle GUI will show up. Now, what you need to do is to move it to a proper position of your choosing (I used the top left corner). The reason we need to do this is that the AI agent will control the mouse to click on the move up, down, left, right buttons to interact with the GUI.

Next, we need to use the Pyautogui library to make the AI agent program aware of the button locations. Follow the tutorial here to get the coordinates: link). An example:


(aiagent) C:\TRUONG\Code_tu_hoc\AI_agent_tutorials\N_puzzle_agent\demo1>python

Python 3.13.5 | packaged by Anaconda, Inc. | (main, Jun 12 2025, 16:37:03) [MSC v.1929 64 bit (AMD64)] on win32

Type "help", "copyright", "credits" or "license" for more information.

>>> import pyautogui

>>> pyautogui.position() # current mouse x and y. Move the mouse into position before enter

(968, 56)

Once you get the coordinates, please populate the following fields in the demo_1_agent.py file:


MOVE_UP_BUTTON_POS = (285, 559)

MOVE_DOWN_BUTTON_POS = (279, 718)

MOVE_LEFT_BUTTON_POS = (195, 646)

MOVE_RIGHT_BUTTON_POS = (367, 647)

Next, open another Anaconda Prompt and run:


ollama run qwen3:latest

Now, open yet another Anaconda Prompt and run:


python demo_1_agent.py

You should start seein the model's thinking trace. Be patient, it takes a while for the AI agent to find the solution.

However, a limitation of this code is that when I tried to run on bigger problems (4x4 puzzle) the AI agent failed to solve it. Perharps if I run models which can fit on 24GB VRAM then it might work, but then I would need to do additional experiments.