Michael Ivertowski

Posted on Sep 4 • Originally published at mivertowski.hashnode.dev

CUDA Kernel Execution Debugging Journey

#cuda #dotnet

Short version: we went from 8/70 passing CUDA tests to a stable, auditable path by fixing NVRTC name resolution, argument marshaling, and unified-memory sync in DotCompute. No mysticism—just careful pointers and fewer foot-guns.

TL;DR

NVRTC will happily mangle your kernel names. Resolve them explicitly.
CUDA expects pointers to values for kernel params (yes, even device pointers).
Unified memory needs synchronization before CPU access.
Track every unmanaged allocation; free it.
Tests climbed from 8/70 → 41/70 → aiming 95%+.

The Starting Point

The hardware test suite was a parade of classics:

“named symbol not found” at launch
GCHandle pinning failures (non-blittable types)
CUDA 700 (illegal address) on kernel calls
Unified memory access violations
A demoralizing 8/70 green checks

What Actually Fixed Things

1) NVRTC name mangling (resolved)

Problem: extern "C" didn’t save us—NVRTC produced mangled names while our loader looked for unmangled ones.

Fix: Register name expressions before compile, then retrieve lowered names after.

// Register before compilation
foreach (var funcName in functionNames)
{
    var nameExpr = $"&{funcName}";
    NvrtcInterop.nvrtcAddNameExpression(program, nameExpr);
}

// After compilation, retrieve the lowered (mangled) name
var lowered = NvrtcInterop.GetLoweredName(program, nameExpr);
mangledNames[funcName] = lowered;

Impact: Jumps us to 41+ / 70 passing tests.

2) Marshaling unified memory arguments

Problem: CudaUnifiedMemoryBuffer<T> isn’t blittable; direct pinning fails.

Fix: Reflect out the device pointer and pass that (see §3 for the pointer-to-value nuance).

if (argValue.GetType().Name.StartsWith("CudaUnifiedMemoryBuffer"))
{
    var prop = argType.GetProperty("DevicePointer",
        BindingFlags.Public | BindingFlags.NonPublic | BindingFlags.Instance);

    if (prop?.GetValue(argValue) is IntPtr devicePtr)
    {
        unsafe
        {
            var storage = Marshal.AllocHGlobal(sizeof(IntPtr));
            *(IntPtr*)storage = devicePtr;    // store value
            unmanagedAllocations.Add(storage); // remember to free
            return storage;                    // return pointer to value
        }
    }
}

3) The critical param passing bug

Problem: We passed device pointer values directly to cuLaunchKernel. CUDA wants an array of pointers to values.

Fix: Allocate space, write the value, pass a pointer to that space.

unsafe
{
    var storage = Marshal.AllocHGlobal(sizeof(IntPtr));
    *(IntPtr*)storage = devicePtr;   // write the value
    unmanagedAllocations.Add(storage);
    return storage;                  // CUDA reads the value from here
}

Symptom this kills: persistent CUDA 700 on matrix-mult tests.

4) Unmanaged memory hygiene

Problem: Leaks from tiny per-arg allocations.

Fix: Track and free religiously.

var unmanagedAllocations = new List<IntPtr>();

try
{
    var argPtr = PrepareKernelArgument(arg, handles, unmanagedAllocations);
    // ... launch ...
}
finally
{
    foreach (var p in unmanagedAllocations)
        if (p != IntPtr.Zero) Marshal.FreeHGlobal(p);
}

5) Unified memory: sync before you touch

Problem: CPU reading stale/remote pages.

Fix: Gate host spans behind an explicit sync.

public override Span<T> GetSpan()
{
    EnsureOnHost(); // synchronize first
    return new Span<T>((void*)_hostPtr, Length);
}

Deep Dives (the “why”)

How CUDA reads kernel arguments

Wrong: argPointers[i] = devicePtr;
Right: argPointers[i] = &devicePtr;

cuLaunchKernel dereferences your argPointers to fetch the actual values. Device pointers are values too—treat them like any other scalar.

C# overload resolution trap

Without the generic arg, a params overload may win by accident:

LaunchAsync(config, buf1, buf2, buf3, size);      // may hit wrong overload
LaunchAsync<float>(config, buf1, buf2, buf3, size); // forces intended path

Lessons (written on a sticky note)

700 ≠ haunted memory — often just wrong argument plumbing.
P/Invoke is sharp — mind blittability, lifetimes, and double-indirection.
GPU is async by default — sync before the CPU peeks.
Reflection is a tool, not a lifestyle — but it saved us here.
Iterate mercilessly — fix → run → commit → repeat.

Status & Performance

Start: 8 / 70 (11.4%)
After NVRTC fix: 41 / 70 (58.6%)
Targeting 95%+ with the remaining stragglers (edge cases & perf polish).

Production Checklist

[ ] Every CUDA/NVRTC call checks return codes
[ ] All unmanaged allocations tracked & freed
[ ] Timers/metrics around compile, HtoD/DtoH, and kernel time
[ ] Launch config validated against device caps
[ ] Tests for edge cases, stress, and multi-GPU

What’s Next

Dynamic parallelism (flag plumbing + tests)
Faster transfers for bidi workloads
CUDA Graphs to amortize launch overhead
Texture/constant memory where patterns fit
Nsight-based profiling in CI

Appendix: File-level changes

CudaKernelCompiler.cs — NVRTC name resolution
CudaKernelLauncher.cs — argument marshaling, lifetime fixes
CudaUnifiedMemoryBuffer.cs — host-access synchronization
CudaKernelExecutionTests.cs — overload resolution fixed

Key APIs we leaned on

nvrtcAddNameExpression, nvrtcGetLoweredName
Marshal.AllocHGlobal, Marshal.FreeHGlobal, GCHandle.Alloc
cudaDeviceSynchronize (and friends)

Handy error codes

NVRTC_ERROR_COMPILATION — syntax/flags/headers
700 — illegal address (often arg passing)
716 — misaligned address
719 — launch failure

Credit where due: ClaudeCode helped pattern-match the error soup, surface the right docs, and keep the loop tight. The wins are still boring engineering ones—our favorite kind.

DEV Community