I started gocudrv with one constraint:
I wanted Go code to call CUDA without making every build depend on CUDA headers, a C compiler, or cgo.
That means loading the NVIDIA driver at runtime instead of linking against CUDA at build time.
Why avoid cgo?
cgo is the normal way to call C from Go, and often the right tool. But it also makes builds heavier.
A package that uses cgo needs:
- a C compiler
- platform-specific toolchains
- cross-compilers for cross-platform builds
For this project, that was exactly the setup I wanted to avoid.
The goal was a normal Go build:
CGO_ENABLED=0 go build ./...
The binary still requires an NVIDIA driver on the machine where it runs.
It just does not require the CUDA toolkit on the machine where it is built.
Why the Driver API?
CUDA exposes two major APIs:
- the higher-level Runtime API
- the lower-level Driver API
gocudrv uses the Driver API because it is exposed directly by the NVIDIA driver itself:
-
libcuda.so.1on Linux/WSL -
nvcuda.dllon Windows
That allows the program to:
- load the driver dynamically at startup
- bind only the symbols it needs
- fail gracefully if the driver is missing
The Driver API is also backward compatible, which makes it a better fit for a thin binding layer.
Where purego fits
gocudrv uses purego to open shared libraries and bind native functions without cgo.
At the top level, initialization looks pretty ordinary:
package main
import (
"fmt"
"log"
"github.com/eitamring/gocudrv/cuda"
)
func main() {
if err := cuda.Init(); err != nil {
log.Fatal(err)
}
v, err := cuda.DriverVersion()
if err != nil {
log.Fatal(err)
}
fmt.Printf("CUDA driver: %d.%d\n", v/1000, (v%1000)/10)
}
Underneath that small API, the package:
- locates the driver library
- binds functions like
cuInitandcuDriverGetVersion - calls
cuInit(0) - maps CUDA result codes into Go errors
What this does not buy
Skipping cgo does not remove the C boundary.
It just makes the boundary more manual.
The library still has to define:
- function signatures
- pointer types
- struct layouts
- alignment and padding
exactly as the CUDA ABI expects them.
If a native function expects a pointer to a struct, the Go side must pass memory with the exact same layout. The compiler will not rescue a bad binding.
That tradeoff is worth it for this project, but it is still a tradeoff.
Next steps
Loading the driver is only the first step.
A machine may have:
- zero GPUs
- one GPU
- several GPUs
The next step is handling devices, contexts, memory, and eventually streams and async execution cleanly from Go.
The project is still very early, but the vector-add example already works.
Top comments (0)