Every Go developer who has worked with C libraries knows the pain: CGO requires a C compiler, breaks cross-compilation, bloats binaries, and adds ~200ns overhead per call. For our WebGPU bindings and ML framework, calling wgpu-native through CGO was a non-starter — we needed to ship a single static binary across Windows, Linux, and macOS without requiring users to install gcc.
So we built goffi — a pure Go FFI library that calls C functions through hand-written assembly, with zero C dependencies and zero per-call allocations. It now powers an entire ecosystem: go-webgpu/webgpu bindings, born-ml/born ML framework, and the GoGPU GPU computing platform with dual Rust and pure Go backends.
This article explains the architecture, the hard problems we solved, how goffi compares to purego, and how you can use it in your own projects.
The Problem
Our stack looks like this:
Go application (gogpu)
└─ wgpu bindings (gogpu/wgpu) ← needs to call C functions
└─ goffi ← this library
└─ wgpu-native (.dll/.so/.dylib)
We need to call hundreds of WebGPU functions from Go: create GPU devices, submit command buffers, handle async callbacks from Metal/Vulkan threads. The requirements were clear:
-
No C compiler at build time — users run
go getand it works -
Cross-compilation —
GOOS=linux GOARCH=arm64 go buildmust work from Windows - Callbacks from C threads — wgpu-native fires callbacks from internal GPU threads
- Struct passing — WebGPU API passes structs by value (descriptors, extents, colors)
- Low overhead — GPU command encoding happens at 60+ FPS
CGO fails requirements 1 and 2. purego covers 1-2 but had gaps in 3-5 when we started. So we built goffi.
Architecture: 4 Layers Deep
Every goffi call traverses four layers to safely transition from Go's managed runtime to raw C code:
Go Code
│ ffi.CallFunction()
▼
runtime.cgocall ← Go runtime: switch to system stack, tell GC
│
▼
Assembly Wrapper ← Our code: load registers per ABI
│ RDI=arg0 RSI=arg1 ... XMM0=float0 ...
▼
C Function ← Target library
Layer 1: The Call Interface (CIF)
Unlike purego which uses reflect.MakeFunc on every call, goffi pre-computes everything once:
// Prepare once at init time
cif := &types.CallInterface{}
ffi.PrepareCallInterface(cif, types.DefaultCall,
types.UInt64TypeDescriptor, // return: size_t
[]*types.TypeDescriptor{types.PointerTypeDescriptor}, // arg: const char*
)
// Call many times — zero reflection, zero allocation
// args = []unsafe.Pointer{unsafe.Pointer(&myPtr)} — pointers TO arg values
ffi.CallFunction(cif, strlenPtr, unsafe.Pointer(&result), args)
PrepareCallInterface classifies each argument (integer register? SSE register? stack?), computes stack layout, and stores everything in a reusable CallInterface struct. The cif.Flags bitmask tells the assembly exactly what to do — no decisions at call time.
Layer 2: Platform Assembly
We write assembly by hand for each ABI. Here's the core of our System V AMD64 implementation (syscall_unix_amd64.s). The function receives a pointer to a syscallArgs struct in DI, loads registers from it, calls the target, and writes return values back:
// syscall_unix_amd64.s — System V AMD64 ABI
TEXT syscallN(SB), NOSPLIT|NOFRAME, $0
PUSHQ BP
MOVQ SP, BP
SUBQ $STACK_SIZE, SP
MOVQ DI, R11 // R11 = args struct pointer
// Load 8 SSE registers from struct offsets 128-184
MOVQ 128(R11), X0 // XMM0
MOVQ 136(R11), X1 // XMM1
// ... XMM2-XMM7
// Push stack-spill args (a7-a15) from struct offsets 56-120
MOVQ 56(R11), R12
MOVQ R12, 0(SP) // stack slot 0
// Load 6 GP registers from struct offsets 8-48
MOVQ 8(R11), DI // RDI = arg 1
MOVQ 16(R11), SI // RSI = arg 2
MOVQ 24(R11), DX // RDX = arg 3
MOVQ 32(R11), CX // RCX = arg 4
MOVQ 40(R11), R8 // R8 = arg 5
MOVQ 48(R11), R9 // R9 = arg 6
MOVQ 0(R11), R10 // function pointer
CALL R10
// Save returns: RAX → r1, RDX → r2, XMM0 → f1
MOVQ PTR_ADDRESS(BP), DI
MOVQ AX, 192(DI) // integer return
MOVQ DX, 200(DI) // second return (9-16B structs)
MOVQ X0, 128(DI) // float return
// ... restore stack, RET
We maintain separate assembly for three ABIs:
| ABI | GP Registers | SSE Registers | Stack |
|---|---|---|---|
| System V AMD64 (Linux/macOS) | RDI, RSI, RDX, RCX, R8, R9 | XMM0-XMM7 | 16-byte aligned |
| Win64 (Windows) | RCX, RDX, R8, R9 | XMM0-XMM3 | 32-byte shadow space |
| AAPCS64 (ARM64) | X0-X7 | D0-D7 | 16-byte aligned |
Layer 3: Struct Returns
This is where it gets interesting. When a C function returns a struct, the ABI rules depend on size:
- <= 8 bytes: returned in RAX
- 9-16 bytes: split across RAX (low 8) + RDX (high 8)
- > 16 bytes: caller passes a hidden pointer as the first argument (sret)
Our handleReturn function assembles the result:
case types.StructType:
size := cif.ReturnType.Size
switch {
case size <= 8:
*(*uint64)(rvalue) = retVal
case size <= 16:
*(*uint64)(rvalue) = retVal // RAX → bytes 0-7
remaining := size - 8
src := (*[8]byte)(unsafe.Pointer(&retVal2))
dst := (*[8]byte)(unsafe.Add(rvalue, 8))
copy(dst[:remaining], src[:remaining]) // RDX → bytes 8-15
}
Layer 4: Callbacks (C → Go)
WebGPU fires async callbacks from internal threads — Metal threads, Vulkan threads, threads goffi never created. These threads have no goroutine (G = nil), so calling Go code directly would crash.
We solve this with Go's crosscall2 mechanism:
C thread (wgpu-native internal)
│ calls our trampoline (1 of 2000 pre-compiled entries)
▼
Assembly dispatcher
│ saves registers, loads callback index
▼
crosscall2 → runtime.load_g → runtime.cgocallback
│ sets up goroutine, switches to Go stack
▼
Your Go callback function
On AMD64, each trampoline is a 5-byte CALL instruction. On ARM64, each entry is 8 bytes — two 4-byte instructions:
// ARM64 (callback_arm64.s) — 8 bytes per entry
MOVD $0, R12 // load callback index
B ·callbackDispatcher // branch (no link — preserves LR)
MOVD $1, R12
B ·callbackDispatcher
// ... 2000 entries
Usage is simple:
cb := ffi.NewCallback(func(status uint32, adapter uintptr, msg uintptr, ud uintptr) {
// This runs safely even when called from a C thread
result.adapter = adapter
close(result.done)
})
ffi.CallFunction(cif, wgpuRequestAdapter, nil, args)
goffi vs purego: Honest Comparison
Both libraries are pure Go, no CGO. But they make different trade-offs:
| Aspect | goffi | purego |
|---|---|---|
| API model | libffi-style: prepare CIF once, call many times | reflect-style: RegisterFunc, closure per call |
| Per-call cost | Zero allocations (CIF reused) |
sync.Pool.Get() for syscall args |
| Callback float returns | Supported (asm writes XMM0) | panic("unsupported return type") |
| ARM64 HFA | Recursive struct walk (nested HFAs) | Top-level fields only |
| Type system | Explicit TypeDescriptor (Size/Alignment/Kind) |
Go reflect.Type introspection |
| Ergonomics | Raw — you manage unsafe.Pointer yourself |
High-level — auto string null-termination, bool, slice |
| Platforms | 5 (amd64x3 + arm64x2) | 9+ architectures |
| Context support | CallFunctionContext(ctx, ...) |
None |
| Typed errors | 5 error types with errors.As()
|
Generic errors |
Choose goffi when: you need struct passing, zero per-call overhead, callback float returns, or you're building GPU/real-time bindings where every nanosecond counts.
Choose purego when: you need string auto-marshaling, broad architecture support (386, ppc64le, riscv64...), or quick one-off C library bindings with minimal boilerplate.
We use both in gogpu — goffi for the hot-path WebGPU calls, purego patterns as reference for platform edge cases.
The Ecosystem: Where goffi Came From and Where It Led
goffi wasn't built in isolation. It was born from a real need — and it enabled an entire ecosystem of pure Go GPU libraries.
The Origin: Proprietary Roots
goffi started as an internal tool. For over a year it lived inside a proprietary codebase — a GPU computing stack we built for our own projects. It worked well enough for us: a handful of platforms, a known set of functions, predictable usage patterns.
In 2025, we decided to open-source everything. Not just goffi, but the entire ecosystem — WebGPU bindings, the ML framework, the shader compiler, the GPU platform. We believed the Go community needed a real alternative to CGO for native library bindings.
What we didn't expect: the gap between "works for us internally" and "production-ready open source" was enormous.
Our internal version handled our specific use cases. Open source means every use case. Users on platforms we never tested. Struct layouts we never considered. Calling conventions with edge cases we'd never hit. The list of things that "just worked" internally but broke in the wild was humbling:
- ABI compliance — our AMD64 assembly didn't handle struct returns >8 bytes correctly. Internally we never returned large structs by value. Open source users did, immediately. We had to implement RAX+RDX split returns and sret hidden pointers.
- ARM64 — we had AMD64 only. Open source meant Apple Silicon support was day one priority, not a nice-to-have.
-
Callbacks from C threads — internally we controlled which threads called back into Go. In the wild, wgpu-native fires callbacks from Metal and Vulkan threads we never created. We had to integrate
crosscall2for proper C→Go transitions. -
Error handling — our internal code used generic errors. Open source users needed
errors.As()with typed errors to build robust applications. We added 5 error types. - Testing — our internal coverage was ~40%. Getting to 89% meant writing hundreds of test cases for edge cases we'd never encountered ourselves.
- Documentation — internally, we knew how the code worked. For open source, every assembly file needed comments explaining the ABI, every public function needed godoc, every platform quirk needed documentation.
We essentially rebuilt goffi from scratch while keeping the core idea intact. The architecture is the same — CIF pre-computation, assembly dispatch, zero allocations — but the implementation is production-grade now, not prototype-grade.
go-webgpu/webgpu
It started with go-webgpu/webgpu — our zero-CGO WebGPU bindings for Go. We wanted to call wgpu-native (Rust-based Vulkan/Metal/DX12 backend) from Go without requiring a C compiler. Every existing approach had a deal-breaker:
-
CGO: requires gcc, breaks
go get, no cross-compilation - purego: at the time, no struct passing, no callback float returns, no HFA support — things WebGPU needs
So we built goffi as the FFI layer for go-webgpu/webgpu. The bindings wrap 180+ wgpu-native functions — device creation, buffer allocation, render passes, compute dispatches, async adapter requests.
born-ml: Machine Learning on GPU
The second consumer was born-ml/born — a production-ready ML framework for Go with a PyTorch-like API. born needs GPU compute for tensor operations: matrix multiplication, convolution, automatic differentiation. The WebGPU compute pipeline powered by goffi gives born GPU acceleration while shipping as a single static binary.
born (ML framework)
└─ go-webgpu/webgpu (WebGPU bindings)
└─ goffi (FFI layer)
└─ wgpu-native (Vulkan/Metal/DX12)
This stack lets you go get github.com/born-ml/born, write a neural network, and run it on GPU — no Python, no CUDA, no C compiler.
GoGPU: The Full Ecosystem
As the projects matured, we realized we could go further. GoGPU grew into a complete GPU computing ecosystem with dual backends — a high-performance Rust backend (wgpu-native via goffi) and a pure Go backend:
| Package | What it does | Uses goffi |
|---|---|---|
| gogpu/gogpu | GPU framework — windowing, input, event loop, dual backends (Rust wgpu-native or Pure Go) | Yes |
| gogpu/wgpu | WebGPU implementation in pure Go — calls Vulkan, Metal, EGL/GLES natively via goffi | Yes |
| gogpu/naga | Shader compiler in pure Go — WGSL to SPIR-V, MSL, GLSL, HLSL | No |
| gogpu/gg | 2D graphics library — SDF rendering, MSDF text, Vello compute pipeline | Indirect |
| gogpu/gpucontext | Shared interfaces for GPU context, windowing, and surface creation | No |
Both gogpu/gogpu and gogpu/wgpu depend directly on goffi. The "pure Go" backend (gogpu/wgpu) is pure Go in the sense of zero CGO — no C compiler needed — but it still calls native Vulkan, Metal, and EGL APIs through goffi. That's the whole point: goffi replaces CGO, not the native graphics drivers.
Real-World Performance
At 60 FPS, a typical frame makes ~30-50 FFI calls through goffi:
Frame budget: 16.6 ms
GPU work: ~15 ms
FFI overhead (50 calls): 50 × 100ns = 5 us = 0.03%
The FFI overhead is literally unmeasurable in profiling.
Callback-Heavy Async APIs
WebGPU is heavily async. Device creation, adapter requests, buffer mapping — all callback-based:
// Request GPU adapter (async) — simplified pattern
cb := ffi.NewCallback(func(status uint32, adapter uintptr, msg uintptr, ud uintptr) {
result.status = status
result.handle = adapter
result.done <- struct{}{}
})
ffi.CallFunction(&requestAdapterCIF, wgpuInstanceRequestAdapter, nil,
[]unsafe.Pointer{
unsafe.Pointer(&instance),
unsafe.Pointer(&options),
unsafe.Pointer(&cb),
unsafe.Pointer(&userdata),
})
<-result.done // Wait for GPU driver callback
This works even when wgpu-native fires the callback from an internal Metal/Vulkan thread, thanks to our crosscall2 integration.
How to Use goffi in Your Project
Installation
go get github.com/go-webgpu/goffi
Minimal Example: Calling strlen
package main
import (
"fmt"
"runtime"
"unsafe"
"github.com/go-webgpu/goffi/ffi"
"github.com/go-webgpu/goffi/types"
)
func main() {
// 1. Load library
libName := "libc.so.6"
if runtime.GOOS == "windows" {
libName = "msvcrt.dll"
}
handle, _ := ffi.LoadLibrary(libName)
defer ffi.FreeLibrary(handle)
strlen, _ := ffi.GetSymbol(handle, "strlen")
// 2. Prepare call interface (once)
cif := &types.CallInterface{}
ffi.PrepareCallInterface(cif, types.DefaultCall,
types.UInt64TypeDescriptor,
[]*types.TypeDescriptor{types.PointerTypeDescriptor},
)
// 3. Call (many times, zero overhead)
str := "Hello, goffi!\x00"
strPtr := uintptr(unsafe.Pointer(unsafe.StringData(str)))
var length uint64
ffi.CallFunction(cif, strlen, unsafe.Pointer(&length),
[]unsafe.Pointer{unsafe.Pointer(&strPtr)})
fmt.Printf("strlen = %d\n", length) // 13
}
Passing Structs
// Define struct layout matching C struct
pointType := &types.TypeDescriptor{
Size: 16, // sizeof(Point)
Alignment: 8, // alignof(Point)
Kind: types.StructType,
Members: []*types.TypeDescriptor{
types.DoubleTypeDescriptor, // x
types.DoubleTypeDescriptor, // y
},
}
// Use in CIF
ffi.PrepareCallInterface(cif, types.DefaultCall,
types.DoubleTypeDescriptor, // returns double (distance)
[]*types.TypeDescriptor{pointType, pointType}, // two Point args
)
Registering Callbacks
cb := ffi.NewCallback(func(eventType uint32, data uintptr) {
fmt.Printf("Event %d received\n", eventType)
})
// Pass cb (uintptr) to C function expecting a function pointer
ffi.CallFunction(&cif, registerCallback, nil,
[]unsafe.Pointer{unsafe.Pointer(&cb)})
The Hard Lessons
Building a production FFI taught us things no documentation covers:
1. Stack alignment kills silently. A single byte of misalignment on AMD64 causes SIGSEGV — but only sometimes, depending on whether the callee uses SSE instructions. We spent days debugging crashes that only reproduced under specific GPU driver versions.
2. Windows shadow space is non-negotiable. Win64 ABI requires 32 bytes of "shadow space" on every call, even if the function takes zero arguments. Miss it and the callee corrupts your stack.
3. ARM64 HFA rules are recursive. A struct containing a struct containing 4 floats is still an HFA (Homogeneous Floating-point Aggregate) and must be passed in D0-D3. purego only checks top-level fields; we had to walk the full type tree.
4. C threads have no goroutine. When wgpu-native calls your callback from an internal Metal thread, getg() returns nil. You must go through crosscall2 → load_g → cgocallback or the runtime panics.
5. float32 encoding matters. On Windows, syscall.SyscallN passes args as uintptr. Widening float32 to float64 then stuffing into a register corrupts the bit pattern — you need math.Float32bits to preserve the exact IEEE-754 representation.
Numbers
| Metric | Value |
|---|---|
| FFI overhead | 88-114 ns/op |
| Test coverage | 89% |
| Platforms | 5 (Win/Linux/macOS x AMD64 + Linux/macOS x ARM64) |
| Assembly files | 17 files, ~900 lines of logic + 6200 lines of trampoline entries |
| Callback slots | 2000 per process |
| Dependencies | 0 (only Go stdlib) |
| CGO required | No |
What About Go 1.26 CGO Improvements?
Go 1.26 (released February 2026) reduced cgo call overhead by ~30% by removing the dedicated syscall P state. Benchmarks on Apple M1 show CgoCall is 33% faster, CgoCallWithCallback is 21% faster.
This is great news — but it doesn't change our calculus:
-
CGO still requires a C compiler at build time. Our users
go getand ship. -
Cross-compilation with CGO still requires cross-toolchains.
GOOS=linux GOARCH=arm64 go buildjust works with goffi. - Static binaries — CGO often pulls in libc. goffi produces fully static Go binaries.
-
Go 1.26 also benefits goffi — our
runtime.cgocallpath gets the same 30% speedup, because goffi uses the same runtime machinery internally.
The gap between CGO and pure-Go FFI is narrowing from both directions. We welcome it.
What's Next
v0.5.0 is focused on usability:
- Variadic function support (
printf,sprintf) - Builder pattern API for less boilerplate
- Platform-specific struct alignment (Windows
#pragma pack)
v1.0.0 targets API stability with SemVer guarantees, security audit, and published benchmarks vs CGO/purego.
The long-term goal: make GPU programming in Go as natural as it is in Rust or C++, with the ergonomics Go developers expect — go get, go build, done.
Links
goffi (FFI layer):
- github.com/go-webgpu/goffi — the library this article is about
- pkg.go.dev/github.com/go-webgpu/goffi — Go documentation
- Performance guide — benchmarks, optimization strategies
Projects built on goffi:
- go-webgpu/webgpu — zero-CGO WebGPU bindings (wgpu-native)
- born-ml/born — ML framework for Go, GPU-accelerated, PyTorch-like API
GoGPU ecosystem (pure Go GPU):
- gogpu/gogpu — GPU framework, dual backends (Rust + Pure Go)
- gogpu/wgpu — WebGPU in pure Go (Vulkan, Metal, DX12, GLES, Software)
- gogpu/naga — shader compiler in pure Go (WGSL to SPIR-V/MSL/GLSL/HLSL)
- gogpu/gg — 2D graphics library (SDF, MSDF text, Vello compute)
Acknowledgments
goffi wouldn't exist without purego.
When we first faced the CGO problem, the conventional wisdom was simple: "you can't call C from Go without a C compiler." purego proved that wrong. The ebitengine team — and specifically @AJ and @TotallyGamerJet — demonstrated that runtime.cgocall, cgo_import_dynamic, and hand-written assembly could replace CGO entirely. They showed the community that pure Go FFI was not just theoretically possible, but practical enough to ship a production game engine.
We studied purego's source code extensively. The crosscall2 callback mechanism, the fakecgo approach, the assembly trampoline pattern — purego pioneered all of these in the Go ecosystem. Without that foundation to learn from, goffi would have taken years longer to build, if we'd attempted it at all.
goffi took a different path — libffi-style CIF pre-computation instead of reflect-based dispatch, explicit type descriptors instead of Go type introspection, struct passing and callback float returns for GPU workloads — but the path only existed because purego cleared it first.
To the purego maintainers: thank you for proving it was possible. The entire pure-Go FFI ecosystem stands on your work.
goffi is MIT-licensed and open to contributions. If you're building Go bindings for C libraries and want zero-CGO with full ABI compliance — give it a try and let us know how it goes.
Top comments (0)