Andrey Kolkov

Posted on Mar 2

goffi: Zero-CGO Foreign Function Interface for Go — How We Call C Libraries Without a C Compiler

#go #opensource #programming #performance

Every Go developer who has worked with C libraries knows the pain: CGO requires a C compiler, breaks cross-compilation, bloats binaries, and adds ~200ns overhead per call. For our WebGPU bindings and ML framework, calling wgpu-native through CGO was a non-starter — we needed to ship a single static binary across Windows, Linux, and macOS without requiring users to install gcc.

So we built goffi — a pure Go FFI library that calls C functions through hand-written assembly, with zero C dependencies and zero per-call allocations. It now powers an entire ecosystem: go-webgpu/webgpu bindings, born-ml/born ML framework, and the GoGPU GPU computing platform with dual Rust and pure Go backends.

This article explains the architecture, the hard problems we solved, how goffi compares to purego, and how you can use it in your own projects.

The Problem

Our stack looks like this:

Go application (gogpu)
  └─ wgpu bindings (gogpu/wgpu)     ← needs to call C functions
       └─ goffi                      ← this library
            └─ wgpu-native (.dll/.so/.dylib)

We need to call hundreds of WebGPU functions from Go: create GPU devices, submit command buffers, handle async callbacks from Metal/Vulkan threads. The requirements were clear:

No C compiler at build time — users run go get and it works
Cross-compilation — GOOS=linux GOARCH=arm64 go build must work from Windows
Callbacks from C threads — wgpu-native fires callbacks from internal GPU threads
Struct passing — WebGPU API passes structs by value (descriptors, extents, colors)
Low overhead — GPU command encoding happens at 60+ FPS

CGO fails requirements 1 and 2. purego covers 1-2 but had gaps in 3-5 when we started. So we built goffi.

Architecture: 4 Layers Deep

Every goffi call traverses four layers to safely transition from Go's managed runtime to raw C code:

Go Code
  │  ffi.CallFunction()
  ▼
runtime.cgocall          ← Go runtime: switch to system stack, tell GC
  │
  ▼
Assembly Wrapper         ← Our code: load registers per ABI
  │  RDI=arg0 RSI=arg1 ... XMM0=float0 ...
  ▼
C Function               ← Target library

Layer 1: The Call Interface (CIF)

Unlike purego which uses reflect.MakeFunc on every call, goffi pre-computes everything once:

// Prepare once at init time
cif := &types.CallInterface{}
ffi.PrepareCallInterface(cif, types.DefaultCall,
types.UInt64TypeDescriptor,                            // return: size_t
[]*types.TypeDescriptor{types.PointerTypeDescriptor},  // arg: const char*
)

// Call many times — zero reflection, zero allocation
// args = []unsafe.Pointer{unsafe.Pointer(&myPtr)} — pointers TO arg values
ffi.CallFunction(cif, strlenPtr, unsafe.Pointer(&result), args)

PrepareCallInterface classifies each argument (integer register? SSE register? stack?), computes stack layout, and stores everything in a reusable CallInterface struct. The cif.Flags bitmask tells the assembly exactly what to do — no decisions at call time.

Layer 2: Platform Assembly

We write assembly by hand for each ABI. Here's the core of our System V AMD64 implementation (syscall_unix_amd64.s). The function receives a pointer to a syscallArgs struct in DI, loads registers from it, calls the target, and writes return values back:

// syscall_unix_amd64.s — System V AMD64 ABI
TEXT syscallN(SB), NOSPLIT|NOFRAME, $0
    PUSHQ BP
    MOVQ  SP, BP
    SUBQ  $STACK_SIZE, SP
    MOVQ  DI, R11             // R11 = args struct pointer

    // Load 8 SSE registers from struct offsets 128-184
    MOVQ 128(R11), X0         // XMM0
    MOVQ 136(R11), X1         // XMM1
    // ... XMM2-XMM7

    // Push stack-spill args (a7-a15) from struct offsets 56-120
    MOVQ 56(R11), R12
    MOVQ R12, 0(SP)           // stack slot 0

    // Load 6 GP registers from struct offsets 8-48
    MOVQ 8(R11), DI           // RDI = arg 1
    MOVQ 16(R11), SI          // RSI = arg 2
    MOVQ 24(R11), DX          // RDX = arg 3
    MOVQ 32(R11), CX          // RCX = arg 4
    MOVQ 40(R11), R8          // R8  = arg 5
    MOVQ 48(R11), R9          // R9  = arg 6

    MOVQ 0(R11), R10          // function pointer
    CALL R10

    // Save returns: RAX → r1, RDX → r2, XMM0 → f1
    MOVQ PTR_ADDRESS(BP), DI
    MOVQ AX, 192(DI)          // integer return
    MOVQ DX, 200(DI)          // second return (9-16B structs)
    MOVQ X0, 128(DI)          // float return
    // ... restore stack, RET

We maintain separate assembly for three ABIs:

ABI	GP Registers	SSE Registers	Stack
System V AMD64 (Linux/macOS)	RDI, RSI, RDX, RCX, R8, R9	XMM0-XMM7	16-byte aligned
Win64 (Windows)	RCX, RDX, R8, R9	XMM0-XMM3	32-byte shadow space
AAPCS64 (ARM64)	X0-X7	D0-D7	16-byte aligned

Layer 3: Struct Returns

This is where it gets interesting. When a C function returns a struct, the ABI rules depend on size:

<= 8 bytes: returned in RAX
9-16 bytes: split across RAX (low 8) + RDX (high 8)
> 16 bytes: caller passes a hidden pointer as the first argument (sret)

Our handleReturn function assembles the result:

case types.StructType:
size := cif.ReturnType.Size
switch {
case size <= 8:
*(*uint64)(rvalue) = retVal
case size <= 16:
*(*uint64)(rvalue) = retVal          // RAX → bytes 0-7
remaining := size - 8
src := (*[8]byte)(unsafe.Pointer(&retVal2))
dst := (*[8]byte)(unsafe.Add(rvalue, 8))
copy(dst[:remaining], src[:remaining]) // RDX → bytes 8-15
}

Layer 4: Callbacks (C → Go)

WebGPU fires async callbacks from internal threads — Metal threads, Vulkan threads, threads goffi never created. These threads have no goroutine (G = nil), so calling Go code directly would crash.

We solve this with Go's crosscall2 mechanism:

C thread (wgpu-native internal)
  │  calls our trampoline (1 of 2000 pre-compiled entries)
  ▼
Assembly dispatcher
  │  saves registers, loads callback index
  ▼
crosscall2 → runtime.load_g → runtime.cgocallback
  │  sets up goroutine, switches to Go stack
  ▼
Your Go callback function

On AMD64, each trampoline is a 5-byte CALL instruction. On ARM64, each entry is 8 bytes — two 4-byte instructions:

// ARM64 (callback_arm64.s) — 8 bytes per entry
MOVD $0, R12              // load callback index
B    ·callbackDispatcher  // branch (no link — preserves LR)
MOVD $1, R12
B    ·callbackDispatcher
// ... 2000 entries

Usage is simple:

cb := ffi.NewCallback(func(status uint32, adapter uintptr, msg uintptr, ud uintptr) {
// This runs safely even when called from a C thread
result.adapter = adapter
close(result.done)
})
ffi.CallFunction(cif, wgpuRequestAdapter, nil, args)

goffi vs purego: Honest Comparison

Both libraries are pure Go, no CGO. But they make different trade-offs:

Aspect	goffi	purego
API model	libffi-style: prepare CIF once, call many times	reflect-style: `RegisterFunc`, closure per call
Per-call cost	Zero allocations (CIF reused)	`sync.Pool.Get()` for syscall args
Callback float returns	Supported (asm writes XMM0)	`panic("unsupported return type")`
ARM64 HFA	Recursive struct walk (nested HFAs)	Top-level fields only
Type system	Explicit `TypeDescriptor` (Size/Alignment/Kind)	Go `reflect.Type` introspection
Ergonomics	Raw — you manage `unsafe.Pointer` yourself	High-level — auto string null-termination, bool, slice
Platforms	5 (amd64x3 + arm64x2)	9+ architectures
Context support	`CallFunctionContext(ctx, ...)`	None
Typed errors	5 error types with `errors.As()`	Generic errors

Choose goffi when: you need struct passing, zero per-call overhead, callback float returns, or you're building GPU/real-time bindings where every nanosecond counts.

Choose purego when: you need string auto-marshaling, broad architecture support (386, ppc64le, riscv64...), or quick one-off C library bindings with minimal boilerplate.

We use both in gogpu — goffi for the hot-path WebGPU calls, purego patterns as reference for platform edge cases.

The Ecosystem: Where goffi Came From and Where It Led

goffi wasn't built in isolation. It was born from a real need — and it enabled an entire ecosystem of pure Go GPU libraries.

The Origin: Proprietary Roots

goffi started as an internal tool. For over a year it lived inside a proprietary codebase — a GPU computing stack we built for our own projects. It worked well enough for us: a handful of platforms, a known set of functions, predictable usage patterns.

In 2025, we decided to open-source everything. Not just goffi, but the entire ecosystem — WebGPU bindings, the ML framework, the shader compiler, the GPU platform. We believed the Go community needed a real alternative to CGO for native library bindings.

What we didn't expect: the gap between "works for us internally" and "production-ready open source" was enormous.

Our internal version handled our specific use cases. Open source means every use case. Users on platforms we never tested. Struct layouts we never considered. Calling conventions with edge cases we'd never hit. The list of things that "just worked" internally but broke in the wild was humbling:

ABI compliance — our AMD64 assembly didn't handle struct returns >8 bytes correctly. Internally we never returned large structs by value. Open source users did, immediately. We had to implement RAX+RDX split returns and sret hidden pointers.
ARM64 — we had AMD64 only. Open source meant Apple Silicon support was day one priority, not a nice-to-have.
Callbacks from C threads — internally we controlled which threads called back into Go. In the wild, wgpu-native fires callbacks from Metal and Vulkan threads we never created. We had to integrate crosscall2 for proper C→Go transitions.
Error handling — our internal code used generic errors. Open source users needed errors.As() with typed errors to build robust applications. We added 5 error types.
Testing — our internal coverage was ~40%. Getting to 89% meant writing hundreds of test cases for edge cases we'd never encountered ourselves.
Documentation — internally, we knew how the code worked. For open source, every assembly file needed comments explaining the ABI, every public function needed godoc, every platform quirk needed documentation.

We essentially rebuilt goffi from scratch while keeping the core idea intact. The architecture is the same — CIF pre-computation, assembly dispatch, zero allocations — but the implementation is production-grade now, not prototype-grade.

go-webgpu/webgpu

It started with go-webgpu/webgpu — our zero-CGO WebGPU bindings for Go. We wanted to call wgpu-native (Rust-based Vulkan/Metal/DX12 backend) from Go without requiring a C compiler. Every existing approach had a deal-breaker:

CGO: requires gcc, breaks go get, no cross-compilation
purego: at the time, no struct passing, no callback float returns, no HFA support — things WebGPU needs

So we built goffi as the FFI layer for go-webgpu/webgpu. The bindings wrap 180+ wgpu-native functions — device creation, buffer allocation, render passes, compute dispatches, async adapter requests.

born-ml: Machine Learning on GPU

The second consumer was born-ml/born — a production-ready ML framework for Go with a PyTorch-like API. born needs GPU compute for tensor operations: matrix multiplication, convolution, automatic differentiation. The WebGPU compute pipeline powered by goffi gives born GPU acceleration while shipping as a single static binary.

born (ML framework)
  └─ go-webgpu/webgpu (WebGPU bindings)
       └─ goffi (FFI layer)
            └─ wgpu-native (Vulkan/Metal/DX12)

This stack lets you go get github.com/born-ml/born, write a neural network, and run it on GPU — no Python, no CUDA, no C compiler.

GoGPU: The Full Ecosystem

As the projects matured, we realized we could go further. GoGPU grew into a complete GPU computing ecosystem with dual backends — a high-performance Rust backend (wgpu-native via goffi) and a pure Go backend:

Package	What it does	Uses goffi
gogpu/gogpu	GPU framework — windowing, input, event loop, dual backends (Rust wgpu-native or Pure Go)	Yes
gogpu/wgpu	WebGPU implementation in pure Go — calls Vulkan, Metal, EGL/GLES natively via goffi	Yes
gogpu/naga	Shader compiler in pure Go — WGSL to SPIR-V, MSL, GLSL, HLSL	No
gogpu/gg	2D graphics library — SDF rendering, MSDF text, Vello compute pipeline	Indirect
gogpu/gpucontext	Shared interfaces for GPU context, windowing, and surface creation	No

Both gogpu/gogpu and gogpu/wgpu depend directly on goffi. The "pure Go" backend (gogpu/wgpu) is pure Go in the sense of zero CGO — no C compiler needed — but it still calls native Vulkan, Metal, and EGL APIs through goffi. That's the whole point: goffi replaces CGO, not the native graphics drivers.

Real-World Performance

At 60 FPS, a typical frame makes ~30-50 FFI calls through goffi:

Frame budget:            16.6 ms
GPU work:                ~15 ms
FFI overhead (50 calls): 50 × 100ns = 5 us = 0.03%

The FFI overhead is literally unmeasurable in profiling.

Callback-Heavy Async APIs

WebGPU is heavily async. Device creation, adapter requests, buffer mapping — all callback-based:

// Request GPU adapter (async) — simplified pattern
cb := ffi.NewCallback(func(status uint32, adapter uintptr, msg uintptr, ud uintptr) {
result.status = status
result.handle = adapter
result.done <- struct{}{}
})

ffi.CallFunction(&requestAdapterCIF, wgpuInstanceRequestAdapter, nil,
[]unsafe.Pointer{
unsafe.Pointer(&instance),
unsafe.Pointer(&options),
unsafe.Pointer(&cb),
unsafe.Pointer(&userdata),
})

<-result.done // Wait for GPU driver callback

This works even when wgpu-native fires the callback from an internal Metal/Vulkan thread, thanks to our crosscall2 integration.

How to Use goffi in Your Project

Installation

go get github.com/go-webgpu/goffi

Minimal Example: Calling strlen

package main

import (
    "fmt"
    "runtime"
    "unsafe"

    "github.com/go-webgpu/goffi/ffi"
    "github.com/go-webgpu/goffi/types"
)

func main() {
    // 1. Load library
    libName := "libc.so.6"
    if runtime.GOOS == "windows" {
        libName = "msvcrt.dll"
    }
    handle, _ := ffi.LoadLibrary(libName)
    defer ffi.FreeLibrary(handle)

    strlen, _ := ffi.GetSymbol(handle, "strlen")

    // 2. Prepare call interface (once)
    cif := &types.CallInterface{}
    ffi.PrepareCallInterface(cif, types.DefaultCall,
        types.UInt64TypeDescriptor,
        []*types.TypeDescriptor{types.PointerTypeDescriptor},
    )

    // 3. Call (many times, zero overhead)
    str := "Hello, goffi!\x00"
    strPtr := uintptr(unsafe.Pointer(unsafe.StringData(str)))
    var length uint64

    ffi.CallFunction(cif, strlen, unsafe.Pointer(&length),
        []unsafe.Pointer{unsafe.Pointer(&strPtr)})
    fmt.Printf("strlen = %d\n", length) // 13
}

Passing Structs

// Define struct layout matching C struct
pointType := &types.TypeDescriptor{
Size:      16,               // sizeof(Point)
Alignment: 8,                // alignof(Point)
Kind:      types.StructType,
Members: []*types.TypeDescriptor{
types.DoubleTypeDescriptor, // x
types.DoubleTypeDescriptor, // y
},
}

// Use in CIF
ffi.PrepareCallInterface(cif, types.DefaultCall,
types.DoubleTypeDescriptor,  // returns double (distance)
[]*types.TypeDescriptor{pointType, pointType}, // two Point args
)

Registering Callbacks

cb := ffi.NewCallback(func(eventType uint32, data uintptr) {
fmt.Printf("Event %d received\n", eventType)
})

// Pass cb (uintptr) to C function expecting a function pointer
ffi.CallFunction(&cif, registerCallback, nil,
[]unsafe.Pointer{unsafe.Pointer(&cb)})

The Hard Lessons

Building a production FFI taught us things no documentation covers:

1. Stack alignment kills silently. A single byte of misalignment on AMD64 causes SIGSEGV — but only sometimes, depending on whether the callee uses SSE instructions. We spent days debugging crashes that only reproduced under specific GPU driver versions.

2. Windows shadow space is non-negotiable. Win64 ABI requires 32 bytes of "shadow space" on every call, even if the function takes zero arguments. Miss it and the callee corrupts your stack.

3. ARM64 HFA rules are recursive. A struct containing a struct containing 4 floats is still an HFA (Homogeneous Floating-point Aggregate) and must be passed in D0-D3. purego only checks top-level fields; we had to walk the full type tree.

4. C threads have no goroutine. When wgpu-native calls your callback from an internal Metal thread, getg() returns nil. You must go through crosscall2 → load_g → cgocallback or the runtime panics.

5. float32 encoding matters. On Windows, syscall.SyscallN passes args as uintptr. Widening float32 to float64 then stuffing into a register corrupts the bit pattern — you need math.Float32bits to preserve the exact IEEE-754 representation.

Numbers

Metric	Value
FFI overhead	88-114 ns/op
Test coverage	89%
Platforms	5 (Win/Linux/macOS x AMD64 + Linux/macOS x ARM64)
Assembly files	17 files, ~900 lines of logic + 6200 lines of trampoline entries
Callback slots	2000 per process
Dependencies	0 (only Go stdlib)
CGO required	No

What About Go 1.26 CGO Improvements?

Go 1.26 (released February 2026) reduced cgo call overhead by ~30% by removing the dedicated syscall P state. Benchmarks on Apple M1 show CgoCall is 33% faster, CgoCallWithCallback is 21% faster.

This is great news — but it doesn't change our calculus:

CGO still requires a C compiler at build time. Our users go get and ship.
Cross-compilation with CGO still requires cross-toolchains. GOOS=linux GOARCH=arm64 go build just works with goffi.
Static binaries — CGO often pulls in libc. goffi produces fully static Go binaries.
Go 1.26 also benefits goffi — our runtime.cgocall path gets the same 30% speedup, because goffi uses the same runtime machinery internally.

The gap between CGO and pure-Go FFI is narrowing from both directions. We welcome it.

What's Next

v0.5.0 is focused on usability:

Variadic function support (printf, sprintf)
Builder pattern API for less boilerplate
Platform-specific struct alignment (Windows #pragma pack)

v1.0.0 targets API stability with SemVer guarantees, security audit, and published benchmarks vs CGO/purego.

The long-term goal: make GPU programming in Go as natural as it is in Rust or C++, with the ergonomics Go developers expect — go get, go build, done.

Acknowledgments

goffi wouldn't exist without purego.

When we first faced the CGO problem, the conventional wisdom was simple: "you can't call C from Go without a C compiler." purego proved that wrong. The ebitengine team — and specifically @AJ and @TotallyGamerJet — demonstrated that runtime.cgocall, cgo_import_dynamic, and hand-written assembly could replace CGO entirely. They showed the community that pure Go FFI was not just theoretically possible, but practical enough to ship a production game engine.

We studied purego's source code extensively. The crosscall2 callback mechanism, the fakecgo approach, the assembly trampoline pattern — purego pioneered all of these in the Go ecosystem. Without that foundation to learn from, goffi would have taken years longer to build, if we'd attempted it at all.

goffi took a different path — libffi-style CIF pre-computation instead of reflect-based dispatch, explicit type descriptors instead of Go type introspection, struct passing and callback float returns for GPU workloads — but the path only existed because purego cleared it first.

To the purego maintainers: thank you for proving it was possible. The entire pure-Go FFI ecosystem stands on your work.

goffi is MIT-licensed and open to contributions. If you're building Go bindings for C libraries and want zero-CGO with full ABI compliance — give it a try and let us know how it goes.

DEV Community