⚠️ Experimental hack: Use on non-critical systems. Ensure you have backups. This patches a proprietary binary at the instruction level — no warranty, no support.
The Story: Defying Obsolescence
Kepler GPUs (2012–2014) are e-waste by NVIDIA's timeline, but they are perfectly capable hardware for inference workloads. The GTX 770 has 1536 CUDA cores and 2 GB GDDR5 — enough for small-to-medium LLMs. This project proves that with a five-byte fix and some kernel backports, these GPUs can be kept useful on modern Linux systems, reducing e-waste and teaching real systems engineering along the way.
Goal
Keep an NVIDIA GeForce GTX 770 (GK104, sm_30) — a Kepler GPU abandoned by NVIDIA's driver stack after driver 470.256.02 and CUDA 10.2 — running CUDA workloads on a modern Linux kernel (6.15 → 7.x, Ubuntu 26.04).
Two problems made stock software a dead end:
- Kernel module won't compile — the 470.256.02 driver source doesn't build against kernels ≥6.15 due to dozens of removed/renamed APIs.
-
cuInitreturns error 802 — even after the module loads andnvidia-smiworks, every CUDA program fails withCUDA_ERROR_SYSTEM_NOT_YET_INITIALIZED.
Technical Deep-Dive
1. Kernel Module Patching
The proprietary 470.256.02 driver source does not build against kernels ≥6.15 due to removed/renamed APIs. I used community-sourced patch sets (primarily from Fedora/Debian packaging by Joan Bruguera Mico and Andreas Beckmann) to resolve issues like:
-
screen_info→sysfb_primary_display.screen -
del_timer_sync→timer_delete_sync -
follow_pfn→unsafe_follow_pfn -
dma_fence_signalnow returns void - GCC 14
efi_enabledcast and UBSAN mismatches
After these backports, nvidia-smi reports the GTX 770 correctly. But cuInit still fails.
2. Resolving the cuInit Error 802
All rm_ioctl kernel calls return NV_OK — the kernel module is fine. The failure lives in userspace. With gdb, I traced cuInit calling rm_ioctl(0x2a) twice; both calls succeed at the kernel level, yet the library still returns 802.
Disassembly of the RM response handler in libcuda.so.470.256.02:
3436a0: mov 0xc(%rsp),%eax ; load status from RM response
3436a4: cmp $0x2,%eax ; status == 2?
3436a7: je 3436f0 ; → return 802
3436a9: jbe 3436e0 ; status <= 1?
3436e0: cmp $0x1,%eax
3436e3: jne 3436c5 ; status != 1 → return 999
3436e5: xor %eax,%eax ; cuInit: 0 (CUDA_SUCCESS)
...
3436f0: add $0x18,%rsp
3436f4: mov $0x322,%eax ; return 802
3436f9: pop; ret
Root cause: The Resource Manager firmware on Kepler returns internal status code 2 (NV_ERR_BUFFER_TOO_SMALL) for the second initialization rm_ioctl. The library interprets RM status 1 and 4 as successful init and eventually returns 0 (CUDA_SUCCESS) from cuInit. Status 2 is treated as fatal, so cuInit returns 802 to the caller. Likely a buffer-size negotiation mismatch between the GTX 770's VBIOS firmware and the final 470.x userspace library. NVIDIA never fixed it because Kepler was already on legacy support.
The fix: At offset 0x3436f4, when RM returns status 2, skip the error path. Instead of mov $0x322, %eax (return 802 to the caller), use xor %eax, %eax (return 0 — same as the successful init path). The patch does not change what the RM returns; it bypasses a false-positive error branch:
| Bytes | Instruction | |
|---|---|---|
| Before | b8 22 03 00 00 |
mov $0x322, %eax |
| After | 31 c0 90 90 90 |
xor %eax, %eax; nop; nop; nop |
Subsequent rm_ioctl calls succeed — only this specific init ioctl is broken. Patch script:
#!/usr/bin/env python3
import shutil, os
libpath = "/usr/lib/x86_64-linux-gnu/libcuda.so.470.256.02"
backup_path = libpath + ".bak"
if not os.path.exists(backup_path):
shutil.copy2(libpath, backup_path)
with open(libpath, "rb") as f:
data = bytearray(f.read())
offset = 0x3436f4
expected = bytes([0xb8, 0x22, 0x03, 0x00, 0x00])
actual = data[offset:offset+5]
if actual == expected:
data[offset:offset+2] = bytes([0x31, 0xc0])
data[offset+2:offset+5] = bytes([0x90, 0x90, 0x90])
print(f"Patched: {actual.hex()} -> {data[offset:offset+5].hex()}")
elif actual[:2] == bytes([0x31, 0xc0]):
print("Already patched!")
else:
print(f"UNEXPECTED at 0x{offset:x}: {actual.hex()}")
exit(1)
with open(libpath, "wb") as f:
f.write(data)
3. Toolchain & Compilation
sm_30 support was dropped in CUDA 11, so we need CUDA 10.2's ptxas. But nvcc rejects GCC 15 (Ubuntu 26.04 default). clang++ bridges legacy CUDA 10.2 headers and modern system libraries.
llama.cpp uses cg::this_grid() (CUDA 11+). Patched softmax.cu for CUDA 10.2:
// Before (CUDA >= 11.0):
const cg::grid_group g = cg::this_grid();
// After (CUDA < 11.00):
const cg::thread_block g = cg::this_thread_block();
Build flags:
cmake .. -DLLAMA_CUDA=ON \
-DCMAKE_C_COMPILER=clang -DCMAKE_CXX_COMPILER=clang++ \
-DCUDAToolkit_ROOT=/usr/local/cuda-10.2 \
-DCMAKE_CUDA_COMPILER=clang++ \
-DCMAKE_CUDA_ARCHITECTURES=30 \
-DGGML_CUDA_GRAPHS=OFF
-DGGML_CUDA_GRAPHS=OFF is critical — CUDA graph capture requires sm_35+ and crashes on sm_30.
Performance Benchmarks
Hardware: GTX 770 (2 GB VRAM), Ubuntu 26.04, kernel 7.0.0-27, llama.cpp c16c35b81.
Qwen 2.5 1.5B — fully offloaded (ngl=99)
| Quant | Test | t/s |
|---|---|---|
| Q4_K_M | pp64 | 69.50±0.95 |
| Q4_K_M | tg512 | 25.84±0.20 |
Qwen 2.5 1.5B — CPU only (ngl=0)
| Quant | Test | t/s |
|---|---|---|
| Q4_K_M | pp64 | 39.03±1.09 |
GPU offload gives ~1.8× speedup on prompt processing for this model.
Qwen 2.5 3B — fully offloaded (ngl=99)
| Quant | Test | t/s |
|---|---|---|
| Q3_K_M | pp64 | 36.18±0.33 |
| Q3_K_M | tg256 | 10.11±0.11 |
Qwen 3B at Q4_K_M (1.95 GiB) exceeds 2 GB VRAM — Q3_K_M (1.60 GiB) is required for full offloading.
It Works
$ nvidia-smi -L
GPU 0: NVIDIA GeForce GTX 770 (UUID: GPU-3a93c548-...)
$ /tmp/test_cuinit
cuInit=0
$ llama-bench --list-devices
CUDA0: NVIDIA GeForce GTX 770 (1998 MiB, ...)
Full working stack: kernel module → patched libcuda.so → CUDA 10.2 runtime → llama.cpp CUDA backend — all on Linux 7.x with a 2013 Kepler GPU.
Surviving Kernel Upgrades (DKMS)
Register the patched driver with DKMS so module rebuilds happen automatically:
sudo apt install dkms
sudo dkms add nvidia/470.256.02
sudo dkms build nvidia/470.256.02 -k $(uname -r)
sudo dkms install nvidia/470.256.02 -k $(uname -r)
Full Technical Write-up
For the complete debugging log, kernel patch table, patch scripts, and build instructions, see the GitHub Gist.
Top comments (0)