skyne

Posted on Jun 27

Resurrecting Kepler: Getting Modern LLMs Running on a GTX 770 (Kernel 7.x)

#cuda #linux #llm #gpu

⚠️ Experimental hack: Use on non-critical systems. Ensure you have backups. This patches a proprietary binary at the instruction level — no warranty, no support.

The Story: Defying Obsolescence

Kepler GPUs (2012–2014) are e-waste by NVIDIA's timeline, but they are perfectly capable hardware for inference workloads. The GTX 770 has 1536 CUDA cores and 2 GB GDDR5 — enough for small-to-medium LLMs. This project proves that with a five-byte fix and some kernel backports, these GPUs can be kept useful on modern Linux systems, reducing e-waste and teaching real systems engineering along the way.

Goal

Keep an NVIDIA GeForce GTX 770 (GK104, sm_30) — a Kepler GPU abandoned by NVIDIA's driver stack after driver 470.256.02 and CUDA 10.2 — running CUDA workloads on a modern Linux kernel (6.15 → 7.x, Ubuntu 26.04).

Two problems made stock software a dead end:

Kernel module won't compile — the 470.256.02 driver source doesn't build against kernels ≥6.15 due to dozens of removed/renamed APIs.
cuInit returns error 802 — even after the module loads and nvidia-smi works, every CUDA program fails with CUDA_ERROR_SYSTEM_NOT_YET_INITIALIZED.

Technical Deep-Dive

1. Kernel Module Patching

The proprietary 470.256.02 driver source does not build against kernels ≥6.15 due to removed/renamed APIs. I used community-sourced patch sets (primarily from Fedora/Debian packaging by Joan Bruguera Mico and Andreas Beckmann) to resolve issues like:

screen_info → sysfb_primary_display.screen
del_timer_sync → timer_delete_sync
follow_pfn → unsafe_follow_pfn
dma_fence_signal now returns void
GCC 14 efi_enabled cast and UBSAN mismatches

After these backports, nvidia-smi reports the GTX 770 correctly. But cuInit still fails.

2. Resolving the `cuInit` Error 802

All rm_ioctl kernel calls return NV_OK — the kernel module is fine. The failure lives in userspace. With gdb, I traced cuInit calling rm_ioctl(0x2a) twice; both calls succeed at the kernel level, yet the library still returns 802.

Disassembly of the RM response handler in libcuda.so.470.256.02:

3436a0: mov   0xc(%rsp),%eax      ; load status from RM response
3436a4: cmp   $0x2,%eax           ; status == 2?
3436a7: je    3436f0              ; → return 802
3436a9: jbe   3436e0              ; status <= 1?
3436e0: cmp   $0x1,%eax
3436e3: jne   3436c5              ; status != 1 → return 999
3436e5: xor   %eax,%eax           ; cuInit: 0 (CUDA_SUCCESS)
...
3436f0: add   $0x18,%rsp
3436f4: mov   $0x322,%eax         ; return 802
3436f9: pop; ret

Root cause: The Resource Manager firmware on Kepler returns internal status code 2 (NV_ERR_BUFFER_TOO_SMALL) for the second initialization rm_ioctl. The library interprets RM status 1 and 4 as successful init and eventually returns 0 (CUDA_SUCCESS) from cuInit. Status 2 is treated as fatal, so cuInit returns 802 to the caller. Likely a buffer-size negotiation mismatch between the GTX 770's VBIOS firmware and the final 470.x userspace library. NVIDIA never fixed it because Kepler was already on legacy support.

The fix: At offset 0x3436f4, when RM returns status 2, skip the error path. Instead of mov $0x322, %eax (return 802 to the caller), use xor %eax, %eax (return 0 — same as the successful init path). The patch does not change what the RM returns; it bypasses a false-positive error branch:

	Bytes	Instruction
Before	`b8 22 03 00 00`	`mov $0x322, %eax`
After	`31 c0 90 90 90`	`xor %eax, %eax; nop; nop; nop`

Subsequent rm_ioctl calls succeed — only this specific init ioctl is broken. Patch script:

#!/usr/bin/env python3
import shutil, os

libpath = "/usr/lib/x86_64-linux-gnu/libcuda.so.470.256.02"
backup_path = libpath + ".bak"

if not os.path.exists(backup_path):
    shutil.copy2(libpath, backup_path)

with open(libpath, "rb") as f:
    data = bytearray(f.read())

offset = 0x3436f4
expected = bytes([0xb8, 0x22, 0x03, 0x00, 0x00])
actual = data[offset:offset+5]

if actual == expected:
    data[offset:offset+2] = bytes([0x31, 0xc0])
    data[offset+2:offset+5] = bytes([0x90, 0x90, 0x90])
    print(f"Patched: {actual.hex()} -> {data[offset:offset+5].hex()}")
elif actual[:2] == bytes([0x31, 0xc0]):
    print("Already patched!")
else:
    print(f"UNEXPECTED at 0x{offset:x}: {actual.hex()}")
    exit(1)

with open(libpath, "wb") as f:
    f.write(data)

3. Toolchain & Compilation

sm_30 support was dropped in CUDA 11, so we need CUDA 10.2's ptxas. But nvcc rejects GCC 15 (Ubuntu 26.04 default). clang++ bridges legacy CUDA 10.2 headers and modern system libraries.

llama.cpp uses cg::this_grid() (CUDA 11+). Patched softmax.cu for CUDA 10.2:

// Before (CUDA >= 11.0):
const cg::grid_group g = cg::this_grid();

// After (CUDA < 11.00):
const cg::thread_block g = cg::this_thread_block();

Build flags:

cmake .. -DLLAMA_CUDA=ON \
  -DCMAKE_C_COMPILER=clang -DCMAKE_CXX_COMPILER=clang++ \
  -DCUDAToolkit_ROOT=/usr/local/cuda-10.2 \
  -DCMAKE_CUDA_COMPILER=clang++ \
  -DCMAKE_CUDA_ARCHITECTURES=30 \
  -DGGML_CUDA_GRAPHS=OFF

-DGGML_CUDA_GRAPHS=OFF is critical — CUDA graph capture requires sm_35+ and crashes on sm_30.

Performance Benchmarks

Hardware: GTX 770 (2 GB VRAM), Ubuntu 26.04, kernel 7.0.0-27, llama.cpp c16c35b81.

Qwen 2.5 1.5B — fully offloaded (ngl=99)

Quant	Test	t/s
Q4_K_M	pp64	69.50±0.95
Q4_K_M	tg512	25.84±0.20

Qwen 2.5 1.5B — CPU only (ngl=0)

Quant	Test	t/s
Q4_K_M	pp64	39.03±1.09

GPU offload gives ~1.8× speedup on prompt processing for this model.

Qwen 2.5 3B — fully offloaded (ngl=99)

Quant	Test	t/s
Q3_K_M	pp64	36.18±0.33
Q3_K_M	tg256	10.11±0.11

Qwen 3B at Q4_K_M (1.95 GiB) exceeds 2 GB VRAM — Q3_K_M (1.60 GiB) is required for full offloading.

It Works

$ nvidia-smi -L
GPU 0: NVIDIA GeForce GTX 770 (UUID: GPU-3a93c548-...)

$ /tmp/test_cuinit
cuInit=0

$ llama-bench --list-devices
CUDA0: NVIDIA GeForce GTX 770 (1998 MiB, ...)

Full working stack: kernel module → patched libcuda.so → CUDA 10.2 runtime → llama.cpp CUDA backend — all on Linux 7.x with a 2013 Kepler GPU.

Surviving Kernel Upgrades (DKMS)

sudo apt install dkms
sudo dkms add nvidia/470.256.02
sudo dkms build nvidia/470.256.02 -k $(uname -r)
sudo dkms install nvidia/470.256.02 -k $(uname -r)

Full Technical Write-up

For the complete debugging log, kernel patch table, patch scripts, and build instructions, see the GitHub Gist.

DEV Community

Resurrecting Kepler: Getting Modern LLMs Running on a GTX 770 (Kernel 7.x)

The Story: Defying Obsolescence

Goal

Technical Deep-Dive

1. Kernel Module Patching

2. Resolving the `cuInit` Error 802

3. Toolchain & Compilation

Performance Benchmarks

Qwen 2.5 1.5B — fully offloaded (ngl=99)

Qwen 2.5 1.5B — CPU only (ngl=0)

Qwen 2.5 3B — fully offloaded (ngl=99)

It Works

Surviving Kernel Upgrades (DKMS)

Full Technical Write-up

Top comments (0)

The Story: Defying Obsolescence

Goal

Technical Deep-Dive

1. Kernel Module Patching

2. Resolving the cuInit Error 802

3. Toolchain & Compilation

Performance Benchmarks

Qwen 2.5 1.5B — fully offloaded (ngl=99)

Qwen 2.5 1.5B — CPU only (ngl=0)

Qwen 2.5 3B — fully offloaded (ngl=99)

It Works

Surviving Kernel Upgrades (DKMS)

Full Technical Write-up

2. Resolving the `cuInit` Error 802