<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: skyne</title>
    <description>The latest articles on DEV Community by skyne (@skyne).</description>
    <link>https://dev.to/skyne</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F4005340%2F866f3e02-f99e-4d85-a6e6-467b3fb8b7ef.png</url>
      <title>DEV Community: skyne</title>
      <link>https://dev.to/skyne</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/skyne"/>
    <language>en</language>
    <item>
      <title>Resurrecting Kepler: Getting Modern LLMs Running on a GTX 770 (Kernel 7.x)</title>
      <dc:creator>skyne</dc:creator>
      <pubDate>Sat, 27 Jun 2026 13:29:09 +0000</pubDate>
      <link>https://dev.to/skyne/resurrecting-kepler-getting-modern-llms-running-on-a-gtx-770-kernel-7x-4na</link>
      <guid>https://dev.to/skyne/resurrecting-kepler-getting-modern-llms-running-on-a-gtx-770-kernel-7x-4na</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;⚠️ Experimental hack&lt;/strong&gt;: Use on non-critical systems. Ensure you have backups. This patches a proprietary binary at the instruction level — no warranty, no support.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  The Story: Defying Obsolescence
&lt;/h2&gt;

&lt;p&gt;Kepler GPUs (2012–2014) are e-waste by NVIDIA's timeline, but they are perfectly capable hardware for inference workloads. The GTX 770 has 1536 CUDA cores and 2 GB GDDR5 — enough for small-to-medium LLMs. This project proves that with a &lt;strong&gt;five-byte fix&lt;/strong&gt; and some kernel backports, these GPUs can be kept useful on modern Linux systems, reducing e-waste and teaching real systems engineering along the way.&lt;/p&gt;

&lt;h2&gt;
  
  
  Goal
&lt;/h2&gt;

&lt;p&gt;Keep an &lt;strong&gt;NVIDIA GeForce GTX 770 (GK104, sm_30)&lt;/strong&gt; — a Kepler GPU abandoned by NVIDIA's driver stack after driver 470.256.02 and CUDA 10.2 — running CUDA workloads on a modern Linux kernel (6.15 → 7.x, Ubuntu 26.04).&lt;/p&gt;

&lt;p&gt;Two problems made stock software a dead end:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Kernel module won't compile&lt;/strong&gt; — the 470.256.02 driver source doesn't build against kernels ≥6.15 due to dozens of removed/renamed APIs.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;cuInit&lt;/code&gt; returns error 802&lt;/strong&gt; — even after the module loads and &lt;code&gt;nvidia-smi&lt;/code&gt; works, every CUDA program fails with &lt;code&gt;CUDA_ERROR_SYSTEM_NOT_YET_INITIALIZED&lt;/code&gt;.&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  Technical Deep-Dive
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Kernel Module Patching
&lt;/h3&gt;

&lt;p&gt;The proprietary 470.256.02 driver source does not build against kernels ≥6.15 due to removed/renamed APIs. I used community-sourced patch sets (primarily from &lt;a href="https://src.fedoraproject.org/rpms/nvidia-kmod" rel="noopener noreferrer"&gt;Fedora/Debian packaging&lt;/a&gt; by Joan Bruguera Mico and Andreas Beckmann) to resolve issues like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;screen_info&lt;/code&gt; → &lt;code&gt;sysfb_primary_display.screen&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;del_timer_sync&lt;/code&gt; → &lt;code&gt;timer_delete_sync&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;follow_pfn&lt;/code&gt; → &lt;code&gt;unsafe_follow_pfn&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;dma_fence_signal&lt;/code&gt; now returns void&lt;/li&gt;
&lt;li&gt;GCC 14 &lt;code&gt;efi_enabled&lt;/code&gt; cast and UBSAN mismatches&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;After these backports, &lt;code&gt;nvidia-smi&lt;/code&gt; reports the GTX 770 correctly. But &lt;code&gt;cuInit&lt;/code&gt; still fails.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Resolving the &lt;code&gt;cuInit&lt;/code&gt; Error 802
&lt;/h3&gt;

&lt;p&gt;All &lt;code&gt;rm_ioctl&lt;/code&gt; kernel calls return &lt;code&gt;NV_OK&lt;/code&gt; — the kernel module is fine. The failure lives in userspace. With &lt;code&gt;gdb&lt;/code&gt;, I traced &lt;code&gt;cuInit&lt;/code&gt; calling &lt;code&gt;rm_ioctl(0x2a)&lt;/code&gt; twice; both calls succeed at the kernel level, yet the library still returns 802.&lt;/p&gt;

&lt;p&gt;Disassembly of the RM response handler in &lt;code&gt;libcuda.so.470.256.02&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;3436a0: mov   0xc(%rsp),%eax      ; load status from RM response
3436a4: cmp   $0x2,%eax           ; status == 2?
3436a7: je    3436f0              ; → return 802
3436a9: jbe   3436e0              ; status &amp;lt;= 1?
3436e0: cmp   $0x1,%eax
3436e3: jne   3436c5              ; status != 1 → return 999
3436e5: xor   %eax,%eax           ; cuInit: 0 (CUDA_SUCCESS)
...
3436f0: add   $0x18,%rsp
3436f4: mov   $0x322,%eax         ; return 802
3436f9: pop; ret
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Root cause:&lt;/strong&gt; The Resource Manager firmware on Kepler returns internal status code &lt;code&gt;2&lt;/code&gt; (&lt;code&gt;NV_ERR_BUFFER_TOO_SMALL&lt;/code&gt;) for the second initialization &lt;code&gt;rm_ioctl&lt;/code&gt;. The library interprets RM status &lt;code&gt;1&lt;/code&gt; and &lt;code&gt;4&lt;/code&gt; as successful init and eventually returns &lt;code&gt;0&lt;/code&gt; (&lt;code&gt;CUDA_SUCCESS&lt;/code&gt;) from &lt;code&gt;cuInit&lt;/code&gt;. Status &lt;code&gt;2&lt;/code&gt; is treated as fatal, so &lt;code&gt;cuInit&lt;/code&gt; returns &lt;code&gt;802&lt;/code&gt; to the caller. Likely a buffer-size negotiation mismatch between the GTX 770's VBIOS firmware and the final 470.x userspace library. NVIDIA never fixed it because Kepler was already on legacy support.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The fix:&lt;/strong&gt; At offset &lt;code&gt;0x3436f4&lt;/code&gt;, when RM returns status &lt;code&gt;2&lt;/code&gt;, skip the error path. Instead of &lt;code&gt;mov $0x322, %eax&lt;/code&gt; (return 802 to the caller), use &lt;code&gt;xor %eax, %eax&lt;/code&gt; (return 0 — same as the successful init path). The patch does not change what the RM returns; it bypasses a false-positive error branch:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;Bytes&lt;/th&gt;
&lt;th&gt;Instruction&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Before&lt;/td&gt;
&lt;td&gt;&lt;code&gt;b8 22 03 00 00&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;mov $0x322, %eax&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;After&lt;/td&gt;
&lt;td&gt;&lt;code&gt;31 c0 90 90 90&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;xor %eax, %eax; nop; nop; nop&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Subsequent &lt;code&gt;rm_ioctl&lt;/code&gt; calls succeed — only this specific init ioctl is broken. Patch script:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;#!/usr/bin/env python3
&lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;shutil&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;

&lt;span class="n"&gt;libpath&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/usr/lib/x86_64-linux-gnu/libcuda.so.470.256.02&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="n"&gt;backup_path&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;libpath&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;.bak&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;exists&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;backup_path&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;shutil&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;copy2&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;libpath&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;backup_path&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;libpath&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rb&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;bytearray&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;read&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;

&lt;span class="n"&gt;offset&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mh"&gt;0x3436f4&lt;/span&gt;
&lt;span class="n"&gt;expected&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;bytes&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="mh"&gt;0xb8&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mh"&gt;0x22&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mh"&gt;0x03&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mh"&gt;0x00&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mh"&gt;0x00&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="n"&gt;actual&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;offset&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="n"&gt;offset&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;actual&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;expected&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;offset&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="n"&gt;offset&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;bytes&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="mh"&gt;0x31&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mh"&gt;0xc0&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
    &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;offset&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="n"&gt;offset&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;bytes&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="mh"&gt;0x90&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mh"&gt;0x90&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mh"&gt;0x90&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Patched: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;actual&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;hex&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; -&amp;gt; &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;offset&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="n"&gt;offset&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;hex&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="n"&gt;actual&lt;/span&gt;&lt;span class="p"&gt;[:&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="nf"&gt;bytes&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="mh"&gt;0x31&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mh"&gt;0xc0&lt;/span&gt;&lt;span class="p"&gt;]):&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Already patched!&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;UNEXPECTED at 0x&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;offset&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;actual&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;hex&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;exit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;libpath&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;wb&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;write&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  3. Toolchain &amp;amp; Compilation
&lt;/h3&gt;

&lt;p&gt;sm_30 support was dropped in CUDA 11, so we need CUDA 10.2's &lt;code&gt;ptxas&lt;/code&gt;. But &lt;code&gt;nvcc&lt;/code&gt; rejects GCC 15 (Ubuntu 26.04 default). &lt;strong&gt;clang++&lt;/strong&gt; bridges legacy CUDA 10.2 headers and modern system libraries.&lt;/p&gt;

&lt;p&gt;llama.cpp uses &lt;code&gt;cg::this_grid()&lt;/code&gt; (CUDA 11+). Patched &lt;code&gt;softmax.cu&lt;/code&gt; for CUDA 10.2:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cuda"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Before (CUDA &amp;gt;= 11.0):&lt;/span&gt;
&lt;span class="k"&gt;const&lt;/span&gt; &lt;span class="n"&gt;cg&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;grid_group&lt;/span&gt; &lt;span class="n"&gt;g&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;cg&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;this_grid&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;

&lt;span class="c1"&gt;// After (CUDA &amp;lt; 11.00):&lt;/span&gt;
&lt;span class="k"&gt;const&lt;/span&gt; &lt;span class="n"&gt;cg&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;thread_block&lt;/span&gt; &lt;span class="n"&gt;g&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;cg&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;this_thread_block&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Build flags:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;cmake .. &lt;span class="nt"&gt;-DLLAMA_CUDA&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;ON &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-DCMAKE_C_COMPILER&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;clang &lt;span class="nt"&gt;-DCMAKE_CXX_COMPILER&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;clang++ &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-DCUDAToolkit_ROOT&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;/usr/local/cuda-10.2 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-DCMAKE_CUDA_COMPILER&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;clang++ &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-DCMAKE_CUDA_ARCHITECTURES&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;30 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-DGGML_CUDA_GRAPHS&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;OFF
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;-DGGML_CUDA_GRAPHS=OFF&lt;/code&gt; is critical — CUDA graph capture requires sm_35+ and crashes on sm_30.&lt;/p&gt;




&lt;h2&gt;
  
  
  Performance Benchmarks
&lt;/h2&gt;

&lt;p&gt;Hardware: &lt;strong&gt;GTX 770 (2 GB VRAM)&lt;/strong&gt;, &lt;strong&gt;Ubuntu 26.04&lt;/strong&gt;, &lt;strong&gt;kernel 7.0.0-27&lt;/strong&gt;, &lt;strong&gt;llama.cpp c16c35b81&lt;/strong&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Qwen 2.5 1.5B — fully offloaded (ngl=99)
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Quant&lt;/th&gt;
&lt;th&gt;Test&lt;/th&gt;
&lt;th&gt;t/s&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Q4_K_M&lt;/td&gt;
&lt;td&gt;pp64&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;69.50±0.95&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Q4_K_M&lt;/td&gt;
&lt;td&gt;tg512&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;25.84±0.20&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Qwen 2.5 1.5B — CPU only (ngl=0)
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Quant&lt;/th&gt;
&lt;th&gt;Test&lt;/th&gt;
&lt;th&gt;t/s&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Q4_K_M&lt;/td&gt;
&lt;td&gt;pp64&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;39.03±1.09&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;GPU offload gives ~1.8× speedup on prompt processing for this model.&lt;/p&gt;

&lt;h3&gt;
  
  
  Qwen 2.5 3B — fully offloaded (ngl=99)
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Quant&lt;/th&gt;
&lt;th&gt;Test&lt;/th&gt;
&lt;th&gt;t/s&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Q3_K_M&lt;/td&gt;
&lt;td&gt;pp64&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;36.18±0.33&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Q3_K_M&lt;/td&gt;
&lt;td&gt;tg256&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;10.11±0.11&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Qwen 3B at Q4_K_M (1.95 GiB) exceeds 2 GB VRAM — Q3_K_M (1.60 GiB) is required for full offloading.&lt;/p&gt;




&lt;h2&gt;
  
  
  It Works
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="gp"&gt;$&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;nvidia-smi &lt;span class="nt"&gt;-L&lt;/span&gt;
&lt;span class="go"&gt;GPU 0: NVIDIA GeForce GTX 770 (UUID: GPU-3a93c548-...)

&lt;/span&gt;&lt;span class="gp"&gt;$&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;/tmp/test_cuinit
&lt;span class="go"&gt;cuInit=0

&lt;/span&gt;&lt;span class="gp"&gt;$&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;llama-bench &lt;span class="nt"&gt;--list-devices&lt;/span&gt;
&lt;span class="go"&gt;CUDA0: NVIDIA GeForce GTX 770 (1998 MiB, ...)
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Full working stack: kernel module → patched &lt;code&gt;libcuda.so&lt;/code&gt; → CUDA 10.2 runtime → llama.cpp CUDA backend — all on Linux 7.x with a 2013 Kepler GPU.&lt;/p&gt;




&lt;h2&gt;
  
  
  Surviving Kernel Upgrades (DKMS)
&lt;/h2&gt;

&lt;p&gt;Register the patched driver with DKMS so module rebuilds happen automatically:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;sudo &lt;/span&gt;apt &lt;span class="nb"&gt;install &lt;/span&gt;dkms
&lt;span class="nb"&gt;sudo &lt;/span&gt;dkms add nvidia/470.256.02
&lt;span class="nb"&gt;sudo &lt;/span&gt;dkms build nvidia/470.256.02 &lt;span class="nt"&gt;-k&lt;/span&gt; &lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;uname&lt;/span&gt; &lt;span class="nt"&gt;-r&lt;/span&gt;&lt;span class="si"&gt;)&lt;/span&gt;
&lt;span class="nb"&gt;sudo &lt;/span&gt;dkms &lt;span class="nb"&gt;install &lt;/span&gt;nvidia/470.256.02 &lt;span class="nt"&gt;-k&lt;/span&gt; &lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;uname&lt;/span&gt; &lt;span class="nt"&gt;-r&lt;/span&gt;&lt;span class="si"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Full Technical Write-up
&lt;/h2&gt;

&lt;p&gt;For the complete debugging log, kernel patch table, patch scripts, and build instructions, see the &lt;a href="https://gist.github.com/skyne/fa150c6e4b025903a2dc0d34d1d9065f" rel="noopener noreferrer"&gt;GitHub Gist&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>cuda</category>
      <category>linux</category>
      <category>llm</category>
      <category>gpu</category>
    </item>
  </channel>
</rss>
