<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Haru-neo</title>
    <description>The latest articles on DEV Community by Haru-neo (@haru-neo).</description>
    <link>https://dev.to/haru-neo</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3909821%2F7cfc949c-400a-450d-a489-767b95a00e6f.jpg</url>
      <title>DEV Community: Haru-neo</title>
      <link>https://dev.to/haru-neo</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/haru-neo"/>
    <language>en</language>
    <item>
      <title>I wrote a custom CUDA inference engine to run Qwen3.5-27B on $130 mining cards</title>
      <dc:creator>Haru-neo</dc:creator>
      <pubDate>Sun, 03 May 2026 04:42:30 +0000</pubDate>
      <link>https://dev.to/haru-neo/i-wrote-a-custom-cuda-inference-engine-to-run-qwen35-27b-on-130-mining-cards-732</link>
      <guid>https://dev.to/haru-neo/i-wrote-a-custom-cuda-inference-engine-to-run-qwen35-27b-on-130-mining-cards-732</guid>
      <description>&lt;p&gt;I bought four NVIDIA CMP 100-210 cards off the secondhand market for about $130 each. They are ex-mining cards based on the&lt;br&gt;
   Volta GV100 die — same silicon as the V100 — with 16 GB of HBM2 each. On paper, four of them give me 64 GB of HBM2 for&lt;br&gt;
  the price of a single used 3090.&lt;/p&gt;

&lt;p&gt;In practice, NVIDIA had crippled them in hardware.&lt;/p&gt;

&lt;p&gt;The throttle&lt;/p&gt;

&lt;p&gt;The CMP 100-210 has its tensor cores throttled 64×. HMMA latency is stretched from 8 cycles to 512. cuBLAS WMMA caps out&lt;br&gt;
  at about 5 TFLOP per card. PCIe is locked to Gen1 x1, no P2P, no NVLink. CUPTI is blocked, so you can't even use NVIDIA's&lt;br&gt;
  own profiler.&lt;/p&gt;

&lt;p&gt;The throttle is enforced by an e-fuse + PMU bootrom double-lock on the die. This isn't a firmware switch — it's blown into&lt;br&gt;
   the silicon. There is no software unlock. (Yes, I tried.)&lt;/p&gt;

&lt;p&gt;The result: anything that goes through cuBLAS tensor cores runs at 1/64 speed or fails outright. That's vLLM, llama.cpp's&lt;br&gt;
  default cuBLAS path, FlashAttention, bitsandbytes, PyTorch's default matmul. The standard LLM inference stack is unusable&lt;br&gt;
  on this hardware.&lt;/p&gt;

&lt;p&gt;So I wrote my own.&lt;/p&gt;

&lt;p&gt;The workaround&lt;/p&gt;

&lt;p&gt;It turns out NVIDIA only throttled tensor cores. Two other paths on the same chip are full speed:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;DP4A (4-way packed int8 dot product): ~17 TFLOP, no throttle&lt;/li&gt;
&lt;li&gt;HFMA2 (2-way packed fp16 fused multiply-add): ~24 TFLOP, no throttle&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Neither is as fast as a healthy V100's tensor cores, but both are far above the 5 TFLOP cuBLAS WMMA ceiling. Routing all&lt;br&gt;
  of inference through these two paths gets you back to roughly half of what an unthrottled V100 would do, which is still&lt;br&gt;
  vastly better than nothing.&lt;/p&gt;

&lt;p&gt;Building on that, qengine is a from-scratch CUDA inference engine for Qwen3.5 / Qwen3.6 hybrid models. (Worth noting:&lt;br&gt;
  Qwen3.5 / 3.6 are a different architecture from Qwen3 — they are dense GDN (Gated DeltaNet) + Attention hybrids, not pure&lt;br&gt;
  transformers. The kernels look quite different.)&lt;/p&gt;

&lt;p&gt;The engine has:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A hand-written Q8_0 GEMM tile path for prefill, all DP4A&lt;/li&gt;
&lt;li&gt;A fused FlashAttention kernel (score + softmax + value online)&lt;/li&gt;
&lt;li&gt;Split-K FlashAttention for long context (more on this below)&lt;/li&gt;
&lt;li&gt;3-bit Walsh-Hadamard + Lloyd-Max KV cache so 27B fits 256K context on three 16 GB cards&lt;/li&gt;
&lt;li&gt;An OpenAI-compatible HTTP API with streaming, tool calls, vision, continuous batching, and per-slot prefix caching&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It's not a fork. Every kernel is written for sm_70 + CMP constraints.&lt;/p&gt;

&lt;p&gt;Honest benchmarks&lt;/p&gt;

&lt;p&gt;I'm comparing against llama.cpp build 8462 with -fa 1, the same Q8_0 GGUFs, on the same hardware. Bigger numbers are&lt;br&gt;
  better.&lt;/p&gt;

&lt;p&gt;▎ Qwen3.5-9B, single GPU prefill (qengine vs llama.cpp, tokens/sec):&lt;br&gt;
  ▎ - 297 — 594 vs 199 (2.99x)&lt;br&gt;
  ▎ - 1.16K — 683 vs 316 (2.16x)&lt;br&gt;
  ▎ - 4.62K — 584 vs 361 (1.62x)&lt;br&gt;
  ▎ - 18K — 393 vs 324 (1.22x)&lt;/p&gt;

&lt;p&gt;qengine leads at the first three lengths and reaches parity at 18K.&lt;/p&gt;

&lt;p&gt;Generation: qengine wins by +48–51% on both sizes (9B: ~70 t/s vs 46.6; 27B: 26.3 vs 17.7).&lt;/p&gt;




&lt;p&gt;The honest weak point: 9B dual-GPU at 18K still trails llama.cpp (~0.48×). Their layer pipeline overlaps activation&lt;br&gt;
  transfer with compute; mine does the transfers sequentially through pinned host memory, because no P2P. Single-GPU 9B is&lt;br&gt;
  faster than either dual-GPU run anyway, so it's mostly a theoretical gap, but it's there.&lt;/p&gt;

&lt;p&gt;What was hard&lt;/p&gt;

&lt;p&gt;A few things that took real time to get right:&lt;/p&gt;

&lt;p&gt;Multi-GPU without P2P. With CMP cards there's no peer-to-peer, no NVLink. Hidden state has to bounce through pinned host&lt;br&gt;
  memory between GPUs. I keep a pinned-host buffer per cross-GPU edge and a worker thread per GPU. It works, it's just&lt;br&gt;
  sequential.&lt;/p&gt;

&lt;p&gt;Numerical drift killing Korean output. Qwopus3.5-9B distill has weak Korean circuits to begin with — small fp16 reorder&lt;br&gt;
  noise shifts argmax decisions and the model starts producing garbled Korean. I learned this the hard way after a&lt;br&gt;
  chunked-prefill kernel optimisation that "passed" my English greedy-argmax tests broke Korean entirely. Now every kernel&lt;br&gt;
  that touches the attention reduction order gets a Korean argmax-stability check before it ships.&lt;/p&gt;

&lt;p&gt;Split-K FA without breaking determinism. The 64-block FA grid was under-utilising the SMs at long context (only 64 blocks&lt;br&gt;
  across 3×68 SM = 204), so each block was running a 575-iteration K/V tile loop in isolation. I added a split-K variant&lt;br&gt;
  that maps each (kv_head, t_idx) to N independent blocks, each handling a contiguous tile range, and merged the partials&lt;br&gt;
  with the standard log-sum-exp identity:&lt;/p&gt;

&lt;p&gt;m_global = max_s m_s&lt;br&gt;
  l_global = Σ_s exp(m_s − m_global) · l_s&lt;br&gt;
  o_global = Σ_s exp(m_s − m_global) · acc_o_s&lt;br&gt;
  First version stored partial o accumulators as half. That truncation caused a small drift after about 31 generated tokens&lt;br&gt;
  at 4.6K prefill — not bit-exact with the base FA path. Korean argmax flipped. Storing partials as fp32 brings drift down&lt;br&gt;
  to fp32-reordering noise (~1e-7 per add), and greedy argmax is stable across 32+ generated tokens. That's the version I&lt;br&gt;
  shipped. 18K prefill went from 270 → 393 t/s on 9B and 104 → 139 t/s on 27B.&lt;/p&gt;

&lt;p&gt;Speculative decoding I never got working. I have DFlash + DDTree code in the repo for the eventual fine-tuned drafter.&lt;br&gt;
  Right now the pretrained drafter (lucebox-hub/dflash) is trained on stock Qwen3.5, and the Qwopus distill output&lt;br&gt;
  distribution doesn't match — accept rate is roughly 0% and the chains degenerate. Listed in the README as broken on&lt;br&gt;
  purpose. MTP K=1 single-token spec works fine.&lt;/p&gt;

&lt;p&gt;What this is and isn't for&lt;/p&gt;

&lt;p&gt;If you have an RTX 30/40-series, A100, or H100, you should be using vLLM or SGLang. They are far more optimised for those&lt;br&gt;
  targets and have actual test coverage. qengine would be slower and weirder.&lt;/p&gt;

&lt;p&gt;If you have:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Ex-mining cards (CMP 100-210, ex-mining V100, P104-100, etc.)&lt;/li&gt;
&lt;li&gt;Older Volta workstations (V100 16/32 GB, Titan V, Quadro GV100)&lt;/li&gt;
&lt;li&gt;A T4 or RTX 20-series and the standard stacks have been disappointing&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;— then qengine might be useful. It targets sm_70 specifically. sm_75 should work but isn't tuned. sm_60 won't work (no&lt;br&gt;
  DP4A). AMD and Apple Silicon definitely won't work.&lt;/p&gt;

&lt;p&gt;Repo&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/Haru-neo/qengine" rel="noopener noreferrer"&gt;https://github.com/Haru-neo/qengine&lt;/a&gt; — Apache 2.0.&lt;/p&gt;

&lt;p&gt;The benchmarks in this post are reproducible with the bench_curl.sh script in the repo. The 27B 3-GPU numbers were&lt;br&gt;
  measured 2026-05-03 on my machine. If you have the hardware and try it, I'd love to know what you see.&lt;/p&gt;




&lt;p&gt;Solo project. Heavy AI assist on the CUDA — I drove the architecture, profiling, and debugging across many sessions;&lt;br&gt;
  Claude did most of the kernel implementation. I'm a Korean high school student. Slow PR turnaround.&lt;/p&gt;




</description>
      <category>ai</category>
      <category>llm</category>
      <category>opensource</category>
    </item>
  </channel>
</rss>
