<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: SHARVESWAR .M</title>
    <description>The latest articles on DEV Community by SHARVESWAR .M (@sharveswar_m_f522ce132b1).</description>
    <link>https://dev.to/sharveswar_m_f522ce132b1</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3825283%2Fba5892cc-bfa1-480c-b0c6-abf60705a81c.jpg</url>
      <title>DEV Community: SHARVESWAR .M</title>
      <link>https://dev.to/sharveswar_m_f522ce132b1</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/sharveswar_m_f522ce132b1"/>
    <language>en</language>
    <item>
      <title>I built the first open-source FP8 linear solver in Python — 2-3x faster than cuBLAS</title>
      <dc:creator>SHARVESWAR .M</dc:creator>
      <pubDate>Sun, 15 Mar 2026 12:11:54 +0000</pubDate>
      <link>https://dev.to/sharveswar_m_f522ce132b1/i-built-the-first-open-source-fp8-linear-solver-in-python-2-3x-faster-than-cublas-5he4</link>
      <guid>https://dev.to/sharveswar_m_f522ce132b1/i-built-the-first-open-source-fp8-linear-solver-in-python-2-3x-faster-than-cublas-5he4</guid>
      <description>&lt;h1&gt;
  
  
  I Built the First Open-Source FP8 Linear
&lt;/h1&gt;

&lt;h1&gt;
  
  
  Solver in Python
&lt;/h1&gt;

&lt;p&gt;I'm a second-year CS student. Last week I&lt;br&gt;
published ssBlast — an open-source Python&lt;br&gt;
library that solves large linear systems&lt;br&gt;
2-3x faster than CuBLAS using FP8 precision&lt;br&gt;
on consumer NVIDIA GPUs.&lt;/p&gt;

&lt;p&gt;Here's exactly how it works and why it's fast.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Problem
&lt;/h2&gt;

&lt;p&gt;Solving Ax = b (where A is a huge matrix)&lt;br&gt;
is one of the most common operations in&lt;br&gt;
scientific computing:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Weather prediction: 1,000,000 unknowns&lt;/li&gt;
&lt;li&gt;Airplane simulation: 500,000 unknowns&lt;/li&gt;
&lt;li&gt;Drug discovery: 100,000 unknowns&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;CPU solvers take hours. GPU solvers are&lt;br&gt;
faster, but existing tools either don't&lt;br&gt;
support FP8 or require C++ expertise.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why FP8 is Faster
&lt;/h2&gt;

&lt;p&gt;Floating point numbers store digits:&lt;/p&gt;

&lt;p&gt;FP64 = 8 bytes per number (very precise)&lt;br&gt;
FP32 = 4 bytes per number&lt;br&gt;
FP16 = 2 bytes per number&lt;br&gt;
FP8  = 1 byte  per number (rough)&lt;/p&gt;

&lt;p&gt;Less bytes = less memory to read from GPU&lt;br&gt;
= faster computation.&lt;/p&gt;

&lt;p&gt;FP64: 128 MB for 4000×4000 matrix&lt;br&gt;
FP8:   16 MB for same matrix (8x less!)&lt;/p&gt;

&lt;p&gt;RTX 4050 FP8 Tensor Cores = ~330 TFLOPS&lt;br&gt;
RTX 4050 FP64 Cores       = ~20 TFLOPS&lt;/p&gt;

&lt;h2&gt;
  
  
  The Problem With FP8
&lt;/h2&gt;

&lt;p&gt;FP8 can only store values from -448 to +448.&lt;/p&gt;

&lt;p&gt;Real matrix values can be 95,000 or -200,000.&lt;br&gt;
Storing them directly in FP8 = overflow = garbage.&lt;/p&gt;

&lt;h3&gt;
  
  
  Existing solution (bad):
&lt;/h3&gt;

&lt;p&gt;Pick one global scale factor for entire matrix.&lt;br&gt;
Problem: tiles with small values lose precision.&lt;br&gt;
Tiles with large values still overflow.&lt;/p&gt;

&lt;h3&gt;
  
  
  ssBlast solution (novel):
&lt;/h3&gt;

&lt;p&gt;Per-tile scaling. Each 32×32 tile gets its OWN&lt;br&gt;
scale factor:&lt;/p&gt;

&lt;p&gt;scale = max(abs(tile)) / 447.0&lt;br&gt;
scaled_tile = tile / scale&lt;/p&gt;

&lt;h1&gt;
  
  
  now all values fit in ±447
&lt;/h1&gt;

&lt;p&gt;After multiply:&lt;br&gt;
result = dot(scaled_A, scaled_B) * scale_A * scale_B&lt;/p&gt;

&lt;p&gt;This means:&lt;br&gt;
✅ Every tile uses full FP8 range&lt;br&gt;
✅ No global clipping&lt;br&gt;
✅ Computed in-kernel (zero CPU overhead)&lt;/p&gt;

&lt;h2&gt;
  
  
  The 5-Layer Architecture
&lt;/h2&gt;

&lt;p&gt;ssBlast has 5 layers, each with one job:&lt;/p&gt;

&lt;p&gt;Layer 1: Detect GPU&lt;br&gt;
  cp.cuda.runtime.getDeviceProperties(0)&lt;br&gt;
  RTX 4050 → cc=8.9 → FP8 tier&lt;/p&gt;

&lt;p&gt;Layer 2: Select precision plan&lt;br&gt;
  FP8 tier → use Triton kernel&lt;br&gt;
  FP16 tier → use CuPy cuBLAS&lt;br&gt;
  FP32 tier → use CuPy cuBLAS&lt;/p&gt;

&lt;p&gt;Layer 3: Dispatch to correct path&lt;/p&gt;

&lt;p&gt;Layer 4: FP8 Triton kernel (THE NOVEL PART)&lt;br&gt;
  Per-tile scaling + tl.dot + Tensor Cores&lt;/p&gt;

&lt;p&gt;Layer 5: Iterative refinement&lt;br&gt;
  Corrects FP8 rough answer → FP64 accuracy&lt;/p&gt;

&lt;h2&gt;
  
  
  The Triton Kernel (~80 lines)
&lt;/h2&gt;

&lt;p&gt;@triton.autotune(&lt;br&gt;
  configs=[&lt;br&gt;
    triton.Config({BLOCK_M:128,BLOCK_N:128},&lt;br&gt;
                  num_warps=8),&lt;br&gt;
    triton.Config({BLOCK_M:64,BLOCK_N:64},&lt;br&gt;
                  num_warps=4),&lt;br&gt;
  ], key=["M","N","K"]&lt;br&gt;
)&lt;br&gt;
@triton.jit&lt;br&gt;
def _fp8_scaled_gemm_kernel(...):&lt;br&gt;
  # Each GPU block handles one output tile&lt;br&gt;
  acc = tl.zeros((BLOCK_M, BLOCK_N),&lt;br&gt;
                  dtype=tl.float32)&lt;/p&gt;

&lt;p&gt;for k in range(0, K, BLOCK_K):&lt;br&gt;
    # Load tiles&lt;br&gt;
    a_tile = tl.load(A_ptr + ...)&lt;br&gt;
    b_tile = tl.load(B_ptr + ...)&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Per-tile scale (THE NOVEL PART)
a_scale = tl.max(tl.abs(a_tile)) / 447.0
b_scale = tl.max(tl.abs(b_tile)) / 447.0

# Safety: avoid divide by zero
a_scale = tl.where(a_scale==0, 1.0, a_scale)
b_scale = tl.where(b_scale==0, 1.0, b_scale)

# Scale to FP8 range
a_scaled = a_tile / a_scale
b_scaled = b_tile / b_scale

# Tensor Core multiply (auto FP8 on RTX 40xx)
product = tl.dot(a_scaled.to(tl.float16),
                 b_scaled.to(tl.float16),
                 out_dtype=tl.float32)

# Unscale
acc += product * a_scale * b_scale
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;# Store FP64&lt;br&gt;
  tl.store(C_ptr + ..., acc.to(tl.float64))&lt;/p&gt;

&lt;h2&gt;
  
  
  Benchmark Results
&lt;/h2&gt;

&lt;p&gt;Tested on RTX 4050 Laptop, CUDA 12.6, WSL2:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Matrix&lt;/th&gt;
&lt;th&gt;SciPy CPU&lt;/th&gt;
&lt;th&gt;CuPy FP64&lt;/th&gt;
&lt;th&gt;ssBlast&lt;/th&gt;
&lt;th&gt;Speedup&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1000×1000&lt;/td&gt;
&lt;td&gt;0.025s&lt;/td&gt;
&lt;td&gt;0.026s&lt;/td&gt;
&lt;td&gt;0.020s&lt;/td&gt;
&lt;td&gt;1.3x&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2000×2000&lt;/td&gt;
&lt;td&gt;0.128s&lt;/td&gt;
&lt;td&gt;0.121s&lt;/td&gt;
&lt;td&gt;0.050s&lt;/td&gt;
&lt;td&gt;2.4x&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4000×4000&lt;/td&gt;
&lt;td&gt;0.713s&lt;/td&gt;
&lt;td&gt;0.542s&lt;/td&gt;
&lt;td&gt;0.188s&lt;/td&gt;
&lt;td&gt;2.9x&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;8000×8000&lt;/td&gt;
&lt;td&gt;4.041s&lt;/td&gt;
&lt;td&gt;2.066s&lt;/td&gt;
&lt;td&gt;1.021s&lt;/td&gt;
&lt;td&gt;2.0x&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;10000×10000&lt;/td&gt;
&lt;td&gt;6.701s&lt;/td&gt;
&lt;td&gt;4.026s&lt;/td&gt;
&lt;td&gt;1.920s&lt;/td&gt;
&lt;td&gt;2.1x&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;All FP64-accurate (error &amp;lt; 1e-11).&lt;br&gt;
Best for n ≥ 2000.&lt;/p&gt;

&lt;h2&gt;
  
  
  How to Use It
&lt;/h2&gt;

&lt;p&gt;pip install ssblast&lt;/p&gt;

&lt;p&gt;from ssblast import solve&lt;br&gt;
import cupy as cp&lt;/p&gt;

&lt;p&gt;A = cp.random.randn(4000, 4000)&lt;br&gt;
b = cp.random.randn(4000)&lt;/p&gt;

&lt;p&gt;x = solve(A, b)&lt;/p&gt;

&lt;h1&gt;
  
  
  FP64 accurate ✅
&lt;/h1&gt;

&lt;h1&gt;
  
  
  2.9x faster than CuPy ✅
&lt;/h1&gt;

&lt;p&gt;Works on any NVIDIA GPU:&lt;br&gt;
RTX 40xx → FP8 (fastest)&lt;br&gt;
RTX 30xx → FP16&lt;br&gt;
RTX 20xx → FP16&lt;br&gt;
GTX 10xx → FP32&lt;br&gt;
No GPU   → scipy CPU&lt;/p&gt;

&lt;h2&gt;
  
  
  What I Learned
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;Triton requires Linux (WSL2 works!)&lt;/li&gt;
&lt;li&gt;Small matrices (&amp;lt;2000) = overhead &amp;gt; benefit&lt;/li&gt;
&lt;li&gt;Iterative refinement is the key to FP64 accuracy&lt;/li&gt;
&lt;li&gt;Per-tile scaling is simple but powerful&lt;/li&gt;
&lt;li&gt;Publishing to PyPI is easier than I thought&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Links
&lt;/h2&gt;

&lt;p&gt;GitHub: github.com/Sharveswar007/SSBLAST&lt;br&gt;
PyPI:   pypi.org/project/ssblast&lt;/p&gt;

&lt;p&gt;43/43 tests passing. MIT license.&lt;/p&gt;

&lt;p&gt;Questions welcome in comments! 🚀&lt;/p&gt;

</description>
      <category>python</category>
      <category>cuda</category>
      <category>gpu</category>
      <category>opensource</category>
    </item>
  </channel>
</rss>
