<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: DevOnBike</title>
    <description>The latest articles on DEV Community by DevOnBike (@devonbike_1a21fc85096f434).</description>
    <link>https://dev.to/devonbike_1a21fc85096f434</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3862885%2Fb424da62-1cf0-46fd-bb05-249c5b915f06.jpg</url>
      <title>DEV Community: DevOnBike</title>
      <link>https://dev.to/devonbike_1a21fc85096f434</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/devonbike_1a21fc85096f434"/>
    <language>en</language>
    <item>
      <title>🚀 8x Faster Than ONNX Runtime: Zero-Allocation AI Inference in Pure C#</title>
      <dc:creator>DevOnBike</dc:creator>
      <pubDate>Sun, 05 Apr 2026 23:32:09 +0000</pubDate>
      <link>https://dev.to/devonbike_1a21fc85096f434/8x-faster-than-onnx-runtime-zero-allocation-ai-inference-in-pure-c-31i5</link>
      <guid>https://dev.to/devonbike_1a21fc85096f434/8x-faster-than-onnx-runtime-zero-allocation-ai-inference-in-pure-c-31i5</guid>
      <description>&lt;h1&gt;
  
  
  The Myth: "C# is too slow for AI"
&lt;/h1&gt;

&lt;p&gt;For years, the narrative has been the same: if you want high-performance AI, you must use C++ or Python wrappers (like PyTorch/ONNX) that call into native kernels. The common belief is that the Garbage Collector (GC) and the overhead of the "managed" environment make C# unsuitable for ultra-low latency inference.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;I decided to challenge that.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;By leveraging the latest features in &lt;strong&gt;.NET 10&lt;/strong&gt;, &lt;strong&gt;AVX-512&lt;/strong&gt; instructions, and strict &lt;strong&gt;Zero-Allocation&lt;/strong&gt; patterns, I built &lt;strong&gt;Overfit&lt;/strong&gt; — an inference engine that outperforms ONNX Runtime by &lt;strong&gt;800%&lt;/strong&gt; in micro-inference tasks.&lt;/p&gt;




&lt;h2&gt;
  
  
  📊 The Results: 432 Nanoseconds
&lt;/h2&gt;

&lt;p&gt;The following benchmark compares &lt;strong&gt;Overfit&lt;/strong&gt; against &lt;strong&gt;Microsoft.ML.OnnxRuntime&lt;/strong&gt;. While ONNX is a powerhouse for large models, its overhead becomes a bottleneck for micro-inference.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Environment:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;CPU:&lt;/strong&gt; AMD Ryzen 9 9950X3D (Zen 5, AVX-512)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Runtime:&lt;/strong&gt; .NET 10.0 (X64 RyuJIT x86-64-v4)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Task:&lt;/strong&gt; Linear Layer Inference (784 -&amp;gt; 10 units)&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Method&lt;/th&gt;
&lt;th&gt;Mean Latency&lt;/th&gt;
&lt;th&gt;Allocated&lt;/th&gt;
&lt;th&gt;Ratio&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Overfit (ZeroAlloc)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;432.0 ns&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0 B&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.12&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ONNX Runtime (Pre-allocated)&lt;/td&gt;
&lt;td&gt;3,571.8 ns&lt;/td&gt;
&lt;td&gt;912 B&lt;/td&gt;
&lt;td&gt;1.00&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ONNX Runtime (Full-alloc)&lt;/td&gt;
&lt;td&gt;3,581.0 ns&lt;/td&gt;
&lt;td&gt;1,128 B&lt;/td&gt;
&lt;td&gt;1.24&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;In the time it takes ONNX Runtime to complete &lt;strong&gt;one&lt;/strong&gt; prediction, Overfit completes &lt;strong&gt;eight&lt;/strong&gt;. More importantly, Overfit does it with &lt;strong&gt;zero bytes allocated&lt;/strong&gt; on the heap.&lt;/p&gt;




&lt;h2&gt;
  
  
  🛠️ Optimization #1: Persistent Buffers (The Death of GC)
&lt;/h2&gt;

&lt;p&gt;The biggest killer of "Tail Latency" (P99.9) in .NET is the Garbage Collector. Even a small allocation of ~1KB per call triggers Gen-0 collections under heavy load. In high-frequency trading (HFT) or real-time game engines, a GC pause is a disaster.&lt;/p&gt;

&lt;p&gt;In Overfit, we use &lt;strong&gt;Persistent Inference Buffers&lt;/strong&gt;. When the model is switched to &lt;code&gt;Eval()&lt;/code&gt; mode, all necessary tensors are pre-allocated.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight csharp"&gt;&lt;code&gt;&lt;span class="k"&gt;public&lt;/span&gt; &lt;span class="n"&gt;AutogradNode&lt;/span&gt; &lt;span class="nf"&gt;Forward&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ComputationGraph&lt;/span&gt; &lt;span class="n"&gt;graph&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;AutogradNode&lt;/span&gt; &lt;span class="n"&gt;input&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="c1"&gt;// In Eval mode, we skip building the computation graph&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;graph&lt;/span&gt; &lt;span class="p"&gt;==&lt;/span&gt; &lt;span class="k"&gt;null&lt;/span&gt; &lt;span class="p"&gt;||&lt;/span&gt; &lt;span class="p"&gt;!&lt;/span&gt;&lt;span class="n"&gt;IsTraining&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="c1"&gt;// Compute directly into the pre-allocated persistent buffer&lt;/span&gt;
        &lt;span class="nf"&gt;LinearInferenceSimd&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;input&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;AsReadOnlySpan&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
            &lt;span class="n"&gt;_weightsTransposed&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;AsReadOnlySpan&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
            &lt;span class="n"&gt;Biases&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;AsReadOnlySpan&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
            &lt;span class="n"&gt;_inferenceOutputNode&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;AsSpan&lt;/span&gt;&lt;span class="p"&gt;());&lt;/span&gt;

        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;_inferenceOutputNode&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="c1"&gt;// Zero-allocation return&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;TensorMath&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;Linear&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;graph&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;input&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Weights&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Biases&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  ⚡ Optimization #2: SIMD &amp;amp; AVX-512
&lt;/h2&gt;

&lt;p&gt;Modern CPUs are vector machines. The Ryzen 9 9950X3D features 512-bit registers. Using .NET's &lt;code&gt;Vector&amp;lt;float&amp;gt;&lt;/code&gt;, we can process &lt;strong&gt;16 float numbers in a single CPU instruction&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;For a layer with 784 inputs, a standard scalar loop does 784 multiplications. Our SIMD kernel does only &lt;strong&gt;49 iterations&lt;/strong&gt; (784 / 16 = 49), drastically reducing the CPU cycles required for the same mathematical operation.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight csharp"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Core SIMD loop using Vector&amp;lt;float&amp;gt;&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="p"&gt;(;&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="p"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="n"&gt;inputSize&lt;/span&gt; &lt;span class="p"&gt;-&lt;/span&gt; &lt;span class="n"&gt;vCount&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="p"&gt;+=&lt;/span&gt; &lt;span class="n"&gt;vCount&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kt"&gt;var&lt;/span&gt; &lt;span class="n"&gt;vIn&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="n"&gt;Vector&lt;/span&gt;&lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="kt"&gt;float&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;(&lt;/span&gt;&lt;span class="n"&gt;input&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;Slice&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;
    &lt;span class="kt"&gt;var&lt;/span&gt; &lt;span class="n"&gt;vW&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="n"&gt;Vector&lt;/span&gt;&lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="kt"&gt;float&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;(&lt;/span&gt;&lt;span class="n"&gt;wRow&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;Slice&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;

    &lt;span class="c1"&gt;// Fused Multiply-Add equivalent in registers&lt;/span&gt;
    &lt;span class="n"&gt;sum&lt;/span&gt; &lt;span class="p"&gt;+=&lt;/span&gt; &lt;span class="n"&gt;vIn&lt;/span&gt; &lt;span class="p"&gt;*&lt;/span&gt; &lt;span class="n"&gt;vW&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; 
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;// Sum reduction and tail loop follow to handle remaining elements...&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  🧠 Optimization #3: Weight Transposition
&lt;/h2&gt;

&lt;p&gt;Memory access patterns are just as important as CPU cycles. Standard weights are often stored as [Input, Output]. For a single inference (Batch=1), this results in "strided" memory access, which kills the CPU cache performance.&lt;/p&gt;

&lt;p&gt;During the &lt;code&gt;Eval()&lt;/code&gt; setup, Overfit &lt;strong&gt;pre-transposes&lt;/strong&gt; the weights to [Output, Input]. This ensures that when we calculate a neuron's output, we read the weights &lt;strong&gt;sequentially&lt;/strong&gt;. This maximizes L1/L2 cache hits and allows the CPU's prefetcher to work at 100% efficiency.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight csharp"&gt;&lt;code&gt;&lt;span class="k"&gt;private&lt;/span&gt; &lt;span class="k"&gt;void&lt;/span&gt; &lt;span class="nf"&gt;RebuildTransposedWeights&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;_weightsTransposed&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="n"&gt;FastTensor&lt;/span&gt;&lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="kt"&gt;float&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;(&lt;/span&gt;&lt;span class="n"&gt;_outputSize&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;_inputSize&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="c1"&gt;// Transpose: W[inputSize, outputSize] -&amp;gt; W_T[outputSize, inputSize]&lt;/span&gt;
    &lt;span class="c1"&gt;// Resulting in sequential rows for high-speed Vector.Dot operations&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  🎯 Why This Matters
&lt;/h2&gt;

&lt;p&gt;This isn't just about winning a benchmark. It’s about &lt;strong&gt;predictability&lt;/strong&gt;. &lt;/p&gt;

&lt;p&gt;When you eliminate heap allocations:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Jitter disappears:&lt;/strong&gt; Your P99.9 latency becomes almost identical to your P50.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Zero GC Pressure:&lt;/strong&gt; You can run millions of inferences per second without ever triggering a Garbage Collection cycle.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Hardware Saturation:&lt;/strong&gt; You are finally using the hardware you paid for (AVX-512) instead of wasting cycles on marshaling data between managed and unmanaged memory.&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Check out the Project
&lt;/h3&gt;

&lt;p&gt;Overfit is open-source (AGPLv3) and designed for developers who need extreme performance in .NET. Whether you are in &lt;strong&gt;FinTech&lt;/strong&gt;, &lt;strong&gt;GameDev&lt;/strong&gt;, or &lt;strong&gt;Edge AI&lt;/strong&gt;, it’s time to stop settling for "managed overhead."&lt;/p&gt;

&lt;p&gt;👉 &lt;strong&gt;GitHub Repository:&lt;/strong&gt; &lt;a href="https://github.com/DevOnBike/Overfit" rel="noopener noreferrer"&gt;https://github.com/DevOnBike/Overfit&lt;/a&gt;&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;What’s your experience with micro-optimizations in .NET? Let’s discuss in the comments!&lt;/strong&gt;&lt;/p&gt;

</description>
      <category>dotnet</category>
      <category>performance</category>
      <category>ai</category>
      <category>benchmark</category>
    </item>
  </channel>
</rss>
