<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: wricheek84</title>
    <description>The latest articles on DEV Community by wricheek84 (@wricheek84).</description>
    <link>https://dev.to/wricheek84</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3851112%2F478533c9-3b46-4b13-a255-4b2db26cad3c.png</url>
      <title>DEV Community: wricheek84</title>
      <link>https://dev.to/wricheek84</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/wricheek84"/>
    <language>en</language>
    <item>
      <title>Squeezing 2,240 TPS out of a 2019 Laptop: Building a C++ Inference Engine</title>
      <dc:creator>wricheek84</dc:creator>
      <pubDate>Mon, 30 Mar 2026 09:48:47 +0000</pubDate>
      <link>https://dev.to/wricheek84/squeezing-2240-tps-out-of-a-2019-laptop-building-a-c-inference-engine-33bm</link>
      <guid>https://dev.to/wricheek84/squeezing-2240-tps-out-of-a-2019-laptop-building-a-c-inference-engine-33bm</guid>
      <description>&lt;p&gt;In current times, for running AI models, the NVIDIA H100 is a top-tier choice. It has almost 16,000+ CUDA cores, a huge 80 GB High Bandwidth Memory (HBM), and a raw computing power of around 10¹⁵ operations/sec or 1000+ TFLOPS.&lt;/p&gt;

&lt;p&gt;I built my project on a 2019 HP 15 series with an AMD Ryzen 5 3500U processor with Radeon Vega 8. It has 8 threads with 8 GB DDR4 RAM, out of which 5.92 GB is usable.  &lt;/p&gt;

&lt;p&gt;The aim was not to try and beat industry standards, but to squeeze the very best out of limited hardware by relying on batching, threading, and system design.&lt;/p&gt;

&lt;p&gt;At their core, AI models are massive chains of linear algebra:&lt;/p&gt;

&lt;p&gt;Matrix × Vector = Results&lt;/p&gt;

&lt;p&gt;High-end GPUs use specialized cores to do this in parallel; I had to make my CPU threads do it as efficiently as possible given the constraints. In high-end systems, the GPU performs the math, ultra-fast VRAM stores the model, and thousands of operations run in parallel.&lt;/p&gt;

&lt;p&gt;I attempted to work within these constraints by batching requests, using thread pools efficiently without causing race conditions, and optimizing CPU usage.&lt;/p&gt;

&lt;p&gt;Large models need almost 16 GB, 32 GB, or even 80 GB just to load. In cases of insufficient RAM, the computer starts using the SSD, which is much slower than RAM.  &lt;/p&gt;

&lt;p&gt;So the focus was simple: keep everything in RAM and push the CPU as hard as possible.&lt;/p&gt;

&lt;p&gt;Most AI uses Python as its primary language, and even though it has its positives, it uses a Global Interpreter Lock (GIL), which limits true parallel execution for CPU-bound tasks. Combined with garbage collection pauses, this can impact performance in high-throughput systems.&lt;/p&gt;

&lt;p&gt;On my laptop, C++ allowed for maximum efficiency, letting me squeeze as much computation as possible from my 4 Ryzen cores.&lt;/p&gt;




&lt;h2&gt;
  
  
  Architecture Overview
&lt;/h2&gt;

&lt;p&gt;ARCHITECTURE DIAGRAM  &lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftdg6c1vue5bikr223wlu.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftdg6c1vue5bikr223wlu.png" alt=" " width="800" height="293"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;To make this work efficiently, I structured the system into multiple layers:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Communication (gRPC)&lt;/li&gt;
&lt;li&gt;Orchestration (Thread Pool + Batching)&lt;/li&gt;
&lt;li&gt;Inference (ONNX Runtime)&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Communication Layer (gRPC)
&lt;/h2&gt;

&lt;p&gt;In a standard AI setup, a REST API sending JSON data is more likely. While JSON is easy to debug, it’s "heavy." Every request requires the CPU to parse text strings into usable data — cycles I simply couldn't afford to waste on a 4-core Ryzen processor.&lt;/p&gt;

&lt;p&gt;I chose gRPC because it treats a remote server method as if it were a local object. More importantly, it uses Protocol Buffers, a binary serialization format. Instead of "reading" sentences, my server receives a compact binary stream.&lt;/p&gt;

&lt;p&gt;Here is the "contract" I defined. I used &lt;code&gt;repeated int32&lt;/code&gt; so the server receives pre-processed IDs, avoiding a heavy tokenizer on the backend:&lt;/p&gt;

&lt;p&gt;Protocol Buffers&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;message InferenceRequest {
  repeated int32 tokens = 1; 
}

message InferenceResponse {
  repeated int32 output_tokens = 1;
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This means the moment a packet hits the server, it’s ready for the math engine.&lt;/p&gt;




&lt;h2&gt;
  
  
  Thread Pool (Orchestration Layer)
&lt;/h2&gt;

&lt;p&gt;Simple apps often use a "thread-per-request" model.&lt;/p&gt;

&lt;p&gt;On a Ryzen 3500U, that breaks pretty quickly. Under heavy load (say 7000 requests), most of the CPU ends up managing threads instead of doing actual computation.&lt;/p&gt;

&lt;p&gt;So I used a fixed thread pool with 8 workers — matching the 8 logical threads available.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cpp"&gt;&lt;code&gt;&lt;span class="kt"&gt;unsigned&lt;/span&gt; &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="kr"&gt;thread&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;hardware_concurrency&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;cout&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&amp;lt;&lt;/span&gt; &lt;span class="s"&gt;"Starting "&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&amp;lt;&lt;/span&gt; &lt;span class="s"&gt;" worker threads."&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;endl&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="n"&gt;ThreadPool&lt;/span&gt; &lt;span class="nf"&gt;pool&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;order_queue&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Efficient wait and batching logic:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cpp"&gt;&lt;code&gt;&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;vector&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;InferenceRequest&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;queue&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;pop_batch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;32&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;25&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The 32-token limit: modern CPUs are most efficient when doing vectorized math. Processing 32 tokens together allows ONNX Runtime to leverage SIMD instructions instead of running 1 token 32 times.&lt;/p&gt;

&lt;p&gt;The 25 ms wait: under low load, I didn’t want requests waiting forever just to fill a batch. This caps latency — after 25 ms, whatever is available gets processed.&lt;/p&gt;

&lt;p&gt;There’s a trade-off here: better throughput at the cost of slightly higher latency under load.&lt;/p&gt;




&lt;h2&gt;
  
  
  Non-Busy Waiting (Resource Efficiency)
&lt;/h2&gt;

&lt;p&gt;Inside that &lt;code&gt;pop_batch&lt;/code&gt; call is a &lt;code&gt;std::condition_variable&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;If the queue is empty, workers don’t spin and waste CPU cycles. They sleep.&lt;/p&gt;

&lt;p&gt;The moment gRPC pushes new work, one of them wakes up instantly.&lt;/p&gt;

&lt;p&gt;So the system stays quiet at idle and jumps straight to 100% CPU when a burst hits.&lt;/p&gt;




&lt;h2&gt;
  
  
  Inference Layer (Architecture over Algorithm)
&lt;/h2&gt;

&lt;p&gt;The engine itself is model-agnostic. I used a generic &lt;code&gt;Ort::Session&lt;/code&gt;, so the system can run any ONNX model.&lt;/p&gt;

&lt;p&gt;DistilBERT was just a reference.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cpp"&gt;&lt;code&gt;&lt;span class="n"&gt;session&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;make_unique&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;Ort&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;Session&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;env&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="s"&gt;L"C:&lt;/span&gt;&lt;span class="se"&gt;\\&lt;/span&gt;&lt;span class="s"&gt;Users&lt;/span&gt;&lt;span class="se"&gt;\\&lt;/span&gt;&lt;span class="s"&gt;wrich&lt;/span&gt;&lt;span class="se"&gt;\\&lt;/span&gt;&lt;span class="s"&gt;inference-server-cpp&lt;/span&gt;&lt;span class="se"&gt;\\&lt;/span&gt;&lt;span class="s"&gt;onnx&lt;/span&gt;&lt;span class="se"&gt;\\&lt;/span&gt;&lt;span class="s"&gt;model_quantized.onnx"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;session_options&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Choosing C++ wasn't just about language preference; it was about removing the 'Middleman' overhead found in  AI frameworks. Most Python AI libraries are built on top of C++ backends. Data is typically wrapped in Python objects (PyObject), which then need to be converted into native types before being passed to the underlying C++ engine. The results are then converted back into Python objects.&lt;/p&gt;

&lt;p&gt;This abstraction is convenient, but it introduces overhead. Given my hardware constraints, I chose to bypass this layer entirely and work directly in C++, using native types (int64_t) and interacting directly with ONNX Runtime’s shared libraries (.dll / .so).&lt;/p&gt;




&lt;h2&gt;
  
  
  INT8 Quantization
&lt;/h2&gt;

&lt;p&gt;A standard FP32 model is heavy (~260 MB). I used an INT8-quantized version (~67 MB).&lt;/p&gt;

&lt;p&gt;This helped in two ways:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;fits comfortably in RAM&lt;/li&gt;
&lt;li&gt;integer ops are faster on CPU&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;More importantly, it avoids spilling into SSD memory, which would kill performance.&lt;/p&gt;

&lt;p&gt;There is a small accuracy drop, but it was a trade-off I had to make.&lt;/p&gt;




&lt;h2&gt;
  
  
  Avoiding Oversubscription
&lt;/h2&gt;

&lt;p&gt;ONNX Runtime tries to use all CPU cores by default.&lt;/p&gt;

&lt;p&gt;But I already had 8 worker threads. Letting ONNX spawn more threads would just create contention.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cpp"&gt;&lt;code&gt;&lt;span class="n"&gt;ThreadPool&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;num_threads&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;SimpleQueue&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;InferenceRequest&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&amp;amp;&lt;/span&gt; &lt;span class="n"&gt;q&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="n"&gt;queue&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;q&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;Ort&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;SessionOptions&lt;/span&gt; &lt;span class="n"&gt;session_options&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="n"&gt;session_options&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;SetIntraOpNumThreads&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This forces ONNX Runtime to use a single thread per session.&lt;/p&gt;

&lt;p&gt;That way, my thread pool stays in control, and the CPU spends time on computation instead of context switching.&lt;/p&gt;

&lt;p&gt;This is what gave me stable ~100% CPU usage under load.&lt;/p&gt;




&lt;h2&gt;
  
  
  Telemetry Challenge: “Ghost Metrics”
&lt;/h2&gt;

&lt;p&gt;One unexpected problem: the system was too fast.&lt;/p&gt;

&lt;p&gt;Even after firing thousands of tokens, by the time I opened the dashboard, everything showed 0 — no load, no active workers.&lt;/p&gt;

&lt;p&gt;The work finished faster than the UI could refresh.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why Not Mutex?
&lt;/h2&gt;

&lt;p&gt;Using &lt;code&gt;std::mutex&lt;/code&gt; for counters created contention.&lt;/p&gt;

&lt;p&gt;If multiple workers finished at the same time, they’d line up waiting for the lock. That slows everything down.&lt;/p&gt;




&lt;h2&gt;
  
  
  Atomic-Based Telemetry
&lt;/h2&gt;

&lt;p&gt;So I switched to &lt;code&gt;std::atomic&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Instead of tracking current values, I tracked peak values over a time window.&lt;/p&gt;

&lt;p&gt;Instead of:&lt;br&gt;
“How many workers are active right now?”&lt;/p&gt;

&lt;p&gt;I tracked:&lt;br&gt;
“What was the max number of active workers in the last interval?”&lt;/p&gt;

&lt;p&gt;This made short bursts visible.&lt;/p&gt;


&lt;h2&gt;
  
  
  Compare-And-Swap (CAS)
&lt;/h2&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cpp"&gt;&lt;code&gt;&lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;cur_peak&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;telemetry&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;active_worker_count_peak&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;load&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;memory_order_relaxed&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;current_active&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;cur_peak&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; 
       &lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="n"&gt;telemetry&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;active_worker_count_peak&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;compare_exchange_weak&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
           &lt;span class="n"&gt;cur_peak&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;current_active&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;memory_order_relaxed&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="p"&gt;{}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Basic idea:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;read current value&lt;/li&gt;
&lt;li&gt;compare with new one&lt;/li&gt;
&lt;li&gt;update if higher&lt;/li&gt;
&lt;li&gt;retry if needed&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;code&gt;compare_exchange_weak&lt;/code&gt; can fail occasionally, but inside a loop it retries immediately.&lt;/p&gt;


&lt;h2&gt;
  
  
  Atomic Snapshot Trick
&lt;/h2&gt;

&lt;p&gt;For dashboard polling:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cpp"&gt;&lt;code&gt;&lt;span class="n"&gt;TelemetrySnapshot&lt;/span&gt; &lt;span class="nf"&gt;get_and_reset_telemetry&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;telemetry&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;max_queue_depth&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;exchange&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;memory_order_relaxed&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="n"&gt;telemetry&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;tasks_processed&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;exchange&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;memory_order_relaxed&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="n"&gt;telemetry&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;active_worker_count_peak&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;exchange&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;memory_order_relaxed&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="n"&gt;telemetry&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;worker_active_time_ns&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;exchange&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;memory_order_relaxed&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;};&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;.exchange(0)&lt;/code&gt; reads and resets in one atomic step, so no updates are lost.&lt;/p&gt;

&lt;p&gt;I used &lt;code&gt;std::memory_order_relaxed&lt;/code&gt; since these are independent counters and don’t need strict ordering.&lt;/p&gt;




&lt;h2&gt;
  
  
  Results &amp;amp; Benchmarks
&lt;/h2&gt;

&lt;p&gt;The engine was stress tested with bursts (~7000 tokens at peak), and performance stayed stable.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Faush5v4ynlot35laenue.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Faush5v4ynlot35laenue.jpeg" alt=" " width="800" height="403"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;~2,240 tokens/sec&lt;br&gt;&lt;br&gt;
100% CPU utilization&lt;br&gt;&lt;br&gt;
No SSD paging  &lt;/p&gt;




&lt;h2&gt;
  
  
  The “Perfect” 100% Load
&lt;/h2&gt;

&lt;p&gt;Seeing CPU usage sit at 100% consistently was probably the most satisfying part.&lt;/p&gt;

&lt;p&gt;In many systems, it hovers around 70–80% due to inefficiencies.&lt;/p&gt;

&lt;p&gt;Here it stayed pinned.&lt;/p&gt;

&lt;p&gt;That basically means:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;no I/O bottlenecks&lt;/li&gt;
&lt;li&gt;no thread contention&lt;/li&gt;
&lt;li&gt;CPU fully used for computation&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Throughput Details
&lt;/h2&gt;

&lt;p&gt;This wasn’t a one-off burst.&lt;/p&gt;

&lt;p&gt;I tested multiple patterns:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;large 7000-token bursts&lt;/li&gt;
&lt;li&gt;smaller rapid batches&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Total processed: 38,392 tokens&lt;/p&gt;

&lt;p&gt;The queue handled backpressure well, and batching stayed efficient.&lt;/p&gt;

&lt;p&gt;Even with 5.92 GB RAM, nothing spilled to SSD.&lt;/p&gt;

&lt;p&gt;At that point, the system wasn’t I/O-bound or thread-bound anymore — just compute-bound.&lt;/p&gt;




&lt;h2&gt;
  
  
  Metrics
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2c5a78jk40g312zn534h.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2c5a78jk40g312zn534h.jpeg" alt=" " width="800" height="331"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;All 8 logical cores pinned, queue handling load without issues.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fax6v6cd8kldom6p3tthm.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fax6v6cd8kldom6p3tthm.jpeg" alt=" " width="800" height="217"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Latency stays controlled even during ramp-up.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fl9mgnjamwce35jmzzyrp.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fl9mgnjamwce35jmzzyrp.jpeg" alt=" " width="800" height="223"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Flat latency under sustained load.&lt;/p&gt;




&lt;h2&gt;
  
  
  Final Thoughts
&lt;/h2&gt;

&lt;p&gt;This wasn’t about competing with GPUs.&lt;/p&gt;

&lt;p&gt;It was about understanding where resources were being used , maximizing efficieny and extracting the best performance from limited hardware.&lt;/p&gt;

</description>
      <category>cpp</category>
      <category>machinelearning</category>
      <category>backend</category>
      <category>performance</category>
    </item>
  </channel>
</rss>
