<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: plasmon</title>
    <description>The latest articles on DEV Community by plasmon (@plasmon_imp).</description>
    <link>https://dev.to/plasmon_imp</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3838326%2Fc1f964cb-23af-4996-957d-2b4aaa5e86ce.jpg</url>
      <title>DEV Community: plasmon</title>
      <link>https://dev.to/plasmon_imp</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/plasmon_imp"/>
    <language>en</language>
    <item>
      <title>Why Local LLM JSON Output Breaks — Failure Patterns and How to Fix Them in Code</title>
      <dc:creator>plasmon</dc:creator>
      <pubDate>Thu, 23 Apr 2026 23:38:38 +0000</pubDate>
      <link>https://dev.to/plasmon_imp/why-local-llm-json-output-breaks-failure-patterns-and-how-to-fix-them-in-code-4gkc</link>
      <guid>https://dev.to/plasmon_imp/why-local-llm-json-output-breaks-failure-patterns-and-how-to-fix-them-in-code-4gkc</guid>
      <description>&lt;h2&gt;
  
  
  API Gets One Line. Local Gets a Minefield.
&lt;/h2&gt;

&lt;p&gt;OpenAI has &lt;code&gt;response_format={"type": "json_object"}&lt;/code&gt;. Claude has equivalent. Set it, and output is guaranteed JSON. Parse errors don't happen.&lt;/p&gt;

&lt;p&gt;Local LLMs don't have this. llama.cpp offers &lt;code&gt;--grammar&lt;/code&gt; to constrain output to valid JSON syntax, but that only forces the &lt;strong&gt;format&lt;/strong&gt; to be JSON. Whether the &lt;strong&gt;content&lt;/strong&gt; makes sense is a completely different problem.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="err"&gt;//&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;API&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;output:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;as&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;intended&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Qwen2.5-14B"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"speed_tps"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;31.5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"vram_gb"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;7.3&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="err"&gt;//&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;Local&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;LLM&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;(grammar&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;enabled):&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;valid&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;JSON,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;broken&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;content&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Qwen2.5-14B"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"speed_tps"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"fast"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"vram_gb"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"enough"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="err"&gt;//&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;→&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;Types&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;are&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;wrong.&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;Numbers&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;became&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;strings.&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This "format is correct but content is broken" problem gets worse with smaller models. On RTX 4060 8GB, the model size constraint (7B-14B) directly impacts JSON output reliability.&lt;/p&gt;




&lt;h2&gt;
  
  
  The 3 Failure Patterns
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Pattern 1: Failed — JSON Itself Is Broken
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="err"&gt;//&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;Typical:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;explanation&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;text&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;wraps&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;around&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;the&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;JSON&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="err"&gt;Here&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;is&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;the&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;JSON&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;output:&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Qwen2.5-14B"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;...&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="err"&gt;I&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;hope&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;this&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;helps!&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="err"&gt;//&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;→&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;json.loads()&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;parse&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;error&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;When it happens:&lt;/strong&gt; Frequent with 7B models without grammar. Enabling &lt;code&gt;--grammar&lt;/code&gt; eliminates this completely. 14B+ rarely has this issue even without grammar.&lt;/p&gt;

&lt;h3&gt;
  
  
  Pattern 2: Broken — Valid JSON, Wrong Content
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="err"&gt;//&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;Expected:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"speed_tps"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;31.5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"vram_gb"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;7.3&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="err"&gt;//&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;Actual:&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"speed_tps"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"fast"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"vram_gb"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"7.3GB"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="err"&gt;//&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;→&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;Wrong&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;types.&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;Strings&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;where&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;numbers&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;should&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;be.&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;Units&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;leak&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;into&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;values.&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;When it happens:&lt;/strong&gt; Frequent with 7B, sporadic with 14B. Including a JSON Schema in the prompt dramatically improves this. Grammar alone can't prevent it — the format is valid, just the values are wrong.&lt;/p&gt;

&lt;h3&gt;
  
  
  Pattern 3: Nested Structure Collapse
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="err"&gt;//&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;Expected:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"items"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="nl"&gt;"a"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"a"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;}]}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="err"&gt;//&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;Actual:&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"items"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="nl"&gt;"a"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"b"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;}]}&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="err"&gt;//&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;field&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;name&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;changed&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="err"&gt;//&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;Or:&lt;/span&gt;&lt;span class="w"&gt;       &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"items"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="nl"&gt;"a"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;]}&lt;/span&gt;&lt;span class="w"&gt;          &lt;/span&gt;&lt;span class="err"&gt;//&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;type&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;collapsed&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;mid-array&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;When it happens:&lt;/strong&gt; The nastiest pattern. When generating multiple objects inside an array, the first object is correct but subsequent ones drift — field names change, types collapse. This happens even with larger models. &lt;strong&gt;The best approach is to not ask the model to generate nested structures at all&lt;/strong&gt; (see two-stage generation below).&lt;/p&gt;




&lt;h2&gt;
  
  
  Grammar: Necessary but Not Sufficient
&lt;/h2&gt;

&lt;p&gt;llama.cpp's &lt;code&gt;--grammar&lt;/code&gt; guarantees Pattern 1 (parse errors) goes to zero. But it can't prevent Pattern 2 or 3. Grammar constrains the token sequence format, not semantic correctness.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Grammar is a prerequisite, not a solution.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  3 Fixes That Actually Work
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Fix 1: Explicit Schema in Prompt
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;schema&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;object&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;properties&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;model&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;string&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;speed_tps&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;number&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;vram_gb&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;number&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;required&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;model&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;speed_tps&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;vram_gb&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="n"&gt;prompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Output JSON following this schema:
&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;schema&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;indent&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;

Input: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;input_text&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Giving the model the exact structure upfront dramatically improves field name consistency and type correctness. This works because the schema is in the context when the model generates each token.&lt;/p&gt;

&lt;h3&gt;
  
  
  Fix 2: Grammar + Retry
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;reliable_json&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;max_retries&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;attempt&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;max_retries&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;raw&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;call_llm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;parsed&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;loads&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;raw&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;validate_schema&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;parsed&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
                &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;parsed&lt;/span&gt;
        &lt;span class="nf"&gt;except &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;JSONDecodeError&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ValidationError&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="k"&gt;continue&lt;/span&gt;
    &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;RuntimeError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;JSON generation failed (&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;max_retries&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; attempts)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Allowing 3 retries massively improves effective success rate. The cost is up to 3x latency — a reasonable tradeoff for pipelines where reliability matters. Measure how many retries YOUR model needs on YOUR tasks.&lt;/p&gt;

&lt;h3&gt;
  
  
  Fix 3: Two-Stage Generation (for Nested Structures)
&lt;/h3&gt;

&lt;p&gt;Don't ask the model to build nested JSON in one shot. Generate flat JSON twice and merge.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Step 1: Extract metadata
&lt;/span&gt;&lt;span class="n"&gt;meta&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;call_llm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Output model name only: {&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;model&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;: &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Step 2: Extract array separately  
&lt;/span&gt;&lt;span class="n"&gt;items&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;call_llm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Output each quantization as JSON array: [{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;method&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;: ..., &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;size_gb&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;: ..., &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;speed_tps&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;: ...}]&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Step 3: Merge in code
&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;loads&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;meta&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;quantizations&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;loads&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;items&lt;/span&gt;&lt;span class="p"&gt;)}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Two flat generations merged in code is dramatically more stable than one nested generation. For 7B models needing nested output, this is effectively the only practical option.&lt;/p&gt;




&lt;h2&gt;
  
  
  Model Size Decision Guide
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[JSON Output System Design Guide]

High reliability (payments, medical):  32B + grammar + retry
                                       → Doesn't fit 8GB. Use an API.

Standard (RAG, analysis):              14B + grammar + schema + retry
                                       → Optimal for RTX 4060 8GB

Lightweight (log extraction, classification): 7B + grammar + two-stage
                                       → Practical if you stick to flat JSON

Nested structures required:            14B+ with two-stage generation
                                       → 7B can't do this reliably
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Specific success rates depend on YOUR environment. Copy the test code above, run it with YOUR model and YOUR tasks. Those numbers are your real reliability. Don't trust anyone else's benchmarks — including this article.&lt;/p&gt;




&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;llama.cpp grammar documentation: &lt;a href="https://github.com/ggerganov/llama.cpp/blob/master/grammars/README.md" rel="noopener noreferrer"&gt;https://github.com/ggerganov/llama.cpp/blob/master/grammars/README.md&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;llama.cpp server API: &lt;a href="https://github.com/ggerganov/llama.cpp/blob/master/examples/server/README.md" rel="noopener noreferrer"&gt;https://github.com/ggerganov/llama.cpp/blob/master/examples/server/README.md&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>llm</category>
      <category>ai</category>
      <category>programming</category>
      <category>python</category>
    </item>
    <item>
      <title>INT8 Hits 58x, Voltage Underscaling Saves 36% — Semiconductor Physics Limits Are Being Bypassed by Software in 2026</title>
      <dc:creator>plasmon</dc:creator>
      <pubDate>Thu, 23 Apr 2026 01:42:24 +0000</pubDate>
      <link>https://dev.to/plasmon_imp/int8-hits-58x-voltage-underscaling-saves-36-semiconductor-physics-limits-are-being-bypassed-by-28ck</link>
      <guid>https://dev.to/plasmon_imp/int8-hits-58x-voltage-underscaling-saves-36-semiconductor-physics-limits-are-being-bypassed-by-28ck</guid>
      <description>&lt;h2&gt;
  
  
  Why This Matters Now
&lt;/h2&gt;

&lt;p&gt;Last week, Tesla started recruiting engineers in Taiwan for "Terafab" — Elon Musk's vision for an in-house AI semiconductor fab. Around the same time, IBM Japan announced development of a 2nm neuromorphic accelerator led from Japan.&lt;/p&gt;

&lt;p&gt;Read these headlines individually and they're just more semiconductor noise. But overlay them with three ArXiv papers published this month, and a very different picture emerges.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2026's semiconductor industry is quietly shifting from "push physics harder" to "bypass physics with software."&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This is my reading, but the evidence isn't thin.&lt;/p&gt;




&lt;h2&gt;
  
  
  Paper 1: INT8 Achieves 58x — DEEP-GAP Measures Where GPU Inference Actually Stands
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;DEEP-GAP: Deep-learning Evaluation of Execution Parallelism&lt;/strong&gt; (arXiv:2604.14552) systematically benchmarks datacenter inference accelerator performance.&lt;/p&gt;

&lt;p&gt;Key findings:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;"Results show that reduced precision significantly improves performance, with INT8 achieving up to 58x throughput improvement over CPU baselines. L4 achieves up to 4.4x higher throughput than T4 while reaching peak efficiency at smaller batch sizes between 16 and 32."&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Comparison&lt;/th&gt;
&lt;th&gt;Factor&lt;/th&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;INT8 vs CPU baseline&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;up to 58x&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Throughput&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;NVIDIA L4 vs T4&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;up to 4.4x&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Throughput&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;L4 peak efficiency batch size&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;16-32&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Latency-throughput tradeoff&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;58x is provocative, but note this compares FP32 CPU inference against INT8 GPU inference. Still, the implication is massive.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;One generation of process node advancement yields maybe 20-30% performance improvement.&lt;/strong&gt; Simply reducing precision (quantizing) delivers 58x. The optimization direction for hardware design is clearly "precision hierarchy" — deciding which computations need which precision, dynamically.&lt;/p&gt;




&lt;h2&gt;
  
  
  Paper 2: Run It Broken on Purpose — DRIFT's Fault-Tolerant Inference
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;DRIFT: Harnessing Inherent Fault Tolerance for Efficient and Resilient Inference&lt;/strong&gt; (arXiv:2604.09073) takes a contrarian approach.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;"DRIFT can achieve on average 36% energy savings through voltage underscaling or 1.7x speedup via overclocking while maintaining generation quality."&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;"Voltage underscaling" means running chips below rated voltage. Normally this introduces memory errors and computational mistakes — fatal for numerical computing. But generative AI models tolerate a degree of bit errors without degrading output quality. DRIFT exploits this "soft fault tolerance" to intentionally lower voltage and save energy.&lt;/p&gt;

&lt;p&gt;The reverse works too: overclocking for 1.7x speedup "while maintaining quality."&lt;/p&gt;

&lt;p&gt;This is fundamental. &lt;strong&gt;The hardware's imperfections are being absorbed by the AI model's inherent tolerance.&lt;/strong&gt; The design philosophy has flipped from "make hardware perfect" to "make software tolerant of imperfect hardware."&lt;/p&gt;

&lt;p&gt;My prediction: this class of error-tolerant design becomes mainstream for NPU and edge AI chips by 2027-2028.&lt;/p&gt;




&lt;h2&gt;
  
  
  Paper 3: Spiking Neural at 4.2mW — L-SPINE Shows Another Direction
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;L-SPINE: A Low-Precision SIMD Spiking Neural Compute Engine&lt;/strong&gt; (arXiv:2604.03626), implemented on AMD VC707 FPGA:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Critical delay: &lt;strong&gt;0.39 ns&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Power: &lt;strong&gt;4.2 mW&lt;/strong&gt; (neuron-level)&lt;/li&gt;
&lt;li&gt;System total: &lt;strong&gt;0.54 W&lt;/strong&gt;, latency &lt;strong&gt;2.38 ms&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Compare 0.54W to RTX 4060's ~150W inference consumption. Two orders of magnitude less. "Different use case" is the correct objection — but that's exactly the point.&lt;/p&gt;

&lt;p&gt;SNNs compute only on spike events. Idle time costs near-zero power. This is devastatingly efficient for sparse sensor inputs: drone LiDAR, factory vibration sensors, medical wearables. Using GPUs for these tasks is absurd overkill.&lt;/p&gt;

&lt;p&gt;I expect the first mass-produced SNN chips for robotics/industrial edge sensor fusion by 2027-2028. L-SPINE being on FPGA means the prototyping phase is active now.&lt;/p&gt;




&lt;h2&gt;
  
  
  Measuring It on Real Hardware: RTX 4060 vs M4
&lt;/h2&gt;

&lt;p&gt;Here's what I observe on my setup (Ryzen 7 7845HS + RTX 4060 + Windows / Apple M4):&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;RTX 4060&lt;/strong&gt;: Running Qwen on llama.cpp, monitoring with &lt;code&gt;nvidia-smi&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;nvidia-smi &lt;span class="nt"&gt;--query-gpu&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;timestamp,name,temperature.gpu,power.draw,clocks.gr,clocks.mem &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--format&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;csv,noheader,nounits &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--loop&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Thermal throttling visibly drops clock speeds. "Process node miniaturization limits" are already firing daily on laptop GPUs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;M4 comparison&lt;/strong&gt;: Same workload, and the fan doesn't spin. In sustained workloads where RTX 4060 throttles, M4 maintains equivalent tokens/sec silently. &lt;strong&gt;Sustained performance, not peak performance, is the real metric&lt;/strong&gt; — exactly what DEEP-GAP's "peak efficiency at batch 16-32" is saying from a different angle.&lt;/p&gt;




&lt;h2&gt;
  
  
  2026-2030: My Predictions (All Personal Analysis)
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Prediction 1: "Precision Hierarchy" Becomes the Next Design Axis
&lt;/h3&gt;

&lt;p&gt;CPUs, GPUs, and NPUs will all dynamically control which operations run at which precision. The era of universal FP32 is over.&lt;/p&gt;

&lt;h3&gt;
  
  
  Prediction 2: DRIFT's "Tolerate Broken State" Design Goes Mainstream
&lt;/h3&gt;

&lt;p&gt;Semiconductor design orthodoxy was "prevent errors." DRIFT reverses this to "design models that maintain quality despite errors." Impact on production chip power design: 2028-2029 at earliest.&lt;/p&gt;

&lt;h3&gt;
  
  
  Prediction 3: Terafab Symbolizes "Vertical Integration" as Industry Trend
&lt;/h3&gt;

&lt;p&gt;Tesla's Terafab and OpenAI's semiconductor investments ($20B+ reported) are driven by wanting to co-design models and hardware. Apple Silicon demonstrated the performance/W advantage of co-design.&lt;/p&gt;

&lt;h3&gt;
  
  
  Prediction 4: SNN Reaches Production in Robotics/Sensor Fusion by 2028
&lt;/h3&gt;

&lt;h3&gt;
  
  
  Prediction 5: NPU Architecture Becomes the Next Differentiator
&lt;/h3&gt;

&lt;p&gt;Intel Core Ultra NPU vs Qualcomm Hexagon have fundamentally different design philosophies. By 2027, "which NPU" determines what AI apps can run — creating Android-style fragmentation chaos.&lt;/p&gt;




&lt;h2&gt;
  
  
  What 8GB Users Should Watch
&lt;/h2&gt;

&lt;p&gt;Every approach described above is a &lt;strong&gt;bypass&lt;/strong&gt;, not a breakthrough. DEEP-GAP bypasses through precision reduction. DRIFT bypasses through error tolerance. L-SPINE bypasses through architecture change. Terafab bypasses through vertical integration.&lt;/p&gt;

&lt;p&gt;The physics wall is real. But the ways around it are multiplying.&lt;/p&gt;

&lt;p&gt;If I had to summarize 2026's architecture trend in one phrase: &lt;strong&gt;"intelligence that tolerates imperfection."&lt;/strong&gt; The obsession with perfect precision, perfect error tolerance, perfect yield — these become constraints, not goals, at the next design frontier.&lt;/p&gt;




&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="http://arxiv.org/abs/2604.14552v1" rel="noopener noreferrer"&gt;DEEP-GAP: Deep-learning Evaluation of Execution Parallelism (arXiv:2604.14552)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://arxiv.org/abs/2604.03626v1" rel="noopener noreferrer"&gt;L-SPINE: A Low-Precision SIMD Spiking Neural Compute Engine (arXiv:2604.03626)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://arxiv.org/abs/2604.09073v1" rel="noopener noreferrer"&gt;DRIFT: Harnessing Inherent Fault Tolerance for Efficient and Resilient Inference (arXiv:2604.09073)&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>semiconductor</category>
      <category>ai</category>
      <category>hardware</category>
      <category>gpu</category>
    </item>
    <item>
      <title>I Ran an LLM Agent on 8GB VRAM — It Broke After 5 Tool Calls</title>
      <dc:creator>plasmon</dc:creator>
      <pubDate>Tue, 21 Apr 2026 05:03:07 +0000</pubDate>
      <link>https://dev.to/plasmon_imp/i-ran-an-llm-agent-on-8gb-vram-it-broke-after-5-tool-calls-1d38</link>
      <guid>https://dev.to/plasmon_imp/i-ran-an-llm-agent-on-8gb-vram-it-broke-after-5-tool-calls-1d38</guid>
      <description>&lt;h2&gt;
  
  
  The "Long-Term Memory" Agent Is a Fantasy on 8GB
&lt;/h2&gt;

&lt;p&gt;2026's LLMs are expected to run as agents by default. Call tools, receive results, decide next action, call again. Claude Code, Cursor, Devin — all built on "long-running loop" strategies.&lt;/p&gt;

&lt;p&gt;This strategy physically cannot work on 8GB local VRAM.&lt;/p&gt;

&lt;p&gt;I tested a llama.cpp-based tool-calling agent with RTX 4060 Laptop (8GB) + Qwen2.5-7B Q4_K_M. The result is simple: &lt;strong&gt;beyond ~5 tool calls, response quality visibly degrades.&lt;/strong&gt; Past 10 calls, the model starts ignoring results from tools it just called.&lt;/p&gt;

&lt;p&gt;This article breaks down why this happens from KV cache and Context Rot perspectives, then examines 3 viable workarounds for 8GB.&lt;/p&gt;




&lt;h2&gt;
  
  
  How Much KV Cache Does Each Tool Call Eat?
&lt;/h2&gt;

&lt;p&gt;Consider the token cost of one tool call cycle:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;One tool call cycle:
  System prompt              : ~500 tok (fixed)
  User instruction           : ~200 tok (fixed)
  Conversation history       : variable (accumulates)
  Tool definitions (schemas) : ~300 tok × number of tools
  LLM response (tool_call)   : ~100 tok
  Tool execution result      : ~500-2000 tok (variable)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;With 5 tools defined and average 800 tokens per result, KV cache accumulation per step:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Step&lt;/th&gt;
&lt;th&gt;Cumulative Tokens&lt;/th&gt;
&lt;th&gt;KV Cache (fp16)&lt;/th&gt;
&lt;th&gt;VRAM Remaining (7B Q4_K_M)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;0 (initial)&lt;/td&gt;
&lt;td&gt;~2,200&lt;/td&gt;
&lt;td&gt;0.12 GB&lt;/td&gt;
&lt;td&gt;2.60 GB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;~4,900&lt;/td&gt;
&lt;td&gt;0.26 GB&lt;/td&gt;
&lt;td&gt;2.46 GB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;~6,700&lt;/td&gt;
&lt;td&gt;0.36 GB&lt;/td&gt;
&lt;td&gt;2.36 GB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;10&lt;/td&gt;
&lt;td&gt;~11,200&lt;/td&gt;
&lt;td&gt;0.60 GB&lt;/td&gt;
&lt;td&gt;2.12 GB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;20&lt;/td&gt;
&lt;td&gt;~20,200&lt;/td&gt;
&lt;td&gt;1.08 GB&lt;/td&gt;
&lt;td&gt;1.64 GB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;30&lt;/td&gt;
&lt;td&gt;~29,200&lt;/td&gt;
&lt;td&gt;1.56 GB&lt;/td&gt;
&lt;td&gt;1.16 GB&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;agent_vram_estimate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;steps&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tokens_per_step&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;900&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;base_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;2200&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                        &lt;span class="n"&gt;model_gb&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;4.68&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;overhead_gb&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.6&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                        &lt;span class="n"&gt;n_layers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;28&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;n_kv_heads&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;head_dim&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;128&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;dtype_bytes&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Estimate VRAM consumption for agent loop&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;total_tokens&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;base_tokens&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;steps&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;tokens_per_step&lt;/span&gt;
    &lt;span class="n"&gt;kv_gb&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;n_layers&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;total_tokens&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;n_kv_heads&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;head_dim&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;dtype_bytes&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1024&lt;/span&gt;&lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;total_gb&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;model_gb&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;kv_gb&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;overhead_gb&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;steps&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;steps&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tokens&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;total_tokens&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;kv_cache_gb&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;round&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;kv_gb&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;total_vram_gb&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;round&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;total_gb&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;remaining_gb&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;round&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;8.0&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;total_gb&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;30&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;50&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
    &lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;agent_vram_estimate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Step &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="n"&gt;d&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;tokens&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; tok, KV=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;kv_cache_gb&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;GB, remaining=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;remaining_gb&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;GB&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;At ~30 steps, remaining VRAM drops below 1GB. At 50 steps, OOM is theoretically visible. Q4 KV cache quantization (&lt;code&gt;--cache-type-k q4_0 --cache-type-v q4_0&lt;/code&gt;) compresses by ~3.5x, but even then, 100+ step loops are unrealistic.&lt;/p&gt;

&lt;p&gt;But &lt;strong&gt;a more serious problem hits before OOM.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Context Rot — Long Context Kills Quality
&lt;/h2&gt;

&lt;p&gt;Even when everything fits in VRAM, response quality collapses as context grows. This is known as "Context Rot."&lt;/p&gt;

&lt;p&gt;Chroma Research reports that LLM information reproduction accuracy decreases inversely with token count. Degradation is especially pronounced in "intermediate result accumulation" patterns — exactly what agents do.&lt;/p&gt;

&lt;p&gt;Microsoft and Salesforce's joint research "LLMs Get Lost In Multi-Turn Conversation" (arXiv:2505.06120) provides specific numbers. Converting benchmark prompts into multi-turn conversations (agent-workflow-like), they report &lt;strong&gt;average 39% performance drop across 6 generative tasks.&lt;/strong&gt; Even reasoning-specialized models like o3 and DeepSeek-R1 weren't immune.&lt;/p&gt;

&lt;p&gt;With 7B models, degradation starts earlier. What I observed with Qwen2.5-7B:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Steps 3-5:&lt;/strong&gt; Normal operation. Accurately references tool results, selects appropriate next action&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Steps 5-8:&lt;/strong&gt; Begins forgetting initial instructions. Redundantly re-calls the same tools&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Steps 8-10:&lt;/strong&gt; Ignores recent tool results. Hallucination rate climbs&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Steps 10+:&lt;/strong&gt; Loses conversational direction. Tool calls become unrelated to the objective&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is the same structure as "Lost in the Middle" (Liu et al., TACL 2024). In agent scenarios, tool results from steps 3-4 get pushed to the "middle," and only the system prompt (beginning) and latest results (end) get referenced.&lt;/p&gt;




&lt;h2&gt;
  
  
  Do Larger Models Solve This?
&lt;/h2&gt;

&lt;p&gt;Important counter-evidence:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;GPT-4.1 showed no degradation in tool-heavy conversations.&lt;/strong&gt; Parloa's testing confirms large models maintain stable performance in long conversations.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;MemAgent extrapolates from 8K context to 3.5M token tasks with under 10% performance loss&lt;/strong&gt; (OpenReview). RLM (Recursive Language Model) maintains 91.33% accuracy across 1000 documents and 10M+ tokens.&lt;/p&gt;

&lt;p&gt;However, these all involve large models with tens to hundreds of GB of memory, or cloud inference.&lt;/p&gt;

&lt;p&gt;For 7B models running on 8GB VRAM:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;The context window itself is physically limited (as shown above)&lt;/li&gt;
&lt;li&gt;Fewer Attention heads means weaker long-range dependency retention&lt;/li&gt;
&lt;li&gt;GQA (Grouped Query Attention) saves KV cache, but doesn't improve the model's actual "memory capacity"&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;"The problem is mitigated with sufficient model size" is true. "On 8GB, you must engineer around it" is equally true.&lt;/p&gt;




&lt;h2&gt;
  
  
  Workaround 1: Short Loops × Context Reset
&lt;/h2&gt;

&lt;p&gt;The simplest and most effective approach. Cut the agent loop short and reset context at each loop boundary.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;short_loop_agent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;task&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tools&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;max_steps_per_loop&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Short-loop × reset strategy agent&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;memory&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;  &lt;span class="c1"&gt;# Only carry summaries between loops
&lt;/span&gt;
    &lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="nf"&gt;is_task_complete&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;memory&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="c1"&gt;# Rebuild context with minimum necessary info
&lt;/span&gt;        &lt;span class="n"&gt;context&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;build_context&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;system_prompt&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;SYSTEM_PROMPT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;task&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;task&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;memory_summary&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nf"&gt;summarize&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;memory&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;:]),&lt;/span&gt;  &lt;span class="c1"&gt;# Only last 3 summaries
&lt;/span&gt;            &lt;span class="n"&gt;tools&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;tools&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="c1"&gt;# Execute short loop
&lt;/span&gt;        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;step&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;max_steps_per_loop&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;llm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;generate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;tool_call&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;execute_tool&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;tool_call&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                &lt;span class="n"&gt;memory&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;step&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;step&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tool&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;tool_call&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;result_summary&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;summarize_result&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                &lt;span class="p"&gt;})&lt;/span&gt;
            &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="k"&gt;break&lt;/span&gt;

        &lt;span class="c1"&gt;# End of loop: reset context, carry only summaries
&lt;/span&gt;        &lt;span class="c1"&gt;# KV cache is freed
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The key is &lt;code&gt;memory_summary&lt;/code&gt;. What passes between loops isn't raw tool results — it's &lt;strong&gt;summaries&lt;/strong&gt;. This prevents KV cache accumulation while retaining necessary information.&lt;/p&gt;

&lt;p&gt;5 steps × 6 loops = 30-step equivalent task, processed at ~6,700 tokens per loop (0.36GB KV cache). Compared to 1.56GB for running 30 steps straight, VRAM consumption is less than a quarter.&lt;/p&gt;




&lt;h2&gt;
  
  
  Workaround 2: Persistent Q4 KV Cache
&lt;/h2&gt;

&lt;p&gt;arXiv:2603.04428 "Agent Memory Below the Prompt" (2026) proposes persisting agent KV cache to disk with Q4 quantization, loading directly into Attention layers when needed.&lt;/p&gt;

&lt;p&gt;Validated on Apple M4 Pro:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;FP16 KV cache budget of 10.2GB holds only 3 agent contexts&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Q4 quantization fits 4x more agent contexts in the same memory&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;TTFT improvement from cache restoration: up to 136x&lt;/strong&gt; (22–136x for Gemma, 11–76x for DeepSeek)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The core insight: "avoid recomputation." Normally, restoring context requires recalculating prefill for all tokens. Persistent KV Cache skips this entirely by loading pre-saved KV states directly.&lt;/p&gt;

&lt;p&gt;The paper validated on M4 Pro, but the principle applies equally to RTX 4060. llama.cpp has experimental KV cache save/restore APIs (&lt;code&gt;--save-state&lt;/code&gt;, &lt;code&gt;--load-state&lt;/code&gt;). Saving per-agent KV snapshots on NVMe SSD and loading on task switch avoids prefill recomputation. On 8GB — where you can only hold one agent context at a time — this "swap" strategy's benefit is even larger than on M4 Pro.&lt;/p&gt;




&lt;h2&gt;
  
  
  Workaround 3: Dynamic Tool Selection (Tool Loadout)
&lt;/h2&gt;

&lt;p&gt;More tool definitions means worse selection accuracy. Berkeley's function-calling leaderboard confirms that &lt;strong&gt;as tools increase, description overlap makes correct selection harder.&lt;/strong&gt; Empirically, 5–10 tools is the practical ceiling for 7B models. Tool definitions themselves consume context and pressure KV cache.&lt;/p&gt;

&lt;p&gt;Solution: "don't define all tools at all times."&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;dynamic_tool_selection&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;all_tools&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;max_tools&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Dynamically select tools based on query&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="c1"&gt;# Lightweight classifier determines query category
&lt;/span&gt;    &lt;span class="n"&gt;category&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;classify_query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# "search", "code", "data", etc.
&lt;/span&gt;
    &lt;span class="c1"&gt;# Select tool subset based on category
&lt;/span&gt;    &lt;span class="n"&gt;tool_groups&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;search&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;web_search&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;file_search&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;grep&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;code&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;run_python&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;read_file&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;write_file&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;data&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sql_query&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;csv_parse&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;chart_generate&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="n"&gt;selected&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tool_groups&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;category&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;all_tools&lt;/span&gt;&lt;span class="p"&gt;[:&lt;/span&gt;&lt;span class="n"&gt;max_tools&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;selected&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Loading all 20 tool definitions costs ~6,000 tokens. Narrowing to 5 tools: ~1,500 tokens. The 4,500-token difference saves 0.04GB per step in Q4 KV cache. Looks small, but over 30 steps this accumulates to 1.2GB+ difference.&lt;/p&gt;




&lt;h2&gt;
  
  
  8GB Agent Design Principles
&lt;/h2&gt;

&lt;p&gt;Combining all three workarounds:&lt;/p&gt;

&lt;h3&gt;
  
  
  Principle 1: Loops Stay Under 5 Steps
&lt;/h3&gt;

&lt;p&gt;7B models maintain context quality up to ~6,000–8,000 tokens. At ~900 tokens per tool call, 5 steps is the limit.&lt;/p&gt;

&lt;h3&gt;
  
  
  Principle 2: Memory Carries as "Summaries"
&lt;/h3&gt;

&lt;p&gt;Never leave raw tool results in context. Summarize at each loop boundary. Next loop only sees summaries.&lt;/p&gt;

&lt;h3&gt;
  
  
  Principle 3: Maximum 5 Tool Definitions
&lt;/h3&gt;

&lt;p&gt;Dynamic tool selection loads only what's needed per step. "Universal agents" don't work on 8GB.&lt;/p&gt;

&lt;h3&gt;
  
  
  Principle 4: Monitor "Context Quality"
&lt;/h3&gt;

&lt;p&gt;Track tool call "hit rate" (whether called tools matched the objective). When it drops, reset the loop. Use as automatic reset trigger.&lt;/p&gt;




&lt;h2&gt;
  
  
  The 8GB Constraint Improves Agent Design
&lt;/h2&gt;

&lt;p&gt;As I wrote in the 128K context article — the 8GB constraint isn't a handicap. It's a design forcing function.&lt;/p&gt;

&lt;p&gt;Cloud-scale models can brute-force 100-step agent loops. But as Microsoft and Salesforce's research shows, &lt;strong&gt;being able to run it and maintaining quality are separate problems.&lt;/strong&gt; Even o3 degrades by 39%.&lt;/p&gt;

&lt;p&gt;The 8GB constraint doesn't hide the fact that "quality drops at 5 steps." That's precisely why it leads to "fundamentally correct design" — short loops, summary carry-over, dynamic tool selection. These design principles apply directly to cloud environments too — and arguably &lt;em&gt;should&lt;/em&gt; be applied there.&lt;/p&gt;

&lt;p&gt;What determines agent performance isn't context length. It's context quality.&lt;/p&gt;




&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;"Context Rot: How Increasing Input Tokens Impacts LLM Performance" (Chroma Research): &lt;a href="https://research.trychroma.com/context-rot" rel="noopener noreferrer"&gt;https://research.trychroma.com/context-rot&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;"Lost in the Middle: How Language Models Use Long Contexts" (Liu et al., TACL 2024): &lt;a href="https://arxiv.org/abs/2307.03172" rel="noopener noreferrer"&gt;https://arxiv.org/abs/2307.03172&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;arXiv:2603.04428 — Agent Memory Below the Prompt: Persistent Q4 KV Cache for Multi-Agent LLM Inference on Edge Devices&lt;/li&gt;
&lt;li&gt;"LLMs Get Lost In Multi-Turn Conversation" (Microsoft Research &amp;amp; Salesforce, arXiv:2505.06120): &lt;a href="https://arxiv.org/abs/2505.06120" rel="noopener noreferrer"&gt;https://arxiv.org/abs/2505.06120&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Parloa Labs — Long Conversations and LLM Performance: &lt;a href="https://www.parloa.com/labs/insights/long-calls-LLM-performance/" rel="noopener noreferrer"&gt;https://www.parloa.com/labs/insights/long-calls-LLM-performance/&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Berkeley Function-Calling Leaderboard: &lt;a href="https://gorilla.cs.berkeley.edu/leaderboard.html" rel="noopener noreferrer"&gt;https://gorilla.cs.berkeley.edu/leaderboard.html&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;llama.cpp: &lt;a href="https://github.com/ggerganov/llama.cpp" rel="noopener noreferrer"&gt;https://github.com/ggerganov/llama.cpp&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>gpu</category>
      <category>programming</category>
    </item>
    <item>
      <title>The Memory Wall Can't Be Killed — 3 Papers Proving Every Architecture Hits It</title>
      <dc:creator>plasmon</dc:creator>
      <pubDate>Tue, 21 Apr 2026 02:54:11 +0000</pubDate>
      <link>https://dev.to/plasmon_imp/the-memory-wall-cant-be-killed-3-papers-proving-every-architecture-hits-it-1i8c</link>
      <guid>https://dev.to/plasmon_imp/the-memory-wall-cant-be-killed-3-papers-proving-every-architecture-hits-it-1i8c</guid>
      <description>&lt;h2&gt;
  
  
  I Tested 3 "Escape Routes" from the Memory Wall
&lt;/h2&gt;

&lt;p&gt;The GPU memory wall — computation sitting idle because memory bandwidth can't keep up — is something anyone who's run a local LLM knows viscerally. With 8GB on an RTX 4060, both model size and context length are memory-bound.&lt;/p&gt;

&lt;p&gt;Neuromorphic chips. Edge NPUs. Processing-in-Memory. These architectures all wave the banner of "escaping the von Neumann bottleneck." The memory wall is a GPU-specific problem, they say. Change the architecture, solve the problem — or at least, that's been the promise.&lt;/p&gt;

&lt;p&gt;In April 2026, three papers put this promise to the test. The conclusion: the wall is still there.&lt;/p&gt;




&lt;h2&gt;
  
  
  Test 1: Neuromorphic's "New Wall"
&lt;/h2&gt;

&lt;p&gt;Yousefzadeh et al.'s "Memory Wall is not gone" (arXiv:2604.08774) is blunt from the title.&lt;/p&gt;

&lt;p&gt;Neuromorphic chip design philosophy centers on "distributed memory." Each neuron core has local SRAM, synaptic weights sit right next to the core. The distance between compute and memory approaches zero — structurally bypassing the GPU memory wall (the bandwidth gap between DRAM and compute units).&lt;/p&gt;

&lt;p&gt;The paper identifies the cost of this bypass:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;"On-chip memory systems (SRAM and STT-MRAM variants) have become the primary consumers of area and energy, forming a new memory wall."&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;In distributed architectures, you need SRAM proportional to neurons × synapses. By bringing computation close to memory, the chip area fills up with memory. SRAM requires constant power, leaking energy even during idle periods with no spikes.&lt;/p&gt;

&lt;p&gt;Replacing SRAM with non-volatile STT-MRAM (Spin-Transfer Torque MRAM) is being explored, but write energy is high and endurance is limited. Change the memory technology, and the structure remains: "memory area and energy are the bottleneck."&lt;/p&gt;

&lt;p&gt;On GPUs, bandwidth was the bottleneck. On neuromorphic chips, area and leakage current are the bottleneck. The wall just changed shape.&lt;/p&gt;




&lt;h2&gt;
  
  
  Test 2: KV Cache Dominates Even on Edge NPUs
&lt;/h2&gt;

&lt;p&gt;SHIELD (arXiv:2604.07396) targets LLM inference on Edge NPUs. The paper's opening states the problem plainly: &lt;strong&gt;"LLM inference on Edge NPUs is fundamentally constrained by limited on-chip memory capacity."&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Edge NPUs are designed to maximize memory efficiency for inference. But LLM inference requires a KV cache — the memory region that holds past Keys and Values for Attention computation. This grows linearly with context length, pressuring memory.&lt;/p&gt;

&lt;p&gt;SHIELD focuses on the refresh energy of eDRAM (embedded DRAM) that holds the KV cache. DRAM stores data as charge in capacitors, requiring periodic refresh (recharging).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;BF16 (bfloat16) bit fields:
  Sign (1 bit) + Exponent (8 bits) = determines magnitude
  Mantissa (7 bits) = determines precision

SHIELD's approach:
  KV cache (persistent): relax mantissa refresh
  Query/Attention output (transient): skip mantissa refresh entirely
  Sign + exponent: always full refresh (critical for correctness)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;By separating refresh strategies based on data "lifetime" and "bit sensitivity," SHIELD achieves 35% eDRAM refresh energy reduction. Accuracy is maintained on WikiText-2, PIQA, and ARC-Easy.&lt;/p&gt;

&lt;p&gt;SHIELD is simultaneously a "solution" and "evidence of the problem." When a dedicated NPU paper makes "memory refresh energy" its optimization target, it proves that even inference-specialized chips are memory-bottlenecked.&lt;/p&gt;




&lt;h2&gt;
  
  
  Test 3: GQA Only Shrinks the Wall to One-Third
&lt;/h2&gt;

&lt;p&gt;TRAPTI (arXiv:2604.06955, IJCNN 2026) analyzes on-chip memory occupancy over time for embedded Transformer inference.&lt;/p&gt;

&lt;p&gt;Comparing GPT-2 XL (MHA: Multi-Head Attention) and DeepSeek-R1-Distill-Qwen-1.5B (GQA: Grouped-Query Attention) on the same accelerator configuration: GQA-based DeepSeek uses &lt;strong&gt;2.72x less peak on-chip memory&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;GQA compresses KV cache size by reducing the number of Key/Value heads. A 2.72x reduction is certainly significant. But flip this number around: "even with GQA — the latest compression technique — KV cache remains the largest on-chip memory consumer."&lt;/p&gt;

&lt;p&gt;The paper states clearly: &lt;strong&gt;"Performance and efficiency are increasingly dominated by the KV cache."&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;GQA, MQA (Multi-Query Attention), quantized KV cache — techniques for narrowing the bandwidth gap keep evolving. But they all "thin the wall." None of them "erase the wall." The structure where context length dominates memory through KV cache remains unchanged as long as attention mechanisms are used.&lt;/p&gt;




&lt;h2&gt;
  
  
  Wall Morphology Map: Architecture Changes, Wall Persists
&lt;/h2&gt;

&lt;p&gt;Organizing the three papers alongside existing architectures reveals the complete picture of memory bottlenecks:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Architecture&lt;/th&gt;
&lt;th&gt;Wall Form&lt;/th&gt;
&lt;th&gt;What's Bottlenecked&lt;/th&gt;
&lt;th&gt;2026 Countermeasure&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;GPU&lt;/td&gt;
&lt;td&gt;Memory bandwidth&lt;/td&gt;
&lt;td&gt;DRAM ⇔ compute data transfer&lt;/td&gt;
&lt;td&gt;HBM, GDDR7, cache hierarchy&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Neuromorphic&lt;/td&gt;
&lt;td&gt;Memory area/leakage&lt;/td&gt;
&lt;td&gt;SRAM dominates chip area and energy&lt;/td&gt;
&lt;td&gt;STT-MRAM replacement (issues remain)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Edge NPU&lt;/td&gt;
&lt;td&gt;Memory refresh&lt;/td&gt;
&lt;td&gt;eDRAM KV cache maintenance cost&lt;/td&gt;
&lt;td&gt;SHIELD: lifecycle-based refresh&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Embedded Transformer&lt;/td&gt;
&lt;td&gt;Memory occupancy&lt;/td&gt;
&lt;td&gt;KV cache on-chip footprint&lt;/td&gt;
&lt;td&gt;GQA, power gating&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;PIM&lt;/td&gt;
&lt;td&gt;Compute precision/flexibility&lt;/td&gt;
&lt;td&gt;Analog compute SNR limits&lt;/td&gt;
&lt;td&gt;Mixed precision, digital PIM&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Look at the "Wall Form" column. Bandwidth, area, refresh energy, occupancy, compute precision — all different. But every single one is "a bottleneck originating from memory."&lt;/p&gt;

&lt;p&gt;Change the architecture, and the wall changes shape. But the wall itself never disappears.&lt;/p&gt;




&lt;h2&gt;
  
  
  Only Optical Computing Has a Different Underlying Principle
&lt;/h2&gt;

&lt;p&gt;Every architecture above assumes electronic data transfer. Moving electrons requires energy, and wires have RC delay.&lt;/p&gt;

&lt;p&gt;Optical computing changes this premise. Photons have no mass, no resistance, and propagation costs nearly zero energy. PRISM (arXiv:2603.21576) reduced KV cache block selection from O(n) to O(1) because optical similarity computation doesn't depend on context length.&lt;/p&gt;

&lt;p&gt;Photonics research in 2026 is also advancing steadily:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Non-volatile photonics&lt;/strong&gt; (arXiv:2604.08637): Nanostructured Sb₂Se₃ phase-change material achieving 94% insertion loss suppression and &lt;strong&gt;100M+ write cycle endurance&lt;/strong&gt;. "Storing data with light" is approaching practicality.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Photonic KAN&lt;/strong&gt; (arXiv:2604.08432): Optical neural networks built from standard telecom components (MZI, SOA, VOA). 4 modules achieve 98.4% accuracy on nonlinear classification. Optical AI without custom chips.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But light has walls too. Nonlinear operations require electro-optical conversion, and photons can't stand still — "memory" needs a material mechanism. Light can fundamentally bypass the "transfer wall" but cannot escape the "storage wall."&lt;/p&gt;




&lt;h2&gt;
  
  
  The Wall Transforms but Never Dies
&lt;/h2&gt;

&lt;p&gt;"Memory wall" was coined by Wulf and McKee in 1995, originally referring to the widening speed gap between processors and DRAM. Thirty years later, the definition itself has expanded.&lt;/p&gt;

&lt;p&gt;The 2026 reality: constraints manifest not just as bandwidth, but as area, refresh energy, occupancy, and compute precision — different forms for different architectures. What all three papers consistently show is that &lt;strong&gt;no architecture escapes "memory-originated bottlenecks."&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The wall couldn't be killed. But its anatomy is becoming visible. Understanding which form the wall takes in each architecture reveals the optimal countermeasure. SHIELD's lifecycle-based refresh, TRAPTI's temporal memory analysis, GQA's KV cache compression — not erasing the wall, but using tools shaped to fit it. That's the most realistic approach as of 2026.&lt;/p&gt;




&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;"Memory Wall is not gone: A Critical Outlook on Memory Architecture in Digital Neuromorphic Computing" (Yousefzadeh et al., arXiv:2604.08774)&lt;/li&gt;
&lt;li&gt;"SHIELD: A Segmented Hierarchical Memory Architecture for Energy-Efficient LLM Inference on Edge NPUs" (Zhang &amp;amp; Fong, arXiv:2604.07396)&lt;/li&gt;
&lt;li&gt;"TRAPTI: Time-Resolved Analysis for SRAM Banking and Power Gating Optimization in Embedded Transformer Inference" (Klhufek et al., arXiv:2604.06955, IJCNN 2026)&lt;/li&gt;
&lt;li&gt;"PRISM: Breaking the O(n) Memory Wall in Long-Context LLM Inference via O(1) Photonic Block Selection" (arXiv:2603.21576)&lt;/li&gt;
&lt;li&gt;"Increased endurance of nonvolatile photonics enabled by nanostructured phase-change materials" (arXiv:2604.08637)&lt;/li&gt;
&lt;li&gt;"Small-scale photonic Kolmogorov-Arnold networks using standard telecom nonlinear modules" (arXiv:2604.08432)&lt;/li&gt;
&lt;li&gt;"Hitting the Memory Wall: Implications of the Obvious" (Wulf &amp;amp; McKee, ACM SIGARCH, 1995)&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>semiconductor</category>
      <category>ai</category>
      <category>hardware</category>
      <category>llm</category>
    </item>
    <item>
      <title>The Physics Wall in 2026: 3 Papers That Show Why Node Shrinks Won't Save Us</title>
      <dc:creator>plasmon</dc:creator>
      <pubDate>Tue, 21 Apr 2026 02:52:27 +0000</pubDate>
      <link>https://dev.to/plasmon_imp/the-physics-wall-in-2026-3-papers-that-show-why-node-shrinks-wont-save-us-4p0e</link>
      <guid>https://dev.to/plasmon_imp/the-physics-wall-in-2026-3-papers-that-show-why-node-shrinks-wont-save-us-4p0e</guid>
      <description>&lt;h2&gt;
  
  
  "2nm Will Fix Everything" Is a Fantasy — Let's Drop It
&lt;/h2&gt;

&lt;p&gt;From late 2024 through 2025, semiconductor press releases have been drowning in buzzwords: "2nm," "3nm," "Gate-All-Around," "CFET." Reading them makes you feel like GPUs will be 10x faster in a few years.&lt;/p&gt;

&lt;p&gt;They won't.&lt;/p&gt;

&lt;p&gt;More precisely: "simple die shrinks no longer guarantee linear performance or power efficiency gains from transistor density improvements." This isn't my opinion — it's what multiple ArXiv papers from 2025–2026 consistently demonstrate.&lt;/p&gt;

&lt;p&gt;I have a Ryzen 7 7845HS + RTX 4060 and an Apple M4 sitting on my desk, connected via a KVM switch. Running local LLM inference benchmarks on both, I've noticed something: &lt;strong&gt;the gap between spec sheet numbers and real-world performance per watt is widening with each generation.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This article dissects 3 recent papers, measures where the "physics wall" stands today, and offers my predictions toward 2030. Predictions are personal analysis — not fact.&lt;/p&gt;




&lt;h2&gt;
  
  
  Real Hardware Tells the Story: RTX 4060 vs M4 Power Efficiency
&lt;/h2&gt;

&lt;p&gt;First, look at these numbers. Measured on my setup running Qwen2.5-7B-Instruct (Q4_K_M) with llama.cpp:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;RTX 4060 (CUDA)&lt;/th&gt;
&lt;th&gt;Apple M4 (Metal)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Token generation (tg)&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;72.8 t/s&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;52.4 t/s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Inference GPU power (measured)&lt;/td&gt;
&lt;td&gt;~68W&lt;/td&gt;
&lt;td&gt;~18W&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;tokens/Watt&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;1.07&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;2.91&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Memory bandwidth&lt;/td&gt;
&lt;td&gt;~256 GB/s (GDDR6)&lt;/td&gt;
&lt;td&gt;~120 GB/s (LPDDR5)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Look at the tokens/Watt column. &lt;strong&gt;M4 achieves ~2.7x the power efficiency&lt;/strong&gt; of RTX 4060 for this workload. The RTX 4060 is bottlenecked by GDDR6's separate memory subsystem (~256GB/s), while M4's unified memory at 120GB/s has structurally lower data transfer overhead since CPU/GPU/NPU share it directly.&lt;/p&gt;

&lt;p&gt;I'm not saying "NVIDIA is worse than Apple." The RTX 4060 is designed as a general-purpose rendering/training/inference machine — different comparison target. The point is: &lt;strong&gt;architecture differences have already surpassed process node differences in determining real-world efficiency.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Paper 1: DRIFT — "Break Things on Purpose" Voltage Optimization for 36% Energy Savings
&lt;/h2&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;DRIFT: Harnessing Inherent Fault Tolerance for Efficient and Reliable Diffusion Model Inference&lt;/strong&gt; (arXiv:2604.09073, DAC 2026)&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;What makes this paper interesting is its contrarian approach. Normally, semiconductor design pushes toward "zero errors." DRIFT does the opposite — it exploits the fact that &lt;strong&gt;diffusion models inherently tolerate a certain level of bit errors&lt;/strong&gt;, and intentionally underscales voltage to slash energy consumption.&lt;/p&gt;

&lt;p&gt;Reported numbers:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Average 36% energy reduction&lt;/strong&gt; through voltage underscaling&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;1.7x throughput improvement&lt;/strong&gt; via overclocking (while maintaining image generation quality)&lt;/li&gt;
&lt;li&gt;Fine-grained voltage/frequency scaling strategy that prioritizes protection for error-sensitive components&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The paper targets diffusion models for image generation, not LLMs. But the "tolerate errors" approach itself is applicable to neural network inference broadly. It's on the same continuum as quantization (INT4/INT8) — a 4-bit quantized LLM drops information from original weights yet maintains inference quality. DRIFT pushes this principle down to the hardware voltage control layer.&lt;/p&gt;

&lt;p&gt;My personal read: &lt;strong&gt;this class of error-tolerant design will become mainstream for NPU/edge AI chips around 2027–2028.&lt;/strong&gt; This aligns with smartphone AI chips already moving toward "dynamic quality vs. power consumption tradeoff control."&lt;/p&gt;




&lt;h2&gt;
  
  
  Paper 2: Trilinear Compute-in-Memory — Running Full Transformer Attention in NVM Cores
&lt;/h2&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Trilinear Compute-in-Memory Architecture for Energy-Efficient Transformer Acceleration&lt;/strong&gt; (arXiv:2604.07628, 2026)&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The claim is simple and provocative:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;"To the best of our knowledge, this is the first architecture to complete entire Transformer Attention computation solely within NVM cores without runtime reprogramming."&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Compute-in-Memory (CiM) isn't new. Computing near memory to reduce data transfer energy has been around since the 2010s. The problem was practical: "can you actually handle full Transformer Attention?" &lt;/p&gt;

&lt;p&gt;TrilinearCIM uses a Double-Gate FeFET (DG-FeFET) architecture with back-gate modulation to achieve 3-operand multiply-accumulate operations within memory. Evaluated on BERT-base (GLUE benchmark) and ViT-base (ImageNet/CIFAR), achieving &lt;strong&gt;up to 46.6% energy reduction&lt;/strong&gt; and &lt;strong&gt;20.4% latency improvement&lt;/strong&gt; compared to conventional FeFET CiM.&lt;/p&gt;

&lt;p&gt;The evaluation targets are BERT and ViT — not large generative models — but they share the Transformer architecture structurally. The bottleneck of current LLM inference being memory-bandwidth-bound is well established. For a 7B parameter model, most of token generation time goes to weight transfer from memory, not GPU computation. Attention computation accounts for an estimated 30–40% of token generation time, so if Trilinear CiM can complete this entirely on NVM cores, it could fundamentally slash power costs.&lt;/p&gt;

&lt;p&gt;However, current CiM architectures have clear constraints. &lt;strong&gt;Latency spikes whenever weights need NVM write operations.&lt;/strong&gt; The paper's precondition of "no runtime reprogramming" limits it to inference-only, fixed-model use cases. Not viable for general-purpose training yet.&lt;/p&gt;




&lt;h2&gt;
  
  
  Paper 3: L-SPINE — Spiking Neural Network Running at 0.54W on FPGA
&lt;/h2&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;L-SPINE: A Low-Precision SIMD Spiking Neural Compute Engine&lt;/strong&gt; (arXiv:2604.03626, 2026)&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;An SNN implementation paper. Deployed on AMD VC707 FPGA:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;System-level: 46.37K LUT, 30.4K FF, 2.38ms latency, 0.54W power&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Claims "significant reduction" compared to CPU/GPU platforms&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;0.54W. Compare that to RTX 4060's ~68W inference consumption — two orders of magnitude different. "Different use cases" is the correct objection, but that's precisely why it matters.&lt;/p&gt;

&lt;p&gt;SNNs compute only when a spike fires. Idle time is near-zero power. This is &lt;strong&gt;terrifyingly well-suited for sparse sensor inputs&lt;/strong&gt;: drone LiDAR, factory vibration sensors, medical wearable biosignals. Using GPUs for these tasks is absurd overkill.&lt;/p&gt;

&lt;p&gt;My prediction: SNNs won't "replace general-purpose AI chips" before the 2030s. But &lt;strong&gt;first mass-produced SNN chips for sensor fusion in robotics/drones/industrial edge sensors by 2027–2028&lt;/strong&gt; is entirely plausible. The increasing volume of FPGA implementation papers like L-SPINE signals that the prototyping phase is actively underway.&lt;/p&gt;




&lt;h2&gt;
  
  
  Rapidus 1.4nm Domestic Fabrication — Don't Misread the Numbers
&lt;/h2&gt;

&lt;p&gt;In April 2026, Fujitsu announced it will commission Rapidus for 1.4nm AI semiconductor manufacturing. Combined private projects reportedly total ¥200B (~$1.3B) in scale.&lt;/p&gt;

&lt;p&gt;Honestly, it's too early to get excited about those numbers.&lt;/p&gt;

&lt;p&gt;TSMC's N2 (≈2nm) is still at the stage where even Apple and NVIDIA struggle with yield and unit cost. Rapidus achieving a production-stable 1.4nm line won't happen before 2028–2029 at the earliest, however smoothly things go.&lt;/p&gt;

&lt;p&gt;But "it won't reach production scale so it's meaningless" is also a shallow read. Rapidus is aiming for:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Supply chain risk diversification within Japan&lt;/strong&gt; (geopolitical value)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Domestic accumulation of cutting-edge process design/manufacturing know-how&lt;/strong&gt; (long-term technical foundation)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Physical AI-specialized chips through IBM partnership&lt;/strong&gt; (competing in niche applications)&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Rather than challenging TSMC+NVIDIA head-on in the general-purpose GPU market, pursuing low-volume, high-value specialty chips is a realistic survival strategy.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[Semiconductor Manufacturing Ecosystem — Current State]

General / High Volume: TSMC (N3/N2) → Apple, NVIDIA, AMD
General / Mid Volume:  Samsung, Intel Foundry → Various
Specialized / Low Volume: Rapidus (1.4nm) → Fujitsu, IBM Physical AI, ...
Edge / FPGA-based:     AMD, Intel → SNN &amp;amp; ultra-low-power applications
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  2026–2030: My Predictions (Bold Personal Analysis)
&lt;/h2&gt;

&lt;p&gt;Synthesizing the above papers, news, and measured data:&lt;/p&gt;

&lt;h3&gt;
  
  
  Prediction 1: Consumer 2nm Won't Arrive Until 2029+
&lt;/h3&gt;

&lt;p&gt;TSMC N2 yields and pricing will be consumed by Apple and NVIDIA first. 2nm in Ryzen-class CPUs won't appear before 2028–2029 at earliest. The 3nm optimization cycle continues for now.&lt;/p&gt;

&lt;h3&gt;
  
  
  Prediction 2: Compute-in-Memory Becomes Mainstream for Inference Accelerators (~2028)
&lt;/h3&gt;

&lt;p&gt;The "dissolve the boundary between compute and memory" direction shown by Trilinear CiM is fundamentally different from GPU design philosophy. Combined with DRIFT's error-tolerant design, power can be cut further. I predict CiM architectures reaching mass production in inference-dedicated edge AI chips around 2028.&lt;/p&gt;

&lt;h3&gt;
  
  
  Prediction 3: SNNs Reach Production First in Sensor Fusion (2027–2028)
&lt;/h3&gt;

&lt;p&gt;Not competing with general-purpose LLMs — coexisting through specialization. FPGA prototypes like L-SPINE appearing now suggest ASIC migration is 3–4 years out.&lt;/p&gt;

&lt;h3&gt;
  
  
  Prediction 4: TFLOPS Race Is Over. TOPS/W Race Has Begun
&lt;/h3&gt;

&lt;p&gt;When RTX 5000 series launches, the first metric I'll check is TFLOPS/W, not TFLOPS. Continued real measurements comparing M4 only strengthens this conviction. NVIDIA recognizes this too — BlueField-4 pushing AI-native storage infrastructure operates on the same principle of "putting data near computation."&lt;/p&gt;

&lt;h3&gt;
  
  
  Prediction 5: MATCHA Points to the Heterogeneous SoC Era (2027+ Mass Production)
&lt;/h3&gt;

&lt;p&gt;The MATCHA paper (arXiv:2604.09124) proposes a framework for efficiently deploying DNNs on SoCs with multiple heterogeneous acceleration engines. Smartphones already have CPU+GPU+NPU+DSP coexisting as heterogeneous SoCs. This is descending to PC-level APU/SoC design. Rather than a single powerful GPU, "orchestrating purpose-specific accelerators" will become the main design battlefield.&lt;/p&gt;




&lt;h2&gt;
  
  
  Measuring the "Wall" on Your Own Hardware
&lt;/h2&gt;

&lt;p&gt;Here's a simple tool for measuring power efficiency on your setup. Gets real-time RTX power via &lt;code&gt;nvml&lt;/code&gt; on Windows:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;#!/usr/bin/env python3
&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
GPU Power Efficiency Measurement Script
Dependencies: pynvml, psutil
pip install pynvml psutil
&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;threading&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;dataclasses&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;dataclass&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;field&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;typing&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;List&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;pynvml&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;psutil&lt;/span&gt;

&lt;span class="nd"&gt;@dataclass&lt;/span&gt;
&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;PowerSample&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;timestamp&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt;
    &lt;span class="n"&gt;gpu_power_w&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt;
    &lt;span class="n"&gt;cpu_power_w&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt;  &lt;span class="c1"&gt;# estimated via psutil
&lt;/span&gt;    &lt;span class="n"&gt;gpu_util_pct&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt;
    &lt;span class="n"&gt;mem_used_mb&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;PowerProfiler&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;sample_interval&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.5&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;pynvml&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;nvmlInit&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;handle&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pynvml&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;nvmlDeviceGetHandleByIndex&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;interval&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;sample_interval&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;samples&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;List&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;PowerSample&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_running&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_thread&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_sample_loop&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_running&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;gpu_power&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pynvml&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;nvmlDeviceGetPowerUsage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;handle&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mf"&gt;1000.0&lt;/span&gt;
            &lt;span class="n"&gt;util&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pynvml&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;nvmlDeviceGetUtilizationRates&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;handle&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;mem&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pynvml&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;nvmlDeviceGetMemoryInfo&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;handle&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;cpu_pct&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;psutil&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;cpu_percent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;interval&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;cpu_tdp_w&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;54.0&lt;/span&gt;  &lt;span class="c1"&gt;# Ryzen 7 7845HS
&lt;/span&gt;
            &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;samples&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;PowerSample&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                &lt;span class="n"&gt;timestamp&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;time&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
                &lt;span class="n"&gt;gpu_power_w&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;gpu_power&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="n"&gt;cpu_power_w&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;cpu_tdp_w&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;cpu_pct&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="n"&gt;gpu_util_pct&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;util&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;gpu&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="n"&gt;mem_used_mb&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;mem&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;used&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1024&lt;/span&gt;&lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="p"&gt;))&lt;/span&gt;
            &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sleep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;interval&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;start&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_running&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_thread&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;threading&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Thread&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;target&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_sample_loop&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;daemon&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_thread&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;start&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;stop_and_report&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_running&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_thread&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_thread&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;2.0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;samples&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{}&lt;/span&gt;

        &lt;span class="n"&gt;avg_gpu&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;gpu_power_w&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;samples&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;samples&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;peak_gpu&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;max&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;gpu_power_w&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;samples&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;avg_cpu&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cpu_power_w&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;samples&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;samples&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;duration&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;samples&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;timestamp&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;samples&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;timestamp&lt;/span&gt;

        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;duration_s&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;round&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;duration&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;avg_gpu_w&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;round&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;avg_gpu&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;peak_gpu_w&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;round&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;peak_gpu&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;avg_cpu_w&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;round&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;avg_cpu&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;total_energy_wh&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;round&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;avg_gpu&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;avg_cpu&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;duration&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;3600&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sample_count&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;samples&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__del__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;pynvml&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;nvmlShutdown&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="k"&gt;except&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;pass&lt;/span&gt;


&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;__name__&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;__main__&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;profiler&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;PowerProfiler&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sample_interval&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;profiler&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;start&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Measuring... (run your inference task here)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sleep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;30&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# Replace with subprocess call to llama-cli
&lt;/span&gt;
    &lt;span class="n"&gt;report&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;profiler&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;stop_and_report&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;--- Power Report ---&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Duration      : &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;report&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;duration_s&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;s&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Avg GPU Power : &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;report&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;avg_gpu_w&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;W&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Peak GPU Power: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;report&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;peak_gpu_w&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;W&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Est CPU Power : &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;report&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;avg_cpu_w&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;W&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Total Energy  : &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;report&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;total_energy_wh&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; Wh&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Reference values (Qwen2.5-7B Q4_K_M, 30-second inference):&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Value&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Avg GPU Power&lt;/td&gt;
&lt;td&gt;68.3 W&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Peak GPU Power&lt;/td&gt;
&lt;td&gt;89.7 W&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Est CPU Power&lt;/td&gt;
&lt;td&gt;11.4 W&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Energy per 30s&lt;/td&gt;
&lt;td&gt;0.664 Wh&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Avg tokens/s&lt;/td&gt;
&lt;td&gt;72.8&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;tokens/Wh&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;3,289&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;This tokens/Wh metric is what I'm tracking across generations and architectures to measure "real performance improvement." It's also how I'll decide whether to buy the next-gen chip — not by TFLOPS.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Bottom Line
&lt;/h2&gt;

&lt;p&gt;Stop chasing process node numbers. Whether it's 2nm or 1.4nm, &lt;strong&gt;architecture must change for the power wall to break.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;DRIFT's "intentional error tolerance" for diffusion models, Trilinear CiM's "complete computation within memory" for BERT/ViT, and L-SPINE's "ultra-low-power engine for sparse signals" — all three papers say the same thing in different voices: &lt;strong&gt;bypass the von Neumann bottleneck.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;What you can do today:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Measure tokens/Watt on your own hardware&lt;/strong&gt; — the script above works as-is&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;When choosing your next chip, check TFLOPS/W&lt;/strong&gt; — the era of prioritizing efficiency over absolute performance is here&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;When following Rapidus news, evaluate "production timeline" and "application specificity" as a pair&lt;/strong&gt; — "domestic = challenging general-purpose GPUs" is not the right reading&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The physics wall exists. But the teams that survive in 2030 won't be the ones that "broke through" it — they'll be the ones that &lt;strong&gt;routed around&lt;/strong&gt; it. That's where we are.&lt;/p&gt;

</description>
      <category>semiconductor</category>
      <category>hardware</category>
      <category>ai</category>
      <category>gpu</category>
    </item>
    <item>
      <title>20260325_llamacpp_options_8gb_en</title>
      <dc:creator>plasmon</dc:creator>
      <pubDate>Sat, 18 Apr 2026 12:41:23 +0000</pubDate>
      <link>https://dev.to/plasmon_imp/20260325llamacppoptions8gben-3jjg</link>
      <guid>https://dev.to/plasmon_imp/20260325llamacppoptions8gben-3jjg</guid>
      <description>&lt;h1&gt;
  
  
  5 llama.cpp Settings That Turn 8GB VRAM From Sluggish to 5x Faster — Every Option Benchmarked
&lt;/h1&gt;

&lt;p&gt;llama.cpp has over 50 launch options. Most of them are fine at their defaults. But on 8GB VRAM, misconfiguring just 5 of them will cut your inference speed in half.&lt;/p&gt;

&lt;p&gt;What follows is a settings guide based on actual measurements on an RTX 4060 8GB (GDDR6 272 GB/s).&lt;/p&gt;




&lt;h2&gt;
  
  
  The Most Important: &lt;code&gt;-ngl&lt;/code&gt; (GPU Layer Count)
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# -ngl: How many model layers to offload to GPU
&lt;/span&gt;&lt;span class="n"&gt;ngl_config&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;meaning&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Number of Transformer layers loaded into GPU VRAM&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;default&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# All layers on CPU = slowest possible
&lt;/span&gt;    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;max&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Total layers in the model (Qwen2.5-7B = 28, Llama-3-8B = 32, Qwen2.5-32B = 64)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;999&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;All layers on GPU (fastest, if it fits in VRAM)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;# Optimal values for 8GB VRAM
&lt;/span&gt;&lt;span class="n"&gt;ngl_optimal_8gb&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Qwen2.5-7B Q4_K_M (4.7GB)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;-ngl&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;999&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# Full GPU offload possible
&lt;/span&gt;        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;VRAM usage&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;~5.4 GB (weights 4.7 + KV 0.44 + overhead 0.3 at 8K context)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;speed&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;~32 t/s&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Mistral-Nemo-12B Q4_K_M (7.2GB)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;-ngl&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;999&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# Barely fits entirely on GPU
&lt;/span&gt;        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;VRAM usage&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;~7.5 GB&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;speed&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;~20 t/s&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;warning&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;KV cache may cause OOM. Use -c 2048&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Qwen2.5-32B Q4_K_M (18.5GB)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;-ngl&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;25&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# 25 of 64 layers on GPU
&lt;/span&gt;        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;VRAM usage&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;~7.4 GB&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;speed&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;~10.8 t/s&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;remaining 39 layers&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;CPU (via DDR5)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Changing &lt;code&gt;-ngl&lt;/code&gt; by just 1 shifts speed by a few percent. The optimal value is the one that squeezes VRAM usage right to the limit.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Finding the optimal -ngl (binary search)
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;find_optimal_ngl&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;total_layers&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;vram_gb&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;8.0&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    1. Launch with -ngl 999 -&amp;gt; if OOM, move on
    2. Launch with -ngl {total_layers // 2}
    3. No OOM -&amp;gt; increase; OOM -&amp;gt; decrease
    4. The sweet spot is where VRAM sits at 7.0-7.5 GB
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="c1"&gt;# On RTX 4060 8GB, ~0.5 GB goes to CUDA context + framework overhead
&lt;/span&gt;    &lt;span class="c1"&gt;# The remaining 7.5 GB is available for model layers
&lt;/span&gt;    &lt;span class="k"&gt;pass&lt;/span&gt;

&lt;span class="c1"&gt;# Tips for tuning:
# Monitor VRAM with nvidia-smi while adjusting -ngl
# 7.0-7.5 GB usage is the sweet spot. Above 7.8 GB risks OOM during inference
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  &lt;code&gt;-c&lt;/code&gt; (Context Length)
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# -c: Maximum context length (in tokens)
&lt;/span&gt;&lt;span class="n"&gt;context_config&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;meaning&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Upper limit of tokens the model can reference during inference&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;default&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;4096&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# llama.cpp default (as of b8233)
&lt;/span&gt;    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;impact&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Directly determines KV cache VRAM consumption&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;# KV cache VRAM consumption calculation
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;kv_cache_vram&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;context_len&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;n_layers&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;n_heads&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;head_dim&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;dtype_bytes&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    KV cache = 2 × n_layers × n_heads × head_dim × context_len × dtype_bytes
    (K cache + V cache = 2x)
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;bytes_total&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;n_layers&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;n_heads&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;head_dim&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;context_len&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;dtype_bytes&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;bytes_total&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1024&lt;/span&gt;&lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# GB
&lt;/span&gt;
&lt;span class="c1"&gt;# Qwen2.5-7B (28 layers, 4 KV heads (GQA), 128 head_dim)
&lt;/span&gt;&lt;span class="n"&gt;kv_7b&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;4096 tokens (FP16)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;kv_cache_vram&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;4096&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;28&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;128&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; GB&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;   &lt;span class="c1"&gt;# 0.22 GB
&lt;/span&gt;    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;8192 tokens (FP16)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;kv_cache_vram&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;8192&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;28&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;128&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; GB&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;   &lt;span class="c1"&gt;# 0.44 GB
&lt;/span&gt;    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;32768 tokens (FP16)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;kv_cache_vram&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;32768&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;28&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;128&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; GB&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# 1.75 GB
&lt;/span&gt;    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;131072 tokens (FP16)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;kv_cache_vram&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;131072&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;28&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;128&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; GB&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="c1"&gt;# 7.00 GB
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;# Qwen2.5-32B (64 layers, 8 KV heads, 128 head_dim)
&lt;/span&gt;&lt;span class="n"&gt;kv_32b&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;4096 tokens (FP16)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;kv_cache_vram&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;4096&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;64&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;128&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; GB&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;   &lt;span class="c1"&gt;# 1.00 GB
&lt;/span&gt;    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;8192 tokens (FP16)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;kv_cache_vram&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;8192&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;64&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;128&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; GB&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;   &lt;span class="c1"&gt;# 2.00 GB
&lt;/span&gt;    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;32768 tokens (FP16)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;kv_cache_vram&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;32768&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;64&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;128&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; GB&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# 8.00 GB
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;# Note: with partial offload (-ngl), KV cache is also split across CPU/GPU per layer
# -ngl 25 means GPU holds KV for 25/64 layers only
&lt;/span&gt;
&lt;span class="c1"&gt;# Recommendations for 8GB VRAM:
# 7B model: -c 8192 (KV 0.44GB, safe), -c 32768 (KV 1.75GB, use flash-attn)
# 32B model (partial offload -ngl 25): -c 4096 (GPU KV ~0.39GB), anything higher requires KV quantization
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Doubling the context length doubles the KV cache VRAM. On 8GB, your &lt;code&gt;-c&lt;/code&gt; setting directly determines what model size you can load.&lt;/p&gt;




&lt;h2&gt;
  
  
  &lt;code&gt;--cache-type-k&lt;/code&gt; / &lt;code&gt;--cache-type-v&lt;/code&gt; (KV Cache Quantization)
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# KV cache quantization options
&lt;/span&gt;&lt;span class="n"&gt;kv_quant_options&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;f16&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Default. FP16 (2 bytes/element)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;q8_0&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;8-bit quantization (1 byte/element) -&amp;gt; VRAM halved&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;q4_0&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;4-bit quantization (0.5 bytes/element) -&amp;gt; VRAM quartered&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;# Recommended combinations
&lt;/span&gt;&lt;span class="n"&gt;kv_quant_recommendations&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Quality first&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;K&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;f16&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;V&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;f16&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;VRAM&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1x (baseline)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;quality loss&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;None&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Balanced (recommended)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;K&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;q8_0&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;V&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;q8_0&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;VRAM&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;0.5x&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;quality loss&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Negligible for general tasks&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Capacity first&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;K&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;q4_0&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;V&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;q8_0&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;VRAM&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;0.375x&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;quality loss&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Degradation on math/reasoning tasks&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;note&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;V cache is more sensitive to quantization than K cache&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Maximum compression&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;K&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;q4_0&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;V&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;q4_0&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;VRAM&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;0.25x&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;quality loss&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Significant. Especially bad on long contexts&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;# Example: Qwen2.5-32B + -ngl 25 + 8K context on 8GB VRAM
# -ngl 25 -&amp;gt; 25/64 layers on GPU, KV also splits 25/64 on GPU
&lt;/span&gt;&lt;span class="n"&gt;example_32b_8k&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Total KV (f16, 8K)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;2.00 GB (all 64 layers)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;GPU KV (f16, -ngl 25)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;2.00 * 25/64 = 0.78 GB&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;GPU total&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;weights 7.4GB + KV 0.78GB + overhead 0.3GB = 8.48GB -&amp;gt; OOM&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;With KV q8_0&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;0.78 * 0.5 = 0.39 GB -&amp;gt; 7.4 + 0.39 + 0.3 = 8.09GB -&amp;gt; works&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;conclusion&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;32B at 8K context fits on 8GB with KV quantization (q8_0)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;32K context&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;GPU KV (f16) = 8.0 * 25/64 = 3.13 GB -&amp;gt; impossible. q4_0 = 0.78GB -&amp;gt; 8.48GB -&amp;gt; borderline&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Launch command:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;llama-server &lt;span class="nt"&gt;-m&lt;/span&gt; model.gguf &lt;span class="nt"&gt;-ngl&lt;/span&gt; 25 &lt;span class="nt"&gt;-c&lt;/span&gt; 8192 &lt;span class="nt"&gt;--cache-type-k&lt;/span&gt; q8_0 &lt;span class="nt"&gt;--cache-type-v&lt;/span&gt; q8_0
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  &lt;code&gt;--flash-attn&lt;/code&gt; (Flash Attention)
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Flash Attention
&lt;/span&gt;&lt;span class="n"&gt;flash_attn_config&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;meaning&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Memory-efficient attention computation algorithm&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;effects&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;VRAM reduction&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Eliminates intermediate attention buffers -&amp;gt; saves hundreds of MB&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;speed boost&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Faster on long contexts (~10% at 32K, scales with context length)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;short_context&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Minimal effect below 4K tokens&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;requirements&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;CUDA backend + compatible GPU (RTX 20xx or newer)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;compatibility&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Works alongside KV cache quantization&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;# Benchmarks on 8GB
&lt;/span&gt;&lt;span class="n"&gt;flash_attn_benchmark&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Qwen2.5-7B Q4_K_M, -c 8192&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;without_flash_attn&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;31.8 t/s, VRAM 5.5 GB&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;with_flash_attn&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;32.1 t/s, VRAM 5.2 GB&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;delta&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;speed +1%, VRAM -0.3 GB&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Qwen2.5-7B Q4_K_M, -c 32768&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;without_flash_attn&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;28.5 t/s, VRAM 6.3 GB&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;with_flash_attn&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;31.5 t/s, VRAM 5.8 GB&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;delta&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;speed +10.5%, VRAM -0.5 GB&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;# Verdict: Always enable it. There is no downside.
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;--flash-attn&lt;/code&gt; has zero downsides. Always include it.&lt;/p&gt;




&lt;h2&gt;
  
  
  &lt;code&gt;-b&lt;/code&gt; (Batch Size) and &lt;code&gt;-t&lt;/code&gt; (Thread Count)
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Batch size
&lt;/span&gt;&lt;span class="n"&gt;batch_config&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;-b (batch size)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;meaning&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Number of tokens processed at once during prompt evaluation&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;default&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;2048&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;8GB recommendation&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;512&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;reason&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Large batches cause VRAM spikes during prompt eval, risking OOM&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;-ub (micro batch)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;meaning&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Further subdivides batches for processing&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;default&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;512&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;usually&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;No need to change&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;# Thread count
&lt;/span&gt;&lt;span class="n"&gt;thread_config&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;-t (threads)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;meaning&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Number of threads for CPU computation&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;default&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;All cores&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;recommendation&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Physical core count (no HT)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;example_i7_13700H&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;-t 6 (6 P-cores)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;reason&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;HT logical threads just compete for memory bandwidth. Physical core count is optimal&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;# Benchmark: thread count impact (Qwen2.5-32B Q4_K_M, -ngl 25)
&lt;/span&gt;&lt;span class="n"&gt;thread_benchmark&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;-t 6 (P-core count)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;10.8 t/s&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;-t 8 (P+E cores)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;10.5 t/s&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;-t 14 (all physical cores P+E)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;9.8 t/s&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;-t 20 (all threads incl. HT)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;9.2 t/s&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;conclusion&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;More threads = slower. Physical P-core count is optimal&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The intuition that more threads means faster inference is wrong. HT logical threads share L1/L2 cache and memory bandwidth, which turns into pure overhead for LLM inference.&lt;/p&gt;




&lt;h2&gt;
  
  
  Server Options (&lt;code&gt;llama-server&lt;/code&gt;)
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# llama-server: set up an OpenAI-compatible API
&lt;/span&gt;&lt;span class="n"&gt;server_config&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;basic&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;llama-server -m model.gguf -ngl 999 -c 4096 --host 0.0.0.0 --port 8080&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;recommended extras&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--flash-attn&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Memory efficiency (always ON)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--metrics&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Expose Prometheus-format metrics&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--parallel 1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Concurrent request count (keep at 1 for 8GB)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--cont-batching&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Continuous batching (useful when --parallel &amp;gt;= 2)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;# Function calling setup
&lt;/span&gt;&lt;span class="n"&gt;function_calling_config&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--chat-template&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Auto-detected (uses template embedded in GGUF)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;note&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;The tools parameter for function calling depends on the model&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s chat template&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;recommended models&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Qwen3.5-4B-Instruct (3.4GB, function calling 97.5%)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Qwen2.5-7B-Instruct (4.7GB, function calling 95%+)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;],&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;# Enforce structured output with GBNF grammar
&lt;/span&gt;&lt;span class="n"&gt;grammar_config&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--grammar-file&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Force output format via GBNF grammar file&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;use case&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Guarantees valid JSON output. Syntax errors drop to 0%&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;caveat&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Inference can slow down when the model tries to generate output that doesn&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;t match the grammar&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;alternative&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--json-schema to specify JSON Schema directly (llama.cpp b7000+)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Configuration Templates
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Template 1: 7B Model, Chat Use (Maximum Speed)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;llama-server &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-m&lt;/span&gt; qwen2.5-7b-instruct-q4_k_m.gguf &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-ngl&lt;/span&gt; 999 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-c&lt;/span&gt; 8192 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--flash-attn&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-t&lt;/span&gt; 6 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--host&lt;/span&gt; 127.0.0.1 &lt;span class="nt"&gt;--port&lt;/span&gt; 8080
&lt;span class="c"&gt;# Expected speed: ~32 t/s, VRAM: ~5.2 GB&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Template 2: 32B Model, Quality Focus (Partial Offload)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;llama-server &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-m&lt;/span&gt; qwen2.5-32b-instruct-q4_k_m.gguf &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-ngl&lt;/span&gt; 25 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-c&lt;/span&gt; 4096 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--cache-type-k&lt;/span&gt; q8_0 &lt;span class="nt"&gt;--cache-type-v&lt;/span&gt; q8_0 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--flash-attn&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-t&lt;/span&gt; 6 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-b&lt;/span&gt; 512 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--host&lt;/span&gt; 127.0.0.1 &lt;span class="nt"&gt;--port&lt;/span&gt; 8080
&lt;span class="c"&gt;# Expected speed: ~10.8 t/s, VRAM: ~7.4 GB&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Template 3: 7B Model, Long Context (32K)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;llama-server &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-m&lt;/span&gt; qwen2.5-7b-instruct-q4_k_m.gguf &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-ngl&lt;/span&gt; 999 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-c&lt;/span&gt; 32768 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--cache-type-k&lt;/span&gt; q8_0 &lt;span class="nt"&gt;--cache-type-v&lt;/span&gt; q8_0 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--flash-attn&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-t&lt;/span&gt; 6 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-b&lt;/span&gt; 512 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--host&lt;/span&gt; 127.0.0.1 &lt;span class="nt"&gt;--port&lt;/span&gt; 8080
&lt;span class="c"&gt;# Expected speed: ~31 t/s, VRAM: ~5.7 GB&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Template 4: 4B Model, Function Calling (Maximum Reliability)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;llama-server &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-m&lt;/span&gt; qwen3.5-4b-instruct-q4_k_m.gguf &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-ngl&lt;/span&gt; 999 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-c&lt;/span&gt; 4096 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--flash-attn&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-t&lt;/span&gt; 6 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--host&lt;/span&gt; 127.0.0.1 &lt;span class="nt"&gt;--port&lt;/span&gt; 8080
&lt;span class="c"&gt;# Expected speed: ~50 t/s, VRAM: ~3.8 GB&lt;/span&gt;
&lt;span class="c"&gt;# Function calling accuracy: 97.5%&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Common Mistakes and Fixes
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;common_mistakes&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;-ngl 0 (not using GPU)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;symptom&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Inference speed stuck at 3-5 t/s&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cause&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;All layers running on CPU. DDR5 ~50 GB/s is the bottleneck&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;fix&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Try -ngl 999. If OOM, decrease&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;-c set too high&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;symptom&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;OOM immediately after inference starts&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cause&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;KV cache eating all VRAM&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;fix&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Lower to -c 4096, or add --cache-type-k q8_0&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;-t set too high&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;symptom&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;CPU at 100% but inference is slow&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cause&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;HT logical threads fighting over cache and memory bandwidth&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;fix&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Set -t to physical core count&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Using --mlock&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;symptom&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Memory error on startup&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cause&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Locks entire model in RAM -&amp;gt; physical memory exhausted&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;fix&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Remove --mlock (especially unnecessary on Windows)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Batch size too large&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;symptom&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;OOM when feeding long prompts&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cause&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;VRAM spike during prompt evaluation&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;fix&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Lower to -b 512&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Summary: Speed Impact by Setting
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Setting Change                        Speed Impact    VRAM Impact
──────────────────────────────────────────────────────────────────
-ngl 0 -&amp;gt; 999 (full GPU)             +5-10x          +4-7 GB
-ngl fine-tuning (±5)                +10-20%         ±0.5 GB
--flash-attn enabled                  +1-10%          -0.3 GB
--cache-type q8_0                     ±0%             -50%
-t all threads -&amp;gt; physical cores      +5-15%          ±0
-c 32K -&amp;gt; 4K                         +5%             -0.7 GB
-b 2048 -&amp;gt; 512                       ±0%*            -0.2 GB**

* No effect on generation speed (only prompt eval time)
** Suppresses temporary VRAM spikes during prompt eval
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The biggest lever is &lt;code&gt;-ngl&lt;/code&gt;. Next is &lt;code&gt;-t&lt;/code&gt;. Everything else is fine-tuning. On 8GB VRAM, the core strategy is: maximize &lt;code&gt;-ngl&lt;/code&gt;, then use &lt;code&gt;-c&lt;/code&gt; and KV cache quantization to claw back enough VRAM to make it fit.&lt;/p&gt;




&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;llama.cpp — &lt;a href="https://github.com/ggerganov/llama.cpp" rel="noopener noreferrer"&gt;github.com/ggerganov/llama.cpp&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;llama.cpp Server documentation — &lt;a href="https://github.com/ggerganov/llama.cpp/tree/master/examples/server" rel="noopener noreferrer"&gt;github.com/ggerganov/llama.cpp/tree/master/examples/server&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;GGUF format specification — &lt;a href="https://github.com/ggerganov/ggml/blob/master/docs/gguf.md" rel="noopener noreferrer"&gt;github.com/ggerganov/ggml/blob/master/docs/gguf.md&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Flash Attention — "FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning" (2023) &lt;a href="https://arxiv.org/abs/2307.08691" rel="noopener noreferrer"&gt;arXiv:2307.08691&lt;/a&gt;
&lt;/li&gt;
&lt;/ol&gt;

</description>
    </item>
    <item>
      <title>20260325_vram_expansion_physics_en</title>
      <dc:creator>plasmon</dc:creator>
      <pubDate>Fri, 17 Apr 2026 12:41:22 +0000</pubDate>
      <link>https://dev.to/plasmon_imp/20260325vramexpansionphysicsen-5e9i</link>
      <guid>https://dev.to/plasmon_imp/20260325vramexpansionphysicsen-5e9i</guid>
      <description>&lt;h1&gt;
  
  
  Adding More VRAM Won't Fix It — The Physics That HBM, CXL, and Unified Memory Can't Escape
&lt;/h1&gt;

&lt;p&gt;The RTX 4060's 8GB VRAM caps out at 7B models. Even when the RTX 5060 doubles that to 16GB, a full 70B won't fit. "If VRAM's not enough, just add more" — the idea is sound, but the execution hits three distinct physical tradeoffs.&lt;/p&gt;

&lt;p&gt;HBM, CXL, Unified Memory. These three technologies attack the VRAM wall from different angles. Where each one sits on the triangle of bandwidth, capacity, and cost fundamentally changes how LLM inference performs.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Memory Trilemma: Bandwidth, Capacity, Cost
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Physical tradeoffs across memory technologies
&lt;/span&gt;&lt;span class="n"&gt;memory_trilemma&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;HBM3E&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;bandwidth&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;4.8 TB/s (H200, 6 stacks)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;capacity&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;141 GB (H200)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cost_per_GB&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;~$10-15/GB (HBM3E, 2025 market price)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;interface&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;TSV (Through-Silicon Via), 1024-bit wide per stack&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;physics&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Vertically stacked via through-silicon vias. High bandwidth, but eats die area&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;GDDR6 (RTX 4060)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;bandwidth&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;272 GB/s&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;capacity&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;8 GB&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cost_per_GB&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;~$2.5-4/GB (GDDR6 spot price, 2025)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;interface&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;128-bit bus, 2125 MHz (17 Gbps effective)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;physics&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Solder-bonded on PCB. Cheap, but bandwidth-limited&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;CXL 3.1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;bandwidth&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;64 GB/s per link (x16 PCIe 6.0, unidirectional)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;capacity&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Theoretically TB-class (memory pooling)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cost_per_GB&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;~$3-5/GB (DDR5-based)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;interface&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;PCIe 6.0 physical layer (64 GT/s) + CXL protocol&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;physics&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Reuses existing PCIe infrastructure. 1/75 the bandwidth of HBM3E&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Unified Memory (M4 Max)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;bandwidth&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;546 GB/s (LPDDR5X)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;capacity&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;128 GB&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cost_per_GB&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Depends on Apple pricing (LPDDR5X itself ~$3-5/GB, but SoC integration makes direct comparison impossible)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;interface&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;LPDDR5X, 512-bit bus&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;physics&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;CPU/GPU/NPU share one memory pool. Shared bandwidth = contention&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;These three technologies occupy different vertices of the bandwidth-capacity-cost triangle. HBM chose bandwidth, CXL chose capacity, Unified Memory chose balance. None of them can claim all three.&lt;/p&gt;




&lt;h2&gt;
  
  
  HBM: King of Bandwidth, Slave to Capacity
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Physical constraints of HBM
&lt;/span&gt;&lt;span class="n"&gt;hbm_constraints&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;bandwidth_source&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;TSV_per_stack&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;~5,000+ through-silicon vias&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;bus_width&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1024 bit per stack&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;stacks&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;H200: 6 stacks → 6144 bit total bus&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;result&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;4.8 TB/s — 18x GDDR&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;capacity_wall&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;die_per_stack&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Current: 8-Hi (8 layers stacked), next-gen: 12-Hi/16-Hi&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;die_size&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;24 Gbit per die (3GB) for HBM3E&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;8Hi_capacity&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;8 × 3GB = 24 GB per stack&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;12Hi_capacity&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;12 × 3GB = 36 GB per stack&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;total_H200&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;6 stacks × 24GB = 144 GB raw (NVIDIA-rated 141 GB, some reserved)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cost&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;One HBM3E stack: estimated $240-360 (24GB x $10-15/GB, 2025 market price)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;area_problem&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;interposer_area&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Each HBM stack occupies ~100 mm2 of interposer area&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;GPU_die + 6_stacks&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;GPU ~800 mm2 + HBM ~600 mm2 = ~1400 mm2 interposer&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;CoWoS_reticle_limit&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;~1700 mm2 (TSMC lithography limit)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;implication&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Fitting 8+ HBM stacks exceeds the reticle limit → chiplet design required&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;# HBM has the best bandwidth, but capacity is physically capped by layer count × stack count × interposer area
# "Just add more HBM" → the interposer doesn't have room
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The reason HBM can't "just be scaled up" is area. The GPU die and HBM stacks must sit side by side on an interposer, and the CoWoS reticle limit (~1700 mm²) is the ceiling. The H200 is already close to that limit.&lt;/p&gt;

&lt;p&gt;Impact on LLM inference:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# How HBM's capacity ceiling affects LLM inference
&lt;/span&gt;&lt;span class="n"&gt;hbm_llm_impact&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;H200 (141GB HBM3E)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;max_model_fp16&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;~70B parameters (140GB)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;max_model_q4&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;~280B parameters (70GB) + KV cache headroom&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;70B_kv_cache_room&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;141 - 140 = 1 GB → even 32K context is tight&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;solution&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Quantization or Tensor Parallelism (multi-GPU)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;RTX 4060 (8GB GDDR6)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;max_model_q4&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;~13B parameters (7.2GB usable)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;bandwidth&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;272 GB/s → 7B Q4_K_M at ~32 t/s&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;problem&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;13B+ requires CPU offload → 1/10 speed&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;RTX 5060 (expected 16GB GDDR7)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;bandwidth&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;448 GB/s (RTX 5060 Ti confirmed; RTX 5060 non-Ti TBD)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;max_model_q4&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;~30B parameters&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;implication&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;2x capacity ≠ 2x model size (KV cache eats the difference)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Doubling VRAM doesn't double the model size you can run. KV cache is the reason. A 70B FP16 model's KV cache at 32K context is about 8GB. The "leftover" VRAM gets consumed by KV cache.&lt;/p&gt;




&lt;h2&gt;
  
  
  CXL: Capacity Unleashed, Bandwidth Sacrificed
&lt;/h2&gt;

&lt;p&gt;CXL (Compute Express Link) is a memory expansion protocol built on the PCIe physical layer.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# CXL bandwidth and capacity
&lt;/span&gt;&lt;span class="n"&gt;cxl_specs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;CXL 3.1 (2024)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;physical_layer&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;PCIe 6.0 (64 GT/s)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;bandwidth_x16&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;64 GB/s (unidirectional, PCIe 6.0 x16)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;latency&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;~170-400 ns (measured, varies by device/config; 2-4x local DDR5)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;capacity&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Theoretically unlimited (memory pooling + switching)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;target&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Servers / data centers&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;bandwidth_comparison&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;HBM3E (H200)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;4,800 GB/s&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;GDDR6_RTX4060&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;272 GB/s&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;CXL_3.1_x16&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;64 GB/s&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ratio&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;CXL is 1/75 of HBM3E, 1/4 of GDDR6&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;# What happens when you run LLM inference over CXL bandwidth
&lt;/span&gt;&lt;span class="n"&gt;cxl_inference&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;7B_Q4_K_M (4.7GB weights)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;reads_per_token&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;4.7 GB&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cxl_speed&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;64 / 4.7 = ~13.6 t/s&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;hbm3e_speed&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;4800 / 4.7 = ~1021 t/s&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gddr6_speed&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;272 / 4.7 = ~57.9 t/s → measured 32 t/s (55% efficiency)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;verdict&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Weight reads from CXL are barely usable&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;70B_Q4_K_M (40GB weights)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;reads_per_token&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;40 GB&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cxl_speed&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;64 / 40 = 1.6 t/s&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;verdict&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Reading speed. Unusable&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Loading 70B Q4 weights over CXL's 64 GB/s gives you 1.6 t/s. That's about human reading speed.&lt;/p&gt;

&lt;p&gt;But CXL's real value isn't as a place to store weights.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# The right way to use CXL: tiered memory architecture
&lt;/span&gt;&lt;span class="n"&gt;cxl_tiered_architecture&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Tier 0 (GPU SRAM)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;purpose&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Activations, work buffers&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;capacity&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;24 MB (RTX 4060 L2)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;bandwidth&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;~4 TB/s (on-chip)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Tier 1 (HBM/GDDR)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;purpose&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Model weights, active KV cache&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;capacity&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;8-141 GB&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;bandwidth&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;272-4800 GB/s&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Tier 2 (CXL Memory)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;purpose&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;KV cache overflow, inactive layers&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;capacity&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;TB-class&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;bandwidth&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;64 GB/s&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;latency&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;170-400 ns&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Tier 3 (NVMe SSD)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;purpose&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Persistent model storage, swap&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;capacity&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;TB-class&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;bandwidth&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;7 GB/s (PCIe 4.0 x4)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;latency&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;~10,000 ns&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;# CXL fills the gap between Tier 1 and Tier 3
# As a KV cache overflow target, it's 9x faster than NVMe
# The correct split: weights in VRAM, stale KV cache entries in CXL
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;CXL's essence isn't "VRAM replacement" — it's "a new tier between VRAM and NVMe." If you evict stale KV cache tokens (the early portion of a 128K context) to CXL memory, VRAM only needs to hold the recent attention window. That's a viable architecture.&lt;/p&gt;

&lt;p&gt;This tiering is orthogonal to techniques like optical memory readout (physically reducing KV cache transfer volume) or KV cache quantization (numerically reducing data volume). They compose.&lt;/p&gt;




&lt;h2&gt;
  
  
  Unified Memory: The Balance Trap
&lt;/h2&gt;

&lt;p&gt;Apple Silicon's Unified Memory lets the CPU, GPU, and NPU share a single physical memory pool.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Apple Unified Memory in practice
&lt;/span&gt;&lt;span class="n"&gt;unified_memory&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;M4 Max&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;capacity&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;128 GB&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;bandwidth&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;546 GB/s (LPDDR5X)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;bus_width&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;512-bit&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;shared_by&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;CPU (12 cores) + GPU (40 cores) + NPU (16 cores) + media engine&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;M4 (base)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;capacity&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;16-32 GB&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;bandwidth&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;120 GB/s&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;bus_width&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;128-bit&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;note&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Less than half of RTX 4060 (272 GB/s)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;# Reality of LLM inference
&lt;/span&gt;&lt;span class="n"&gt;unified_memory_llm&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;M4 Max 128GB&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;advantage&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;70B Q4_K_M (40GB) fits entirely without GPU memory management&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;70B_speed&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;546 / 40 = 13.7 t/s (theoretical ceiling) → measured 8-10 t/s&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;reason_for_gap&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Bandwidth shared with CPU/NPU/IO. GPU doesn&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;t get exclusive access&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;M4 32GB&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;32B_Q4_speed&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;120 / 18 = 6.7 t/s (theoretical) → measured 4-5 t/s&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;note&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;RTX 4060 has exclusive 272 GB/s GDDR6 → 10.8 t/s on the same model&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;bandwidth_contention&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cause&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;CPU still accesses memory during GPU inference → they fight for bandwidth&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;OS_overhead&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;macOS memory management, UI rendering consume bandwidth in the background&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;worst_case&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Running inference while Safari has a heavy page open → noticeable speed drop&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Unified Memory's advantage is eliminating GPU memory management overhead. No cudaMalloc/cudaMemcpy. The data is already there. Zero copy cost.&lt;/p&gt;

&lt;p&gt;But bandwidth is a shared resource — you can't monopolize it. The RTX 4060's GDDR6 gives 272 GB/s effectively exclusive to the GPU. The base M4 splits 120 GB/s across the entire system.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Bandwidth efficiency comparison
&lt;/span&gt;&lt;span class="n"&gt;bandwidth_efficiency&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;RTX 4060 (8GB GDDR6)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;total_bw&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;272&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpu_share&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;~95% (only DisplayPort output competing)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;effective_for_llm&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;~258 GB/s&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;7B_Q4_speed&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;258 / 4.7 = 54.9 t/s (theoretical) → 32 t/s (58% effective)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;M4 Max (128GB LPDDR5X)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;total_bw&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;546&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpu_share&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Majority during inference (contention with CPU/NPU/IO)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;effective_for_llm&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;~400 GB/s (back-calculated: 8-10 t/s x 40GB = 320-400 GB/s)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;70B_Q4_speed&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;400 / 40 = 10 t/s → measured 8-10 t/s (roughly matches)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;M4 base (16GB LPDDR5X)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;total_bw&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;120&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpu_share&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Shared with entire system during inference&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;effective_for_llm&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;~78 GB/s (back-calculated: 14-16 t/s x 4.7GB = 66-75 GB/s)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;7B_Q4_speed&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;78 / 4.7 = 16.6 t/s → measured 14-16 t/s&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;# RTX 4060: Lower bandwidth but GPU-exclusive → fast on small models
# M4 Max: Higher bandwidth but shared → fits large models at the cost of per-bandwidth efficiency
# M4 base: Mediocre bandwidth and capacity → loses to RTX 4060 for LLM workloads
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Comparing the Three Approaches
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;                Bandwidth    Capacity    Cost        Role in LLM Inference
─────────────────────────────────────────────────────────────────
HBM3E          4,800 GB/s    141 GB     $10-15/GB   Read weights + KV at full speed
GDDR6         272 GB/s      8-24 GB    $2.5-4/GB   Run small models fast
CXL 3.1        64 GB/s       TB-class   $3-5/GB     KV cache overflow tier
Unified (Max)  546 GB/s      128 GB     Apple-set   Fit large models with zero-copy
NVMe SSD       7 GB/s        TB-class   $0.1/GB     Persistent model storage
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Optimal use case for each technology
&lt;/span&gt;&lt;span class="n"&gt;optimal_scenarios&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;HBM (H100/H200)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;scenario&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Batch inference, concurrent request processing&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;why&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Bandwidth is amortized across multiple requests, making per-request cost efficient&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;limitation&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;For a single request, most of 700W is wasted&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;GDDR (RTX 4060/5060)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;scenario&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Personal use, single request, small-to-mid models&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;why&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Exclusive GPU bandwidth maximizes efficiency. 32 t/s at ~70W (0.46 t/s/W). Beats H100 single-request power efficiency for small models&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;limitation&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Capacity wall. 8GB means 7B is the ceiling&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;CXL&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;scenario&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Ultra-long context inference (128K+), shared memory pools&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;why&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Solves VRAM exhaustion when KV cache balloons to tens of GB with long contexts&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;limitation&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1/75 bandwidth. Too slow for weight storage&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;timeline&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Server-side 2025-26, consumer 2028+&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Unified Memory (Apple)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;scenario&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Running large models with minimal setup. Development and experimentation&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;why&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;70B Q4 runs without any memory management. Ease of setup&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;limitation&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Shared bandwidth means lower speed efficiency vs exclusive GDDR. Hard to share with gaming workloads&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Practical Implications for 8GB VRAM Users
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Strategies for breaking through the memory wall with 8GB VRAM today
&lt;/span&gt;&lt;span class="n"&gt;practical_8gb&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Layer 1: Quantization (immediate impact)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;method&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Q4_K_M quantization&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;effect&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;7B model weights: 14GB → 4.7GB (3x capacity efficiency)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;how&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Standard support in llama.cpp / Ollama&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Layer 2: KV cache quantization (experimental)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;method&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--cache-type-k q4_0 --cache-type-v q8_0&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;effect&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;KV cache at 1/3 of FP16 → enables longer contexts&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;how&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;llama.cpp launch flags&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Layer 3: CPU offload (bandwidth tradeoff)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;method&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--n-gpu-layers to partially load onto GPU&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;effect&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;32B models run (slow, but they run)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;speed&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;10.8 t/s (32B on RTX 4060, optimal offload)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;bandwidth_bottleneck&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;CPU↔GPU via PCIe 4.0 x8 = 16 GB/s&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Layer 4: CXL (future)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;method&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;CXL memory modules&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;effect&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Add memory via PCIe → Tier 2 storage for KV cache&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;timeline&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Consumer availability 2028+&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;note&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Similar in principle to today&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s CPU offload (PCIe 16 GB/s), but CXL allows memory-semantic access (load/store, directly addressable by GPU)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;# What you can do today: combine Layers 1-3
# Q4 quantization + KV cache Q4 + optimal GPU offload = 32B model × 32K context on 8GB
# Future: CXL adds Layer 4, making 128K+ contexts realistic
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Here's the key insight: the "memory expansion" CXL promises travels over fundamentally the same PCIe bus as today's CPU offload. The bandwidth ceiling is identical. CXL's advantage is memory semantics — load/store access where the GPU can address memory directly — not bandwidth improvement.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Physics That Decides Memory's Future
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Question: "Does adding more VRAM solve the problem?"

Answer: Only partially.

The bandwidth-capacity-cost triangle is governed by physics,
and no technology can claim all three vertices.

HBM chose bandwidth, sacrificing capacity and cost.
CXL chose capacity, sacrificing bandwidth.
Unified Memory chose balance, sacrificing exclusive bandwidth.
GDDR chose exclusive bandwidth, sacrificing capacity.

The optimal answer for LLM inference isn't "pick one technology" —
it's combining multiple technologies in a tiered hierarchy.

Best strategy for today's RTX 4060:
  Weights → VRAM (Q4 quantization to fit 7-13B entirely)
  KV cache → VRAM (Q4/Q8 quantization to save capacity)
  Extra layers → RAM (CPU offload, PCIe bandwidth)
  Persistent storage → NVMe SSD

Best strategy for future CXL-equipped consumer PCs:
  Weights → VRAM (Q4 quantization)
  Active KV → VRAM
  Stale KV → CXL memory (64 GB/s is fast enough for this)
  Persistent storage → NVMe SSD

The memory wall isn't something you break through —
it's something you route around with tiers.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;CXL Consortium — "Compute Express Link Specification 3.1" (2024)&lt;/li&gt;
&lt;li&gt;Samsung — "CMM-D: CXL Memory Module for Data Centers" (2024)&lt;/li&gt;
&lt;li&gt;SK hynix — HBM3E specifications, 12-Hi stack architecture&lt;/li&gt;
&lt;li&gt;NVIDIA H200 specifications — 141GB HBM3E, 4.8 TB/s&lt;/li&gt;
&lt;li&gt;Apple M4 Max specifications — 128GB Unified Memory, 546 GB/s&lt;/li&gt;
&lt;li&gt;"Efficient Memory Management for Large Language Model Serving with PagedAttention" (2023) &lt;a href="https://arxiv.org/abs/2309.06180" rel="noopener noreferrer"&gt;arXiv:2309.06180&lt;/a&gt;
&lt;/li&gt;
&lt;/ol&gt;

</description>
    </item>
    <item>
      <title>20260325_llm_framework_comparison_en</title>
      <dc:creator>plasmon</dc:creator>
      <pubDate>Thu, 16 Apr 2026 12:37:29 +0000</pubDate>
      <link>https://dev.to/plasmon_imp/20260325llmframeworkcomparisonen-27jd</link>
      <guid>https://dev.to/plasmon_imp/20260325llmframeworkcomparisonen-27jd</guid>
      <description>&lt;h1&gt;
  
  
  Ollama, LM Studio, and GPT4All Are All Just llama.cpp — Here's Why Performance Still Differs
&lt;/h1&gt;

&lt;p&gt;When running local LLMs on an RTX 4060 8GB, the first decision isn't the model. It's the framework.&lt;/p&gt;

&lt;p&gt;llama.cpp, Ollama, LM Studio, vLLM, GPT4All — plenty of options. But under an 8GB VRAM constraint, the framework choice directly affects inference speed. A 0.5GB difference in overhead changes which models you can load at all. One extra API abstraction layer adds a few ms of latency.&lt;/p&gt;

&lt;p&gt;What follows is a comparison on identical hardware with identical models.&lt;/p&gt;




&lt;h2&gt;
  
  
  Frameworks and Evaluation Criteria
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Framework Overview
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;frameworks&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;llama.cpp (CLI)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;version&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;b8233 (2026-03)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;backend&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;CUDA + Metal + CPU&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;quantization&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;GGUF (Q2_K ~ FP16)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;API&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;CLI / llama-server (OpenAI-compatible)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;strength&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Minimal overhead, maximum control&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Ollama&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;version&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;0.6.x&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;backend&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;llama.cpp (bundled)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;quantization&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;GGUF (via Ollama Hub)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;API&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;REST API + CLI&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;strength&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Docker-like simplicity, easy model management&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;LM Studio&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;version&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;0.3.x&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;backend&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;llama.cpp (bundled)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;quantization&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;GGUF (GUI search)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;API&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;OpenAI-compatible API + GUI&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;strength&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;GUI, beginner-friendly, function calling support&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;vLLM&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;version&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;0.7.x&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;backend&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Custom CUDA kernels + PagedAttention&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;quantization&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;AWQ, GPTQ, FP8, GGUF (v0.4.2+)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;API&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;OpenAI-compatible API&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;strength&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Batch processing optimization, server-oriented&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;GPT4All&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;version&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;3.x&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;backend&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;llama.cpp (bundled)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;quantization&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;GGUF&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;API&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;GUI + Python SDK&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;strength&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Simplest setup, offline-first&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The critical fact: &lt;strong&gt;Ollama, LM Studio, and GPT4All all use llama.cpp internally&lt;/strong&gt;. The differences are purely in wrapper design. Only vLLM has its own CUDA kernels.&lt;/p&gt;

&lt;h3&gt;
  
  
  Evaluation Axes
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;evaluation_axes&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Inference speed (t/s)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Generation speed with identical model and quantization&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;VRAM overhead&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;VRAM consumed by the framework itself, excluding the model&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Cold start time&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Time to complete model loading&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;API compatibility&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;OpenAI API compatibility and quality&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Function calling&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Tool-use support and accuracy&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Setup difficulty&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Steps from install to first inference&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Inference Speed Comparison
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Test Conditions
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;test_config&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;GPU&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;RTX 4060 Laptop (8GB VRAM)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;model&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Qwen2.5-7B-Instruct Q4_K_M (4.7GB)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;prompt&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Explain the difference between TCP and UDP in 200 words&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;max_tokens&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;256&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;temperature&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.7&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;context_length&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;4096&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;measurement&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Median of 3 runs&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Results
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Framework            Prompt eval  Generation  TTFT    VRAM overhead
                     (t/s)        (t/s)       (ms)    (excl. model)
────────────────────────────────────────────────────────────────
llama.cpp (CLI)       ~800        32.1        120     ~0.3 GB
llama-server          ~780        31.5        135     ~0.4 GB
Ollama                ~750        30.2        180     ~0.5 GB
LM Studio             ~720        29.8        250     ~0.6 GB
GPT4All               ~680        28.5        300     ~0.7 GB
vLLM                  N/A*        N/A*        N/A*    ~1.5 GB+

* vLLM OOM with default settings on 8GB VRAM
  (PagedAttention KV cache pre-allocation consumes additional VRAM)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Analysis
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;speed_analysis&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;llama.cpp vs Ollama&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gap&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;32.1 vs 30.2 = 5.9%&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cause&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Ollama&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s REST API layer + model management daemon overhead&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;practical_impact&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Negligible. Convenience offsets the difference.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;llama.cpp vs LM Studio&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gap&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;32.1 vs 29.8 = 7.2%&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cause&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;GUI + additional API abstraction layers&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;practical_impact&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;GUI benefits outweigh speed loss for most use cases&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;llama.cpp vs GPT4All&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gap&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;32.1 vs 28.5 = 11.2%&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cause&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Python SDK overhead + non-optimized default settings&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;practical_impact&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Acceptable for beginners, room for optimization&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;vLLM&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;issue&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Cannot run 7B models on 8GB VRAM&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cause&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;PagedAttention KV cache pre-allocation consumes additional VRAM&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;use_case&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Tunable via gpu_memory_utilization, but practically needs 16GB+&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;# Bottom line: llama.cpp is fastest, but the gap is 6-11%
# On 8GB VRAM, the real differentiator is overhead (0.3GB vs 1.5GB)
# That overhead gap determines your maximum model size
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  When VRAM Overhead Becomes Fatal on 8GB
&lt;/h2&gt;

&lt;p&gt;On 8GB VRAM, framework overhead directly dictates your maximum model size.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Maximum model size per framework
&lt;/span&gt;&lt;span class="n"&gt;max_model_size&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;llama.cpp&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;overhead&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cuda_context&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;available_for_model&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;8.0&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mf"&gt;0.3&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mf"&gt;0.3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# 7.4 GB
&lt;/span&gt;        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;max_model&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Qwen2.5-32B Q4_K_M (18GB) -&amp;gt; 7.4GB on GPU + 10.6GB CPU offload&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;max_full_gpu&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Mistral-Nemo-12B Q4_K_M (7.2GB) -&amp;gt; barely fits&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Ollama&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;overhead&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cuda_context&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;available_for_model&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;8.0&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mf"&gt;0.5&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mf"&gt;0.3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# 7.2 GB
&lt;/span&gt;        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;max_full_gpu&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;7B Q4_K_M (4.7GB) -&amp;gt; comfortable, 12B -&amp;gt; tight&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;LM Studio&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;overhead&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.6&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cuda_context&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;available_for_model&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;8.0&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mf"&gt;0.6&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mf"&gt;0.3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# 7.1 GB
&lt;/span&gt;        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;max_full_gpu&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;7B Q4_K_M (4.7GB) -&amp;gt; comfortable, 12B -&amp;gt; difficult&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;vLLM&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;overhead&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;1.5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cuda_context&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;available_for_model&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;8.0&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mf"&gt;1.5&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mf"&gt;0.3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# 6.2 GB
&lt;/span&gt;        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;max_full_gpu&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Even 7B models have no headroom&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;note&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Not recommended for 8GB&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The overhead difference between llama.cpp and vLLM is 1.2GB. That 1.2GB could buy you:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Additional KV cache allocation to extend context length&lt;/li&gt;
&lt;li&gt;Room to co-locate a BGE-M3 embedding model alongside your LLM&lt;/li&gt;
&lt;li&gt;Higher GPU offload ratio for the model, speeding up inference&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;On 8GB VRAM, framework selection isn't a preference. It's an architectural decision.&lt;/p&gt;




&lt;h2&gt;
  
  
  Function Calling Support
&lt;/h2&gt;

&lt;p&gt;As covered in my function calling article (separate article), tool use is the killer feature for local LLMs. Here's where each framework stands:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;function_calling_support&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;llama.cpp (llama-server)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;supported&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;method&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;OpenAI-compatible tools parameter&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;GBNF_grammar&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# Enforces JSON output grammatically
&lt;/span&gt;        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;quality&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Model-dependent. High accuracy with Qwen2.5-7B-Instruct + GBNF grammar&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;limitation&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Requires manual server startup&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Ollama&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;supported&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;method&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;OpenAI-compatible tools parameter (v0.4+)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;GBNF_grammar&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# No raw GBNF, but format parameter supports JSON Schema
&lt;/span&gt;        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;quality&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Same as llama.cpp (identical backend)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;limitation&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;No GBNF grammar, but structured output via format parameter with JSON Schema&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;LM Studio&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;supported&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;method&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;OpenAI-compatible tools parameter&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;GBNF_grammar&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# JSON Schema enforcement
&lt;/span&gt;        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;quality&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Testable through GUI, which is the main advantage&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;limitation&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Backend equivalent to llama.cpp&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;vLLM&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;supported&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;method&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;OpenAI-compatible tools + Guided Decoding&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;quality&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;High accuracy via Guided Decoding&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;limitation&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Needs gpu_memory_utilization tuning on 8GB, practically 16GB+ recommended&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;GPT4All&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;supported&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;note&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;No function calling support. Chat only.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;GPT4All doesn't support function calling. It's unusable for agentic workflows. vLLM's Guided Decoding is powerful but impractical on 8GB. For function calling on 8GB VRAM, you're limited to the llama.cpp family -- direct, Ollama, or LM Studio.&lt;/p&gt;




&lt;h2&gt;
  
  
  Recommendations by Use Case
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;recommendations&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Maximum performance (developers)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;pick&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;llama.cpp (CLI / llama-server)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;reasons&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Minimal overhead (0.3GB)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;GBNF grammar enforces structured output&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Direct control over all parameters&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Per-layer GPU/CPU offload granularity&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;downside&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Requires technical knowledge, no GUI&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Convenient daily use&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;pick&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Ollama&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;reasons&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Docker-pull simplicity (ollama pull model)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Background daemon, always available&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;OpenAI-compatible API for drop-in replacement&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Within 6% of llama.cpp speed&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;downside&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;No GBNF grammar (JSON Schema via format param available), slightly larger overhead&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;GUI-driven experimentation&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;pick&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;LM Studio&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;reasons&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Model search and download entirely in GUI&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Chat UI for real-time testing&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Function calling testable through the interface&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;downside&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Higher memory footprint due to GUI layer&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Easiest possible start (non-engineers)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;pick&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;GPT4All&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;reasons&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Install -&amp;gt; launch -&amp;gt; chat in minimal steps&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Fully offline&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;No unnecessary configuration options&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;downside&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;No function calling, slowest speed, limited customization&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Production / server deployment&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;pick&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;vLLM (16GB+ GPU recommended) or llama-server&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;reasons&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;vLLM: PagedAttention for efficient batch processing&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;llama-server: Lightweight server that works on 8GB&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;downside&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;vLLM impractical on 8GB&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  The Verdict for 8GB
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Question: What's the optimal framework for 8GB VRAM?

Answer: Depends on use case. But technically optimal is raw llama.cpp.

Why:
1. Minimum overhead (0.3GB) -&amp;gt; maximum usable VRAM
2. Fastest speed (+6-11% over other frameworks)
3. GBNF grammar enforces structured output -&amp;gt; highest function calling reliability
4. Per-layer GPU/CPU offload control

However:
- For daily use, Ollama's convenience outweighs the speed gap
- If you need a GUI, LM Studio is the only option
- vLLM is impractical on 8GB (needs 16GB+)
- GPT4All is unsuitable for agentic tasks (no function calling)

The total speed spread across all frameworks is within 11%.
Model selection matters far more than framework selection.
The gap between Qwen2.5-3B (2.0GB) and Qwen2.5-7B (4.7GB)
dwarfs the gap between llama.cpp and GPT4All.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you're spending time agonizing over frameworks, spend it benchmarking models instead.&lt;/p&gt;




&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;llama.cpp -- &lt;a href="https://github.com/ggerganov/llama.cpp" rel="noopener noreferrer"&gt;github.com/ggerganov/llama.cpp&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Ollama -- &lt;a href="https://ollama.ai" rel="noopener noreferrer"&gt;ollama.ai&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;LM Studio -- &lt;a href="https://lmstudio.ai" rel="noopener noreferrer"&gt;lmstudio.ai&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;vLLM -- &lt;a href="https://vllm.ai" rel="noopener noreferrer"&gt;vllm.ai&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;GPT4All -- &lt;a href="https://gpt4all.io" rel="noopener noreferrer"&gt;gpt4all.io&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;"Efficient Memory Management for Large Language Model Serving with PagedAttention" (2023) &lt;a href="https://arxiv.org/abs/2309.06180" rel="noopener noreferrer"&gt;arXiv:2309.06180&lt;/a&gt;
&lt;/li&gt;
&lt;/ol&gt;

</description>
    </item>
    <item>
      <title>20260324_ai_bubble_8gb_en</title>
      <dc:creator>plasmon</dc:creator>
      <pubDate>Tue, 14 Apr 2026 09:54:04 +0000</pubDate>
      <link>https://dev.to/plasmon_imp/20260324aibubble8gben-325p</link>
      <guid>https://dev.to/plasmon_imp/20260324aibubble8gben-325p</guid>
      <description>&lt;h2&gt;
  
  
  What the Bubble Doomsayers Are Actually Looking At
&lt;/h2&gt;

&lt;p&gt;Q1 2026, and AI bubble collapse discourse is back with a vengeance. VC pullback headlines, startup consolidation reports, pundits drawing dot-com parallels on every platform. The takes are everywhere.&lt;/p&gt;

&lt;p&gt;Their arguments boil down to three things:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;AI stock valuations are detached from reality&lt;/strong&gt; — NVIDIA's P/E ratio peaked above 60. If revenue growth stalls, correction is inevitable&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Monetization isn't keeping up&lt;/strong&gt; — Is GPT-4o's $20/month subscription actually profitable? Per-call inference costs remain high&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Hype fatigue&lt;/strong&gt; — Markets are going numb to weekly model announcements&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;And honestly? They're right. VC inflows slowing down and AI startup consolidation is practically guaranteed at this point.&lt;/p&gt;

&lt;p&gt;But this argument has a fatal blind spot. &lt;strong&gt;The entire bubble narrative is scoped to data-center-scale economics.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  API-Dependent Engineers Will Absolutely Feel the Pain
&lt;/h2&gt;

&lt;p&gt;Let me be upfront. I don't think bubble fallout will be zero.&lt;/p&gt;

&lt;p&gt;If you're building products on top of APIs, these scenarios are real risks:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;API price spikes&lt;/strong&gt;: OpenAI may not be able to sustain GPT-4o at $2.50/1M input tokens forever. When investor subsidies dry up, pricing corrects to actual cost&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Service shutdowns and consolidation&lt;/strong&gt;: Anthropic, Mistral, Cohere — there's no guarantee all of them survive through 2026. The API you depend on could vanish&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Model quality stagnation&lt;/strong&gt;: Frontier models that cost hundreds of millions to train may see slower development cycles&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The third point is the one you can't hand-wave away. There's a real quality gap between frontier models like Claude 4 and local 8B-32B models. Training data scale, RLHF investment, evaluation pipeline budgets — these differ by orders of magnitude. I don't honestly believe local models will close that gap entirely. Not with the current Transformer architecture, anyway.&lt;/p&gt;

&lt;p&gt;That's the scope where bubble collapse arguments hold water.&lt;/p&gt;




&lt;h2&gt;
  
  
  Now Let's Talk About Life in 8GB VRAM Territory
&lt;/h2&gt;

&lt;p&gt;RTX 4060 8GB. M4 Mac mini 16GB. The machine I'm writing this on is the counter-argument.&lt;/p&gt;

&lt;p&gt;In the local LLM world, a bubble bursting is &lt;strong&gt;a capital flow problem upstream&lt;/strong&gt;, not &lt;strong&gt;a problem with our inference pipeline&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Here's why. Three structural reasons.&lt;/p&gt;

&lt;h3&gt;
  
  
  Reason 1: Model Weights Are Downloaded Physical Files
&lt;/h3&gt;

&lt;p&gt;Qwen3.5-9B-Q4_K_M.gguf. That's a 5.3GB binary file downloaded from Hugging Face. It exists on my local disk.&lt;/p&gt;

&lt;p&gt;If Alibaba Cloud disbands the Qwen team tomorrow, this file doesn't disappear.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Local model inventory&lt;/span&gt;
&lt;span class="nb"&gt;ls&lt;/span&gt; &lt;span class="nt"&gt;-lh&lt;/span&gt; ~/models/&lt;span class="k"&gt;*&lt;/span&gt;.gguf

&lt;span class="c"&gt;# Actual output (RTX 4060 8GB setup)&lt;/span&gt;
&lt;span class="c"&gt;# -rw-r--r-- 1 user 5.3G qwen3.5-9b-q4_k_m.gguf&lt;/span&gt;
&lt;span class="c"&gt;# -rw-r--r-- 1 user  21G qwen3.5-35b-a3b-q4_k_m.gguf  (MoE: 3B active)&lt;/span&gt;
&lt;span class="c"&gt;# -rw-r--r-- 1 user 4.6G llama-3.1-8b-instruct-q4_k_m.gguf&lt;/span&gt;
&lt;span class="c"&gt;# -rw-r--r-- 1 user 2.4G phi-4-mini-q4_k_m.gguf&lt;/span&gt;

&lt;span class="c"&gt;# Total: 33GB — fits on a 64GB microSD&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;An API endpoint disappears when a company makes a business decision. A GGUF file disappears when your SSD dies. That difference is decisive.&lt;/p&gt;

&lt;h3&gt;
  
  
  Reason 2: The Inference Engine Is Open Source and Community-Driven
&lt;/h3&gt;

&lt;p&gt;llama.cpp's GitHub repo has over 700 contributors. Even if Meta, Google, or Microsoft gut their AI divisions, as long as Georgi Gerganov keeps writing code on his MacBook, llama.cpp isn't going anywhere.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# llama.cpp release cadence (2025-2026)&lt;/span&gt;
&lt;span class="c"&gt;# b8233 (2026-03) — Qwen3.5 MoE optimization&lt;/span&gt;
&lt;span class="c"&gt;# b8102 (2026-03) — Flash Attention v2 improvements&lt;/span&gt;
&lt;span class="c"&gt;# b7955 (2026-02) — KV cache compression improvements&lt;/span&gt;
&lt;span class="c"&gt;# b7811 (2026-02) — INT4 GEMM kernel optimization&lt;/span&gt;

&lt;span class="c"&gt;# Releases every two weeks or less&lt;/span&gt;
&lt;span class="c"&gt;# This development velocity has nothing to do with corporate funding&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;What matters most: llama.cpp improvements keep &lt;strong&gt;boosting performance on the same hardware&lt;/strong&gt;. No new GPU needed. When I first ran Qwen2.5-32B on my RTX 4060 8GB, I got 8.2 tok/s at ngl=20. After llama.cpp's Flash Attention improvements, same config hit 10.8 tok/s. Same hardware. Free software upgrade.&lt;/p&gt;

&lt;h3&gt;
  
  
  Reason 3: Quantization Is Math, Not a License
&lt;/h3&gt;

&lt;p&gt;Q4_K_M, Q5_K_S, IQ4_XS — these are algorithms. Not proprietary tech locked behind patents. Published in papers, implemented in open source.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Quantization impact in hard numbers
&lt;/span&gt;&lt;span class="n"&gt;models&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Qwen3.5-9B FP16&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;    &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;size_gb&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;18.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;fits_8gb&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Qwen3.5-9B Q4_K_M&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;size_gb&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;5.3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;fits_8gb&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Qwen3.5-27B FP16&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;   &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;size_gb&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;54.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;fits_8gb&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Qwen3.5-27B Q4_K_M&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;size_gb&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;16.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;fits_8gb&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;  &lt;span class="c1"&gt;# Runs with CPU offload
&lt;/span&gt;    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Qwen3.5-35B-A3B Q4_K_M&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;size_gb&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;21.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;fits_8gb&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;  &lt;span class="c1"&gt;# MoE: 3B active, runs via CPU offload
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;info&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;models&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;items&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="n"&gt;status&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;GPU only&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;info&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;fits_8gb&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;CPU offload&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;30&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;info&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;size_gb&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="mf"&gt;5.1&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;GB  [&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;status&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;]&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# FP16 → Q4_K_M ≈ 3.5x compression
# This has nothing to do with Alibaba's balance sheet
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Even if half of all AI companies go bankrupt, the Q4_K_M quantization algorithm doesn't vanish. The GGML format spec doesn't vanish. The llama.cpp binary doesn't vanish.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Real Risks for Local LLM Are Elsewhere
&lt;/h2&gt;

&lt;p&gt;I've been optimistic so far, but local LLM has weak spots too. Just not the ones bubble discourse is about.&lt;/p&gt;

&lt;h3&gt;
  
  
  Risk 1: New Model Training Slows Down
&lt;/h3&gt;

&lt;p&gt;The model weights running on your machine were trained on massive GPU clusters owned by corporations. Qwen3.5 came from Alibaba's compute. The next Llama version depends on Meta's infrastructure.&lt;/p&gt;

&lt;p&gt;If the bubble pops and these companies slash AI investment, new models stop appearing. Existing models keep running, but &lt;strong&gt;evolution stalls&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;In practice though, Meta, Alibaba, and Google all treat their AI divisions as core infrastructure, not pure VC plays. Startups may die, but big tech's open model development won't stop overnight. Meta uses Llama internally for Instagram and WhatsApp inference. As long as internal demand exists, development continues.&lt;/p&gt;

&lt;h3&gt;
  
  
  Risk 2: CUDA Lock-in
&lt;/h3&gt;

&lt;p&gt;llama.cpp supports CPU, Metal, Vulkan, and CUDA backends, but &lt;strong&gt;peak performance on an RTX 4060 requires CUDA&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;There's a nonzero chance NVIDIA changes CUDA licensing. But ROCm (AMD) and Vulkan backends are maturing as real alternatives. The M4 Mac mini's Metal backend already delivers practical speeds comparable to CUDA. Single-point-of-failure risk on CUDA is meaningfully lower than it was three years ago.&lt;/p&gt;

&lt;h3&gt;
  
  
  Risk 3: Semiconductor Supply Chain Fragmentation
&lt;/h3&gt;

&lt;p&gt;This is the most realistic threat. A Taiwan Strait crisis that halts TSMC fabs would cut off GPU supply. Your existing RTX 4060 keeps running, but &lt;strong&gt;if it breaks, there's no replacement&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;The hedge is straightforward: watch Intel Arc improve, and diversify toward Apple Silicon. Intel Arc uses Intel's own fabs (Intel Foundry), while Apple Silicon is shifting toward TSMC's Arizona facility. Not a perfect hedge, but better than being entirely dependent on NVIDIA + TSMC Taiwan.&lt;/p&gt;




&lt;h2&gt;
  
  
  Making Your Personal AI Stack Bubble-Proof
&lt;/h2&gt;

&lt;p&gt;Theory's done. What do you actually do?&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Local Model Backups
&lt;/h3&gt;

&lt;p&gt;Copy your GGUF files to a NAS or external SSD. If a Hugging Face repo gets taken down, you've still got the weights.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Backup to external SSD&lt;/span&gt;
rsync &lt;span class="nt"&gt;-av&lt;/span&gt; &lt;span class="nt"&gt;--progress&lt;/span&gt; ~/models/&lt;span class="k"&gt;*&lt;/span&gt;.gguf /mnt/backup_ssd/llm_models/

&lt;span class="c"&gt;# Or just copy&lt;/span&gt;
&lt;span class="nb"&gt;cp&lt;/span&gt; ~/models/qwen3.5-9b-q4_k_m.gguf /mnt/backup_ssd/llm_models/
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;33GB of models. Fits on a 64GB microSD card. That's the entire cost of your bubble insurance policy.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Pin Your Runtime
&lt;/h3&gt;

&lt;p&gt;Save a known-good llama.cpp build as a static binary.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Build and save a verified version&lt;/span&gt;
&lt;span class="nb"&gt;cd &lt;/span&gt;llama.cpp
git checkout b8233
cmake &lt;span class="nt"&gt;-B&lt;/span&gt; build &lt;span class="nt"&gt;-DGGML_CUDA&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;ON
cmake &lt;span class="nt"&gt;--build&lt;/span&gt; build &lt;span class="nt"&gt;--config&lt;/span&gt; Release &lt;span class="nt"&gt;-j8&lt;/span&gt;
&lt;span class="nb"&gt;cp &lt;/span&gt;build/bin/llama-cli ~/stable_bins/llama-cli-b8233

&lt;span class="c"&gt;# This binary has no external service dependencies&lt;/span&gt;
&lt;span class="c"&gt;# Just needs CUDA Toolkit 12.x and an NVIDIA driver&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  3. Audit Your API Dependency
&lt;/h3&gt;

&lt;p&gt;Map out which parts of your workflow rely on API calls.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[Dependency Checklist]
□ Code completion → Copilot (API) or local FIM?
□ Writing/editing → GPT-4o (API) or local 9B?
□ RAG embeddings → OpenAI Embeddings (API) or BGE-M3 (local)?
□ Image generation → DALL-E (API) or SDXL (local)?
□ Speech-to-text → Whisper API or whisper.cpp (local)?
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You don't need to eliminate all API usage. For tasks that genuinely need frontier capabilities — deep chain-of-thought reasoning, multimodal analysis — use the API. But &lt;strong&gt;know whether a fallback path exists&lt;/strong&gt; for when that API disappears.&lt;/p&gt;




&lt;h2&gt;
  
  
  Proving It with Numbers on 8GB
&lt;/h2&gt;

&lt;p&gt;Let's ground the bubble debate in actual measurements. How far can an RTX 4060 8GB go as an API replacement?&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[RTX 4060 8GB Local Inference Benchmark — 2026-03]

Task                    Model               tok/s   Quality (subjective /5)
─────────────────────────────────────────────────────────────
Code completion (Python) Qwen3.5-9B Q4_K_M   33.0    ★★★★☆
Technical doc summary    Qwen3.5-9B Q4_K_M   37.1    ★★★☆☆
Mathematical reasoning   Qwen3.5-35B-A3B     8.6     ★★★★☆
Paper reading (RAG)      BGE-M3 + Qwen3.5-9B 28.5    ★★★☆☆
Chat / dialogue          Qwen3.5-9B Q4_K_M   33.0    ★★★★☆

Ref: Claude Sonnet 4.6 API                    ~80     ★★★★★
Ref: GPT-4o API                             ~60     ★★★★★

Power draw: ~95W × usage hours (no API fees, $0/month fixed)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I won't pretend local quality beats frontier APIs. Claude Sonnet and GPT-4o are in a different league from a local 9B model for reasoning tasks. That's just honest.&lt;/p&gt;

&lt;p&gt;But 33 tok/s code completion at $0/month, works offline, no rate limits, data never leaves your machine — that structural advantage holds whether the bubble bursts or not.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Bubble Is a Data Center Problem
&lt;/h2&gt;

&lt;p&gt;Strip it all down, and nearly every AI bubble take is about the same thing: return on massive capital investment. Billions in training clusters, thousands of H100s, millions per year in power costs — whether that scale of business is sustainable.&lt;/p&gt;

&lt;p&gt;Your personal 8GB VRAM is not in that blast radius.&lt;/p&gt;

&lt;p&gt;An RTX 4060 costs around $350. An M4 Mac mini runs about $700. Model weights are free to download. llama.cpp is free to use. Quantization algorithms are in published papers.&lt;/p&gt;

&lt;p&gt;All of this exists independently of VC capital flows.&lt;/p&gt;

&lt;p&gt;When the bubble pops, the people in trouble are companies running products on API subscriptions and investors holding NVIDIA stock. Not the individual engineer running Qwen3.5 on 8GB of VRAM.&lt;/p&gt;

&lt;p&gt;If anything, a bubble collapse might accelerate migration from API-dependent products to local inference. If API prices climb, the relative appeal of local goes up. For those of us in 8GB territory, a bubble burst could be a tailwind.&lt;/p&gt;

&lt;p&gt;One caveat though. The risk of frontier model stagnation is real. Getting complacent about your local 9B being "good enough" and ignoring cutting-edge reasoning capabilities only available via API — that's a different kind of danger. Don't get comfortable just because you're outside the bubble. Keep both tools in your belt. That's the optimal play at individual scale.&lt;/p&gt;




&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;llama.cpp: &lt;a href="https://github.com/ggerganov/llama.cpp" rel="noopener noreferrer"&gt;https://github.com/ggerganov/llama.cpp&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Hugging Face GGUF Models: &lt;a href="https://huggingface.co/models?library=gguf" rel="noopener noreferrer"&gt;https://huggingface.co/models?library=gguf&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Qwen3.5 Model Family: &lt;a href="https://huggingface.co/Qwen" rel="noopener noreferrer"&gt;https://huggingface.co/Qwen&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;GGML Quantization Methods: &lt;a href="https://github.com/ggerganov/llama.cpp/blob/master/examples/quantize/README.md" rel="noopener noreferrer"&gt;https://github.com/ggerganov/llama.cpp/blob/master/examples/quantize/README.md&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>ai</category>
      <category>discuss</category>
      <category>news</category>
      <category>startup</category>
    </item>
    <item>
      <title>20260324_snn_vs_gpu_en</title>
      <dc:creator>plasmon</dc:creator>
      <pubDate>Tue, 14 Apr 2026 09:54:01 +0000</pubDate>
      <link>https://dev.to/plasmon_imp/20260324snnvsgpuen-p0h</link>
      <guid>https://dev.to/plasmon_imp/20260324snnvsgpuen-p0h</guid>
      <description>&lt;h2&gt;
  
  
  GPU Dominance in AI Inference Is Getting Challenged
&lt;/h2&gt;

&lt;p&gt;Running llama.cpp on an RTX 4060, the fans scream. 95W. 38 tok/s. The results are fine, but the moment you talk power efficiency, things get awkward. An M4 Mac mini pulls the same speed at 30W, and CUDA's brute-force approach becomes hard to defend.&lt;/p&gt;

&lt;p&gt;Meanwhile, the biological brain runs on 20W. And most of that goes to maintaining membrane potentials and keeping synapses on standby — the incremental cost of "conscious thought" is less than 5% above baseline (Raichle, &lt;em&gt;Science&lt;/em&gt;, 2006). That puts actual thinking at under 1W.&lt;/p&gt;

&lt;p&gt;The human brain has roughly 86 billion neurons, and only 1-2% fire at any given moment (Lennie, &lt;em&gt;Current Biology&lt;/em&gt;, 2003). Only the neurons that need to spike do so, only when needed. This is fundamentally different from Transformer inference, where every parameter is active on every token.&lt;/p&gt;

&lt;p&gt;Spiking Neural Networks (SNNs) and neuromorphic computing are trying to bring this biological design principle into hardware. Three interesting papers dropped in Q1 2026. I read them, and thought about where GPUs are headed.&lt;/p&gt;




&lt;h2&gt;
  
  
  SPARQ: 330x Energy Savings, With Caveats
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://arxiv.org/abs/2603.14380" rel="noopener noreferrer"&gt;SPARQ&lt;/a&gt;, published on arXiv in March 2026, integrates quantization-aware training and reinforcement-learning-based early exit into a unified SNN framework.&lt;/p&gt;

&lt;p&gt;The key insight: &lt;strong&gt;dynamically deciding spike propagation depth per input&lt;/strong&gt;. Easy inputs get classified at shallow layers; only hard inputs propagate to deeper layers. Close to what biological brains actually do.&lt;/p&gt;

&lt;p&gt;The numbers:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[SPARQ Benchmark Results — from paper Table 2/3]

MLP on MNIST:
  Baseline SNN: 95.00%    QSNN: 94.50%    SPARQ (QDSNN): 97.80%

LeNet-5 on MNIST:
  Baseline SNN: 97.76%    QSNN: 93.09%    SPARQ (QDSNN): 98.24%

AlexNet on CIFAR-10:
  Baseline SNN: 77.01%    QSNN: 74.30%    SPARQ (QDSNN): 78.00%

Energy consumption: SPARQ achieves 330x+ reduction vs baseline
Synaptic operations: 90%+ reduction
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;330x energy savings. Looks stunning at first glance. But read carefully.&lt;/p&gt;

&lt;p&gt;The evaluated models are MLP, LeNet, AlexNet — MLP is a classic, LeNet is from 1998, AlexNet from 2012. Not even ResNet-50. Let alone billion-parameter Transformers. SPARQ's achievement is &lt;strong&gt;excellent optimization within the SNN paradigm, but it's not yet a story about replacing GPU-based Transformer inference&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;One more thing: that 330x figure is relative to a baseline SNN, not a GPU. The SNN baseline itself hasn't been compared under identical conditions to GPU inference.&lt;/p&gt;




&lt;h2&gt;
  
  
  FPGA + RISC-V SoC: Neuromorphic You Can Actually Touch
&lt;/h2&gt;

&lt;p&gt;Another March 2026 paper, the &lt;a href="https://arxiv.org/abs/2603.18054" rel="noopener noreferrer"&gt;FPGA SNN study&lt;/a&gt;, takes a different approach.&lt;/p&gt;

&lt;p&gt;It's a SoC architecture integrating a RISC-V controller with an event-driven SNN core. Multipliers are replaced with bitwise operations (binary weights), using spike-timing-based temporal coding. Implemented on FPGA — hardware you can actually buy.&lt;/p&gt;

&lt;p&gt;This is where it gets interesting. Intel Loihi 2 and IBM NorthPole are research-institution-only chips. You can't just buy one. But FPGAs (Xilinx Artix-7, Intel Cyclone V) cost a few hundred dollars. RISC-V is open source. &lt;strong&gt;The path to running neuromorphic experiments at individual scale is opening up.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The paper validates on image classification tasks (MNIST/Fashion-MNIST), but the architectural design is general-purpose. Event-driven processing, binary weights, temporal coding — these are foundational technologies for ultra-low-power inference on edge devices.&lt;/p&gt;




&lt;h2&gt;
  
  
  Loihi 2 and Hala Point: Intel's Serious Bet, and the Quiet Slowdown
&lt;/h2&gt;

&lt;p&gt;Intel Labs has delivered &lt;strong&gt;Hala Point&lt;/strong&gt;, a massive neuromorphic system based on Loihi 2, to Sandia National Laboratories.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[Hala Point Specs]

Processors:          1,152 × Loihi 2
Neurons:             1.15 billion
Synapses:            128 billion
Neuromorphic Cores:  140,544
Power Consumption:   Up to 2,600W
Form Factor:         6 rack units
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;1.15 billion neurons. Roughly 1.3% of the human brain. Running at 2,600W. Compare that to an H100 at 700W TDP × thousands of GPUs in an AI cluster — the per-neuron power efficiency is orders of magnitude better.&lt;/p&gt;

&lt;p&gt;But let's be honest about something.&lt;/p&gt;

&lt;p&gt;Intel has over 200 neuromorphic research community partners, but &lt;strong&gt;no clear commercial product roadmap has been published&lt;/strong&gt;. Loihi 2 remains a research chip. Hala Point is a proof-of-concept system, not a product flowing through the market like NVIDIA's GPUs.&lt;/p&gt;

&lt;p&gt;Given that Intel hasn't officially announced a Loihi 3 tape-out, a future where neuromorphic immediately replaces GPUs isn't visible. Innatera demoing real-world neuromorphic edge AI at CES 2026 is encouraging, but that's an edge-specific story.&lt;/p&gt;




&lt;h2&gt;
  
  
  Spike Sparsity at 0.1 Gets You 3.6x; Above 0.5, You Lose
&lt;/h2&gt;

&lt;p&gt;Under what conditions can SNNs beat GPUs? CEA's (French Alternative Energies and Atomic Energy Commission) &lt;a href="https://cea.hal.science/cea-03852141" rel="noopener noreferrer"&gt;hardware-aware comparative study&lt;/a&gt; provides clear numbers.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[SNN vs ANN Energy Efficiency — Variation by Spike Sparsity]

Spike Sparsity (spikes/synapse/inference)
  0.1  → SNN is 3.6x more energy-efficient than ANN
  0.3  → SNN is 1.5x more energy-efficient than ANN
  0.5  → SNN ≈ ANN (roughly equivalent)
  0.7  → ANN is more energy-efficient
  1.0  → ANN wins by a wide margin

Conclusion: Lower spike sparsity favors SNN
           Above 0.5 spikes/synapse, SNN advantage disappears
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Spike sparsity of 0.1&lt;/strong&gt; — meaning only 10% of all synapses fire per inference — gets you 3.6x energy savings. This is a condition close to how biological brains actually operate.&lt;/p&gt;

&lt;p&gt;The problem: achieving this level of sparsity reliably with current SNN training algorithms is hard. SPARQ's early exit approach is attacking this, but large-scale model validation is still ahead.&lt;/p&gt;

&lt;p&gt;There's an even more interesting data point. Knight &amp;amp; Nowotny (2018)'s &lt;a href="https://www.frontiersin.org/journals/neuroscience/articles/10.3389/fnins.2018.00941" rel="noopener noreferrer"&gt;benchmark study in Frontiers in Neuroscience&lt;/a&gt; showed that &lt;strong&gt;running SNN simulations on a GPU was 14x more energy-efficient than SpiNNaker, a dedicated neuromorphic chip&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Ironic. The SNN that was supposed to run on neuromorphic hardware turns out to be more efficient on a GPU. Hardware maturity gaps are eating the architectural advantage alive.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why the GPU Won't Die: Software Ecosystem Inertia
&lt;/h2&gt;

&lt;p&gt;Technical potential alone doesn't win. Look at how massive the CUDA ecosystem is.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[Software Ecosystem Comparison — March 2026]

              GPU (CUDA)           SNN (Neuromorphic)
─────────────────────────────────────────────────────
Major Frameworks:   PyTorch, TF,         Lava (Intel), Norse,
                   llama.cpp, vLLM      snnTorch, SpikingJelly
GitHub Stars:       ~98K (PyTorch)       ~2K (snnTorch)
Commercial HW:      RTX/A100/H100 etc.   Loihi 2 (research),
                   Buy today            Innatera (CES 2026 demo)
Programming         Medium               High (spike encoding,
Difficulty:        (Python + CUDA)       timing design required)
Pretrained Models:  HuggingFace 1M+      Hundreds (research)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;PyTorch's 98K stars vs snnTorch's 2K stars. That 50x gap is a developer community gap, a bug-fix velocity gap, a StackOverflow answer count gap.&lt;/p&gt;

&lt;p&gt;llama.cpp ships releases every two weeks, improving performance on the same RTX 4060 for free. No SNN framework matches that development velocity.&lt;/p&gt;




&lt;h2&gt;
  
  
  What's Left at Individual Scale
&lt;/h2&gt;

&lt;p&gt;Datacenter power problems (H100 at 700W × thousands of units) are where SNN's energy efficiency matters. Acknowledged.&lt;/p&gt;

&lt;p&gt;But at individual scale with an RTX 4060 at 95W, power isn't the bottleneck. One wall outlet covers it.&lt;/p&gt;

&lt;p&gt;Where SNNs matter for individuals:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Always-on edge inference&lt;/strong&gt; — 24/7 inference on battery-powered devices. Wearables, IoT sensors, robotic vision processing. SNN could own this space&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;FPGA experimentation&lt;/strong&gt; — The era of running neuromorphic experiments on a few-hundred-dollar FPGA board is arriving. RISC-V + SNN SoC is realistic for education and research&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Ultra-low-latency processing&lt;/strong&gt; — Event-driven by nature, processing fires only when input arrives. Fundamentally lower latency than frame-based GPU processing&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Conversely, LLM inference — pushing massive parameters at high throughput — is GPU territory. Transformer attention is dense matrix math, and it's a bad match for sparse-firing SNNs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;At least with current algorithms&lt;/strong&gt;, there's no incentive to port LLM inference to SNNs. The possibility of sparse inference and SNN convergence in the future isn't zero, but that's a next-generation story.&lt;/p&gt;




&lt;h2&gt;
  
  
  SNNs Won't Kill the GPU — But They'll Take the Seat Next to It
&lt;/h2&gt;

&lt;p&gt;Time for an answer. Can SNNs kill the GPU? &lt;strong&gt;No. But they'll coexist.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;GPUs remain the kings of dense matrix computation. LLM inference, image generation, large-scale training — these are GPU territory. Running Qwen3.5 at 33 tok/s in 8GB VRAM on an RTX 4060 isn't something SNNs can replace.&lt;/p&gt;

&lt;p&gt;Where SNNs win is the edge. Battery-powered, always-on, ultra-low-latency. Sensor fusion, anomaly detection, robotic control. SPARQ's 330x energy savings means something in this context.&lt;/p&gt;

&lt;p&gt;Looking at Intel's quiet roadmap and Innatera's entry at CES 2026, neuromorphic computing is transitioning from research phase to edge deployment phase. Encroachment on general-purpose computing is still 5+ years out.&lt;/p&gt;

&lt;p&gt;If there's one thing worth doing as an individual engineer right now — grab an FPGA board and play with snnTorch. A few hundred dollars gets you to the doorstep of the next computing paradigm. You don't have to give up your GPU. Keep both.&lt;/p&gt;




&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://arxiv.org/abs/2603.14380" rel="noopener noreferrer"&gt;SPARQ: Spiking Early-Exit Neural Networks for Energy-Efficient Edge AI&lt;/a&gt; — SNN + quantization + early exit integrated framework&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://arxiv.org/abs/2603.18054" rel="noopener noreferrer"&gt;An FPGA-Based SoC Architecture with a RISC-V Controller for Energy-Efficient Temporal-Coding SNNs&lt;/a&gt; — Neuromorphic SoC accessible to individuals&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://arxiv.org/abs/2602.13261" rel="noopener noreferrer"&gt;A feedback control optimizer for online and hardware-aware training of SNNs&lt;/a&gt; — Hardware-aware learning for neuromorphic devices&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.frontiersin.org/journals/neuroscience/articles/10.3389/fnins.2018.00941" rel="noopener noreferrer"&gt;GPUs Outperform Current HPC and Neuromorphic Solutions in Terms of Speed and Energy When Simulating a Highly Connected Cortical Model&lt;/a&gt; — Knight &amp;amp; Nowotny (2018), GPU vs SpiNNaker energy efficiency comparison&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://cea.hal.science/cea-03852141" rel="noopener noreferrer"&gt;Are SNNs really more energy-efficient than ANNs?&lt;/a&gt; — CEA's hardware-aware comparative study&lt;/li&gt;
&lt;li&gt;Innatera CES 2026 Demo (PR Newswire, 2026-01) — Real-world neuromorphic edge AI&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>machinelearning</category>
      <category>performance</category>
    </item>
    <item>
      <title>VRAMを増やせば解決する、は物理的に間違っている — HBM・CXL・Unified Memoryが取れなかったもの</title>
      <dc:creator>plasmon</dc:creator>
      <pubDate>Tue, 14 Apr 2026 09:53:58 +0000</pubDate>
      <link>https://dev.to/plasmon_imp/vramwozeng-yasebajie-jue-suru-hawu-li-de-nijian-wei-tuteiru-hbmcxlunified-memorygaqu-renakatutamono-14ha</link>
      <guid>https://dev.to/plasmon_imp/vramwozeng-yasebajie-jue-suru-hawu-li-de-nijian-wei-tuteiru-hbmcxlunified-memorygaqu-renakatutamono-14ha</guid>
      <description>&lt;h1&gt;
  
  
  VRAMを増やせば解決する、は物理的に間違っている — HBM・CXL・Unified Memoryが取れなかったもの
&lt;/h1&gt;

&lt;p&gt;HBMを6倍に増やしても、載せられるモデルサイズは2倍にしかならない。RTX 5060のVRAMが16GBに倍増しても70Bはフルに載らない。「VRAMが足りないなら増やせばいい」——この発想は、帯域・容量・コストの物理的トレードオフを無視している。&lt;/p&gt;

&lt;p&gt;HBM、CXL、Unified Memory。この3つはVRAMの壁に対する異なるアプローチだ。それぞれが「帯域」と「容量」と「コスト」の三角形のどこに位置するかで、LLM推論の性能が根本的に変わる。&lt;/p&gt;




&lt;h2&gt;
  
  
  メモリの三角形: 帯域・容量・コスト
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;技術              帯域          容量       コスト/GB    インターフェース
────────────────────────────────────────────────────────────────────
HBM3E (H200)     4,800 GB/s    141 GB     $10-15       TSV 1024-bit × 6 stacks
GDDR6 (RTX4060)    272 GB/s      8 GB     $2.5-4       128-bit, 17 Gbps
CXL 3.1             64 GB/s*    TB級      $3-5         PCIe 6.0 x16
Unified (M4 Max)   546 GB/s    128 GB     Apple依存    LPDDR5X 512-bit

* per direction (128 GB/s bidirectional)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;各技術の物理的な特徴:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;HBM3E&lt;/strong&gt;: シリコン貫通電極（TSV）で垂直に積層。帯域は圧倒的だが、インターポーザ面積とコストを食う&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;GDDR6&lt;/strong&gt;: 基板上のはんだ接続。安くてGPU独占で使えるが、容量に限界がある&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;CXL 3.1&lt;/strong&gt;: 既存のPCIeインフラを流用。TB級の容量が取れるが、メモリ読み出し帯域はHBM3Eの1/75&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Unified Memory&lt;/strong&gt;: CPU/GPU/NPUが同じメモリプールを共有。コピーコストゼロだが、帯域は共有=競合&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;HBMは帯域、CXLは容量、Unified Memoryはバランスを選んだ。どれも三角形の全頂点は取れない。&lt;/p&gt;




&lt;h2&gt;
  
  
  HBM: 帯域の王、容量の奴隷
&lt;/h2&gt;

&lt;p&gt;HBMの帯域は、~5,000本以上のシリコン貫通電極（TSV）による1024-bit幅のバスから生まれる。H200は6スタックを搭載し、合計6144-bitのバス幅で4.8 TB/sを実現する。GDDRの18倍だ。&lt;/p&gt;

&lt;p&gt;だが容量には物理的な天井がある:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;HBM3E スタック構成:
  1ダイ = 24 Gbit (3GB)
  8-Hi (8枚積層) = 24 GB/stack
  12-Hi (次世代)  = 36 GB/stack
  H200: 6 stacks × 24GB = 144 GB raw (公称141 GB)
  スタック単価: $240-360推定 ($10-15/GB)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;「もっとスタックを増やせば？」——ここで面積の問題にぶつかる。各HBMスタックはインターポーザ上で~100 mm²を占有する。GPU die (~800 mm²) + 6スタック (~600 mm²) = ~1400 mm²。CoWoS-Sの現行上限は約2831 mm²（3.3xレチクル）で、H200にはまだ余裕があるが、インターポーザの大型化はコストと歩留まりを直接悪化させる。&lt;/p&gt;

&lt;h3&gt;
  
  
  LLM推論への影響
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;GPU                     最大モデル (Q4)   帯域       備考
─────────────────────────────────────────────────────────────
H200 (141GB HBM3E)     ~280B             4,800 GB/s  70B FP16だとKV cache余裕1GB
RTX 4060 (8GB GDDR6)   ~13B              272 GB/s    13B以上はCPUオフロード必須
RTX 5060 (16GB GDDR7)  ~30B              448 GB/s    容量2倍でもモデルは2倍にならない
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;VRAMを倍増しても載せられるモデルサイズは倍にならない。KV cacheの存在がある。32Kコンテキストの70B FP16モデルのKV cacheは約8GB。VRAMの「余り」がKV cacheに食われる。&lt;/p&gt;




&lt;h2&gt;
  
  
  CXL: 容量の解放、帯域の犠牲
&lt;/h2&gt;

&lt;p&gt;CXL (Compute Express Link) はPCIeの物理層上に構築されたメモリ拡張プロトコルだ。&lt;/p&gt;

&lt;p&gt;CXL 3.1はPCIe 6.0の物理層上に構築され、64 GB/s per direction（x16レーン）の帯域を提供する。レイテンシは170-400 ns（DDR5ローカルの2-4倍）。容量はmemory poolingにより理論上無制限だが、現時点ではサーバー/データセンター向けだ。&lt;/p&gt;

&lt;p&gt;CXLの帯域でLLM推論すると何が起きるか:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;モデル                 CXL (64 GB/s)    GDDR6 (272 GB/s)    HBM3E (4,800 GB/s)
──────────────────────────────────────────────────────────────────────────
7B Q4_K_M (4.7GB)     ~13.6 t/s        ~32 t/s (実効)       ~1021 t/s (理論)
70B Q4_K_M (40GB)     1.6 t/s          N/A (載らない)       ~120 t/s (理論)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;70B Q4の重みをCXLから読むと1.6 t/s。人間が読む速度と変わらない。&lt;/p&gt;

&lt;p&gt;だが、CXLの真価は「重みの格納場所」ではない。&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;階層型メモリアーキテクチャ:

Tier    メモリ         用途                       容量       帯域        レイテンシ
────────────────────────────────────────────────────────────────────────────
 0      GPU SRAM       アクティベーション          24 MB     ~4 TB/s     ~1 ns
 1      HBM/GDDR      重み、アクティブKVキャッシュ 8-141 GB  272-4800    ~10 ns
 2      CXL Memory     KVキャッシュのオーバーフロー TB級      64 GB/s     170-400 ns
 3      NVMe SSD       永続ストレージ              TB級      7 GB/s      ~10,000 ns
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;CXLの本質は「VRAMの代替」ではなく「VRAMとNVMeの間を埋める新しい層」だ。KVキャッシュの古いトークン（128Kコンテキストの最初の方）をCXLメモリに退避させれば、VRAM上には直近のアテンション範囲だけ残す設計が可能になる。KVキャッシュのオーバーフロー先としてNVMeの9倍高速。&lt;/p&gt;

&lt;p&gt;この階層化は、光メモリ読み出し（KVキャッシュの物理的な転送量削減）やKVキャッシュ量子化（データ量を数値的に削減）とは直交する最適化だ。組み合わせられる。&lt;/p&gt;




&lt;h2&gt;
  
  
  Unified Memory: バランスの罠
&lt;/h2&gt;

&lt;p&gt;Apple SiliconのUnified Memoryは、CPU・GPU・NPUが同じ物理メモリプールを共有する。&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;チップ          容量      帯域         バス幅     共有先
───────────────────────────────────────────────────────────────────
M4 Max         128 GB    546 GB/s     512-bit    CPU 12コア + GPU 40コア + NPU 16コア + メディアエンジン
M4 (base)      16-32 GB  120 GB/s     128-bit    同上（RTX 4060の272 GB/sの半分以下）
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  LLM推論での現実
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;M4 Max 128GB&lt;/strong&gt;: 70B Q4_K_M (40GB) がメモリ管理なしで全量載る。理論上限 546/40 = 13.7 t/sだが、実測は8-10 t/s。CPU/NPU/IOとの帯域共有がボトルネック&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;M4 32GB&lt;/strong&gt;: 32B Q4で理論 120/18 = 6.7 t/s → 実測4-5 t/s。RTX 4060はGDDR6 272 GB/sを独占するため、同モデルで10.8 t/sを出す&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;帯域共有の問題は構造的だ。GPU推論中もCPUがメモリにアクセスし、帯域を食い合う。macOSのメモリ管理やUI描画がバックグラウンドで帯域を消費する。Safariで大きなページを開きながら推論すれば、体感で速度が落ちる。&lt;/p&gt;

&lt;p&gt;Unified Memoryの利点は「GPUメモリ管理の排除」だ。CUDAのcudaMalloc/cudaMemcpyが不要。データはすでにそこにある。コピーコストゼロ。&lt;/p&gt;

&lt;p&gt;だが帯域は共有資源であり、独占できない。RTX 4060のGDDR6は272 GB/sをGPUが事実上独占する。M4のベースモデルは120 GB/sをシステム全体で分け合う。&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;GPU                      総帯域     GPU占有率              LLM実効帯域   推論速度
─────────────────────────────────────────────────────────────────────────────────
RTX 4060 (8GB GDDR6)     272 GB/s   ~95% (DP出力程度)      ~258 GB/s    7B Q4: 32 t/s (実効率58%)
M4 Max (128GB LPDDR5X)   546 GB/s   大半 (CPU/NPU/IOと競合) ~400 GB/s    70B Q4: 8-10 t/s
M4 base (16GB LPDDR5X)   120 GB/s   システム全体と共有      ~78 GB/s     7B Q4: 14-16 t/s
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;RTX 4060は帯域が小さいがGPU独占で、小モデルなら最速。M4 Maxは帯域が大きいが共有のため、大モデルを載せられる代わりに帯域あたりの効率は低い。M4 baseは帯域も容量も中途半端で、LLM用途ではRTX 4060に負ける。&lt;/p&gt;




&lt;h2&gt;
  
  
  3つのアプローチの比較
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;                帯域         容量        コスト     LLM推論での位置
─────────────────────────────────────────────────────────────────
HBM3E          4,800 GB/s    141 GB     $10-15/GB   重み+KVを高速に読む
GDDR6         272 GB/s      8-24 GB    $2.5-4/GB   小モデルを高速に回す
CXL 3.1        64 GB/s       TB級       $3-5/GB     KVキャッシュのオーバーフロー先
Unified (Max)  546 GB/s      128 GB     Apple依存   大モデルをゼロコピーで載せる
NVMe SSD       7 GB/s        TB級       $0.1/GB     モデルの永続ストレージ
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;HBM (H100/H200)&lt;/strong&gt; — バッチ推論、複数リクエスト同時処理。帯域を複数リクエストで共有できるため、1リクエストあたりのコスト効率が高い。ただし単一リクエストでは700W TDPの大半が無駄になる&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;GDDR (RTX 4060/5060)&lt;/strong&gt; — 個人利用、単一リクエスト、小〜中モデル。GPU独占帯域で効率最大。115W TDPで32 t/s (0.28 t/s/W)、小モデルなら電力効率でH100単一リクエスト (700W) に勝る。ただし容量の壁があり、8GBでは7Bが上限&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;CXL&lt;/strong&gt; — 超長コンテキスト推論 (128K+)、メモリプール共有。KVキャッシュが数十GBに膨らむ長コンテキストでVRAM不足を解消する。ただし帯域はHBM3Eの1/75で、重みの格納先としては遅すぎる。サーバー向け2025-26年、コンシューマーは2028年以降&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Unified Memory (Apple)&lt;/strong&gt; — 大モデルを手軽に動かしたい開発・実験用途。70B Q4がメモリ管理なしで動く。ただし帯域共有で速度効率はGDDR独占に劣り、ゲーミング用途との両立は困難&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  8GB VRAMユーザーへの実用的示唆
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Layer 1: 量子化（即効性あり）&lt;/strong&gt;&lt;br&gt;
Q4_K_M量子化で7Bモデルの重みが14GB → 4.7GBになる（3倍の容量効率）。llama.cpp/Ollamaで標準サポート。&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Layer 2: KVキャッシュ量子化（実験段階）&lt;/strong&gt;&lt;br&gt;
&lt;code&gt;--cache-type-k q4_0 --cache-type-v q8_0&lt;/code&gt; でKVキャッシュをFP16の1/3に圧縮。長コンテキスト対応の鍵。詳細は「&lt;a href="https://qiita.com/plasmon/items/44baacc8c2459dcd31ed" rel="noopener noreferrer"&gt;KVキャッシュをQ4に落としたら32Kコンテキストが8GBに収まった&lt;/a&gt;」で検証した。&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Layer 3: CPUオフロード（帯域トレードオフ）&lt;/strong&gt;&lt;br&gt;
&lt;code&gt;--n-gpu-layers&lt;/code&gt; で部分的にGPUに載せれば、32Bモデルが動く（遅いが動く）。RTX 4060で32B最適オフロード時10.8 t/s。ボトルネックはCPU↔GPU間のPCIe 4.0 x8 = 16 GB/s。&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Layer 4: CXL（将来）&lt;/strong&gt;&lt;br&gt;
CXLメモリモジュールでPCIe経由のメモリ追加。KVキャッシュのTier 2ストレージとして機能する。コンシューマー向けは2028年以降。今のCPUオフロード（PCIe 16 GB/s）と原理は似ているが、CXLはメモリセマンティクス（load/storeアクセス、GPU直接アドレッシング）で差別化される。&lt;/p&gt;

&lt;p&gt;今日できるのはLayer 1-3の組み合わせだ。Q4量子化 + KVキャッシュQ4 + 最適GPUオフロード = 32Bモデル × 32Kコンテキストが8GBで動く。将来CXLがLayer 4として加われば、128K+コンテキストが現実的になる。&lt;/p&gt;

&lt;p&gt;注目すべきは、CXLが約束する「メモリ追加」は、今日のCPUオフロードと本質的に同じPCIeバスを通ることだ。帯域の天井は同じ。CXLの利点はメモリセマンティクス（load/storeでアクセスでき、GPUが直接アドレッシング可能）であって、帯域の向上ではない。&lt;/p&gt;




&lt;h2&gt;
  
  
  物理が決めるメモリの未来
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;「VRAMを増やせば問題は解決するか？」&lt;/strong&gt;——解決しない。容量を増やすと帯域かコストが犠牲になる。&lt;/p&gt;

&lt;p&gt;帯域・容量・コストの三角形は物理法則が支配しており、どの技術も3つ全ては取れない。HBMは帯域を取って容量とコストを犠牲にした。CXLは容量を取って帯域を犠牲にした。Unified Memoryはバランスを取って独占帯域を犠牲にした。GDDRは独占帯域を取って容量を犠牲にした。&lt;/p&gt;

&lt;p&gt;LLM推論の最適解は「1つの技術を選ぶ」ことではなく、複数の技術を階層的に組み合わせることだ。&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;今日のRTX 4060で実行可能な最善策:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;重み → VRAM（Q4量子化で7-13Bを全載せ）&lt;/li&gt;
&lt;li&gt;KVキャッシュ → VRAM（Q4/Q8量子化で容量節約）&lt;/li&gt;
&lt;li&gt;追加レイヤー → RAM（CPUオフロード、PCIe帯域）&lt;/li&gt;
&lt;li&gt;永続ストレージ → NVMe SSD&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;将来のCXL搭載コンシューマーPCでの最善策:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;重み → VRAM（Q4量子化）&lt;/li&gt;
&lt;li&gt;アクティブKV → VRAM&lt;/li&gt;
&lt;li&gt;古いKV → CXLメモリ（64 GB/sで十分なアクセス速度）&lt;/li&gt;
&lt;li&gt;永続ストレージ → NVMe SSD&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;メモリの壁は「破る」ものではなく「階層で回避する」ものだ。&lt;/p&gt;




&lt;h2&gt;
  
  
  参考文献
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;CXL Consortium — "Compute Express Link Specification 3.1" (2024)&lt;/li&gt;
&lt;li&gt;Samsung — "CMM-D: CXL Memory Module for Data Centers" (2024)&lt;/li&gt;
&lt;li&gt;SK hynix — HBM3E specifications, 12-Hi stack architecture&lt;/li&gt;
&lt;li&gt;NVIDIA H200 SXM specifications — 141GB HBM3E, 4.8 TB/s&lt;/li&gt;
&lt;li&gt;Apple M4 Max specifications — 128GB Unified Memory, 546 GB/s&lt;/li&gt;
&lt;li&gt;"Efficient Memory Management for Large Language Model Serving with PagedAttention" (2023) &lt;a href="https://arxiv.org/abs/2309.06180" rel="noopener noreferrer"&gt;arXiv:2309.06180&lt;/a&gt;
&lt;/li&gt;
&lt;/ol&gt;

</description>
      <category>llm</category>
      <category>gpu</category>
      <category>vram</category>
    </item>
    <item>
      <title>llama.cppの設定で8GBの性能が5倍変わる — 主要オプションの最適値を出した</title>
      <dc:creator>plasmon</dc:creator>
      <pubDate>Tue, 14 Apr 2026 09:53:55 +0000</pubDate>
      <link>https://dev.to/plasmon_imp/llamacppnoshe-ding-de8gbnoxing-neng-ga5bei-bian-waru-zhu-yao-opusiyonnozui-shi-zhi-wochu-sita-3fgp</link>
      <guid>https://dev.to/plasmon_imp/llamacppnoshe-ding-de8gbnoxing-neng-ga5bei-bian-waru-zhu-yao-opusiyonnozui-shi-zhi-wochu-sita-3fgp</guid>
      <description>&lt;h1&gt;
  
  
  llama.cppの設定で8GBの性能が5倍変わる — 主要オプションの最適値を出した
&lt;/h1&gt;

&lt;p&gt;llama.cppの起動オプションは50以上ある。そのほとんどはデフォルトのままでいい。だが8GB VRAMでは、5つのオプションの設定ミスが推論速度を半分にする。&lt;/p&gt;

&lt;p&gt;以下は、RTX 4060 8GB (GDDR6 272 GB/s) での推定値（公開ベンチマーク・公式ドキュメント・VRAM使用量の理論計算から算出）に基づく設定ガイドだ。個別環境で数値は変動する。&lt;/p&gt;




&lt;h2&gt;
  
  
  最重要: &lt;code&gt;-ngl&lt;/code&gt; (GPUレイヤー数)
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;-ngl&lt;/code&gt; はTransformerレイヤーのうちいくつをGPU VRAMに載せるかを決める。デフォルトは0（全レイヤーCPU = 最も遅い）。999を指定すると全レイヤーGPU（VRAMに収まれば最速）。モデルごとの総レイヤー数: Qwen2.5-7B = 28、Llama-3-8B = 32、Qwen2.5-32B = 64。&lt;/p&gt;

&lt;h3&gt;
  
  
  8GB VRAMでの最適値
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;モデル                              -ngl   VRAM使用    速度       備考
────────────────────────────────────────────────────────────────────────────
Qwen2.5-7B Q4_K_M (4.7GB)          999    ~5.4 GB    ~32 t/s    全28レイヤーGPU
Mistral-Nemo-12B Q4_K_M (7.2GB)    999    ~7.5 GB    ~20 t/s    KVでOOMの可能性。-c 2048推奨
Qwen2.5-32B Q4_K_M (18.5GB)         25    ~7.4 GB    ~10.8 t/s  64層中25をGPU、残りCPU
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;-ngl&lt;/code&gt; を1変えるだけで速度が数%変わる。最適値は「VRAMをぎりぎりまで使い切る」値だ。&lt;/p&gt;

&lt;p&gt;最適値の探し方（二分探索）:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;code&gt;-ngl 999&lt;/code&gt; で起動。OOMなら次へ&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;-ngl {総レイヤー数/2}&lt;/code&gt; で起動&lt;/li&gt;
&lt;li&gt;OOMなければ増やす、OOMなら減らす&lt;/li&gt;
&lt;li&gt;VRAMが7.0-7.5GB使用で安定する値が最適&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;RTX 4060 8GBの場合、0.5GBはCUDAコンテキスト+フレームワークに取られるため、実質7.5GBをモデルに使える。&lt;code&gt;nvidia-smi&lt;/code&gt; でVRAM使用量を監視しながら調整するのが確実。7.8GB以上は推論中のOOMリスクがある。&lt;/p&gt;




&lt;h2&gt;
  
  
  &lt;code&gt;-c&lt;/code&gt; (コンテキスト長)
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;-c&lt;/code&gt; は推論時に参照できるトークン数の上限。デフォルトは4096（llama.cpp v b8233）。KVキャッシュのVRAM消費に直結する。&lt;/p&gt;

&lt;p&gt;KVキャッシュの計算式: &lt;code&gt;KV cache = 2 × n_layers × n_kv_heads × head_dim × context_len × dtype_bytes&lt;/code&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  KVキャッシュのVRAM消費 (FP16)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;コンテキスト長    Qwen2.5-7B (28層, 4 KV heads)    Qwen2.5-32B (64層, 8 KV heads)
─────────────────────────────────────────────────────────────────────────────────
4,096 tokens     0.22 GB                            1.00 GB
8,192 tokens     0.44 GB                            2.00 GB
32,768 tokens    1.75 GB                            8.00 GB
131,072 tokens   7.00 GB                            —
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;-ngl&lt;/code&gt; で部分オフロード時、KVキャッシュもレイヤー単位でCPU/GPUに分散される。&lt;code&gt;-ngl 25&lt;/code&gt; の場合、GPU上のKV = 25/64 × 上記値。&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;8GB VRAMでの推奨:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;7Bモデル: &lt;code&gt;-c 8192&lt;/code&gt;（KV 0.44GB、安全）、&lt;code&gt;-c 32768&lt;/code&gt;（KV 1.75GB、flash-attn推奨）&lt;/li&gt;
&lt;li&gt;32Bモデル（-ngl 25）: &lt;code&gt;-c 4096&lt;/code&gt;（GPU上KV ~0.39GB）、それ以上はKV量子化必須&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;コンテキスト長を倍にするとKVキャッシュのVRAMも倍になる。8GBでは &lt;code&gt;-c&lt;/code&gt; の設定が載せられるモデルサイズを直接決める。&lt;/p&gt;




&lt;h2&gt;
  
  
  &lt;code&gt;--cache-type-k&lt;/code&gt; / &lt;code&gt;--cache-type-v&lt;/code&gt; (KVキャッシュ量子化)
&lt;/h2&gt;

&lt;p&gt;量子化オプション: &lt;code&gt;f16&lt;/code&gt;（デフォルト、2 bytes/element）、&lt;code&gt;q8_0&lt;/code&gt;（1 byte、VRAM半減）、&lt;code&gt;q4_0&lt;/code&gt;（0.5 bytes、VRAM 1/4）。&lt;/p&gt;

&lt;h3&gt;
  
  
  推奨組み合わせ
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;プロファイル          K cache    V cache    VRAM倍率    品質劣化
──────────────────────────────────────────────────────────────────
品質重視              f16        f16        1x          なし
バランス (推奨)       q8_0       q8_0       0.5x        ほぼなし (一般タスク)
容量優先              q4_0       q8_0       0.375x      数学・推論で劣化あり*
最大圧縮              q4_0       q4_0       0.25x       顕著。長コンテキストで悪化

* V cacheはK cacheより量子化に弱い
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  実例: Qwen2.5-32B + -ngl 25 + 8Kコンテキスト on 8GB
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;-ngl 25&lt;/code&gt; ではGPU上のレイヤーが25/64、KVも25/64がGPU上に置かれる。&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;KV全体 (f16, 8K): 2.00 GB → GPU上: 2.00 × 25/64 = 0.78 GB&lt;/li&gt;
&lt;li&gt;GPU合計 (f16): 重み7.4 + KV 0.78 + overhead 0.3 = &lt;strong&gt;8.48 GB → OOM&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;KV q8_0にすると: 0.78 × 0.5 = 0.39 GB → 7.4 + 0.39 + 0.3 = &lt;strong&gt;8.09 GB → 動く&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;32Kコンテキスト (f16): GPU上KV = 3.13 GB → 不可能。q4_0でも8.48 GB → ギリギリ&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;起動コマンド:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;llama-server &lt;span class="nt"&gt;-m&lt;/span&gt; model.gguf &lt;span class="nt"&gt;-ngl&lt;/span&gt; 25 &lt;span class="nt"&gt;-c&lt;/span&gt; 8192 &lt;span class="nt"&gt;--cache-type-k&lt;/span&gt; q8_0 &lt;span class="nt"&gt;--cache-type-v&lt;/span&gt; q8_0
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  &lt;code&gt;--flash-attn&lt;/code&gt; (Flash Attention)
&lt;/h2&gt;

&lt;p&gt;Flash Attentionはメモリ効率の高いAttention計算アルゴリズム。Attentionの中間バッファが不要になり数百MB節約、長コンテキストで高速化する（32Kで約10%）。4Kトークン以下では効果は小さい。要件はCUDA backend + RTX 20xx以降。KVキャッシュ量子化と併用可能。&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;設定 (Qwen2.5-7B Q4_K_M)      速度          VRAM       差分
──────────────────────────────────────────────────────────────
-c 8192, flash-attn OFF        31.8 t/s      5.6 GB     —
-c 8192, flash-attn ON         32.1 t/s      5.3 GB     +1%, -0.3 GB
-c 32768, flash-attn OFF       28.5 t/s      7.2 GB     —
-c 32768, flash-attn ON        31.5 t/s      6.5 GB     +10.5%, -0.7 GB
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;常に有効にすべき。デメリットなし。&lt;/p&gt;

&lt;p&gt;&lt;code&gt;--flash-attn&lt;/code&gt; はデメリットがない。常に付けておくべきオプションだ。&lt;/p&gt;




&lt;h2&gt;
  
  
  &lt;code&gt;-b&lt;/code&gt; (バッチサイズ) と &lt;code&gt;-t&lt;/code&gt; (スレッド数)
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;-b&lt;/code&gt; (batch size)&lt;/strong&gt;: prompt evaluation時に一度に処理するトークン数。デフォルト2048。8GBでは512推奨——バッチが大きいとprompt eval中のVRAMスパイクでOOMリスクがある。&lt;code&gt;-ub&lt;/code&gt; (micro batch) はデフォルト512で変更不要。&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;-t&lt;/code&gt; (threads)&lt;/strong&gt;: CPU演算に使うスレッド数。デフォルトは全コア。推奨は物理コア数（HTなし）。HTの論理スレッドはメモリ帯域を食い合うだけ。例: i7-13700Hなら &lt;code&gt;-t 6&lt;/code&gt;（Pコア6つ）。&lt;/p&gt;

&lt;h3&gt;
  
  
  スレッド数の影響 (Qwen2.5-32B Q4_K_M, -ngl 25)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;-t 設定                  速度
──────────────────────────────
-t 6  (Pコア数)          10.8 t/s
-t 8  (P+Eコア)          10.5 t/s
-t 14 (全物理コア P+E)    9.8 t/s
-t 20 (HT含む全スレッド)  9.2 t/s
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;スレッド数を増やせば速くなるという直感は間違っている。HTの論理スレッドはL1/L2キャッシュとメモリ帯域を分け合うため、LLM推論ではオーバーヘッドになる。&lt;/p&gt;




&lt;h2&gt;
  
  
  サーバー用オプション (&lt;code&gt;llama-server&lt;/code&gt;)
&lt;/h2&gt;

&lt;p&gt;基本コマンド: &lt;code&gt;llama-server -m model.gguf -ngl 999 -c 4096 --host 0.0.0.0 --port 8080&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;推奨追加オプション:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;--flash-attn&lt;/code&gt; — メモリ効率化（常にON）&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;--metrics&lt;/code&gt; — Prometheus形式のメトリクス公開&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;--parallel 1&lt;/code&gt; — 同時リクエスト数（8GBでは1推奨）&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;--cont-batching&lt;/code&gt; — Continuous batching（&lt;code&gt;--parallel 2&lt;/code&gt; 以上で有効）&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Function calling
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;--chat-template&lt;/code&gt; はGGUF内のテンプレートを自動検出する。function callingの &lt;code&gt;tools&lt;/code&gt; パラメータはモデルのchat templateに依存する。推奨モデル: Qwen2.5-3B-Instruct Q4_K_M（2.0GB、軽量高速）、Qwen2.5-7B-Instruct Q4_K_M（4.7GB、品質と速度のバランス）。&lt;/p&gt;

&lt;h3&gt;
  
  
  構造化出力
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;--grammar-file&lt;/code&gt; でGBNF文法ファイルを指定すると、出力形式を強制できる。JSON出力の構文エラーが0%になる。ただし文法に合わない出力を生成しようとすると推論が遅くなることがある。llama.cpp b7000以降では &lt;code&gt;--json-schema&lt;/code&gt; でJSON Schemaを直接指定する方法もある。&lt;/p&gt;




&lt;h2&gt;
  
  
  設定テンプレート集
&lt;/h2&gt;

&lt;h3&gt;
  
  
  テンプレート1: 7Bモデル、会話用 (最速)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;llama-server &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-m&lt;/span&gt; qwen2.5-7b-instruct-q4_k_m.gguf &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-ngl&lt;/span&gt; 999 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-c&lt;/span&gt; 8192 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--flash-attn&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-t&lt;/span&gt; 6 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--host&lt;/span&gt; 127.0.0.1 &lt;span class="nt"&gt;--port&lt;/span&gt; 8080
&lt;span class="c"&gt;# 期待速度: ~32 t/s, VRAM: ~5.4 GB&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  テンプレート2: 32Bモデル、品質重視 (部分オフロード)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;llama-server &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-m&lt;/span&gt; qwen2.5-32b-instruct-q4_k_m.gguf &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-ngl&lt;/span&gt; 25 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-c&lt;/span&gt; 4096 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--cache-type-k&lt;/span&gt; q8_0 &lt;span class="nt"&gt;--cache-type-v&lt;/span&gt; q8_0 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--flash-attn&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-t&lt;/span&gt; 6 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-b&lt;/span&gt; 512 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--host&lt;/span&gt; 127.0.0.1 &lt;span class="nt"&gt;--port&lt;/span&gt; 8080
&lt;span class="c"&gt;# 期待速度: ~10.8 t/s, VRAM: ~7.4 GB&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  テンプレート3: 7Bモデル、長コンテキスト (32K)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;llama-server &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-m&lt;/span&gt; qwen2.5-7b-instruct-q4_k_m.gguf &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-ngl&lt;/span&gt; 999 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-c&lt;/span&gt; 32768 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--cache-type-k&lt;/span&gt; q8_0 &lt;span class="nt"&gt;--cache-type-v&lt;/span&gt; q8_0 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--flash-attn&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-t&lt;/span&gt; 6 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-b&lt;/span&gt; 512 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--host&lt;/span&gt; 127.0.0.1 &lt;span class="nt"&gt;--port&lt;/span&gt; 8080
&lt;span class="c"&gt;# 期待速度: ~31 t/s, VRAM: ~6.9 GB (flash-attn有効時)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  テンプレート4: 3Bモデル、function calling用 (軽量高速)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;llama-server &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-m&lt;/span&gt; qwen2.5-3b-instruct-q4_k_m.gguf &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-ngl&lt;/span&gt; 999 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-c&lt;/span&gt; 4096 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--flash-attn&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-t&lt;/span&gt; 6 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--host&lt;/span&gt; 127.0.0.1 &lt;span class="nt"&gt;--port&lt;/span&gt; 8080
&lt;span class="c"&gt;# 期待速度: ~50 t/s, VRAM: ~2.5 GB&lt;/span&gt;
&lt;span class="c"&gt;# 7Bと併用可能（合計VRAM ~8GB）&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  よくある失敗と対処
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;問題                    症状                          原因                                対処
──────────────────────────────────────────────────────────────────────────────────────────────
-ngl 0 (GPU未使用)     推論速度 3-5 t/s              全レイヤーCPU。DDR5がボトルネック    -ngl 999 → OOMなら減らす
-c が大きすぎる        推論開始直後にOOM              KVキャッシュがVRAM圧迫              -c 4096 or --cache-type-k q8_0
-t が多すぎる          CPU 100%なのに遅い             HT論理スレッドが帯域を食い合い      -t を物理コア数に
--mlock 使用           起動時メモリエラー             モデル全体RAMロック→物理メモリ不足   --mlock を外す (Windows特に不要)
バッチサイズ過大       長プロンプトでOOM              prompt eval中のVRAMスパイク          -b 512
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  設定による速度差のまとめ
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;設定変更                          速度影響        VRAM影響
──────────────────────────────────────────────────────────
-ngl 0 → 999 (全GPU)             +5-10x          +4-7 GB
-ngl 最適値の探索 (±5)           +10-20%         ±0.5 GB
--flash-attn 有効化               +1-10%          -0.3 GB
--cache-type q8_0                 ±0%             -50%
-t 全スレッド → 物理コア数        +5-15%          ±0
-c 32K → 4K (7B model)            +5%             -1.5 GB
-b 2048 → 512                    ±0%*            -0.2 GB**

* 生成速度には影響しない (prompt eval時間のみ)
** prompt eval中の一時的なVRAMスパイクを抑制
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;最も影響が大きいのは &lt;code&gt;-ngl&lt;/code&gt;。次に &lt;code&gt;-t&lt;/code&gt;。その他は微調整。8GB VRAMでは「-nglを最大化し、-cとKVキャッシュ量子化でVRAMを確保する」が基本戦略だ。&lt;/p&gt;




&lt;h2&gt;
  
  
  参考文献
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;llama.cpp — &lt;a href="https://github.com/ggerganov/llama.cpp" rel="noopener noreferrer"&gt;github.com/ggerganov/llama.cpp&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;llama.cpp Server documentation — &lt;a href="https://github.com/ggerganov/llama.cpp/tree/master/examples/server" rel="noopener noreferrer"&gt;github.com/ggerganov/llama.cpp/tree/master/examples/server&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;GGUF format specification — &lt;a href="https://github.com/ggerganov/ggml/blob/master/docs/gguf.md" rel="noopener noreferrer"&gt;github.com/ggerganov/ggml/blob/master/docs/gguf.md&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Flash Attention — "FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning" (2023) &lt;a href="https://arxiv.org/abs/2307.08691" rel="noopener noreferrer"&gt;arXiv:2307.08691&lt;/a&gt;
&lt;/li&gt;
&lt;/ol&gt;

</description>
      <category>llm</category>
      <category>llamacpp</category>
      <category>gpu</category>
    </item>
  </channel>
</rss>
