<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Federico "SpeederX" Piana</title>
    <description>The latest articles on DEV Community by Federico "SpeederX" Piana (@speederxlab).</description>
    <link>https://dev.to/speederxlab</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3991282%2F0fa5680f-2ec3-4880-baff-3ca2a6345dd4.png</url>
      <title>DEV Community: Federico "SpeederX" Piana</title>
      <link>https://dev.to/speederxlab</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/speederxlab"/>
    <language>en</language>
    <item>
      <title>What secretly eats your local LLMs' speed as your context fills up - Part 1</title>
      <dc:creator>Federico "SpeederX" Piana</dc:creator>
      <pubDate>Thu, 25 Jun 2026 13:49:24 +0000</pubDate>
      <link>https://dev.to/speederxlab/what-secretly-eats-your-local-llms-speed-as-your-context-fills-up-part-1-13o0</link>
      <guid>https://dev.to/speederxlab/what-secretly-eats-your-local-llms-speed-as-your-context-fills-up-part-1-13o0</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fdl7sfdzacwavbutem3ui.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fdl7sfdzacwavbutem3ui.png" alt=" " width="799" height="419"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Did you ever notice that sometimes while you use a model locally, you run into a sudden drop in performance?&lt;/p&gt;

&lt;p&gt;Today I want to talk about that.&lt;/p&gt;

&lt;p&gt;I'm building an open source tool that aims to help determine the best configuration for a local llm for a given machine, and I scratched my head about this issue, because it seems simple but it's really tricky.&lt;/p&gt;

&lt;p&gt;First of all you have to determine the allocation that the model takes in your VRAM budget. For ease of explanation, I'm going to use &lt;strong&gt;Qwen 3.5 9B Q4_K_M&lt;/strong&gt; which is the model I've been using to battle test this specific problem.  My hardware specification: I have a &lt;strong&gt;RTX 2070 with 8GB VRAM&lt;/strong&gt;, &lt;strong&gt;24GB of RAM&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;I loaded Qwen, it sat on my VRAM but I had a really restricted 16k to 32k context max, also leaving some memory free. I asked myself: but why does this happen? &lt;/p&gt;

&lt;p&gt;These apps we use to run local models try to make them "work" with the current conditions we have on our computer. The heavy lifting would be determining the best configuration and then scale down from that. &lt;/p&gt;

&lt;p&gt;The problem is users are humans, and humans forget things. Imagine you are playing Skyrim or GTA, or watching a Youtube video. You're locking down VRAM with that. RAM that the next Qwen is really eager to use to be faster and have more context for your next prompts. &lt;strong&gt;GRRRRR&lt;/strong&gt;! &lt;/p&gt;

&lt;p&gt;As you load Qwen in VRAM, the VRAM usage bumps up to 6.8GB with full offload of those holy layers. Then you unleash the kv preallocation - llama.cpp does preallocate the memory as you start it - which is roughly &lt;strong&gt;22MB per 1k token&lt;/strong&gt; - from my empirical tests. &lt;/p&gt;

&lt;p&gt;So if you choose &lt;strong&gt;16k&lt;/strong&gt; you get &lt;strong&gt;352MB&lt;/strong&gt; of VRAM &lt;strong&gt;32k&lt;/strong&gt; is &lt;strong&gt;704MB&lt;/strong&gt; and so on. &lt;/p&gt;

&lt;p&gt;Doing some math &lt;strong&gt;8GB&lt;/strong&gt; is &lt;strong&gt;8192MB&lt;/strong&gt;, let's say you're aware that youtube podcast you're listening in the background is using &lt;strong&gt;500-800 MB&lt;/strong&gt; of the gpu, so you close it.&lt;/p&gt;

&lt;p&gt;System reserves &lt;strong&gt;0.5 to 1GB&lt;/strong&gt; - we're talking windows now - so to be safe..  you have &lt;strong&gt;7000MB&lt;/strong&gt; available?&lt;/p&gt;

&lt;p&gt;Qwen uses &lt;strong&gt;6.8GB&lt;/strong&gt;, so it's fine. You load &lt;strong&gt;131k&lt;/strong&gt; of context and start using the chat interface and everything is fine!&lt;/p&gt;

&lt;p&gt;It works! You bypassed that ugly problem and now you can use the model with its full context.&lt;/p&gt;

&lt;p&gt;You start using it seriously, the context goes up to &lt;strong&gt;30, 40, 50k&lt;/strong&gt;. At some point you reach 60k and it starts to feel a bit slower. &lt;strong&gt;70k&lt;/strong&gt; even slower, but not a normal slower a really strong drawdown in generation and also during prompt processing. you reach &lt;strong&gt;90k&lt;/strong&gt; and you're down from &lt;strong&gt;32 tok/sec&lt;/strong&gt; to &lt;strong&gt;16 tok/sec&lt;/strong&gt;, and prompt processing takes an even harder hit considering the initial &lt;strong&gt;488 tok/sec&lt;/strong&gt; to &lt;strong&gt;41.01 tok/sec&lt;/strong&gt; - and prompt processing takes an even harder hit, considering the initial &lt;strong&gt;488 tok/sec&lt;/strong&gt; to &lt;strong&gt;41.01 tok/sec&lt;/strong&gt;. You start a new chat, it feels great again, at &lt;strong&gt;80-90k&lt;/strong&gt; you have the same problem.  What's happening? Why does it work fine until it doesn't?&lt;/p&gt;

&lt;p&gt;That's the KV cache spilling from the VRAM to the RAM. Once the context grows, at some point the prompts and responses will be moved from GPU to RAM. For that reason, most applications use constrained context to completely avoid this kind of issue. Windows is magic sometimes because it doesn't go out of memory, it uses shared memory to manage critical situations.&lt;/p&gt;

&lt;p&gt;The first part of the memory which is in VRAM will respond really fast, just once you reach some specific amount of context the eval will drastically fall and you end up using your model with about &lt;strong&gt;50% less speed&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;In the next part I will share how I started to notice this, what was not working and in part 3 I will share the fixes I put in place to manage that.&lt;/p&gt;

&lt;p&gt;These are the runs used to build the chart above&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Qwen 3.5 9B with 131k context&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Used KV&lt;/th&gt;
&lt;th&gt;Eval t/s&lt;/th&gt;
&lt;th&gt;Delta from 8k&lt;/th&gt;
&lt;th&gt;Prompt t/s&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;8k&lt;/td&gt;
&lt;td&gt;42.30&lt;/td&gt;
&lt;td&gt;Baseline&lt;/td&gt;
&lt;td&gt;488.25&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;65k&lt;/td&gt;
&lt;td&gt;32.51&lt;/td&gt;
&lt;td&gt;−23.1%&lt;/td&gt;
&lt;td&gt;87.66&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;90k&lt;/td&gt;
&lt;td&gt;16.61&lt;/td&gt;
&lt;td&gt;−60.7%&lt;/td&gt;
&lt;td&gt;41.01&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;105k&lt;/td&gt;
&lt;td&gt;15.66&lt;/td&gt;
&lt;td&gt;−63.0%&lt;/td&gt;
&lt;td&gt;36.89&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;120k&lt;/td&gt;
&lt;td&gt;14.81&lt;/td&gt;
&lt;td&gt;−65.0%&lt;/td&gt;
&lt;td&gt;45.13&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Qwen 3.5 2B with 131k context&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Used KV&lt;/th&gt;
&lt;th&gt;Eval t/s&lt;/th&gt;
&lt;th&gt;Delta from 8k&lt;/th&gt;
&lt;th&gt;Prompt t/s&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;8k&lt;/td&gt;
&lt;td&gt;103.49&lt;/td&gt;
&lt;td&gt;Baseline&lt;/td&gt;
&lt;td&gt;3902.12&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;65k&lt;/td&gt;
&lt;td&gt;72.62&lt;/td&gt;
&lt;td&gt;−29.8%&lt;/td&gt;
&lt;td&gt;3011.60&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;90k&lt;/td&gt;
&lt;td&gt;67.49&lt;/td&gt;
&lt;td&gt;−34.8%&lt;/td&gt;
&lt;td&gt;2702.58&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;105k&lt;/td&gt;
&lt;td&gt;64.87&lt;/td&gt;
&lt;td&gt;−37.3%&lt;/td&gt;
&lt;td&gt;2498.70&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;120k&lt;/td&gt;
&lt;td&gt;60.82&lt;/td&gt;
&lt;td&gt;−41.2%&lt;/td&gt;
&lt;td&gt;2326.17&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;The data in the image - the green line in the chart- is from a control test on generation speed with a model - Qwen 3.5 2B Q4_K_M - that I knew would stay entirely in VRAM at the same context.&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>llamacpp</category>
      <category>locallm</category>
      <category>ai</category>
      <category>machinelearning</category>
    </item>
  </channel>
</rss>
