DEV Community: Federico "SpeederX" Piana

What secretly eats your local LLMs' speed as your context fills up - Part 3

Federico "SpeederX" Piana — Tue, 21 Jul 2026 16:09:32 +0000

It’s been 3 weeks since the first article of this series.

A lot has changed for me.

In the first article there was a lot of excitement, because I found evidence, data and something new.

The whole thing seemed a lucky coincidence.

Over time I had the chance to revise my way to identify that situation - not only that.

How do you prevent - or at least recognize - when a model will degrade for a given context?

My answer: there's no rule, no magic formula - it has to be grounded in data. You cannot predict it. **It’s a sum of different

signals you read from your hardware.**

You start to see shared memory going up.

The ram usage growing also. Generation takes a heavy hit - at least 50%. These numbers are measured across real runs, over and over with different models and contexts.

In calibr, I’ve defined how you can tackle the issue - the strategy differs from the one provided here, since the one below is the logic I’ve used to narrow down the cliff token range.

First you run a baseline with llama.cpp.

You get the numbers:

Check for VRAM shared memory:
- shared memory usage = shared memory peak - shared memory baseline. where peak is the highest value you save during the model run and, baseline is VRAM usage before starting the run.
Check RAM delta:
- ram delta usage = ram peak usage - ram baseline. this is important because it shows as the shared memory grows, it’s going into the ram.
Check eval tok/sec
- A baseline run shows what the model's eval speed can be - say, 17 tok/sec.
- A long run - deep prefill, or filling the KV cache directly - shows how it evolves over time: the same model might drop to 2 tok/sec (real numbers again: Gemma 4 12B, at a deep prefill).

These points show you why a shallow check, might be a false-positive result.

To be clear: test only the baseline and you get happy results with high evals. Dig deeper with prefill and KV fill, and you get the real picture.

There's no way to prevent this. There are workarounds to squeeze more into memory, but they usually cost precision or eval speed.

The blanket is a fixed size - you can't stretch it. Pull it up to your shoulders and your feet stick out.

You can quantize the KV cache from f16 (llama.cpp's default) down to q8_0: it halves the cache footprint with near-lossless quality - PPL within run-to-run noise, mean precision above 99.8% against the bf16 baseline; the only measurable dip is in the worst 0.1% of token positions source.

This might fit more into VRAM, but it doesn't solve the issue.

To identify the cliff precisely, you can approach it this way:

use the numbers from VRAM shared memory and RAM delta.
setup at least 3 runs, with different context - for instance 16k, 32k, 65k. Two points give you a slope; a third is what tells you the slope is real and not just noise from that one run. The delta between consecutive runs tells you how much VRAM roughly 1,000 tokens of context allocate.

From here, you extrapolate: take your last two clean points, keep that slope, and project forward until you hit your card's total VRAM.

model total load in vram + ( * / 1024 ) = vram total usage theoretical

Now you can see your hypothesis getting shaped and run some more numbers:

baseline vram usage
system vram reserved
theorical total vram
avalable vram = theorical total vram - baseline vram usage - system vram reserved
model total load in vram + ( * / 1024 ) = vram total usage theorical

Does it seems overkill?

Think about the architectures of LLMs.

We have now hybrid models, which don’t scale with a simple formula memory usage, so you cannot make up a different formula for each different model.

Why not just use a formula to compute it?

Because a formula doesn't account for:

buffer allocation
real available vram
real reserved vram
shared memory vram
all things that only show up when you actually run the model on your PC.

Different architecture implies different attention strategies and layers, which don’t grow in the same way of a full attention or supposedly predictable architecture.

What secretly eats your local LLMs' speed as your context fills up - Part 2

Federico "SpeederX" Piana — Sat, 04 Jul 2026 17:21:10 +0000

So, we left in the previous chapter with some data and insights.

This time we talk about the story behind this.
I was serving my model through llama-server.exe locally. While chatting, I've noticed a sudden drop of performance.

My idea was initially that other applications that were open were causing the issue.
I closed everything, started the task manager and put it side by side with the browser page.
I noticed my VRAM was stable and… so was the RAM.

At this point by searching on Google and Reddit users were saying it was pretty common for the generation speed to drop as the context was filling. Not being satisfied with this first answer, even though it seemed reasonable, I decided to dig deeper.

My first impression was that the model itself was the bottleneck with its architecture and higher context. Still it felt like speculation and I decided to find out the answer by comparing other models.
Since I had 30 GBs of models just sitting on my hard drive -** who doesn't, am I right? lol** - I decided to build a PowerShell script that would take all the models I had and run them against different configurations.

Why?
Simple. To answer 2 questions:

What's the best model to run in my machine?
What's the best configuration for each model?

very naive I was. I couldn't imagine how complex things were going to be.

I started with a simple PowerShell script indeed. 4 steps: scan the folder for models ( .gguf files ), build a plan to test them out with various configs, run llama-server.exe with the specific configuration and generate a log. There was a log with all the stats gathered from nvidia-smi, a very useful command that you can use within the prompt to track the status of the GPU and work load of processes. On Windows you cannot see each process unfortunately, we will leave more details about this for another day lol.

This PowerShell script gave me some insight!
Little models didn't have a dramatic drop in performance, even with high context.
Suspicious.

I soon realized one of the things I didn't notice initially in the task manager itself, was the shared memory.

As the context size was increased ( from 16k to 32k ), so did the shared memory.
That fact as stated in the first part means you end up storing a portion of tokens of the LLM in RAM which is going to slow you down.
From my tests on this empirical use-case I've decided to setup a rule for my PowerShell script. shared_memory growing more than > 500MB meant the model for my configuration was not safe to run as KV cache would spill.

If a model exceeds 500MB in shared_memory, mark that configuration as not safe. I did download plenty of different models.
Running the script against a bigger sample of models led me to another pattern: a lot of models were labelled as not safe. Too many of them.

The rule was working - but it was also flagging wrong cases, which was a thing. But... there are some kind of models - MoE - that usually are spread between GPU and RAM.

In that case I was misinterpreting the shared_memory as a happy path solution to my problem. The logs were just filled with models labelled as not safe running fine on my pc throughput wise.
MoE sometimes had 8-10GB RAM and they ended up with increased shared_memory. What was missing was not only that, also llama-server.exe process was allocating some part to the shared_memory.

Naive again, lol.

The pieces allocated from llama-server.exe were affecting the results, even though they are meant to be on the CPU. These pieces were not going to grow as the model was used, meaning you had a fixed amount of memory just assigned and giving you false-positive results.
Relying on a single gate rule against memory increase was coming from a hands-on approach. It was data-driven, but not precise enough. It was a "sort of", rather than saying: when the model reaches 70k it breaks into ram and we have this huge problem.

My approach has been:
Take a model that's close to the edge - 0.5-1GB from the maximum dedicated VRAM.
How? Load it just with default configs. Look at eval, shared_memory usage RAM and VRAM.
Fill the context with some random document or copy and paste. No drop in performance?

Ok. That's a good candidate.

Take the candidate, look at the logs of llama-server.exe, identify the context size of the default config and... double it!
Repeat the previous step until you notice a severe - BOOM! - drop from 32 tok/sec to 16 tok/sec.
In my case, the candidate was Qwen 3.5 9B with Q4_K_M quantization and a default context of 16k. I noticed it was not using plenty of VRAM and for this reason it seemed conservative, so I decided to increase it. I hit the drop of performance close to 90ish k, but as mentioned above completely unaware of what was happening at first.

In the next part we will see a hands-on approach and formulas to derive exactly when and where the spill happens.

*sorry, I just couldn't help myself from posting llama-server eating gpus

What secretly eats your local LLMs' speed as your context fills up - Part 1

Federico "SpeederX" Piana — Thu, 25 Jun 2026 13:49:24 +0000

Did you ever notice that sometimes while you use a model locally, you run into a sudden drop in performance?

Today I want to talk about that.

I'm building an open source tool that aims to help determine the best configuration for a local llm for a given machine, and I scratched my head about this issue, because it seems simple but it's really tricky.

First of all you have to determine the allocation that the model takes in your VRAM budget. For ease of explanation, I'm going to use Qwen 3.5 9B Q4_K_M which is the model I've been using to battle test this specific problem. My hardware specification: I have a RTX 2070 with 8GB VRAM, 24GB of RAM.

I loaded Qwen, it sat on my VRAM but I had a really restricted 16k to 32k context max, also leaving some memory free. I asked myself: but why does this happen?

These apps we use to run local models try to make them "work" with the current conditions we have on our computer. The heavy lifting would be determining the best configuration and then scale down from that.

The problem is users are humans, and humans forget things. Imagine you are playing Skyrim or GTA, or watching a Youtube video. You're locking down VRAM with that. RAM that the next Qwen is really eager to use to be faster and have more context for your next prompts. GRRRRR!

As you load Qwen in VRAM, the VRAM usage bumps up to 6.8GB with full offload of those holy layers. Then you unleash the kv preallocation - llama.cpp does preallocate the memory as you start it - which is roughly 22MB per 1k token - from my empirical tests.

So if you choose 16k you get 352MB of VRAM 32k is 704MB and so on.

Doing some math 8GB is 8192MB, let's say you're aware that youtube podcast you're listening in the background is using 500-800 MB of the gpu, so you close it.

System reserves 0.5 to 1GB - we're talking windows now - so to be safe.. you have 7000MB available?

Qwen uses 6.8GB, so it's fine. You load 131k of context and start using the chat interface and everything is fine!

It works! You bypassed that ugly problem and now you can use the model with its full context.

You start using it seriously, the context goes up to 30, 40, 50k. At some point you reach 60k and it starts to feel a bit slower. 70k even slower, but not a normal slower a really strong drawdown in generation and also during prompt processing. you reach 90k and you're down from 32 tok/sec to 16 tok/sec, and prompt processing takes an even harder hit considering the initial 488 tok/sec to 41.01 tok/sec - and prompt processing takes an even harder hit, considering the initial 488 tok/sec to 41.01 tok/sec. You start a new chat, it feels great again, at 80-90k you have the same problem. What's happening? Why does it work fine until it doesn't?

That's the KV cache spilling from the VRAM to the RAM. Once the context grows, at some point the prompts and responses will be moved from GPU to RAM. For that reason, most applications use constrained context to completely avoid this kind of issue. Windows is magic sometimes because it doesn't go out of memory, it uses shared memory to manage critical situations.

The first part of the memory which is in VRAM will respond really fast, just once you reach some specific amount of context the eval will drastically fall and you end up using your model with about 50% less speed.

In the next part I will share how I started to notice this, what was not working and in part 3 I will share the fixes I put in place to manage that.

These are the runs used to build the chart above

Qwen 3.5 9B with 131k context

Used KV	Eval t/s	Delta from 8k	Prompt t/s
8k	42.30	Baseline	488.25
65k	32.51	−23.1%	87.66
90k	16.61	−60.7%	41.01
105k	15.66	−63.0%	36.89
120k	14.81	−65.0%	45.13

Qwen 3.5 2B with 131k context

Used KV	Eval t/s	Delta from 8k	Prompt t/s
8k	103.49	Baseline	3902.12
65k	72.62	−29.8%	3011.60
90k	67.49	−34.8%	2702.58
105k	64.87	−37.3%	2498.70
120k	60.82	−41.2%	2326.17

The data in the image - the green line in the chart- is from a control test on generation speed with a model - Qwen 3.5 2B Q4_K_M - that I knew would stay entirely in VRAM at the same context.