What secretly eats your local LLMs' speed as your context fills up - Part 2

#llamacpp #locallm #ai #machinelearning

So, we left in the previous chapter with some data and insights.

This time we talk about the story behind this.
I was serving my model through llama-server.exe locally. While chatting, I've noticed a sudden drop of performance.

My idea was initially that other applications that were open were causing the issue.
I closed everything, started the task manager and put it side by side with the browser page.
I noticed my VRAM was stable and… so was the RAM.

At this point by searching on Google and Reddit users were saying it was pretty common for the generation speed to drop as the context was filling. Not being satisfied with this first answer, even though it seemed reasonable, I decided to dig deeper.

My first impression was that the model itself was the bottleneck with its architecture and higher context. Still it felt like speculation and I decided to find out the answer by comparing other models.
Since I had 30 GBs of models just sitting on my hard drive -** who doesn't, am I right? lol** - I decided to build a PowerShell script that would take all the models I had and run them against different configurations.

Why?
Simple. To answer 2 questions:

What's the best model to run in my machine?
What's the best configuration for each model?

very naive I was. I couldn't imagine how complex things were going to be.

I started with a simple PowerShell script indeed. 4 steps: scan the folder for models ( .gguf files ), build a plan to test them out with various configs, run llama-server.exe with the specific configuration and generate a log. There was a log with all the stats gathered from nvidia-smi, a very useful command that you can use within the prompt to track the status of the GPU and work load of processes. On Windows you cannot see each process unfortunately, we will leave more details about this for another day lol.

This PowerShell script gave me some insight!
Little models didn't have a dramatic drop in performance, even with high context.
Suspicious.

I soon realized one of the things I didn't notice initially in the task manager itself, was the shared memory.

As the context size was increased ( from 16k to 32k ), so did the shared memory.
That fact as stated in the first part means you end up storing a portion of tokens of the LLM in RAM which is going to slow you down.
From my tests on this empirical use-case I've decided to setup a rule for my PowerShell script. shared_memory growing more than > 500MB meant the model for my configuration was not safe to run as KV cache would spill.

If a model exceeds 500MB in shared_memory, mark that configuration as not safe. I did download plenty of different models.
Running the script against a bigger sample of models led me to another pattern: a lot of models were labelled as not safe. Too many of them.

The rule was working - but it was also flagging wrong cases, which was a thing. But... there are some kind of models - MoE - that usually are spread between GPU and RAM.

In that case I was misinterpreting the shared_memory as a happy path solution to my problem. The logs were just filled with models labelled as not safe running fine on my pc throughput wise.
MoE sometimes had 8-10GB RAM and they ended up with increased shared_memory. What was missing was not only that, also llama-server.exe process was allocating some part to the shared_memory.

Naive again, lol.

The pieces allocated from llama-server.exe were affecting the results, even though they are meant to be on the CPU. These pieces were not going to grow as the model was used, meaning you had a fixed amount of memory just assigned and giving you false-positive results.
Relying on a single gate rule against memory increase was coming from a hands-on approach. It was data-driven, but not precise enough. It was a "sort of", rather than saying: when the model reaches 70k it breaks into ram and we have this huge problem.

My approach has been:
Take a model that's close to the edge - 0.5-1GB from the maximum dedicated VRAM.
How? Load it just with default configs. Look at eval, shared_memory usage RAM and VRAM.
Fill the context with some random document or copy and paste. No drop in performance?

Ok. That's a good candidate.

Take the candidate, look at the logs of llama-server.exe, identify the context size of the default config and... double it!
Repeat the previous step until you notice a severe - BOOM! - drop from 32 tok/sec to 16 tok/sec.
In my case, the candidate was Qwen 3.5 9B with Q4_K_M quantization and a default context of 16k. I noticed it was not using plenty of VRAM and for this reason it seemed conservative, so I decided to increase it. I hit the drop of performance close to 90ish k, but as mentioned above completely unaware of what was happening at first.

In the next part we will see a hands-on approach and formulas to derive exactly when and where the spill happens.

*sorry, I just couldn't help myself from posting llama-server eating gpus