DEV Community: darkpenguin

The age of local LLMs is here

darkpenguin — Sat, 04 Jul 2026 00:24:41 +0000

Half a year ago, I wanted to see for myself what can we currently have with local LLMs. I went down the rabbit hole, learned quite a lot in the process, and shared my results in an article.

The results were pretty discouraging: even with 32 GB VRAM, the best models I could run were both too slow and too dumb. At the same time, what you could get for free from inference providers was actually decent - and much faster. I remember my conclusion: "Let's wait for the next generation of models, which looks very promising. If we can run something comparable to full-size Qwen3-Coder-480B locally, that would be ~~year of the Linux Desktop~~ age of fully capable local LLMs.

And now this day has arrived.

Models

Half a year later, I'm revisiting this question. And this time, the whole situation has turned upside-down.

Almost none of the providers still have free tier, and anything that's still free is barely good enough even for the simplest tasks. And is rate-limited all over. And on the local side, the next Qwen lineup is out. So, that's what I'm going to be looking at.

Once again, I have two RX6800's, 16 GB each, and 64 GB RAM. On one hand, this is more VRAM than any "normal person" can have with one GPU - unless you've got something specifically for AI, like an unified-memory Mac or a DGX Spark. On the other hand, RX6800 is "pre-AI" - anything newer will have much better performance thanks to tensor processors.

Qwen3.6-27B: This is a dense model, so basically you can't run it at all on anything less than 32 GB VRAM. It's the slowest one, but also the best one if you can run it. Its accuracy is claimed to be on par with Claude 4.5 Opus, and better than Qwen3.5-397B-A17B. This is what I've been waiting for. It runs reasonably fast on my setup, so it's very much usable both in terms of performance and accuracy.
Qwen3.6-35B-A3B: This one is MoE, and it's pretty small, so it's the fastest one. It's good for anything that doesn't require too much (i.e. for agentic tasks that don't need a lot of reasoning), and apparently better than GLM-4.7-Flash or Gemini-3.1-Flash-Lite (which is basically all you can get for free nowadays). So, we don't need all that anymore. And it's FAST!
Qwen-Coder-Next-80B: It's big, but it's also MoE, so you can offload some experts to the CPU. On my setup, its performance is somewhere between 3.6-dense and 3.6-MoE. Its accuracy is claimed to be near the full-sized GLM-4.7, or Kimi K2.5, or DeepSeek-V3.2. It's based on Qwen-Next, fine-tuned specifically for coding.

Other than the new models, there are quite a lot of other improvements. Last time, REAP was just appearing. This time, there is a REAM variant of Qwen-Coder-Next-80B - that's when they merge the weights instead of simply pruning them. And based on the benchmarks, its accuracy is within the margin of error from the full model!

In other news: we no longer need llama-swap - llama.cpp now has an experimental "router mode", where it's loading and unloading models itself. You can specify different parameters per model in a config file. The config file format is less robust than what llama-swap had, but there is a very good reason to use it instead. Read on.

What else?

In the real-life usage, you will probably want to switch models a lot. With a good enough SSD, loading a new model can be brought down to under a minute, and if you have enough RAM, then both smaller models fit into 64GB of page cache nicely. But here comes another problem: context cache. If you switch to a faster model and then back to a smarter one, then you'll have to reprocess your whole conversation. And that can easily take minutes.

It would be really nice if there was a way to save that context cache, and restore it after switching back, wouldn't it? Turns out there is an option for this in llama.cpp: when you switch models in router mode, it saves your slots to disk, then restores them! But this option does not work anymore, because for the newer models' attention mechanisms, you also have checkpoints, so only restoring slots it not enough. But there is a PR to save checkpoints as well! But it's not merged yet (last time I checked). So, I've created my own fork of llama.cpp with this and some other yet-unmerged PRs included, which you are welcome to try!

Pi

Now that we've got usable local models, it's time to choose a harness. And there's a lot of choice! Not only coding - lately, there is talk about OpenClaw all around, and Hermes is emerging as a better alternative with some interesting features.

I've spent some time tinkering with Hermes, and quickly noticed two things. First, it feels very vibe-coded - even some basic features don't work. 8000+ open issues on its Github tell me that this is unlikely to be fixed any time soon. And second... Sometimes I have no idea what it's doing. I've tried turning on maximum verbosity in reasoning and tool calls, but that's one of the things that are bugged and doesn't work. I even developed a tiny proxy to intercept its requests that it doesn't show me, but that's a really janky solution. I can literally hear my LLMs working (apparently AMD GPUs are famous for their coil whine, which I consider a great feedback feature). I don't want to interrupt if it's taking so much time to think about something useful, but I don't even know what is it thinking about.

And then I saw Mario Zechner's talk about Pi. Specifically, how he was annoyed with the same things, and created his own minimalistic harness. His mindset sounds very close to what I'm looking for, so the next thing I'm going to do is tinker with Pi to my heart's content. I've listened to his other talks, and in one of them, he also said that he is very impressed with the current state of local LLMs; if all frontier models are turned off today - which in my opinion we can say is already happening with all the price changes - then he would be very happy with what we have now. There is also a lot of insight into AI in general.

So, here it is - the moment when AI becomes available to hardcore libre software people who don't want to rely on software running on someone else's machine. Which also means we can experiment for free and without fear of it changing or disappearing. So, let's experiment a lot! And let's remember to use it to learn, and not to contribute to the slop-pocalypse of libre software.

Local LLMs: state of the art

darkpenguin — Mon, 08 Dec 2025 10:56:44 +0000

With all the local LLMs available by now, you might get curious about what's the best we can have running locally and how does that compare against what you can get with free-tier inference providers. And the first question you'll have is: what model do I use?

I've set out to answer those questions for myself. Here is what I've learned from this journey.

Goals and hardware

My use case is agentic coding. Specifically, KiloCode. That's pretty important because broadly speaking, there are two main use cases for LLMs, and some of the requirements are the opposite:

Creative writing/roleplay: you want the model to be creative - to be able to tell an interesting and unexpected story, rather than sticking to what you say.
Agentic coding (or other agentic use): you want the model to do exactly what you say - the less "creativity", the better.

Of course, you can control this with temperature, but generally some models are best for creative writing, and some are best for following instructions.

I'm running those models on 64 GB RAM and two RX6800's with 16 GB VRAM each. This gives me 32 GB VRAM in total. It's not as fast as the latest NVidia graphics cards, but on the other hand, I can fit quite a lot in VRAM, and that's going to have a much more noticeable difference compared to NVidia cards.

So, what are the most popular models for agentic coding, and what experience did I have with them?

The Journey

At first, I've tried Qwen3-Coder-30B-A3B-Instruct - the obvious choice when you look for "the latest model" and "for coding". First of all, it was pretty slow. The first request from KiloCode is around 10k tokens, and processing that much input takes about one minute. It then gets cached, so consequent requests will be faster, but if you switch models - or if your model crashes and you have to restart it, which does happen pretty often - you'll have to wait for one minute again. And the more context you already have filled, the slower it gets, so you're looking at similarly long wait times after each request. What's more, very often it couldn't even use tools correctly, so using it was mostly fighting with the model rather than the model helping me. I've tried running the model at Q6 and Q8, but it didn't make them much better than Q4_K_XL.

Then I've tried something similar: Qwen3-30B-A3B-Instruct-2507 . It's about as fast as the previous one, but this time it could use tools more consistently. It did pretty well with creating a basic project skeleton to begin iterating on, but it was pretty bad at finding and fixing bugs. I've asked it to add a few features, and it succeeded after a few attempts, but then the program could no longer run due to a (pretty obvious) null pointer dereferencing bug, and the model could not figure it out, no matter how many times I've tried. Instead, it tried to fix nonexistent issues, and messed up the code more and more.

This bug was pretty obvious and easy to fix, but it gave a lot of trouble to the models, so I've decided to use this as an exam. Is there any model that can fix this? And there was: gpt-oss-120b , which I've managed to run now that I've gained enough experience in this. And it was FAST! But it is famous for being somewhat "derpy", occasionally failing to use tools, and thinking a lot. It has figured out the bug, but instead of fixing it properly, it slapped a proxy function in between to substitute the null pointer. And as my project progressed further, most tasks became too complex even for gpt-oss-120b to handle. And it is still much slower than what you can get with a free-tier cloud inference provider.

So, I've hit the ceiling in terms of smartness, but there was still room for improvement in terms of speed. I've suddenly realized that ROCm v7 is WAY WAY faster than ROCm v6, and even faster than Vulkan, especially in prompt processing - what I needed the most! So I migrated from Vulkan to ROCm v7. Also, MXFP4 models really are (a little) faster even on AMD graphics cards.

It was at this point that the first REAP models have appeared. And this made a lot of difference! But obviously only in speed, not "smartness". Qwen3 and Qwen3-Coder got much faster, even though they did already fit completely in my VRAM before. gpt-oss (now 58b) got very very fast! At least compared to everything I've seen before. With all that, I got that 10k prompt that initially took about a minute down to only about 10 seconds!

REAP also made it possible to try GLM-4.5-Air which wouldn't even fit in my RAM before. At first, it did not work at all - it was only outputting "???????". The solution to this is to disable its thinking. But either way, it is SLOW. Even with all those improvements. Like 5-10 minutes to ingest that initial 10k prompt slow. It's obviously completely unusable.

Tools

I've learned a lot while doing this research. I've tried ollama, then llama.cpp, then llama-swap. When I saw that some models have trouble with tool calls, I noticed that there is no way in KiloCode (and I assume most other similar software) to see the raw model output, so I wanted a tool for that. I also wanted a tool to benchmark the models' performance in a consistent way that would be easy to compare. Apparently no established benchmark exists; everyone simply writes their own. I've seen multiple benchmarks like this, but wanted something more feature-complete. I also saw llama-bench, but turns out it's not very representative or consistent, as it uses literally random tokens as input.

At the same time, I needed something to write - some simple project to use as an example, simple enough that local models could handle it. And then it turned a little recursive: why don't I write that toolkit? And that's what I did. It has two tools. One is a benchmark that can test multiple models in multiple configurations at once, with up to 100k context, and output results in a table you can easily save in a simple text file. The other one is a proxy that dumps raw model output in real-time.

Thanks to the KiloCode Discord community for helping me a lot every step of the way!

Conclusion

"Small" models - ones you can fit completely in 32 GB VRAM at Q4 - are basically completely impotent.
"Big guns" models - ones you can run on 64GB RAM (but very little VRAM as it's only used for the KV cache!) - can only help with pretty basic things, and are still way slower than free-tier inference providers.
Full-sized models running in the cloud - Qwen3-235b-A22b-Instruct-2507, Qwen3-Coder-480b, GLM-4.5-Air - can do way better. And way faster. They are actually usable and helpful. And when they are not enough, asking ChatGPT the old-fashioned way seems to be the best you can get.

With MoE and REAP, the state of local LLMs has advanced profoundly. Let's hope that in the near future, we'll get more technologies like that. Once we can run something comparable to already existing full-size models locally at reasonable speed, I'd call that the "day of the LLM desktop".