Digression
It's been a while since the AI have hit the public in a way that some user use it to plan vacation and do basic addition with it. That pretty much at the same time that we (developer) discover a new tool can help (to ruin the market) to allow us to work a little bit faster.
I have a little of a hate/love relationship with this tool mainly because I wonder were (has humanity) we're going with our "pollution" problem. But I also love this tool because, you know, it has lowered many steps for smart people to be able to build stuff. And I love this idea that people whatever their country and knowledge are able to build something for fun or else. That for me is also the main purpose of internet, sharing knowledge, being able to learn something that is not accessible for us where we live or because of money.
Another subject is privacy.
And what a good privacy job we get with AI! All the data are sent somewhere, and nobody (at least honest people) have read the term of use and other boring stuff. But that make me think: right now, we have all the tool we need to run a local model to help us work with code, so why not.
What were doing
So the plan was pretty basic: we got llama.cpp to get local model working has an api (basically), and we got pi (or opencode) to get the "chat" like experience.
Setting up
I'm using arch (btw) but all the step here are pretty much the same for any distro or os.
The first step is to install all the tool we need:
yay -S llama.cpp-vulkan
npm install -g @mariozechner/pi-coding-agent
The first command install the version of llama that is optimized for my machine, on the AUR repository you can find a huge variety of optimized version.
The second one installed the pi harness.
Running the model
Like the rest of this little experiment running the model with llama is actually very simple, the only "hard" step is to actually choose the model.
By advance sorry no big reveal here, no "blowing my sock out" website: I have chosen the model because other people have already used it, and it works somewhat great. Some website propose some ranking for model and I think this is great, but we also knew for a fact that ratting LLM's are a pretty hard tasks mainly because the nature of how LLM's work.
Also, this is a little fun experiment so if we really need to choose we can always switch later on another model :)
Let's get back to action:
llama-server -hf AaryanK/Qwen3.6-27B-GGUF:Q4_K_M --port 8080 --host 127.0.0.1 -c 8192 -t 8 -ngl 40
I guess for the most part of the cmd you already know what is doing, but let me explain the weird one:
-
ccontrols context size -
tcontrols CPU threads -
nglcontrols GPU offload
This is the most important stuff here because that the value we need to edit and test to allow to get the most performance of our "model/pc" couple.
If everything is working correctly you should now have something like that on your console:
main: model loaded
main: server is listening on http://127.0.0.1:8080
main: starting the main loop...
srv update_slots: all slots are idle
LLM talks to Pi
Ok right after that we have to actually get the model "linked" to our chat. With pi that pretty simple:
mkdir -p ~/.pi/agent
nvim ~/.pi/agent/models.json
# in ~/.pi/agent/models.json
{
"providers": {
"llama-cpp": {
"baseUrl": "http://127.0.0.1:8080/v1",
"api": "openai-completions",
"apiKey": "none",
"models": [
{
"id": "Qwen3.6-27B",
"contextWindow": 8192,
"maxTokens": 4096
}
]
}
}
}
Here the two most important values are:
- contextWindow: how much the model can remember
- maxTokens: response size limit
We can now check if the model is actually linked by launching pi and selecting it via the /model command.
Say "hello" and go for a walk
Okai, my machine is a tuxedo notebook (InfinityBook Pro something) without a graphic card and certainly not a macpro chipset.
So there the reality check for me: saying "what's the day?" to the model take a really good amount of time to actually get a response. And a pretty good damn chunk of the compute capacity of this pc:
So what next?
For me, I guess this the end. Without a good pc for this kind of compute I'm pretty much no able to use a local model, or a more smaller one, so the result are not going to be has good comparing to a Claude Code or a ChatGpt. And I don't want to set up a VPS for that (maybe another time?).
Mellum
Another tasks were a local model can help: code completion.
I'm going to use Mellum from Jetbrain to test that.
Easy has before for the setup:
llama-server -hf JetBrains/Mellum-4b-base-gguf --port 8989 --host 127.0.0.1 -c 8192 -t 8 -ngl 40
and for our pi we just add another option to the model:
{
"providers":{
"llama-cpp":{
"baseUrl":"http://127.0.0.1:8080/v1",
"api":"openai-completions",
"apiKey":"none",
"models":[
{
"id":"Qwen3-27B",
"contextWindow":8192,
"maxTokens":4096
},
{
"id":"Mellum-4B",
"contextWindow":8192,
"maxTokens":4096
}
]
}
}
}
If needed you can always run two model on different port but if I'm doing that I guess my pc are going to melt.
We can select it via the same command:
The response to hey was very quicker that the other model for the same amount of fan speed:
So what's the deal? This is a code completion model so it can't answer question I guess:
But! Let's try code completion so:
Even from my machine this is quick enough to use it on Neovim for example with a code completion plugin.
You know what, let's try it in a real project
Ok the model is rapid enouth (it seem) so let's try it in a real project.
For that I'm going to add that to my neovim config (with Lazy):
{
{
"milanglacier/minuet-ai.nvim",
config = function()
require("minuet").setup({
provider = "openai_fim_compatible",
n_completions = 1, -- Use 1 for local models to save resources
context_window = 4096, -- Adjust based on your GPU's capability
throttle = 500, -- Minimum time between requests in ms
debounce = 300, -- Wait time after typing stops before requesting
provider_options = {
openai_fim_compatible = {
api_key = "TERM", -- Ollama doesn't need a real API key
name = "Ollama",
end_point = "http://localhost:8080/v1/completions",
model = "JetBrains/Mellum-4b-base-gguf",
optional = {
max_tokens = 256, -- Maximum tokens to generate
stop = { "\n\n" },-- Stop at double newlines
top_p = 0.9, -- Nucleus sampling parameter
},
},
},
-- Virtual text display settings
virtualtext = {
auto_trigger_ft = { "*" }, -- Enable for all filetypes
keymap = {
accept = "<Tab>",
accept_line = "<C-y>",
next = "<C-n>",
prev = "<C-p>",
dismiss = "<C-e>",
},
},
})
end,
},
{
"Saghen/blink.cmp",
},
}
After reloading that into a project of mine (postier) I can try to use it:
Hey that not bad! After some experiment the result is not alway very good but I guess the model is maybe not the more efficient and the settings can also be changed to allow for better result.
But, hey, you got a mostly free auto-completion AI in you neovim that cool and that can spare some token usage I guess ^^
And this work when your offline too, so this can be a good option for certain use case too.
So… is it worth it?
Yeah I guess, just maybe not on this machine.
What I’ve built here does work. It proves the point. You can run your own local models, wire them into tools like pi, plug them into Neovim, and get something that feels very close to the "big AI experience"… without sending your data somewhere else.
But hardware matters. A lot. (outch).
Right now I’m running this on a CPU-only laptop, and honestly, it shows. The experience is slow, sometimes frustrating, and clearly not comparable to cloud models. So not a great experience to use it everyday to work with.
But if you take the exact same setup and drop it onto a more modern machine, things can change drastically.
Take something like a MacBook Pro with Apple Silicon. These chips come with powerful integrated GPUs and unified memory, which are insanely good for this kind of workload. You can offload a large part of the model to the GPU, increase context size, and suddenly responses go from "go grab a coffee" to "this is actually usable."
Same story on the PC side. A desktop or laptop with a decent GPU will run circles around a CPU-only setup. With proper GPU offloading (-ngl), quantized models, and a bit of tuning, you can get real-time or near real-time responses, even with larger models.
What I like most about this whole thing isn’t just performance, it’s control:
- code stays local
- prompts stay private
- you can tweak, break, and rebuild everything
- you’re not tied to an API or pricing model
That feels very close to what I want: owning your tools and understanding how they work.
So yeah, my poor CPU-only laptop is struggling.
But the experiment itself? Totally worth it.
And honestly… it kind of makes me want to upgrade my hardware (if I win the lotery).
So I let you there, have fun!







Top comments (0)