The Shift to Local Models
Running local LLMs for software development is getting increasingly popular, especially as commercial providers continue to charge by the token. It finally makes economic sense to run models locally to avoid cost overruns.
I have personally spent a lot of time trying to figure out the best configuration. After experimenting with LM Studio, Ollama, and RooCode, I finally found a setup that consistently works for my workflow: Llama.cpp running Qwen Coder via GitHub Copilot with Open Spec SSD.
Here is a breakdown of my experience and the exact configuration I use.
The Hardware Reality
To get decent results locally, hardware is the primary constraint. I was fortunate enough to recently purchase the latest MacBook Pro M5 with 128GB of RAM.
Initially, I had some buyer's remorse spending that much on a machine, but it has proven essential. I tend to consume a lot of memory — I regularly run VS Code with multiple workspaces, React Native servers and simulators, a mail client, and around 100 Google Chrome tabs simultaneously.
Even with all of this running alongside the local LLM, my system rarely swaps more than 2GB of memory to the disk. Performance stays smooth, and I avoid the severe degradation that happens when swap usage climbs higher.
The Software Stack
While wrappers like Ollama and LM Studio are convenient, I found the best results come from running Llama.cpp directly.
Installation on macOS is straightforward:
brew install llama.cpp
For the models, Hugging Face is the best source.
To use Llama.cpp models from Hugging Face, you need files in the GGUF format. These models are optimized for local inference on both CPUs and GPUs. https://huggingface.co/docs/hub/en/gguf-llamacpp
My model of choice is Qwen3-Coder-Next.
The Quantization Goldilocks Zone
Quantization makes a massive difference in performance and stability. I went through a bit of trial and error to find the right balance:
- 4-bit: I tried this first, but it lacked precision. For complex coding tasks, the model would frequently get stuck in infinite loops (pretty much always for me).
- 16-bit: I attempted to run the 32B parameter model at 16-bit, but it was simply too large for my hardware to handle.
- 8-bit: This was the sweet spot. It fits within my memory constraints while executing complex reasoning flawlessly. To download and run the server with the given model on your local environment, run:
llama-server -hf Qwen/Qwen3-Coder-Next-GGUF:Q8_0
When you installed llama.cpp via brew, you can run the above command, which will take a while the first time -- depending on your internet speed up to an hour or so. Running it again will access the cached version, so it will take only few seconds to load into the memory and start. You can check if it's running by going to http://localhost:8080/
My Copilot Configuration
If you run a model locally, GitHub Copilot does not charge you for tokens (not yet anyways), meaning you can stay on the standard plan while running complex code analysis.
In VSCode, copilot chat window, click on the model picker and select a gear next to the "Other Models":

Then select "add models" and "custom end point",
type "llama.cpp" for group name, hit "enter" for the API key, and "Enter" again for the API type (the value does not really matter).
After you save the initial config, you should be able to see the llama.cpp in the list of models when you click the gear next to the "Other Models" selection again. Then, click a gear next to "llama.cpp" and open the config as JSON. To save you time -- here is the final JSON I have, you may want to copy paste it in your config:
{
"name": "llama.cpp",
"vendor": "customendpoint",
"apiKey": "${input:chat.lm.secret.6d112807}",
"models": [
{
"id": "Local Llama, qwen-Q8_0",
"name": "qwen-Q8_0",
"url": "http://localhost:8080",
"toolCalling": true,
"vision": false,
"reasoning": true,
"thinking": true,
"maxInputTokens": 131072,
"maxOutputTokens": 131072,
"contextWindowSize": 262144,
"parameters": {
"top_k": 40,
"top_p": 0.9,
"min_p": 0.1,
"repetition_penalty": 1.15,
"temperature": 0.1,
"max_new_tokens": 1500,
"num_ctx": 16384,
"num_gpu": -1,
"num_thread": 12
}
}
]
},
Let's go over some of these parameters:
-
Endpoint:
localhost:8080(or your specific local port) -
Tool Calling:
true -
Vision:
false(I tried enabling this, but Qwen kept returning errors that vision is unsupported) -
Reasoning & Thinking:
true -
Context Window:
256ktotal (128kmax input /128kmax output) -
Temperature:
0.1 -
GPU Offload (
num_gpu):-1(Uses all available GPUs) -
Threads:
12(Maps to the performance cores on my Mac)
A Note on Reasoning and Temperature
Some documentation suggests turning "Reasoning" and "Thinking" off for coding tasks, but my experiments proved otherwise. I use spec-driven development with OpenSpec, which requires heavy analysis and planning before any code is written. Leaving reasoning set to true yielded significantly better results for this workflow.
Additionally, keep your temperature at 0.1 rather than strict 0.0. A purely deterministic 0.0 temperature can cause the model to get permanently stuck if it hits a logic loop. That slight 0.1 bump gives it just enough variance to diverge and find a solution.
Few words on agentic coding
No matter the model you use -- Vibe coding, as we know it, always going to suffer from the Architecture Entropy. Read more on this topic here in this post The End of Vibe Coding
This is why SpecDrivenDevelopment (SDD) is a must.
Also, on the topic of why it's essential to keep your Architecture "as simple as possible, but not simpler", read the following post From Vibe Coding to SDD: Why the Future of Engineering is Architecture
The Payoff
The results I am getting from this local Qwen setup are remarkably close to top-tier remote models like Claude Opus 4.6 (3x).
Between the high-quality output and the fact that I am saving a couple of hundred dollars a month on API costs, the heavy upfront investment in the MacBook Pro will pay for itself within a couple of years. If you have the hardware, I highly recommend giving this stack a try.
Top comments (0)