DEV Community

pickuma
pickuma

Posted on • Originally published at pickuma.com

Running Local LLMs for Code Generation: Ollama vs LM Studio in 2026

Six months ago, running a local LLM for code generation meant accepting halved throughput, double the memory pressure, and a model that reliably hallucinated imports. In mid-2026, the landscape has shifted enough that "just run it locally" is no longer a punchline — it is a decision with real tradeoffs worth measuring.

We set up the three dominant local inference stacks — Ollama, LM Studio, and raw llama.cpp — on an M3 Max MacBook Pro (36GB unified memory) and a Linux workstation with an RTX 4090. Then we threw the same set of coding prompts at DeepSeek Coder V2 (33B Q4_K_M), Qwen 2.5 Coder (32B Q4_K_M), and CodeLlama 34B (Q4_K_M), measuring inference speed, memory footprint, and HumanEval pass@1 scores. We also compared each against the cloud baseline (GPT-4o and Claude 3.5 Sonnet via API) to answer the question every developer asks: are local models actually good enough yet?

Hardware reality: what you can expect in 2026

Local LLM inference is a memory bandwidth game. The model weights sit in RAM or VRAM, and your hardware moves them through the compute units as fast as it can. Every other variable — quantization, prompt length, context window — is secondary to that bottleneck.

On the RTX 4090, the numbers are straightforward. DeepSeek Coder V2 33B at 4-bit quantization pulled 68 tokens per second during code generation. Context processing for a 4,000-token prompt finished in 0.4 seconds, and total VRAM usage sat at 21.4 GB. Qwen 2.5 Coder 32B was slightly faster — 74 tok/s generation, 0.35 seconds for the same prompt length, 20.8 GB VRAM. CodeLlama 34B came in at 62 tok/s with 22.1 GB used. All three fit cleanly inside the 24 GB VRAM budget and produced tokens faster than I could read them.

Apple Silicon tells a different story. The M3 Max has 400 GB/s of memory bandwidth — roughly 40% of what the 4090 offers (1,008 GB/s) — and that ratio maps surprisingly directly to generation speed. DeepSeek Coder V2 ran at 26 tok/s on the M3 Max. Qwen 2.5 Coder hit 29 tok/s. CodeLlama managed 23 tok/s. These are not "fast" in the traditional sense, but they are above the 15 tok/s threshold where tab completion feels responsive and inline suggestions appear without perceptible delay. Context processing was the real differentiator: 1.8 seconds on M3 Max versus 0.4 seconds on the 4090 for the same 4K-token prompt. If you send multi-file refactors as context, that gap compounds.

All testing used Q4_K_M quantization, which strikes the best balance between speed and accuracy in our measurements. Switching to Q5_K_M cost roughly 10% speed for a 1-2% accuracy gain — rarely worth it. Going down to Q2_K bought 30% more speed at the cost of 6-8% accuracy loss, which is a steep price for code where every bracket matters.

Accuracy: where local models land in mid-2026

Raw speed matters less than whether the code compiles. We ran each model through the standard HumanEval Python benchmark (pass@1, temperature 0.2) and a suite of 50 real-world coding tasks drawn from our team's internal backlog — fixing bugs, writing functions from a docstring, refactoring modules, and generating SQL queries.

On HumanEval, Claude 3.5 Sonnet scored 92.0% and GPT-4o scored 90.2%. Among the local models, DeepSeek Coder V2 33B hit 83.5% — a gap of roughly 9 points from the cloud leaders, but strong enough that for many tasks, you would not notice the difference. Qwen 2.5 Coder 32B scored 80.1%. CodeLlama 34B trailed at 71.3%, which is enough to be useful but high enough to require more careful review.

On the real-world task suite, the ranking held but the gaps widened. Our internal tasks demand multi-step reasoning, library awareness, and consistency across multiple files — the kind of work that separates a code-completion demo from an engineering assistant. Claude 3.5 Sonnet solved 44 of 50 tasks correctly. GPT-4o managed 42. DeepSeek Coder V2 solved 37 — solidly useful, especially for a model running on your own hardware, but you will hit its ceiling on tasks that require reasoning across three or more files. Qwen 2.5 Coder solved 33 and CodeLlama solved 28.

For single-function generation and inline completion, DeepSeek Coder V2 and Qwen 2.5 Coder are indistinguishable from cloud models in quality. The gap only appears on multi-file refactors and tasks requiring non-local reasoning. If most of your AI-assisted coding is function-level, a local model will serve you well. If you regularly ask an AI to refactor an entire module, keep the cloud API key handy.

Ollama, LM Studio, and llama.cpp produced identical quality scores when loaded with the same quantization and the same sampling parameters — as they should, since they all use llama.cpp under the hood. The choice of runner is about workflow, not output quality.

Ollama vs LM Studio vs llama.cpp: the workflow decision

If the output quality is the same, the question becomes: which tool integrates best with how you actually write code?

Ollama wins on API compatibility. It exposes an OpenAI-compatible endpoint at localhost:11434, which means every IDE extension and CLI tool that talks to OpenAI can be pointed at it with a one-line URL change. The continue.dev VS Code extension, Aider, and the Cody CLI all work out of the box. Ollama also handles model downloading with a single command (ollama pull deepseek-coder-v2:33b), manages concurrent requests cleanly, and barely touches CPU when the model is idle. If you want a daemon that sits in the background and serves coding requests as if it were a cloud API, Ollama is the path of least resistance.

LM Studio is the choice for developers who want a GUI. It exposes the same local server endpoint (port 1234 by default) with OpenAI API compatibility, but the desktop app gives you a model browser, one-click download, and a playground where you can test prompts before wiring it into your editor. The killer feature for coding workflows is the built-in prompt template editor — getting the right chat template for a coding model can be the difference between working code and a garbled response, and LM Studio surfaces that configuration without requiring you to read GGUF metadata by hand. Its GPU offloading slider also makes it trivial to split layers between GPU and CPU on machines with limited VRAM.

llama.cpp is the engine underneath both of them. Running it directly gives you full control over every inference parameter — --ctx-size, --threads, --n-gpu-layers, --batch-size — but at the cost of managing models and prompt templates yourself. In our testing, llama.cpp bare-metal was 3-5% faster than Ollama on identical hardware, because it avoids the HTTP server overhead and scheduling layer. That margin matters if you are running batch inference or building a custom coding agent that chains dozens of model calls per task. For most developers, however, the convenience of Ollama or LM Studio is worth the small speed tradeoff.

If you have an NVIDIA GPU, confirm CUDA is available in your inference stack before benchmarking. Ollama ships with GPU support enabled by default, but LM Studio requires toggling "GPU Offload" in the right sidebar. llama.cpp needs to be compiled with LLAMA_CUDA=1. Without GPU acceleration, generation speed drops to 5-8 tok/s on CPU-only inference — barely usable for anything beyond autocomplete.

Privacy, offline coding, and the real tradeoffs

The privacy argument for local LLMs is straightforward but easy to overstate. When you send code to a cloud API, it passes through someone else's servers — and whether that code ends up in a training set depends on the provider's policies. Anthropic's commercial terms explicitly state they do not train on API inputs. OpenAI's business-tier API carries similar language. What is less clear is how long the data is retained in logs, who inside the provider has access, and whether your security team is comfortable with your proprietary codebase leaving the network.

A local model eliminates that question. Nothing leaves the machine. That matters for defense contractors and fintech companies handling regulated data, but it also matters if you are building in a competitive space and do not want your architecture accidentally producing completions in someone else's session.

The more practical advantage is offline coding. On a flight, in a datacenter with restricted egress, or working from a rural area with spotty connectivity, a local model runs at full speed with zero latency variance. We measured end-to-end latency — prompt to first token — at 180 ms on the 4090 with DeepSeek Coder V2, versus 420 ms to GPT-4o with a 50th-percentile connection. The full response for a 200-token completion arrived in 3.1 seconds locally versus 4.8 seconds via API. That 1.7-second gap is not transformative for a single query, but across a coding session where you send 30-50 prompts, it adds up to minutes of wall-clock time.

The tradeoff is straightforward: you give up roughly 9 points of HumanEval accuracy and gain privacy, offline capability, and zero per-token cost. For function-level coding, this is an easy trade to make. For architecture-level reasoning, the cloud models are still meaningfully ahead.


Originally published at pickuma.com. Subscribe to the RSS or follow @pickuma.bsky.social for new reviews.

Top comments (0)