If you've been paying attention to the open-source LLM space lately, you've probably noticed something: models like Kimi K2.5 are getting absurdly good at code generation. Good enough that even commercial tools are quietly acknowledging them as top-tier. And that means running a capable coding model locally is no longer a pipe dream — it's a real option.
But here's the problem. You download a model, hook it up to your editor, and... it's painfully slow. Completions take 3-5 seconds. Your fan sounds like a jet engine. You give up and go back to a hosted API.
I've been there. Multiple times. After spending way too many hours benchmarking and tweaking local setups, I finally have a workflow that's genuinely usable. Here's how to get there.
The Root Cause: It's Not (Just) Your Hardware
The first instinct is to blame your GPU. And sure, VRAM matters. But the real bottleneck for most people is a combination of three things:
- Wrong quantization level — running a full FP16 model when a Q5_K_M would be nearly identical in quality
- No KV cache optimization — the model is recomputing context from scratch on every keystroke
- Bad batching configuration — your inference server isn't tuned for the interactive, low-latency pattern that code completion needs
Let me walk through each one.
Step 1: Pick the Right Quantization
Running a 70B parameter model at full precision requires ~140GB of VRAM. Nobody has that on a workstation. But here's the thing — for code completion specifically, you can quantize aggressively without meaningful quality loss.
# Download a quantized model using huggingface-cli
pip install huggingface-hub
huggingface-cli download TheBloke/CodeLlama-34B-GGUF \
codellama-34b.Q5_K_M.gguf \
--local-dir ./models
# Q5_K_M is the sweet spot for most setups
# Q4_K_M if you're VRAM-constrained
# Q6_K if you have headroom and want max quality
The general rule I follow: Q5_K_M is the default. I've tested code generation quality across quantization levels, and Q5_K_M sits right at the knee of the curve — you lose maybe 1-2% on HumanEval benchmarks compared to FP16, but you cut memory usage by ~60%.
If you're running a smaller model (7B-14B range), you can afford Q6_K or even FP16. For anything 30B+, Q5_K_M or Q4_K_M is the practical choice.
Step 2: Configure Your Inference Server Properly
This is where most people leave performance on the table. The default configs for llama.cpp, vLLM, or Ollama are tuned for chatbot-style interactions — long prompts, long responses. Code completion is the opposite: you need low latency on short outputs.
Here's my llama-server config for code completion:
# Start llama.cpp server optimized for code completion
./llama-server \
-m ./models/codellama-34b.Q5_K_M.gguf \
--host 0.0.0.0 \
--port 8080 \
-ngl 99 \ # offload all layers to GPU
-c 4096 \ # context window — 4K is enough for completions
-np 1 \ # single slot — we only need one concurrent request
--cache-type-k q8_0 \ # quantize the KV cache to save VRAM
--cache-type-v q8_0 \
-fa \ # enable flash attention
--cont-batching # continuous batching for faster token generation
The two flags that make the biggest difference:
-
--cache-type-k q8_0and--cache-type-v q8_0— This quantizes the KV cache itself, which can reduce VRAM usage by 30-50% during inference. The quality impact is negligible for code. -
-c 4096— Don't set this to 32K or 128K just because the model supports it. Larger context windows consume more VRAM and slow down attention computation. For code completions, 4K is plenty.
Step 3: Set Up Speculative Decoding
This is the trick that took my completions from "usable" to "actually fast." Speculative decoding uses a small draft model to predict tokens, then the main model verifies them in a single forward pass. When the draft model guesses correctly (which happens a lot with predictable code patterns), you get multiple tokens for the cost of one.
# Use a small model as a draft for speculative decoding
./llama-server \
-m ./models/codellama-34b.Q5_K_M.gguf \
-md ./models/codellama-7b.Q4_K_M.gguf \ # draft model
--draft-max 8 \ # predict up to 8 tokens ahead
--draft-min 1 \ # always draft at least 1
-ngl 99 \
-ngld 99 \ # offload draft model to GPU too
-c 4096 \
--cache-type-k q8_0 \
--cache-type-v q8_0 \
-fa
In my testing, speculative decoding with a 7B draft model gives roughly 1.8-2.5x speedup on code completion tasks. The key is that code is highly predictable — boilerplate, closing brackets, common patterns — so the draft model nails most of it.
Step 4: Connect It to Your Editor
Most editors that support AI code completion can point to a local OpenAI-compatible API. llama.cpp's server exposes one at /v1/completions by default.
For Neovim users, here's a minimal setup with a generic completion plugin:
-- Example config pointing to local llama.cpp server
local completion_config = {
api_endpoint = "http://localhost:8080/v1/completions",
model = "local", -- name doesn't matter for local server
max_tokens = 128, -- keep completions short and fast
temperature = 0.1, -- low temp for predictable code
stop = ["\n\n"], -- stop at logical boundaries
debounce_ms = 300, -- don't fire on every keystroke
}
The debounce_ms setting is crucial. Without it, you're sending a request on every single keystroke, which queues up inference jobs and makes everything feel laggy. 300ms is a good starting point — fast enough to feel responsive, slow enough to avoid wasted computation.
Prevention: Monitoring Your Setup
Once it's running, keep an eye on a few things:
- Tokens per second — llama.cpp logs this. For code completions to feel snappy, you want at least 30 tok/s on the generation side.
- Time to first token (TTFT) — this matters more than raw throughput. If TTFT is over 500ms, your context might be too large or your prompt processing is bottlenecked.
-
VRAM usage — run
nvidia-smi(orsudo powermetrics --samplers gpu_poweron Mac) and make sure you're not swapping to system RAM. The moment you spill to CPU, latency goes through the roof.
The Bigger Picture
The reason this matters right now is that the open-source model landscape has shifted dramatically. A year ago, local models were a novelty — fun to play with but not production-ready for code work. Today, models like Kimi K2.5, DeepSeek-Coder V3, and Qwen2.5-Coder are legitimately competitive with hosted options for many coding tasks.
The gap isn't in model quality anymore. It's in the infrastructure layer — getting inference fast enough that it doesn't break your flow. The steps above won't give you cloud-grade latency on commodity hardware, but they'll get you to a place where local completions are genuinely useful rather than a frustrating demo.
And honestly? There's something satisfying about your entire AI coding stack running on your own machine, no API keys required, no usage limits, no data leaving your network. For proprietary codebases especially, that last point matters more than any benchmark score.
Top comments (0)