Chappie

Posted on Mar 28

Weekend Project: Run a Local LLM for Coding (Zero Cloud, Zero API Keys)

#ai #productivity #programming #tutorial

I spent last weekend ditching cloud AI for coding. No more API rate limits, no more sending proprietary code to external servers, no more surprise bills. Just a local LLM running on my machine, integrated with my editor.

Here's exactly how to set it up in an afternoon.

Why Local LLMs for Coding?

Three reasons I made the switch:

Privacy — My client code never leaves my machine
Cost — $0/month after initial setup
Speed — No network latency, works offline

The trade-off? You need decent hardware and the models aren't quite GPT-4 level. But for code completion, refactoring, and explaining code? They're surprisingly good.

What You'll Need

RAM: 16GB minimum, 32GB recommended
GPU: Optional but helps (NVIDIA with 8GB+ VRAM ideal)
Storage: 10-50GB depending on models
OS: Linux, macOS, or Windows with WSL2

No GPU? CPU inference works fine — just slower. I ran this on a 2-year-old laptop with no dedicated GPU and it was usable.

Step 1: Install Ollama

Ollama is the easiest way to run local LLMs. One binary, no Python environment hell.

# Linux/WSL
curl -fsSL https://ollama.ai/install.sh | sh

# macOS
brew install ollama

# Start the service
ollama serve

That's it. Ollama runs as a local API server on port 11434.

Step 2: Pull a Coding Model

Not all models are equal for code. Here's what actually works:

# Best balance of speed and quality (7B params, ~4GB)
ollama pull deepseek-coder:6.7b

# Faster, smaller, good for completions (3B params, ~2GB)
ollama pull starcoder2:3b

# Heavy hitter if you have the RAM (33B params, ~20GB)
ollama pull codellama:34b

I use deepseek-coder:6.7b daily. It handles Python, TypeScript, Go, and Rust well. For quick completions, starcoder2:3b is snappier.

Test it works:

ollama run deepseek-coder:6.7b "Write a Python function to merge two sorted lists"

Step 3: Editor Integration

VS Code with Continue

Continue is my pick. Open source, actively maintained, works offline.

Install the Continue extension from VS Code marketplace
Open Continue settings (Cmd/Ctrl + Shift + P → "Continue: Open config.json")
Add your Ollama model:

{
  "models": [
    {
      "title": "DeepSeek Coder",
      "provider": "ollama",
      "model": "deepseek-coder:6.7b"
    }
  ],
  "tabAutocompleteModel": {
    "title": "StarCoder",
    "provider": "ollama", 
    "model": "starcoder2:3b"
  }
}

Now you have:

Chat with code context (highlight code → ask questions)
Tab completions as you type
Inline edits (Cmd+I to refactor selected code)

Neovim with Ollama.nvim

-- lazy.nvim
{
  "nomnivore/ollama.nvim",
  dependencies = { "nvim-lua/plenary.nvim" },
  cmd = { "Ollama", "OllamaModel" },
  opts = {
    model = "deepseek-coder:6.7b",
    url = "http://127.0.0.1:11434",
  }
}

Map it to a key:

vim.keymap.set("v", "<leader>oo", ":<c-u>lua require('ollama').prompt()<cr>")

Step 4: Terminal Integration

Sometimes I just want to ask a quick question without leaving the terminal.

# Add to .bashrc/.zshrc
ask() {
  ollama run deepseek-coder:6.7b "$*"
}

# Usage
ask "What's the time complexity of Python's sorted()?"

For piping code:

cat broken_script.py | ollama run deepseek-coder:6.7b "Fix the bugs in this code"

Performance Tuning

GPU Acceleration (NVIDIA)

Ollama auto-detects CUDA. Verify it's using your GPU:

ollama run deepseek-coder:6.7b --verbose
# Look for "using CUDA" in output

If not detected, ensure you have NVIDIA drivers and nvidia-container-toolkit installed.

Reduce Memory Usage

Loading multiple models eats RAM. Ollama keeps models in memory by default. To unload:

# List loaded models
curl http://localhost:11434/api/tags

# Ollama auto-unloads after 5 min idle
# Or restart the service to clear everything

Speed vs Quality

For faster responses with slight quality drop, use quantized models:

# q4 = 4-bit quantization, faster, less accurate
ollama pull deepseek-coder:6.7b-instruct-q4_0

I use full precision for complex refactoring, quantized for quick completions.

Real-World Usage

After a month with this setup, here's what works well:

Great for:

Code completion and boilerplate
Explaining unfamiliar code
Writing tests for existing functions
Regex and SQL generation
Git commit messages

Still use cloud AI for:

Complex architectural decisions
Multi-file refactoring
Debugging truly weird issues

The local setup handles 80% of my daily AI coding needs. That's a win.

Troubleshooting

"Model not found" — Run ollama list to see installed models. Pull again if missing.

Slow responses — Try a smaller model or quantized version. Check if it's using GPU with --verbose.

Out of memory — Close other apps, use a smaller model, or add swap space.

Connection refused — Ensure ollama serve is running. Check nothing else is on port 11434.

What's Next

Once you're comfortable:

Try different models — Mistral, Phi-3, Llama 3 all have coding variants
Fine-tune on your codebase — Ollama supports custom Modelfiles
Build custom tools — The Ollama API is dead simple to script against

The local LLM ecosystem is moving fast. Models that needed 64GB RAM two years ago now run on laptops. It's only getting better.

More at dev.to/cumulus

DEV Community