DEV Community

Chappie
Chappie

Posted on

Weekend Project: Run a Local LLM for Coding (Zero Cloud, Zero API Keys)

I spent last weekend ditching cloud AI for coding. No more API rate limits, no more sending proprietary code to external servers, no more surprise bills. Just a local LLM running on my machine, integrated with my editor.

Here's exactly how to set it up in an afternoon.

Why Local LLMs for Coding?

Three reasons I made the switch:

  1. Privacy — My client code never leaves my machine
  2. Cost — $0/month after initial setup
  3. Speed — No network latency, works offline

The trade-off? You need decent hardware and the models aren't quite GPT-4 level. But for code completion, refactoring, and explaining code? They're surprisingly good.

What You'll Need

  • RAM: 16GB minimum, 32GB recommended
  • GPU: Optional but helps (NVIDIA with 8GB+ VRAM ideal)
  • Storage: 10-50GB depending on models
  • OS: Linux, macOS, or Windows with WSL2

No GPU? CPU inference works fine — just slower. I ran this on a 2-year-old laptop with no dedicated GPU and it was usable.

Step 1: Install Ollama

Ollama is the easiest way to run local LLMs. One binary, no Python environment hell.

# Linux/WSL
curl -fsSL https://ollama.ai/install.sh | sh

# macOS
brew install ollama

# Start the service
ollama serve
Enter fullscreen mode Exit fullscreen mode

That's it. Ollama runs as a local API server on port 11434.

Step 2: Pull a Coding Model

Not all models are equal for code. Here's what actually works:

# Best balance of speed and quality (7B params, ~4GB)
ollama pull deepseek-coder:6.7b

# Faster, smaller, good for completions (3B params, ~2GB)
ollama pull starcoder2:3b

# Heavy hitter if you have the RAM (33B params, ~20GB)
ollama pull codellama:34b
Enter fullscreen mode Exit fullscreen mode

I use deepseek-coder:6.7b daily. It handles Python, TypeScript, Go, and Rust well. For quick completions, starcoder2:3b is snappier.

Test it works:

ollama run deepseek-coder:6.7b "Write a Python function to merge two sorted lists"
Enter fullscreen mode Exit fullscreen mode

Step 3: Editor Integration

VS Code with Continue

Continue is my pick. Open source, actively maintained, works offline.

  1. Install the Continue extension from VS Code marketplace
  2. Open Continue settings (Cmd/Ctrl + Shift + P → "Continue: Open config.json")
  3. Add your Ollama model:
{
  "models": [
    {
      "title": "DeepSeek Coder",
      "provider": "ollama",
      "model": "deepseek-coder:6.7b"
    }
  ],
  "tabAutocompleteModel": {
    "title": "StarCoder",
    "provider": "ollama", 
    "model": "starcoder2:3b"
  }
}
Enter fullscreen mode Exit fullscreen mode

Now you have:

  • Chat with code context (highlight code → ask questions)
  • Tab completions as you type
  • Inline edits (Cmd+I to refactor selected code)

Neovim with Ollama.nvim

-- lazy.nvim
{
  "nomnivore/ollama.nvim",
  dependencies = { "nvim-lua/plenary.nvim" },
  cmd = { "Ollama", "OllamaModel" },
  opts = {
    model = "deepseek-coder:6.7b",
    url = "http://127.0.0.1:11434",
  }
}
Enter fullscreen mode Exit fullscreen mode

Map it to a key:

vim.keymap.set("v", "<leader>oo", ":<c-u>lua require('ollama').prompt()<cr>")
Enter fullscreen mode Exit fullscreen mode

Step 4: Terminal Integration

Sometimes I just want to ask a quick question without leaving the terminal.

# Add to .bashrc/.zshrc
ask() {
  ollama run deepseek-coder:6.7b "$*"
}

# Usage
ask "What's the time complexity of Python's sorted()?"
Enter fullscreen mode Exit fullscreen mode

For piping code:

cat broken_script.py | ollama run deepseek-coder:6.7b "Fix the bugs in this code"
Enter fullscreen mode Exit fullscreen mode

Performance Tuning

GPU Acceleration (NVIDIA)

Ollama auto-detects CUDA. Verify it's using your GPU:

ollama run deepseek-coder:6.7b --verbose
# Look for "using CUDA" in output
Enter fullscreen mode Exit fullscreen mode

If not detected, ensure you have NVIDIA drivers and nvidia-container-toolkit installed.

Reduce Memory Usage

Loading multiple models eats RAM. Ollama keeps models in memory by default. To unload:

# List loaded models
curl http://localhost:11434/api/tags

# Ollama auto-unloads after 5 min idle
# Or restart the service to clear everything
Enter fullscreen mode Exit fullscreen mode

Speed vs Quality

For faster responses with slight quality drop, use quantized models:

# q4 = 4-bit quantization, faster, less accurate
ollama pull deepseek-coder:6.7b-instruct-q4_0
Enter fullscreen mode Exit fullscreen mode

I use full precision for complex refactoring, quantized for quick completions.

Real-World Usage

After a month with this setup, here's what works well:

Great for:

  • Code completion and boilerplate
  • Explaining unfamiliar code
  • Writing tests for existing functions
  • Regex and SQL generation
  • Git commit messages

Still use cloud AI for:

  • Complex architectural decisions
  • Multi-file refactoring
  • Debugging truly weird issues

The local setup handles 80% of my daily AI coding needs. That's a win.

Troubleshooting

"Model not found" — Run ollama list to see installed models. Pull again if missing.

Slow responses — Try a smaller model or quantized version. Check if it's using GPU with --verbose.

Out of memory — Close other apps, use a smaller model, or add swap space.

Connection refused — Ensure ollama serve is running. Check nothing else is on port 11434.

What's Next

Once you're comfortable:

  1. Try different models — Mistral, Phi-3, Llama 3 all have coding variants
  2. Fine-tune on your codebase — Ollama supports custom Modelfiles
  3. Build custom tools — The Ollama API is dead simple to script against

The local LLM ecosystem is moving fast. Models that needed 64GB RAM two years ago now run on laptops. It's only getting better.


More at dev.to/cumulus

Top comments (0)