DEV Community

Chappie
Chappie

Posted on

How to Run Local LLMs for Coding (No Cloud, No API Keys)

I stopped sending my code to external APIs six months ago. Not for privacy reasons—though that's a nice bonus—but because local LLMs for coding have gotten genuinely good.

Here's how to set up a complete local AI coding assistant in under 20 minutes. No subscriptions. No rate limits. No sending your proprietary code to someone else's servers.

Why Local LLMs Actually Make Sense Now

The gap between cloud models and local ones has shrunk dramatically. For most coding tasks—autocomplete, explaining code, writing tests, refactoring—a well-tuned 7B or 14B model running locally performs within 80-90% of GPT-4.

That remaining 10-20%? It's usually in complex multi-file reasoning or obscure language edge cases. For daily coding, local models handle it fine.

The real wins:

  • Zero latency dependency — Works offline, on planes, in cafes with garbage wifi
  • No token costs — Run it 1000 times a day, costs nothing
  • Privacy — Your code stays on your machine
  • Customization — Fine-tune on your codebase if you want

Step 1: Install Ollama

Ollama is the easiest way to run local LLMs. One binary, handles model downloads, provides an API.

macOS/Linux:

curl -fsSL https://ollama.com/install.sh | sh
Enter fullscreen mode Exit fullscreen mode

Windows:
Download from ollama.com and run the installer.

Verify it's running:

ollama --version
Enter fullscreen mode Exit fullscreen mode

Step 2: Pull a Coding Model

Not all models are created equal for code. Here's what actually works:

Best all-rounder (7B, runs on 8GB RAM):

ollama pull deepseek-coder:6.7b-instruct
Enter fullscreen mode Exit fullscreen mode

Better quality, needs 16GB RAM:

ollama pull codellama:13b-instruct
Enter fullscreen mode Exit fullscreen mode

Best local coding model (needs 32GB RAM):

ollama pull deepseek-coder:33b-instruct
Enter fullscreen mode Exit fullscreen mode

My daily driver is deepseek-coder:6.7b-instruct. Fast, accurate, fits in memory alongside my IDE and browser.

Step 3: Test It Works

ollama run deepseek-coder:6.7b-instruct "Write a Python function to validate email addresses using regex"
Enter fullscreen mode Exit fullscreen mode

You should see it generate code within seconds. If it's slow, you're either memory-constrained or need to close some Chrome tabs.

Step 4: Connect to Your Editor

VS Code with Continue

Continue is the best free extension for local LLM integration.

  1. Install Continue from VS Code marketplace
  2. Open settings (Ctrl+Shift+P → "Continue: Open Config")
  3. Add this config:
{
  "models": [
    {
      "title": "DeepSeek Local",
      "provider": "ollama",
      "model": "deepseek-coder:6.7b-instruct"
    }
  ],
  "tabAutocompleteModel": {
    "title": "DeepSeek Autocomplete",
    "provider": "ollama",
    "model": "deepseek-coder:6.7b-instruct"
  }
}
Enter fullscreen mode Exit fullscreen mode

Now you have:

  • Inline autocomplete (like Copilot)
  • Chat sidebar for questions
  • Cmd+L to explain selected code

Neovim with gen.nvim

-- In your lazy.nvim config
{
  "David-Kunz/gen.nvim",
  opts = {
    model = "deepseek-coder:6.7b-instruct",
    host = "localhost",
    port = "11434",
  }
}
Enter fullscreen mode Exit fullscreen mode

Step 5: API Integration for Scripts

Ollama exposes a REST API on port 11434. Use it in your tooling:

import requests

def ask_llm(prompt: str) -> str:
    response = requests.post(
        "http://localhost:11434/api/generate",
        json={
            "model": "deepseek-coder:6.7b-instruct",
            "prompt": prompt,
            "stream": False
        }
    )
    return response.json()["response"]

# Generate a test
code = open("my_module.py").read()
tests = ask_llm(f"Write pytest tests for this code:\n\n{code}")
print(tests)
Enter fullscreen mode Exit fullscreen mode

I use this for:

  • Pre-commit hooks that generate test stubs
  • Documentation generators
  • Code review bots in CI

Performance Tuning

If responses are slow:

Check memory usage:

ollama ps
Enter fullscreen mode Exit fullscreen mode

Use a smaller context window:

ollama run deepseek-coder:6.7b-instruct --num-ctx 2048
Enter fullscreen mode Exit fullscreen mode

Enable GPU acceleration (if you have NVIDIA):

# Should auto-detect, but verify
nvidia-smi
Enter fullscreen mode Exit fullscreen mode

Most 7B models run fine on CPU with 16GB RAM. For 13B+, you really want a GPU.

Model Recommendations by Use Case

Task Model RAM Needed
Autocomplete deepseek-coder:1.3b 4GB
General coding deepseek-coder:6.7b-instruct 8GB
Complex refactoring codellama:13b-instruct 16GB
Architecture decisions deepseek-coder:33b-instruct 32GB

Start small. The 6.7B model handles 90% of daily tasks. Scale up when you hit limits.

What Local LLMs Won't Do

Be realistic about limitations:

  • Large codebase understanding — They can't hold 50 files in context
  • Cutting-edge frameworks — Training data has a cutoff
  • Complex debugging — Claude and GPT-4 still win here

For those cases, I keep a cloud API as backup. But 80% of my AI-assisted coding now runs locally.

Wrapping Up

The setup takes 15 minutes. The models are free. The privacy is a bonus.

If you're still paying for Copilot and only use it for autocomplete and simple explanations, try this for a week. You might not go back.

More at dev.to/cumulus

Top comments (0)