Chappie

Posted on Apr 3

How to Run Local LLMs for Coding (No Cloud, No API Keys)

#ai #tutorial #productivity #programming

I stopped sending my code to external APIs six months ago. Not for privacy reasons—though that's a nice bonus—but because local LLMs for coding have gotten genuinely good.

Here's how to set up a complete local AI coding assistant in under 20 minutes. No subscriptions. No rate limits. No sending your proprietary code to someone else's servers.

Why Local LLMs Actually Make Sense Now

The gap between cloud models and local ones has shrunk dramatically. For most coding tasks—autocomplete, explaining code, writing tests, refactoring—a well-tuned 7B or 14B model running locally performs within 80-90% of GPT-4.

That remaining 10-20%? It's usually in complex multi-file reasoning or obscure language edge cases. For daily coding, local models handle it fine.

The real wins:

Zero latency dependency — Works offline, on planes, in cafes with garbage wifi
No token costs — Run it 1000 times a day, costs nothing
Privacy — Your code stays on your machine
Customization — Fine-tune on your codebase if you want

Step 1: Install Ollama

Ollama is the easiest way to run local LLMs. One binary, handles model downloads, provides an API.

macOS/Linux:

curl -fsSL https://ollama.com/install.sh | sh

Windows:
Download from ollama.com and run the installer.

Verify it's running:

ollama --version

Step 2: Pull a Coding Model

Not all models are created equal for code. Here's what actually works:

Best all-rounder (7B, runs on 8GB RAM):

ollama pull deepseek-coder:6.7b-instruct

Better quality, needs 16GB RAM:

ollama pull codellama:13b-instruct

Best local coding model (needs 32GB RAM):

ollama pull deepseek-coder:33b-instruct

My daily driver is deepseek-coder:6.7b-instruct. Fast, accurate, fits in memory alongside my IDE and browser.

Step 3: Test It Works

ollama run deepseek-coder:6.7b-instruct "Write a Python function to validate email addresses using regex"

You should see it generate code within seconds. If it's slow, you're either memory-constrained or need to close some Chrome tabs.

Step 4: Connect to Your Editor

VS Code with Continue

Continue is the best free extension for local LLM integration.

Install Continue from VS Code marketplace
Open settings (Ctrl+Shift+P → "Continue: Open Config")
Add this config:

{
  "models": [
    {
      "title": "DeepSeek Local",
      "provider": "ollama",
      "model": "deepseek-coder:6.7b-instruct"
    }
  ],
  "tabAutocompleteModel": {
    "title": "DeepSeek Autocomplete",
    "provider": "ollama",
    "model": "deepseek-coder:6.7b-instruct"
  }
}

Now you have:

Inline autocomplete (like Copilot)
Chat sidebar for questions
Cmd+L to explain selected code

Neovim with gen.nvim

-- In your lazy.nvim config
{
  "David-Kunz/gen.nvim",
  opts = {
    model = "deepseek-coder:6.7b-instruct",
    host = "localhost",
    port = "11434",
  }
}

Step 5: API Integration for Scripts

Ollama exposes a REST API on port 11434. Use it in your tooling:

import requests

def ask_llm(prompt: str) -> str:
    response = requests.post(
        "http://localhost:11434/api/generate",
        json={
            "model": "deepseek-coder:6.7b-instruct",
            "prompt": prompt,
            "stream": False
        }
    )
    return response.json()["response"]

# Generate a test
code = open("my_module.py").read()
tests = ask_llm(f"Write pytest tests for this code:\n\n{code}")
print(tests)

I use this for:

Pre-commit hooks that generate test stubs
Documentation generators
Code review bots in CI

Performance Tuning

If responses are slow:

Check memory usage:

ollama ps

Use a smaller context window:

ollama run deepseek-coder:6.7b-instruct --num-ctx 2048

Enable GPU acceleration (if you have NVIDIA):

# Should auto-detect, but verify
nvidia-smi

Most 7B models run fine on CPU with 16GB RAM. For 13B+, you really want a GPU.

Model Recommendations by Use Case

Task	Model	RAM Needed
Autocomplete	`deepseek-coder:1.3b`	4GB
General coding	`deepseek-coder:6.7b-instruct`	8GB
Complex refactoring	`codellama:13b-instruct`	16GB
Architecture decisions	`deepseek-coder:33b-instruct`	32GB

Start small. The 6.7B model handles 90% of daily tasks. Scale up when you hit limits.

What Local LLMs Won't Do

Be realistic about limitations:

Large codebase understanding — They can't hold 50 files in context
Cutting-edge frameworks — Training data has a cutoff
Complex debugging — Claude and GPT-4 still win here

For those cases, I keep a cloud API as backup. But 80% of my AI-assisted coding now runs locally.

Wrapping Up

The setup takes 15 minutes. The models are free. The privacy is a bonus.

If you're still paying for Copilot and only use it for autocomplete and simple explanations, try this for a week. You might not go back.

More at dev.to/cumulus

DEV Community