DEV Community

Learn AI Resource
Learn AI Resource

Posted on

Run LLMs Locally Without Losing Your Mind: A Dev Workflow Guide

Run LLMs Locally Without Losing Your Mind: A Dev Workflow Guide

So you want to use AI in your development workflow but don't want to send every code snippet to the cloud? I get it. Privacy concerns, latency headaches, API costs adding up—all valid. Here's how I actually set this up and what actually works.

Why Local LLMs Matter Right Now

Cloud APIs are great until:

  • You're debugging sensitive code and don't want it in someone's training data
  • You're on spotty wifi and waiting 10 seconds for a response kills your flow
  • Your team burns through API budgets faster than expected
  • You need the LLM to just... stay offline

Local LLMs fix most of this. They're fast, they're free after setup, and you keep your code to yourself.

The Honest Assessment: What Works, What Doesn't

Local models that actually help:

  • Llama 2 (7B) - Surprisingly useful for code explanations and simple refactoring
  • Code Llama - Specifically trained on code. Better at completions and bug spotting than general models
  • Mistral 7B - Fast, decent reasoning for middleware and architecture questions
  • Phi 3 - Tiny but effective for quick debugging hunches

What doesn't work great:

  • Asking them to debug genuinely complex issues (they struggle with context beyond a few hundred lines)
  • Expecting them to learn your codebase unless you feed them docs explicitly
  • Using them for system design when you need real creativity (they tend to suggest textbook solutions)

The Setup (Real Talk)

You'll need either Ollama or LM Studio. I recommend Ollama because it's dead simple and has good integration options.

# Mac/Linux
brew install ollama
ollama run llama2

# Or grab Code Llama directly
ollama pull codellama
ollama serve  # Runs on localhost:11434
Enter fullscreen mode Exit fullscreen mode

That's it. You now have a local API running on http://localhost:11434.

Integrating It Into Your Workflow

Option 1: CLI (Fastest for quick questions)

# Using curl to hit your local LLM
curl http://localhost:11434/api/generate -d '{
  "model": "codellama",
  "prompt": "Why would this break? function merge(a, b) { return {...a, ...b} }",
  "stream": false
}'
Enter fullscreen mode Exit fullscreen mode

Option 2: VSCode Extension
Install "Continue" or "Codeium" (run locally) and point it to localhost:11434. You get autocomplete without leaving your editor. Game changer for repetitive patterns.

Option 3: Custom Scripts
I wrote a small Python wrapper that pipes code snippets to my local LLM and formats responses as comments. Keeps everything in my editor flow.

import requests
import sys

def ask_model(code_snippet, question):
    resp = requests.post("http://localhost:11434/api/generate", json={
        "model": "codellama",
        "prompt": f"{code_snippet}\n\nQuestion: {question}",
        "stream": False
    })
    return resp.json()["response"]

# Usage: python ask.py "your code here" "what's wrong"
Enter fullscreen mode Exit fullscreen mode

Real-World Scenarios Where This Shines

Scenario 1: Code Review on Your Terms
You're reviewing a PR at 2 AM and don't want to wait for cloud latency. Run codellama locally, paste the diff, get feedback in seconds. No API calls logged anywhere.

Scenario 2: Learning Someone Else's Codebase
Feed the LLM a module's README and some key files. Ask it to explain the data flow. You get better explanations than you'd get from a generic LLM because it's working with your actual code.

Scenario 3: Rapid Prototyping
Building a small CLI tool and want to brainstorm patterns? Local models are fast enough that you can iterate quickly. No rate limits, no costs, just feedback.

Gotchas You'll Hit

Memory: Even the 7B models need 8GB+ of RAM to run smoothly. If you've got 16GB+, you're golden. If you're maxing out memory, Phi 3 (3B) is smaller but still useful.

GPU Acceleration: The first time you run a model it's slow. But if you have a GPU, Ollama will use it. This is the difference between 30 seconds and 3 seconds per response.

Context Window: These models top out around 4K-8K tokens. You can't feed them your entire codebase. Work around it by being specific: "Here's the function, here's how it's called, here's the error."

Cold Starts: If your system hasn't used the model in a while, the first request will load it into memory. Annoying but quick once it's loaded.

When to Use Cloud APIs Instead

Be honest with yourself:

  • If you need state-of-the-art reasoning (Claude, GPT-4), local models won't compete
  • If you're working on ML/data science and need sophisticated analysis, cloud LLMs are better
  • If your internet is solid and you trust your provider with your code, the convenience might be worth it

I use local for everyday development and cloud for the hard problems. Best of both worlds.

Quick Wins This Week

  1. Install Ollama, run ollama pull codellama
  2. Try one code snippet: ask it to explain something confusing in your current project
  3. If you like it, integrate it into your editor next week
  4. Measure your own experience—don't take my word for it

The whole setup takes 15 minutes. Even if you never use it regularly, you'll know what's possible.


Want to stay current on AI tools and developer productivity? Check out LearnAI Weekly—practical resources and tool roundups delivered every week, no fluff.

Top comments (0)