Alan West

Posted on Apr 17

How to Run LLMs Locally When Cloud AI Gets Too Invasive

#ai #llm #opensource #privacy

If you've been paying attention to the AI space lately, you've probably noticed a trend: cloud AI providers are tightening the screws on identity verification. We're talking government IDs, facial recognition scans, the works. For a lot of developers, that's a hard no.

I'm not here to debate whether these policies are justified. What I am here to do is walk you through a practical setup for running capable LLMs on your own hardware, so your workflow doesn't depend on any single provider's terms of service.

The actual problem

Here's the scenario. You've built scripts, tooling, maybe even internal apps that rely on a cloud-hosted LLM API. Then one morning you wake up to find you need to hand over a passport photo and a face scan just to keep using it. Your options are:

Comply and hope your biometric data is handled responsibly
Switch to a different cloud provider (until they do the same thing)
Run models locally

Option 3 is the only one that puts you fully in control. Let's make it work.

What you actually need (hardware reality check)

Before we dive in, let's be honest about hardware requirements. Running a 7B parameter model comfortably requires about 8GB of RAM (or VRAM if you're using a GPU). A 13B model wants 16GB. The bigger models — 70B and up — need serious hardware or quantization tricks.

For most development tasks like code completion, summarization, and chat-based debugging, a 7B-13B model running on a decent laptop is genuinely useful. You don't need a data center.

Step 1: Install Ollama

Ollama is the fastest way to get a local LLM running. It handles model downloads, quantization, and serves a local API that's compatible with the OpenAI chat completions format.

# macOS / Linux
curl -fsSL https://ollama.com/install.sh | sh

# Verify it's running
ollama --version

On macOS, you can also grab it from the official site as a .dmg. Windows support is available too.

Once installed, pull a model:

# Grab a solid general-purpose model
ollama pull llama3.1:8b

# Or if you want something smaller and faster
ollama pull phi3:mini

# Check what you've got locally
ollama list

That's it. You now have a local LLM. Test it:

ollama run llama3.1:8b "Explain dependency injection in two sentences"

Step 2: Use the local API in your code

Ollama exposes a REST API on localhost:11434 by default. Here's how to hit it from your existing code with minimal changes:

import requests
import json

def query_local_llm(prompt, model="llama3.1:8b"):
    response = requests.post(
        "http://localhost:11434/api/chat",
        json={
            "model": model,
            "messages": [{"role": "user", "content": prompt}],
            "stream": False  # set True if you want streaming
        }
    )
    return response.json()["message"]["content"]

# Use it exactly like you'd use any LLM API
result = query_local_llm("Write a Python function to retry HTTP requests with exponential backoff")
print(result)

If your existing code uses the OpenAI Python SDK, Ollama supports a compatibility endpoint. You can point the OpenAI client at your local server:

from openai import OpenAI

# Point at local Ollama instead of OpenAI's servers
client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="not-needed"  # Ollama doesn't require auth
)

response = client.chat.completions.create(
    model="llama3.1:8b",
    messages=[{"role": "user", "content": "Refactor this function to use async/await"}]
)

print(response.choices[0].message.content)

That base_url swap is the whole migration. If you've been using the OpenAI SDK format, your existing code barely needs to change.

Step 3: For the power users — llama.cpp directly

If you want more control over memory usage, quantization, and GPU layer offloading, llama.cpp is what Ollama is built on top of. You can run it directly:

# Clone and build
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make

# Download a GGUF model from Hugging Face (example)
# Then run the server
./llama-server -m ./models/your-model.gguf \
    --host 0.0.0.0 \
    --port 8080 \
    -ngl 35  # number of layers to offload to GPU

This gives you fine-grained control over things like context length, batch size, and exactly how much work goes to the GPU versus CPU. For most people Ollama is enough, but if you're optimizing for a specific deployment scenario, going direct is worth it.

Choosing the right model for dev work

Not all models are equal for coding tasks. Here's what I've found actually works in practice:

Code generation and completion: CodeLlama or DeepSeek Coder models tend to produce cleaner output for pure coding tasks
General dev Q&A: Llama 3.1 8B is a solid all-rounder that runs on modest hardware
Quick lookups and small tasks: Phi-3 Mini is surprisingly capable for its size and runs fast on CPUs

The key insight: you don't need one model for everything. Pull two or three and use the right one for the job. Storage is cheap.

The tradeoffs (being honest)

Look, I'm not going to pretend local models are a perfect replacement for frontier cloud models. Here's where it gets real:

Quality gap: A local 8B model won't match the output quality of the largest cloud models. For complex reasoning or long-context tasks, you'll notice the difference.
Speed: First-token latency is fine, but generation speed depends heavily on your hardware. GPU acceleration makes a huge difference here.
Context window: Local models typically handle shorter contexts than cloud APIs. Plan your prompts accordingly.
No training on your data: This is actually a benefit. Your code never leaves your machine.

For 80% of day-to-day dev tasks — writing boilerplate, explaining code, generating tests, rubber-duck debugging — a local model does the job. For the remaining 20% where you genuinely need frontier-level capability, you can make that decision on a case-by-case basis.

Making it stick: a fallback setup

The pragmatic approach is to build a fallback into your workflow:

def query_llm(prompt, prefer_local=True):
    if prefer_local:
        try:
            return query_local_llm(prompt)
        except requests.ConnectionError:
            print("Local model unavailable, falling back to cloud")
            # fall through to cloud provider

    # Your cloud API call here as backup
    return query_cloud_llm(prompt)

This way you default to local, keep your data on your machine, and only hit the cloud when you actually need to. No biometric scan required for your localhost.

Prevention: building vendor-independent tooling

The bigger lesson here goes beyond any single provider's policy change. If your developer tooling breaks when one company changes their terms of service, that's a fragility problem.

Abstract your LLM calls behind a simple interface so swapping providers is a config change, not a rewrite
Keep local models as a first-class option in your architecture, not an afterthought
Pin model versions in your dev environment so behavior is reproducible
Test with local models in CI where possible — it's faster and doesn't burn API credits

The trend toward heavier identity requirements isn't going away. If anything, regulation is going to push more providers in this direction. Building your workflow to be provider-agnostic isn't paranoia — it's just good engineering.

The models are good enough. The tooling is mature enough. And your passport can stay in the drawer where it belongs.

DEV Community