Charlie

Posted on Mar 2 • Originally published at charlieseay.com

How to Audit Your Stack for Offline AI Readiness

#localai #applesilicon #homelab #offline

Every API has a free tier until it doesn't. Every cloud service is reliable until it isn't. And every AI provider is affordable until the pricing page changes.

This isn't about paranoia. It's about optionality. If Anthropic raises prices, Google kills Gemini's free tier, or you just want to work from a cabin with no signal — do you have a playbook?

I built one. Here's the framework.

The audit

For every cloud dependency in your stack, document four things:

What it does — the actual function, not the product name
What local replacement exists — specific tool, not "something open source"
What hardware it needs — RAM, VRAM, storage, with specific quantities
What it costs — real pricing, verified, not "about $2K"

Here's what that looks like for an AI-heavy stack running on a Mac Mini M4 Pro:

AI services

Function	Cloud Provider	Local Alternative	RAM Needed
Coding assistant	Claude Code	Ollama + Aider + Qwen 2.5 Coder 32B	48GB+
App LLM (formatting)	Gemini 2.0 Flash	Ollama + Llama 3.3 70B Q4	48GB+
App LLM (fallback)	Groq / Llama 3.3 70B	Same local Ollama instance	(same)
Image generation	Pollinations / Stable Horde	FLUX.1 or SDXL via ComfyUI	16GB+
Streaming story gen	Gemini 2.0 Flash	Ollama + Llama 3.3 70B Q4	48GB+

Infrastructure

Function	Cloud Provider	Local Alternative	Effort
Git hosting	GitHub	Gitea or Forgejo (Docker)	Low
DNS + routing	Cloudflare Tunnel	dnsmasq + mDNS	Medium
SSL certificates	Cloudflare (auto)	mkcert (local CA)	Low
Auth (SSO)	Google OAuth	Authentik local passwords	Low
Container registry	Docker Hub	Local registry:2 + pre-pulled images	Low
Package manager	npm / Homebrew	Verdaccio + cached bottles	Low

What's already offline

This is the part most people skip. Before buying anything, check what's already local:

Docker, containers, reverse proxy — already running on your machine
IDE — VSCode, Xcode, everything that matters is local
IaC tools — OpenTofu, Terraform, Ansible — all local binaries
Media server — Plex/Jellyfin playback is local (metadata calls aside)

In my case, about 80% of the infrastructure stack is already offline-capable. The 20% that isn't is almost entirely AI and DNS.

What fits in your RAM

This is the question. Everything else is details.

24GB (M4 Pro base)

You can run today — no upgrades needed:

Qwen 2.5 Coder 7B (Q8) — ~5GB, good for single-file edits and autocomplete
Qwen 3 14B (Q4) — ~9GB, strong reasoning with /think mode
SDXL 1.0 — ~8GB, mature ecosystem, 4-12s per image

The catch: one model at a time. Running a coding model and an image generator simultaneously will swap.

48GB (upgrade sweet spot)

Qwen 2.5 Coder 32B (Q4) — ~20GB, 92.7% HumanEval, matches GPT-4o on code benchmarks
Gemma 3 40B (Q4) — ~24GB, 128K context, great for content generation
FLUX.1 Schnell — ~16GB, high-quality image gen in 30-60s

You can run a coding model or a creative model with headroom. Not both simultaneously.

64GB (the real sweet spot)

Llama 3.3 70B (Q4) — ~40GB, with ~20GB headroom for OS, apps, and a second model
Two models loaded at once — coding + creative, no swapping
FLUX.1 Dev alongside an active LLM

The jump from 48GB to 64GB is only ~$400 on Apple's configurator but unlocks 70B models and multi-model workflows. This is the tier where local AI stops feeling like a compromise.

The models that matter in 2026

For coding

Qwen 2.5 Coder 32B is the answer for most people. 128K context window, 92.7% on HumanEval, 73.7% on the Aider benchmark. It handles multi-file edits, refactoring, and test generation well.

Qwen3 Coder 30B-A3B is the wildcard — a Mixture of Experts model where only 3.3B parameters are active per token. It needs ~12GB of RAM despite being a "30B" model. If you're RAM-constrained, this is the one to watch.

For autocomplete specifically, Qwen 2.5 Coder 7B at Q8 quantization is fast enough for tab completion and fits alongside larger models.

For creative text

Llama 3.3 70B (Q4) for maximum quality if you have the RAM. Gemma 3 40B for 128K context at lower memory cost. Both handle structured JSON output — critical if your app needs parseable responses, not just prose.

Ollama supports constrained JSON output natively now. You can pass a JSON schema in the API call and the model's output will conform to it. This matters more than benchmark scores for production use.

For image generation

On Apple Silicon, Draw Things is the fastest runtime — 25% faster than mflux for FLUX models, with optimized Metal FlashAttention 2.0. For Stable Diffusion, Mochi Diffusion uses Core ML and the Neural Engine, running at ~150MB memory.

Reality check: Apple Silicon is 2-4x slower than NVIDIA GPUs for image generation. If you're generating dozens of images per session, this is where a Linux GPU box pays for itself.

The tools that wire it together

The model is only half the equation. You need the tooling layer:

Layer	Tool	What it does
Model runtime	Ollama	Serves models via OpenAI-compatible API. One command to download and run any model.
CLI coding agent	Aider	Git-native AI pair programmer. Applies diffs, understands repo context. Connects to Ollama.
VSCode integration	Continue.dev	Model routing — small fast model for autocomplete, big model for chat/reasoning.
Image generation	Draw Things or ComfyUI	Native macOS app or node-based workflow. Both support FLUX and SDXL.
Chat interface	Open WebUI	ChatGPT-style web UI for any Ollama model. Docker one-liner.

The key insight: Ollama's OpenAI-compatible API means your code barely changes. If you're already calling https://api.groq.com/openai/v1/chat/completions, switching to http://localhost:11434/v1/chat/completions is a one-line change. Same request format, same streaming SSE response format.

Hardware costs (verified March 2026)

Option	Config	Price	Best for
Mac Mini M4 Pro 48GB	14C/20G, 1TB	$1,999	Running 32B coding models comfortably
Mac Mini M4 Pro 64GB	14C/20G, 1TB	~$2,399	70B models + multi-model workflows
Used RTX 3090	24GB VRAM	$650-840	Cheapest path to serious VRAM ($33/GB)
Linux GPU box	Workstation + 3090	$1,200-2,000	Fast inference, image gen
Mac Studio M3 Ultra	192GB unified	$5,499	Overkill, but no compromises

If you already have a 24GB Mac, selling it covers $400-500 toward the upgrade. Net cost for the 64GB sweet spot: around $1,900-2,000.

Note on used GPU pricing: tariffs are expected to push used RTX 3090 prices up 10-20% in Q1-Q2 2026. If you're going the Linux route, sooner is cheaper.

What's not ready yet

Honest assessment. Skip this section if you only want good news.

Local coding assistants are at maybe 40-60% of Claude Code capability for complex tasks. Single-file edits, refactoring, debugging, test writing — fine. "Build me a full authentication system across 12 files in one session" — not fine. Qwen 2.5 Coder 32B matches GPT-4o on benchmarks, but benchmarks aren't multi-file architectural reasoning.

Image generation on Apple Silicon is slow. FLUX.1 Schnell takes 30-60 seconds per image on M4 Pro. If your workflow generates 20+ images per session, you'll feel it. A $700 used RTX 3090 cuts that to 5-10 seconds.

Package managers need internet. npm, pip, Homebrew — they all phone home. You can cache with Verdaccio (npm) or pre-download bottles (Homebrew), but it's maintenance overhead you don't have today.

Documentation and search are the silent dependency. Stack Overflow, MDN, Apple Developer docs — you don't realize how often you reach for them until you can't. Pre-downloading docs is possible but tedious. This might be the hardest thing to replace.

The framework, not the answer

The specific models and prices in this post will age. The framework won't:

Audit every cloud dependency
Identify the local replacement with specific hardware requirements
Price the hardware honestly
Be honest about what doesn't work yet
Update the audit every time you add a new dependency

I keep a living document that gets updated every time I touch the stack. When a dependency changes, the offline alternative gets re-evaluated. It's not a one-time exercise — it's a habit.

The goal isn't to go offline tomorrow. It's to know that you could.

This is Part 1 of the Off the Grid series. Next up: actually running the dev workflow offline for a week and documenting what breaks.

Originally published at charlieseay.com

DEV Community