DEV Community

Charlie
Charlie

Posted on • Originally published at charlieseay.com

How to Audit Your Stack for Offline AI Readiness

Every API has a free tier until it doesn't. Every cloud service is reliable until it isn't. And every AI provider is affordable until the pricing page changes.

This isn't about paranoia. It's about optionality. If Anthropic raises prices, Google kills Gemini's free tier, or you just want to work from a cabin with no signal — do you have a playbook?

I built one. Here's the framework.

The audit

For every cloud dependency in your stack, document four things:

  1. What it does — the actual function, not the product name
  2. What local replacement exists — specific tool, not "something open source"
  3. What hardware it needs — RAM, VRAM, storage, with specific quantities
  4. What it costs — real pricing, verified, not "about $2K"

Here's what that looks like for an AI-heavy stack running on a Mac Mini M4 Pro:

AI services

Function Cloud Provider Local Alternative RAM Needed
Coding assistant Claude Code Ollama + Aider + Qwen 2.5 Coder 32B 48GB+
App LLM (formatting) Gemini 2.0 Flash Ollama + Llama 3.3 70B Q4 48GB+
App LLM (fallback) Groq / Llama 3.3 70B Same local Ollama instance (same)
Image generation Pollinations / Stable Horde FLUX.1 or SDXL via ComfyUI 16GB+
Streaming story gen Gemini 2.0 Flash Ollama + Llama 3.3 70B Q4 48GB+

Infrastructure

Function Cloud Provider Local Alternative Effort
Git hosting GitHub Gitea or Forgejo (Docker) Low
DNS + routing Cloudflare Tunnel dnsmasq + mDNS Medium
SSL certificates Cloudflare (auto) mkcert (local CA) Low
Auth (SSO) Google OAuth Authentik local passwords Low
Container registry Docker Hub Local registry:2 + pre-pulled images Low
Package manager npm / Homebrew Verdaccio + cached bottles Low

What's already offline

This is the part most people skip. Before buying anything, check what's already local:

  • Docker, containers, reverse proxy — already running on your machine
  • IDE — VSCode, Xcode, everything that matters is local
  • IaC tools — OpenTofu, Terraform, Ansible — all local binaries
  • Media server — Plex/Jellyfin playback is local (metadata calls aside)

In my case, about 80% of the infrastructure stack is already offline-capable. The 20% that isn't is almost entirely AI and DNS.

What fits in your RAM

This is the question. Everything else is details.

24GB (M4 Pro base)

You can run today — no upgrades needed:

  • Qwen 2.5 Coder 7B (Q8) — ~5GB, good for single-file edits and autocomplete
  • Qwen 3 14B (Q4) — ~9GB, strong reasoning with /think mode
  • SDXL 1.0 — ~8GB, mature ecosystem, 4-12s per image

The catch: one model at a time. Running a coding model and an image generator simultaneously will swap.

48GB (upgrade sweet spot)

  • Qwen 2.5 Coder 32B (Q4) — ~20GB, 92.7% HumanEval, matches GPT-4o on code benchmarks
  • Gemma 3 40B (Q4) — ~24GB, 128K context, great for content generation
  • FLUX.1 Schnell — ~16GB, high-quality image gen in 30-60s

You can run a coding model or a creative model with headroom. Not both simultaneously.

64GB (the real sweet spot)

  • Llama 3.3 70B (Q4) — ~40GB, with ~20GB headroom for OS, apps, and a second model
  • Two models loaded at once — coding + creative, no swapping
  • FLUX.1 Dev alongside an active LLM

The jump from 48GB to 64GB is only ~$400 on Apple's configurator but unlocks 70B models and multi-model workflows. This is the tier where local AI stops feeling like a compromise.

The models that matter in 2026

For coding

Qwen 2.5 Coder 32B is the answer for most people. 128K context window, 92.7% on HumanEval, 73.7% on the Aider benchmark. It handles multi-file edits, refactoring, and test generation well.

Qwen3 Coder 30B-A3B is the wildcard — a Mixture of Experts model where only 3.3B parameters are active per token. It needs ~12GB of RAM despite being a "30B" model. If you're RAM-constrained, this is the one to watch.

For autocomplete specifically, Qwen 2.5 Coder 7B at Q8 quantization is fast enough for tab completion and fits alongside larger models.

For creative text

Llama 3.3 70B (Q4) for maximum quality if you have the RAM. Gemma 3 40B for 128K context at lower memory cost. Both handle structured JSON output — critical if your app needs parseable responses, not just prose.

Ollama supports constrained JSON output natively now. You can pass a JSON schema in the API call and the model's output will conform to it. This matters more than benchmark scores for production use.

For image generation

On Apple Silicon, Draw Things is the fastest runtime — 25% faster than mflux for FLUX models, with optimized Metal FlashAttention 2.0. For Stable Diffusion, Mochi Diffusion uses Core ML and the Neural Engine, running at ~150MB memory.

Reality check: Apple Silicon is 2-4x slower than NVIDIA GPUs for image generation. If you're generating dozens of images per session, this is where a Linux GPU box pays for itself.

The tools that wire it together

The model is only half the equation. You need the tooling layer:

Layer Tool What it does
Model runtime Ollama Serves models via OpenAI-compatible API. One command to download and run any model.
CLI coding agent Aider Git-native AI pair programmer. Applies diffs, understands repo context. Connects to Ollama.
VSCode integration Continue.dev Model routing — small fast model for autocomplete, big model for chat/reasoning.
Image generation Draw Things or ComfyUI Native macOS app or node-based workflow. Both support FLUX and SDXL.
Chat interface Open WebUI ChatGPT-style web UI for any Ollama model. Docker one-liner.

The key insight: Ollama's OpenAI-compatible API means your code barely changes. If you're already calling https://api.groq.com/openai/v1/chat/completions, switching to http://localhost:11434/v1/chat/completions is a one-line change. Same request format, same streaming SSE response format.

Hardware costs (verified March 2026)

Option Config Price Best for
Mac Mini M4 Pro 48GB 14C/20G, 1TB $1,999 Running 32B coding models comfortably
Mac Mini M4 Pro 64GB 14C/20G, 1TB ~$2,399 70B models + multi-model workflows
Used RTX 3090 24GB VRAM $650-840 Cheapest path to serious VRAM ($33/GB)
Linux GPU box Workstation + 3090 $1,200-2,000 Fast inference, image gen
Mac Studio M3 Ultra 192GB unified $5,499 Overkill, but no compromises

If you already have a 24GB Mac, selling it covers $400-500 toward the upgrade. Net cost for the 64GB sweet spot: around $1,900-2,000.

Note on used GPU pricing: tariffs are expected to push used RTX 3090 prices up 10-20% in Q1-Q2 2026. If you're going the Linux route, sooner is cheaper.

What's not ready yet

Honest assessment. Skip this section if you only want good news.

Local coding assistants are at maybe 40-60% of Claude Code capability for complex tasks. Single-file edits, refactoring, debugging, test writing — fine. "Build me a full authentication system across 12 files in one session" — not fine. Qwen 2.5 Coder 32B matches GPT-4o on benchmarks, but benchmarks aren't multi-file architectural reasoning.

Image generation on Apple Silicon is slow. FLUX.1 Schnell takes 30-60 seconds per image on M4 Pro. If your workflow generates 20+ images per session, you'll feel it. A $700 used RTX 3090 cuts that to 5-10 seconds.

Package managers need internet. npm, pip, Homebrew — they all phone home. You can cache with Verdaccio (npm) or pre-download bottles (Homebrew), but it's maintenance overhead you don't have today.

Documentation and search are the silent dependency. Stack Overflow, MDN, Apple Developer docs — you don't realize how often you reach for them until you can't. Pre-downloading docs is possible but tedious. This might be the hardest thing to replace.

The framework, not the answer

The specific models and prices in this post will age. The framework won't:

  1. Audit every cloud dependency
  2. Identify the local replacement with specific hardware requirements
  3. Price the hardware honestly
  4. Be honest about what doesn't work yet
  5. Update the audit every time you add a new dependency

I keep a living document that gets updated every time I touch the stack. When a dependency changes, the offline alternative gets re-evaluated. It's not a one-time exercise — it's a habit.

The goal isn't to go offline tomorrow. It's to know that you could.


This is Part 1 of the Off the Grid series. Next up: actually running the dev workflow offline for a week and documenting what breaks.


Originally published at charlieseay.com

Top comments (0)