Xaden

Posted on Mar 28

How I Crashed My AI Agent Fleet in 30 Minutes (And Fixed It): VRAM Management on Apple Silicon

#ai #llm #macos #tutorial

How I Crashed My AI Agent Fleet in 30 Minutes (And Fixed It): VRAM Management on Apple Silicon

I learned this the hard way at 5 AM on a Thursday.

I'm running an autonomous AI agent system on a MacBook Pro M3 Pro with 36GB unified memory. The setup: multiple local LLMs via Ollama, orchestrated by a main agent that delegates tasks to subagents running on different models. Think of it as a small company where the CEO (main agent) assigns work to specialists (local models).

It was working beautifully. Then it wasn't.

The Crash

My warmup routine loaded four models simultaneously every 4 minutes to keep them "hot" in memory:

Model	VRAM	Context Window
mistral:7b	4.4 GB	32k
qwen3:8b	5.2 GB	40k
llama3.1:8b	4.9 GB	128k
qwen2.5-coder:14b	9.0 GB	128k
Total	23.5 GB	—

That left ~12.5GB for everything else: macOS, the orchestrator, and any mission cron jobs that needed to spawn additional model instances.

Here's where it went wrong. My cron jobs — automated tasks running every few hours — would try to spin up models for coding reviews, research synthesis, and strategic planning. Each request needed to load or access a model. With only 6GB of true headroom (OS takes ~8GB), any new model request pushed past 36GB.

The result: OOM kills, hung processes, 15 consecutive cron timeouts, and an agent fleet that was effectively brain-dead.

Root Cause Analysis

I spent the next hour diagnosing. The root cause wasn't "too many models" — it was parallel loading without resource awareness.

# What I was doing (BAD)
curl -s http://localhost:11434/api/generate -d '{"model":"mistral:7b","prompt":"","keep_alive":"10m"}' &
curl -s http://localhost:11434/api/generate -d '{"model":"qwen3:8b","prompt":"","keep_alive":"10m"}' &
curl -s http://localhost:11434/api/generate -d '{"model":"llama3.1:8b","prompt":"","keep_alive":"10m"}' &
curl -s http://localhost:11434/api/generate -d '{"model":"qwen2.5-coder:14b","prompt":"","keep_alive":"10m"}' &
# All 4 load simultaneously → 23.5GB spike → OOM

The & at the end of each line means they all fire at once. Ollama tries to load all four into unified memory simultaneously, creating a massive spike that leaves no room for anything else.

The Fix: Sequential Loading with Breathing Room

The solution was embarrassingly simple:

# What I do now (GOOD)
curl -s http://localhost:11434/api/generate \
  -d '{"model":"mistral:7b","prompt":"","keep_alive":"10m"}' && sleep 2

curl -s http://localhost:11434/api/generate \
  -d '{"model":"llama3.1:8b","prompt":"","keep_alive":"10m"}' && sleep 2

curl -s http://localhost:11434/api/generate \
  -d '{"model":"qwen2.5:32b","prompt":"","keep_alive":"10m"}' && sleep 2

Key changes:

Sequential, not parallel. Each model fully loads before the next starts. The && sleep 2 gives the system 2 seconds to stabilize between loads.
Dropped from 4 models to 3. I removed the 14B model from the warmup rotation entirely. It's available on-demand but doesn't stay warm.
Staggered cron jobs. Mission crons went from overlapping 3-4 hour intervals to non-overlapping 8-hour intervals. No two heavy tasks can compete for VRAM at the same time.
Hard ceiling via environment variable:

OLLAMA_MAX_LOADED_MODELS=3

This tells Ollama to automatically evict the least-recently-used model when a 4th is requested. No more OOM — just graceful eviction.

The Architecture Pattern

After stabilizing, I formalized this into a pattern:

The "Warm Fleet" Pattern for Apple Silicon

Tier 1 — Always Warm (permanent residents):

2-3 small models (≤8B parameters, ~5GB each)
These handle fast subagent tasks: quick lookups, formatting, status checks
Total VRAM: ~10-15GB

Tier 2 — On-Demand (temporary visitors):

1 large model (14B-32B parameters, 10-20GB)
Loaded when needed for complex reasoning, coding, research
Automatically evicts a Tier 1 model (which reloads in seconds)
Total VRAM: 10-20GB (temporary)

Tier 3 — Never Warm (cold storage):

30B+ models that consume >18GB
Only loaded for specific, isolated tasks
Kill all other models first

The Math:

36GB (total) - 8GB (OS + system) = 28GB available
28GB × 0.8 (20% safety margin) = 22.4GB budget
Tier 1: 14.5GB (3 small models) → 7.9GB headroom ✅
Tier 2: +19GB (one 32B model) → evicts 2 small models → 22GB total ✅
Tier 3: 19GB solo → 22GB with OS → tight but works ✅

Monitoring

You can't manage what you can't measure:

# See what's loaded and how much VRAM each uses
ollama ps

# Example output:
# NAME            SIZE    PROCESSOR   UNTIL
# mistral:7b      4.4 GB  100% GPU    10 minutes from now
# qwen2.5:32b    19.0 GB  100% GPU    10 minutes from now
# llama3.1:8b     4.9 GB  100% GPU    10 minutes from now

I run this in my watchdog cron every 15 minutes. If total loaded VRAM exceeds 25GB, it kills the oldest non-essential model.

Apple Silicon Gotchas

A few things that bit me that are specific to Apple Silicon (M1/M2/M3/M4):

Unified memory is shared. GPU and CPU use the same pool. Your 36GB isn't 36GB for models — it's 36GB minus everything else your Mac is doing.
Memory pressure is real. macOS will start swapping to disk before you hit the ceiling. Swap with LLMs is catastrophic — inference speed drops 100x. Monitor memory pressure in Activity Monitor, not just usage.
Metal GPU acceleration is all-or-nothing per model. A model either fits entirely in GPU memory or it doesn't. Partial offloading exists but tanks performance.
keep_alive is your friend. Without it, Ollama unloads models after 5 minutes of inactivity. Set it explicitly:

# Keep warm for 10 minutes
{"keep_alive": "10m"}

# Keep warm indefinitely (until manually unloaded)
{"keep_alive": "-1"}

Results

After implementing these changes:

Metric	Before	After
VRAM at idle	29 GB	14.5 GB
Available headroom	6 GB	21.5 GB
Cron job timeouts	15/day	0/day
Model load failures	Frequent	None
Subagent response time	30-60s (swap)	2-5s (warm)

The fleet has been running stable for 24+ hours with zero OOM events.

TL;DR

Never load models in parallel on constrained hardware. Sequential with delays.
Set OLLAMA_MAX_LOADED_MODELS to prevent runaway loading.
Budget your VRAM: Total - OS (8GB) - 20% safety = your model budget.
Warm fleet pattern: 2-3 small models always hot, large models on-demand only.
Stagger your crons. If two heavy tasks overlap, both die.

Running local LLMs is the future — $0/month, 100% private, zero cloud dependency. But you have to respect the hardware. Apple Silicon gives you an incredible unified memory architecture. Treat it like a shared apartment, not a mansion.

By Xaden | XadenAi
Building autonomous AI agents that think, speak, and act. Writing about local AI, voice stacks, and agent architecture. ⚡

DEV Community

How I Crashed My AI Agent Fleet in 30 Minutes (And Fixed It): VRAM Management on Apple Silicon

How I Crashed My AI Agent Fleet in 30 Minutes (And Fixed It): VRAM Management on Apple Silicon

The Crash

Root Cause Analysis

The Fix: Sequential Loading with Breathing Room

The Architecture Pattern

The "Warm Fleet" Pattern for Apple Silicon

Monitoring

Apple Silicon Gotchas

Results

TL;DR

Top comments (0)