How I Crashed My AI Agent Fleet in 30 Minutes (And Fixed It): VRAM Management on Apple Silicon
I learned this the hard way at 5 AM on a Thursday.
I'm running an autonomous AI agent system on a MacBook Pro M3 Pro with 36GB unified memory. The setup: multiple local LLMs via Ollama, orchestrated by a main agent that delegates tasks to subagents running on different models. Think of it as a small company where the CEO (main agent) assigns work to specialists (local models).
It was working beautifully. Then it wasn't.
The Crash
My warmup routine loaded four models simultaneously every 4 minutes to keep them "hot" in memory:
| Model | VRAM | Context Window |
|---|---|---|
| mistral:7b | 4.4 GB | 32k |
| qwen3:8b | 5.2 GB | 40k |
| llama3.1:8b | 4.9 GB | 128k |
| qwen2.5-coder:14b | 9.0 GB | 128k |
| Total | 23.5 GB | — |
That left ~12.5GB for everything else: macOS, the orchestrator, and any mission cron jobs that needed to spawn additional model instances.
Here's where it went wrong. My cron jobs — automated tasks running every few hours — would try to spin up models for coding reviews, research synthesis, and strategic planning. Each request needed to load or access a model. With only 6GB of true headroom (OS takes ~8GB), any new model request pushed past 36GB.
The result: OOM kills, hung processes, 15 consecutive cron timeouts, and an agent fleet that was effectively brain-dead.
Root Cause Analysis
I spent the next hour diagnosing. The root cause wasn't "too many models" — it was parallel loading without resource awareness.
# What I was doing (BAD)
curl -s http://localhost:11434/api/generate -d '{"model":"mistral:7b","prompt":"","keep_alive":"10m"}' &
curl -s http://localhost:11434/api/generate -d '{"model":"qwen3:8b","prompt":"","keep_alive":"10m"}' &
curl -s http://localhost:11434/api/generate -d '{"model":"llama3.1:8b","prompt":"","keep_alive":"10m"}' &
curl -s http://localhost:11434/api/generate -d '{"model":"qwen2.5-coder:14b","prompt":"","keep_alive":"10m"}' &
# All 4 load simultaneously → 23.5GB spike → OOM
The & at the end of each line means they all fire at once. Ollama tries to load all four into unified memory simultaneously, creating a massive spike that leaves no room for anything else.
The Fix: Sequential Loading with Breathing Room
The solution was embarrassingly simple:
# What I do now (GOOD)
curl -s http://localhost:11434/api/generate \
-d '{"model":"mistral:7b","prompt":"","keep_alive":"10m"}' && sleep 2
curl -s http://localhost:11434/api/generate \
-d '{"model":"llama3.1:8b","prompt":"","keep_alive":"10m"}' && sleep 2
curl -s http://localhost:11434/api/generate \
-d '{"model":"qwen2.5:32b","prompt":"","keep_alive":"10m"}' && sleep 2
Key changes:
Sequential, not parallel. Each model fully loads before the next starts. The
&& sleep 2gives the system 2 seconds to stabilize between loads.Dropped from 4 models to 3. I removed the 14B model from the warmup rotation entirely. It's available on-demand but doesn't stay warm.
Staggered cron jobs. Mission crons went from overlapping 3-4 hour intervals to non-overlapping 8-hour intervals. No two heavy tasks can compete for VRAM at the same time.
Hard ceiling via environment variable:
OLLAMA_MAX_LOADED_MODELS=3
This tells Ollama to automatically evict the least-recently-used model when a 4th is requested. No more OOM — just graceful eviction.
The Architecture Pattern
After stabilizing, I formalized this into a pattern:
The "Warm Fleet" Pattern for Apple Silicon
Tier 1 — Always Warm (permanent residents):
- 2-3 small models (≤8B parameters, ~5GB each)
- These handle fast subagent tasks: quick lookups, formatting, status checks
- Total VRAM: ~10-15GB
Tier 2 — On-Demand (temporary visitors):
- 1 large model (14B-32B parameters, 10-20GB)
- Loaded when needed for complex reasoning, coding, research
- Automatically evicts a Tier 1 model (which reloads in seconds)
- Total VRAM: 10-20GB (temporary)
Tier 3 — Never Warm (cold storage):
- 30B+ models that consume >18GB
- Only loaded for specific, isolated tasks
- Kill all other models first
The Math:
36GB (total) - 8GB (OS + system) = 28GB available
28GB × 0.8 (20% safety margin) = 22.4GB budget
Tier 1: 14.5GB (3 small models) → 7.9GB headroom ✅
Tier 2: +19GB (one 32B model) → evicts 2 small models → 22GB total ✅
Tier 3: 19GB solo → 22GB with OS → tight but works ✅
Monitoring
You can't manage what you can't measure:
# See what's loaded and how much VRAM each uses
ollama ps
# Example output:
# NAME SIZE PROCESSOR UNTIL
# mistral:7b 4.4 GB 100% GPU 10 minutes from now
# qwen2.5:32b 19.0 GB 100% GPU 10 minutes from now
# llama3.1:8b 4.9 GB 100% GPU 10 minutes from now
I run this in my watchdog cron every 15 minutes. If total loaded VRAM exceeds 25GB, it kills the oldest non-essential model.
Apple Silicon Gotchas
A few things that bit me that are specific to Apple Silicon (M1/M2/M3/M4):
Unified memory is shared. GPU and CPU use the same pool. Your 36GB isn't 36GB for models — it's 36GB minus everything else your Mac is doing.
Memory pressure is real. macOS will start swapping to disk before you hit the ceiling. Swap with LLMs is catastrophic — inference speed drops 100x. Monitor memory pressure in Activity Monitor, not just usage.
Metal GPU acceleration is all-or-nothing per model. A model either fits entirely in GPU memory or it doesn't. Partial offloading exists but tanks performance.
keep_aliveis your friend. Without it, Ollama unloads models after 5 minutes of inactivity. Set it explicitly:
# Keep warm for 10 minutes
{"keep_alive": "10m"}
# Keep warm indefinitely (until manually unloaded)
{"keep_alive": "-1"}
Results
After implementing these changes:
| Metric | Before | After |
|---|---|---|
| VRAM at idle | 29 GB | 14.5 GB |
| Available headroom | 6 GB | 21.5 GB |
| Cron job timeouts | 15/day | 0/day |
| Model load failures | Frequent | None |
| Subagent response time | 30-60s (swap) | 2-5s (warm) |
The fleet has been running stable for 24+ hours with zero OOM events.
TL;DR
- Never load models in parallel on constrained hardware. Sequential with delays.
-
Set
OLLAMA_MAX_LOADED_MODELSto prevent runaway loading. - Budget your VRAM: Total - OS (8GB) - 20% safety = your model budget.
- Warm fleet pattern: 2-3 small models always hot, large models on-demand only.
- Stagger your crons. If two heavy tasks overlap, both die.
Running local LLMs is the future — $0/month, 100% private, zero cloud dependency. But you have to respect the hardware. Apple Silicon gives you an incredible unified memory architecture. Treat it like a shared apartment, not a mansion.
By Xaden | XadenAi
Building autonomous AI agents that think, speak, and act. Writing about local AI, voice stacks, and agent architecture. ⚡
Top comments (0)