DEV Community

Xaden
Xaden

Posted on

How I Crashed My AI Agent Fleet in 30 Minutes (And Fixed It): VRAM Management on Apple Silicon

How I Crashed My AI Agent Fleet in 30 Minutes (And Fixed It): VRAM Management on Apple Silicon

I learned this the hard way at 5 AM on a Thursday.

I'm running an autonomous AI agent system on a MacBook Pro M3 Pro with 36GB unified memory. The setup: multiple local LLMs via Ollama, orchestrated by a main agent that delegates tasks to subagents running on different models. Think of it as a small company where the CEO (main agent) assigns work to specialists (local models).

It was working beautifully. Then it wasn't.

The Crash

My warmup routine loaded four models simultaneously every 4 minutes to keep them "hot" in memory:

Model VRAM Context Window
mistral:7b 4.4 GB 32k
qwen3:8b 5.2 GB 40k
llama3.1:8b 4.9 GB 128k
qwen2.5-coder:14b 9.0 GB 128k
Total 23.5 GB

That left ~12.5GB for everything else: macOS, the orchestrator, and any mission cron jobs that needed to spawn additional model instances.

Here's where it went wrong. My cron jobs — automated tasks running every few hours — would try to spin up models for coding reviews, research synthesis, and strategic planning. Each request needed to load or access a model. With only 6GB of true headroom (OS takes ~8GB), any new model request pushed past 36GB.

The result: OOM kills, hung processes, 15 consecutive cron timeouts, and an agent fleet that was effectively brain-dead.

Root Cause Analysis

I spent the next hour diagnosing. The root cause wasn't "too many models" — it was parallel loading without resource awareness.

# What I was doing (BAD)
curl -s http://localhost:11434/api/generate -d '{"model":"mistral:7b","prompt":"","keep_alive":"10m"}' &
curl -s http://localhost:11434/api/generate -d '{"model":"qwen3:8b","prompt":"","keep_alive":"10m"}' &
curl -s http://localhost:11434/api/generate -d '{"model":"llama3.1:8b","prompt":"","keep_alive":"10m"}' &
curl -s http://localhost:11434/api/generate -d '{"model":"qwen2.5-coder:14b","prompt":"","keep_alive":"10m"}' &
# All 4 load simultaneously → 23.5GB spike → OOM
Enter fullscreen mode Exit fullscreen mode

The & at the end of each line means they all fire at once. Ollama tries to load all four into unified memory simultaneously, creating a massive spike that leaves no room for anything else.

The Fix: Sequential Loading with Breathing Room

The solution was embarrassingly simple:

# What I do now (GOOD)
curl -s http://localhost:11434/api/generate \
  -d '{"model":"mistral:7b","prompt":"","keep_alive":"10m"}' && sleep 2

curl -s http://localhost:11434/api/generate \
  -d '{"model":"llama3.1:8b","prompt":"","keep_alive":"10m"}' && sleep 2

curl -s http://localhost:11434/api/generate \
  -d '{"model":"qwen2.5:32b","prompt":"","keep_alive":"10m"}' && sleep 2
Enter fullscreen mode Exit fullscreen mode

Key changes:

  1. Sequential, not parallel. Each model fully loads before the next starts. The && sleep 2 gives the system 2 seconds to stabilize between loads.

  2. Dropped from 4 models to 3. I removed the 14B model from the warmup rotation entirely. It's available on-demand but doesn't stay warm.

  3. Staggered cron jobs. Mission crons went from overlapping 3-4 hour intervals to non-overlapping 8-hour intervals. No two heavy tasks can compete for VRAM at the same time.

  4. Hard ceiling via environment variable:

OLLAMA_MAX_LOADED_MODELS=3
Enter fullscreen mode Exit fullscreen mode

This tells Ollama to automatically evict the least-recently-used model when a 4th is requested. No more OOM — just graceful eviction.

The Architecture Pattern

After stabilizing, I formalized this into a pattern:

The "Warm Fleet" Pattern for Apple Silicon

Tier 1 — Always Warm (permanent residents):

  • 2-3 small models (≤8B parameters, ~5GB each)
  • These handle fast subagent tasks: quick lookups, formatting, status checks
  • Total VRAM: ~10-15GB

Tier 2 — On-Demand (temporary visitors):

  • 1 large model (14B-32B parameters, 10-20GB)
  • Loaded when needed for complex reasoning, coding, research
  • Automatically evicts a Tier 1 model (which reloads in seconds)
  • Total VRAM: 10-20GB (temporary)

Tier 3 — Never Warm (cold storage):

  • 30B+ models that consume >18GB
  • Only loaded for specific, isolated tasks
  • Kill all other models first

The Math:

36GB (total) - 8GB (OS + system) = 28GB available
28GB × 0.8 (20% safety margin) = 22.4GB budget
Tier 1: 14.5GB (3 small models) → 7.9GB headroom ✅
Tier 2: +19GB (one 32B model) → evicts 2 small models → 22GB total ✅
Tier 3: 19GB solo → 22GB with OS → tight but works ✅
Enter fullscreen mode Exit fullscreen mode

Monitoring

You can't manage what you can't measure:

# See what's loaded and how much VRAM each uses
ollama ps

# Example output:
# NAME            SIZE    PROCESSOR   UNTIL
# mistral:7b      4.4 GB  100% GPU    10 minutes from now
# qwen2.5:32b    19.0 GB  100% GPU    10 minutes from now
# llama3.1:8b     4.9 GB  100% GPU    10 minutes from now
Enter fullscreen mode Exit fullscreen mode

I run this in my watchdog cron every 15 minutes. If total loaded VRAM exceeds 25GB, it kills the oldest non-essential model.

Apple Silicon Gotchas

A few things that bit me that are specific to Apple Silicon (M1/M2/M3/M4):

  1. Unified memory is shared. GPU and CPU use the same pool. Your 36GB isn't 36GB for models — it's 36GB minus everything else your Mac is doing.

  2. Memory pressure is real. macOS will start swapping to disk before you hit the ceiling. Swap with LLMs is catastrophic — inference speed drops 100x. Monitor memory pressure in Activity Monitor, not just usage.

  3. Metal GPU acceleration is all-or-nothing per model. A model either fits entirely in GPU memory or it doesn't. Partial offloading exists but tanks performance.

  4. keep_alive is your friend. Without it, Ollama unloads models after 5 minutes of inactivity. Set it explicitly:

# Keep warm for 10 minutes
{"keep_alive": "10m"}

# Keep warm indefinitely (until manually unloaded)
{"keep_alive": "-1"}
Enter fullscreen mode Exit fullscreen mode

Results

After implementing these changes:

Metric Before After
VRAM at idle 29 GB 14.5 GB
Available headroom 6 GB 21.5 GB
Cron job timeouts 15/day 0/day
Model load failures Frequent None
Subagent response time 30-60s (swap) 2-5s (warm)

The fleet has been running stable for 24+ hours with zero OOM events.

TL;DR

  1. Never load models in parallel on constrained hardware. Sequential with delays.
  2. Set OLLAMA_MAX_LOADED_MODELS to prevent runaway loading.
  3. Budget your VRAM: Total - OS (8GB) - 20% safety = your model budget.
  4. Warm fleet pattern: 2-3 small models always hot, large models on-demand only.
  5. Stagger your crons. If two heavy tasks overlap, both die.

Running local LLMs is the future — $0/month, 100% private, zero cloud dependency. But you have to respect the hardware. Apple Silicon gives you an incredible unified memory architecture. Treat it like a shared apartment, not a mansion.


By Xaden | XadenAi
Building autonomous AI agents that think, speak, and act. Writing about local AI, voice stacks, and agent architecture. ⚡

Top comments (0)