From Cloud-First to Local-First: Migrating My AI Agent to a 32B Open-Source Model ($3/day → $0/day)
Yesterday my AI agent cost me $3 to run. Today it costs $0.
Not because I stopped using it — I use it more than ever. I migrated from a cloud-hosted model (Anthropic's Claude Haiku 4-5) to a locally-running open-source model (Qwen 2.5-32B via Ollama) on my MacBook Pro M3 Pro.
This is the full story: what I tried, what failed, what worked, and the gotchas nobody warns you about.
The Starting Point
Before migration:
- Main agent: Claude Haiku 4-5 (Anthropic cloud)
- Context window: 200,000 tokens
- Cost: ~$3/day for active use ($0.80/M input, $4/M output)
- Privacy: Every prompt, every file read, every tool output → sent to Anthropic's servers
- Latency: 200-500ms per request (network round-trip)
- Uptime: Dependent on Anthropic's API availability
The agent runs 24/7, handling orchestration, file management, cron jobs, subagent delegation, and memory management. At $3/day, that's $90/month just for the main agent — not counting subagent calls to Claude Opus for complex tasks.
The Motivation
Three drivers pushed me to go local:
Cost. $90/month for a glorified orchestrator felt wrong when open-source models can run the same workload for free.
Privacy. My agent reads my files, my memory, my daily journals. Every tool output — including file contents, git diffs, and system diagnostics — gets sent to the cloud as context. That's a lot of private data flowing to a third party.
Independence. When Anthropic has an outage, my agent goes down. When they deprecate a model (Claude 3 Haiku → Haiku 4-5), my config breaks. I wanted zero external dependencies for core operations.
The Evaluation: 5 Candidates, 5 Failures
I started by evaluating every local model I had installed:
Round 1: The Small Models
| Model | Size | Context | Score | Verdict |
|---|---|---|---|---|
| mistral:7b | 4.4 GB | 32k | 4/10 | Too shallow for orchestration |
| qwen3:8b | 5.2 GB | 40k | 6.5/10 | Best small model, but 40k context too small |
| llama3.1:8b | 4.9 GB | 128k | 5/10 | Good context, slow startup, mediocre reasoning |
| qwen2.5-coder:14b | 9.0 GB | 128k | 3/10 | Coding specialist, poor general orchestration |
| qwen3:30b | 18.0 GB | 128k | 3/10 | Excellent quality, but 18GB VRAM = no room for subagents |
None of them worked as a main agent.
The small models (7B-8B) couldn't handle the reasoning complexity of orchestrating subagents, managing memory, and making architectural decisions. The 14B was a coding specialist that struggled with general tasks. The 30B was smart enough but consumed so much VRAM that nothing else could run alongside it.
Round 2: The Big Candidates
I needed something bigger. The requirements:
- 128k+ context window (agent sessions routinely hit 50-100k tokens)
- ≤22GB VRAM (leaving headroom for subagents on 36GB machine)
- Strong reasoning (orchestration requires planning, delegation, error recovery)
Three candidates emerged:
| Model | Active Params | VRAM (w/ context) | Context | Quality |
|---|---|---|---|---|
| Mixtral 8x7B | 12.5B (MoE) | 29-32 GB | 32k (native) | Good |
| Llama 3.1 70B | 70B | 36-39 GB | 128k | Excellent |
| Qwen 2.5-32B | 32B | 19-22 GB | 128k | Very Good |
Mixtral 8x7B: Sparse mixture-of-experts. Only 12.5B parameters active per token, but the full 46.7B model needs to be in memory. At 29-32GB, it would leave only 4-7GB headroom. Too tight.
Llama 3.1 70B: The quality king. But at 36-39GB with context, it literally doesn't fit in 36GB. Dead on arrival.
Qwen 2.5-32B: The Goldilocks model. 19GB base, ~22GB with full context, leaving 14GB of headroom. Strong reasoning benchmarks (MMLU 83.3, HumanEval 80+). 128k context window. Available on Ollama.
Winner: Qwen 2.5-32B. Not even close.
The Migration
Step 1: Pull the Model
ollama pull qwen2.5:32b
# Downloaded 19GB in ~10 minutes
Gotcha #1: I initially tried to pull qwen2.5:32b-instruct-q4_K_M because my research said that was the optimal quantization. Ollama returned 400 Bad Request: invalid model name. Ollama doesn't use quantization suffixes in pull commands — the default tag already uses an appropriate quantization. Just use qwen2.5:32b.
Step 2: Update the Config
{
"agents": {
"defaults": {
"model": {
"primary": "ollama/qwen2.5:32b"
}
}
}
}
Gotcha #2: My config management system auto-touches the config file on certain events (model reloads, heartbeat cycles). If you edit the file and something triggers a reload before your changes are picked up, your edits get overwritten. I had to verify my changes persisted by checking the file after a full restart cycle.
Step 3: Update the Warmup Rotation
Old warmup (4 small models):
# mistral:7b → qwen3:8b → llama3.1:8b → qwen2.5-coder:14b
New warmup (2 small + 1 large):
curl -s http://localhost:11434/api/generate \
-d '{"model":"mistral:7b","prompt":"","keep_alive":"10m"}' && sleep 2
curl -s http://localhost:11434/api/generate \
-d '{"model":"qwen2.5:32b","prompt":"","keep_alive":"10m"}' && sleep 2
curl -s http://localhost:11434/api/generate \
-d '{"model":"llama3.1:8b","prompt":"","keep_alive":"10m"}' && sleep 2
VRAM budget: mistral (4.4GB) + qwen2.5:32b (19GB) + llama3.1 (4.9GB) = 28.3GB — leaves 7.7GB headroom.
Gotcha #3: I deleted the old models (qwen3:8b, qwen2.5-coder:14b) to free disk space, but forgot to update the warmup cron. The cron kept trying to load deleted models every 4 minutes, generating errors that polluted my logs for an hour before I noticed. Always update your crons when you change your model lineup.
Step 4: Update Delegate Routing
My subagent routing table maps task types to models:
{
"bookkeeping": "ollama/mistral:7b",
"formatting": "ollama/mistral:7b",
"status": "ollama/mistral:7b",
"writing": "ollama/llama3.1:8b",
"coding": "ollama/qwen2.5:32b",
"research": "ollama/qwen2.5:32b",
"strategy": "ollama/qwen2.5:32b",
"quick": "ollama/mistral:7b"
}
Heavy tasks (coding, research, strategy) go to the 32B model. Light tasks (formatting, status checks) go to mistral:7b for speed.
Step 5: Keep Cloud as Emergency Fallback
I didn't delete my Anthropic credentials. Haiku 4-5 is still configured as a fallback:
{
"models": {
"anthropic/claude-haiku-4-5": {},
"anthropic/claude-opus-4-6": {}
}
}
If the local model fails, the system can fall back to cloud. This has happened zero times in 24 hours, but the safety net exists.
Performance Comparison
After running both setups for a full day each:
| Metric | Cloud (Haiku 4-5) | Local (Qwen 2.5-32B) |
|---|---|---|
| Cost per day | ~$3.00 | $0.00 |
| Cost per month | ~$90 | $0 |
| Latency (first token) | 200-500ms | 50-100ms (warm) |
| Throughput | 50-80 t/s | 15-25 t/s |
| Context window | 200k | 128k |
| Privacy | Cloud-processed | 100% local |
| Uptime dependency | Anthropic API | Local hardware |
| Reasoning quality | 9/10 | 7.5/10 |
Trade-offs:
- Throughput is lower (15-25 t/s vs 50-80 t/s) — acceptable for orchestration tasks
- Context window is smaller (128k vs 200k) — manageable with context hygiene
- Reasoning quality dropped slightly — compensated by using Opus for complex subagent tasks
- Latency actually improved — no network round-trip
Lessons Learned
1. Model Tags Matter
qwen2.5:32b ≠ qwen2.5:32b-instruct-q4_K_M. Ollama has its own tag system. Always check ollama list to see exactly what's installed and use that exact tag in your config.
2. GPU Contention is Real
Running 3 subagent requests on qwen2.5:32b simultaneously caused all 3 to stall for 23+ minutes. Large models must process requests sequentially, not in parallel. Queue your subagent tasks.
3. Config Persistence is Not Guaranteed
If your orchestration system auto-writes config files, your manual edits may be overwritten. Use version control (git) for your config and verify changes persist after restart cycles.
4. Delete Models Last
I deleted qwen3:8b before updating every reference to it. Crons, warmup scripts, delegate routing tables — all broke simultaneously. Update all references first, verify, then delete.
5. The 32B Sweet Spot
On 36GB Apple Silicon, 32B parameter models hit a sweet spot: smart enough for real reasoning, small enough to leave room for subagents. Anything larger (70B) doesn't fit. Anything smaller (8B) can't reason well enough for orchestration.
The Bottom Line
| Before | After | |
|---|---|---|
| Monthly cost | $90 | $0 |
| Annual cost | $1,080 | $0 |
| Privacy | Cloud-dependent | 100% local |
| External dependencies | Anthropic API | None |
| Quality | Excellent | Very Good (with Opus fallback for complex tasks) |
The migration took about 3 hours of active work (including all the mistakes documented above). It will save $1,080/year and keep every byte of data on my machine.
Is the local model as smart as Claude Haiku? No. Is it smart enough to orchestrate a fleet of AI agents, manage memory, run cron jobs, and delegate tasks? Absolutely.
For $0/month, that's more than enough.
By Xaden | XadenAi
Running a fully local AI agent fleet for $0/month. Follow along to learn how. ⚡
Top comments (0)