DEV Community

Xaden
Xaden

Posted on

From Cloud-First to Local-First: Migrating My AI Agent to a 32B Open-Source Model ($3/day $0/day)

From Cloud-First to Local-First: Migrating My AI Agent to a 32B Open-Source Model ($3/day → $0/day)

Yesterday my AI agent cost me $3 to run. Today it costs $0.

Not because I stopped using it — I use it more than ever. I migrated from a cloud-hosted model (Anthropic's Claude Haiku 4-5) to a locally-running open-source model (Qwen 2.5-32B via Ollama) on my MacBook Pro M3 Pro.

This is the full story: what I tried, what failed, what worked, and the gotchas nobody warns you about.

The Starting Point

Before migration:

  • Main agent: Claude Haiku 4-5 (Anthropic cloud)
  • Context window: 200,000 tokens
  • Cost: ~$3/day for active use ($0.80/M input, $4/M output)
  • Privacy: Every prompt, every file read, every tool output → sent to Anthropic's servers
  • Latency: 200-500ms per request (network round-trip)
  • Uptime: Dependent on Anthropic's API availability

The agent runs 24/7, handling orchestration, file management, cron jobs, subagent delegation, and memory management. At $3/day, that's $90/month just for the main agent — not counting subagent calls to Claude Opus for complex tasks.

The Motivation

Three drivers pushed me to go local:

  1. Cost. $90/month for a glorified orchestrator felt wrong when open-source models can run the same workload for free.

  2. Privacy. My agent reads my files, my memory, my daily journals. Every tool output — including file contents, git diffs, and system diagnostics — gets sent to the cloud as context. That's a lot of private data flowing to a third party.

  3. Independence. When Anthropic has an outage, my agent goes down. When they deprecate a model (Claude 3 Haiku → Haiku 4-5), my config breaks. I wanted zero external dependencies for core operations.

The Evaluation: 5 Candidates, 5 Failures

I started by evaluating every local model I had installed:

Round 1: The Small Models

Model Size Context Score Verdict
mistral:7b 4.4 GB 32k 4/10 Too shallow for orchestration
qwen3:8b 5.2 GB 40k 6.5/10 Best small model, but 40k context too small
llama3.1:8b 4.9 GB 128k 5/10 Good context, slow startup, mediocre reasoning
qwen2.5-coder:14b 9.0 GB 128k 3/10 Coding specialist, poor general orchestration
qwen3:30b 18.0 GB 128k 3/10 Excellent quality, but 18GB VRAM = no room for subagents

None of them worked as a main agent.

The small models (7B-8B) couldn't handle the reasoning complexity of orchestrating subagents, managing memory, and making architectural decisions. The 14B was a coding specialist that struggled with general tasks. The 30B was smart enough but consumed so much VRAM that nothing else could run alongside it.

Round 2: The Big Candidates

I needed something bigger. The requirements:

  • 128k+ context window (agent sessions routinely hit 50-100k tokens)
  • ≤22GB VRAM (leaving headroom for subagents on 36GB machine)
  • Strong reasoning (orchestration requires planning, delegation, error recovery)

Three candidates emerged:

Model Active Params VRAM (w/ context) Context Quality
Mixtral 8x7B 12.5B (MoE) 29-32 GB 32k (native) Good
Llama 3.1 70B 70B 36-39 GB 128k Excellent
Qwen 2.5-32B 32B 19-22 GB 128k Very Good
  • Mixtral 8x7B: Sparse mixture-of-experts. Only 12.5B parameters active per token, but the full 46.7B model needs to be in memory. At 29-32GB, it would leave only 4-7GB headroom. Too tight.

  • Llama 3.1 70B: The quality king. But at 36-39GB with context, it literally doesn't fit in 36GB. Dead on arrival.

  • Qwen 2.5-32B: The Goldilocks model. 19GB base, ~22GB with full context, leaving 14GB of headroom. Strong reasoning benchmarks (MMLU 83.3, HumanEval 80+). 128k context window. Available on Ollama.

Winner: Qwen 2.5-32B. Not even close.

The Migration

Step 1: Pull the Model

ollama pull qwen2.5:32b
# Downloaded 19GB in ~10 minutes
Enter fullscreen mode Exit fullscreen mode

Gotcha #1: I initially tried to pull qwen2.5:32b-instruct-q4_K_M because my research said that was the optimal quantization. Ollama returned 400 Bad Request: invalid model name. Ollama doesn't use quantization suffixes in pull commands — the default tag already uses an appropriate quantization. Just use qwen2.5:32b.

Step 2: Update the Config

{
  "agents": {
    "defaults": {
      "model": {
        "primary": "ollama/qwen2.5:32b"
      }
    }
  }
}
Enter fullscreen mode Exit fullscreen mode

Gotcha #2: My config management system auto-touches the config file on certain events (model reloads, heartbeat cycles). If you edit the file and something triggers a reload before your changes are picked up, your edits get overwritten. I had to verify my changes persisted by checking the file after a full restart cycle.

Step 3: Update the Warmup Rotation

Old warmup (4 small models):

# mistral:7b → qwen3:8b → llama3.1:8b → qwen2.5-coder:14b
Enter fullscreen mode Exit fullscreen mode

New warmup (2 small + 1 large):

curl -s http://localhost:11434/api/generate \
  -d '{"model":"mistral:7b","prompt":"","keep_alive":"10m"}' && sleep 2
curl -s http://localhost:11434/api/generate \
  -d '{"model":"qwen2.5:32b","prompt":"","keep_alive":"10m"}' && sleep 2
curl -s http://localhost:11434/api/generate \
  -d '{"model":"llama3.1:8b","prompt":"","keep_alive":"10m"}' && sleep 2
Enter fullscreen mode Exit fullscreen mode

VRAM budget: mistral (4.4GB) + qwen2.5:32b (19GB) + llama3.1 (4.9GB) = 28.3GB — leaves 7.7GB headroom.

Gotcha #3: I deleted the old models (qwen3:8b, qwen2.5-coder:14b) to free disk space, but forgot to update the warmup cron. The cron kept trying to load deleted models every 4 minutes, generating errors that polluted my logs for an hour before I noticed. Always update your crons when you change your model lineup.

Step 4: Update Delegate Routing

My subagent routing table maps task types to models:

{
  "bookkeeping": "ollama/mistral:7b",
  "formatting": "ollama/mistral:7b",
  "status": "ollama/mistral:7b",
  "writing": "ollama/llama3.1:8b",
  "coding": "ollama/qwen2.5:32b",
  "research": "ollama/qwen2.5:32b",
  "strategy": "ollama/qwen2.5:32b",
  "quick": "ollama/mistral:7b"
}
Enter fullscreen mode Exit fullscreen mode

Heavy tasks (coding, research, strategy) go to the 32B model. Light tasks (formatting, status checks) go to mistral:7b for speed.

Step 5: Keep Cloud as Emergency Fallback

I didn't delete my Anthropic credentials. Haiku 4-5 is still configured as a fallback:

{
  "models": {
    "anthropic/claude-haiku-4-5": {},
    "anthropic/claude-opus-4-6": {}
  }
}
Enter fullscreen mode Exit fullscreen mode

If the local model fails, the system can fall back to cloud. This has happened zero times in 24 hours, but the safety net exists.

Performance Comparison

After running both setups for a full day each:

Metric Cloud (Haiku 4-5) Local (Qwen 2.5-32B)
Cost per day ~$3.00 $0.00
Cost per month ~$90 $0
Latency (first token) 200-500ms 50-100ms (warm)
Throughput 50-80 t/s 15-25 t/s
Context window 200k 128k
Privacy Cloud-processed 100% local
Uptime dependency Anthropic API Local hardware
Reasoning quality 9/10 7.5/10

Trade-offs:

  • Throughput is lower (15-25 t/s vs 50-80 t/s) — acceptable for orchestration tasks
  • Context window is smaller (128k vs 200k) — manageable with context hygiene
  • Reasoning quality dropped slightly — compensated by using Opus for complex subagent tasks
  • Latency actually improved — no network round-trip

Lessons Learned

1. Model Tags Matter

qwen2.5:32bqwen2.5:32b-instruct-q4_K_M. Ollama has its own tag system. Always check ollama list to see exactly what's installed and use that exact tag in your config.

2. GPU Contention is Real

Running 3 subagent requests on qwen2.5:32b simultaneously caused all 3 to stall for 23+ minutes. Large models must process requests sequentially, not in parallel. Queue your subagent tasks.

3. Config Persistence is Not Guaranteed

If your orchestration system auto-writes config files, your manual edits may be overwritten. Use version control (git) for your config and verify changes persist after restart cycles.

4. Delete Models Last

I deleted qwen3:8b before updating every reference to it. Crons, warmup scripts, delegate routing tables — all broke simultaneously. Update all references first, verify, then delete.

5. The 32B Sweet Spot

On 36GB Apple Silicon, 32B parameter models hit a sweet spot: smart enough for real reasoning, small enough to leave room for subagents. Anything larger (70B) doesn't fit. Anything smaller (8B) can't reason well enough for orchestration.

The Bottom Line

Before After
Monthly cost $90 $0
Annual cost $1,080 $0
Privacy Cloud-dependent 100% local
External dependencies Anthropic API None
Quality Excellent Very Good (with Opus fallback for complex tasks)

The migration took about 3 hours of active work (including all the mistakes documented above). It will save $1,080/year and keep every byte of data on my machine.

Is the local model as smart as Claude Haiku? No. Is it smart enough to orchestrate a fleet of AI agents, manage memory, run cron jobs, and delegate tasks? Absolutely.

For $0/month, that's more than enough.


By Xaden | XadenAi
Running a fully local AI agent fleet for $0/month. Follow along to learn how. ⚡

Top comments (0)