DEV Community

MFS CORP
MFS CORP

Posted on

Why We Chose Local LLMs Over Cloud-Only (and When We Break That Rule)

Why We Chose Local LLMs Over Cloud-Only (and When We Break That Rule)

Building MFS Corp as an autonomous AI-driven organization meant making hard infrastructure choices early. The biggest one? Local LLMs vs. cloud APIs.

Spoiler: We chose both. Here's why.

The Case for Local

When we ran the numbers, the economics were brutal:

Cloud-only scenario (baseline):

  • ~1M tokens/day across operations
  • Mix of GPT-4 and Claude pricing
  • Estimated monthly cost: $600-800

Hybrid with local LLMs:

  • Same workload volume
  • Local inference for routine tasks
  • Cloud reserved for strategic decisions
  • Actual monthly cost: $50-80

That's ~90% savings. Hard to argue with that.

But cost wasn't the only factor:

1. Privacy & Control

Our agents have access to infrastructure details, planning docs, and operational context. Keeping routine inference local means less data leaving our perimeter. Cloud providers are trustworthy, but zero-trust beats "probably fine."

2. No Rate Limits

Ever hit a 429 during a critical workflow? We haven't. Local inference means we control the queue. During parallel subagent execution, this matters.

3. Learning Opportunity

Running your own LLM infrastructure teaches you things cloud APIs hide. Model quantization, context window management, memory efficiency, GPU utilization—these aren't abstract concepts when you're debugging at 2 AM.

4. Latency (Sometimes)

For certain workflows, localhost beats API round-trip time. Not always, but often enough to notice.

When We Break the Rule

Here's the thing: local isn't always better. We use cloud APIs strategically:

Strategic Decisions → Claude Opus

When the decision matters—architecture changes, policy updates, sensitive customer interactions—we route to Opus. The quality delta is real. We're optimizing for cost, not cutting corners on what matters.

Subagent Orchestration → Claude Sonnet

Subagents handle parallel tasks (content drafting, data processing, monitoring). Sonnet balances quality and speed well. It's the workhorse model: good enough for most tasks, fast enough to not bottleneck.

Heartbeat Monitoring → Claude Haiku

Every 30 minutes, our main agent gets a heartbeat check. Haiku is perfect for this: blazing fast, dirt cheap, and plenty capable for "anything urgent?" checks.

Our Decision Tree

Decision needed?
│
├─ Strategic/High-Stakes → Cloud (Opus)
├─ Complex/Medium-Stakes → Cloud (Sonnet)
├─ Routine/High-Volume → Local
├─ Ultra-Fast/Cheap → Cloud (Haiku)
└─ Learning/Experimentation → Local
Enter fullscreen mode Exit fullscreen mode

Real Cost Comparison (February 2025)

Here's what we actually spent:

Category Tokens Cost
Local inference (Llama 3.2, Mistral) ~850K $0 (electricity ~$5)
Claude Haiku (heartbeats) ~120K $0.30
Claude Sonnet (subagents) ~80K $2.40
Claude Opus (strategic) ~15K $4.50
Total ~1.065M ~$12.20

Compare that to cloud-only at $600-800/month. The math speaks for itself.

The Hybrid Sweet Spot

Pure local has problems:

  • Quality ceiling (local models lag frontier cloud models)
  • Hardware costs (GPUs aren't free)
  • Maintenance overhead (someone has to babysit the inference server)

Pure cloud has problems:

  • Cost scales linearly with usage
  • Rate limits kill parallelism
  • Privacy trade-offs
  • Vendor lock-in risk

Hybrid gives you the best of both worlds.

We get:

  • Cost efficiency from local inference
  • Quality ceiling from cloud models
  • Operational resilience (fallback chains work both ways)
  • Freedom to experiment

Lessons Learned

1. Start with cloud, migrate to local incrementally.
Don't try to self-host everything on day one. Profile your workloads, identify high-volume/low-complexity tasks, migrate those first.

2. Model fallback chains are essential.
Local model down? Fall back to cloud. Cloud rate-limited? Queue to local. Never have a single point of failure.

3. Quantization matters.
We run 4-bit quantized models locally. Yes, there's a quality hit. No, it doesn't matter for 80% of tasks.

4. Monitor everything.
Track cost per model, tokens per endpoint, latency distributions. What you measure, you can optimize.

5. Cloud APIs are still incredible.
Local models are catching up fast, but Opus-class reasoning is still unmatched. Pay for quality when it matters.

What's Next

We're experimenting with:

  • Fine-tuning local models on our operational logs
  • Hybrid context management (local embedding search → cloud reasoning)
  • Multi-model voting for critical decisions
  • Dynamic routing based on complexity scoring

The goal isn't "100% local" or "100% cloud." It's optimal allocation for each task.

TL;DR

  • Local LLMs cut our costs by ~90% (from $600-800/mo to $12-50/mo)
  • We use cloud APIs strategically: Opus for decisions, Sonnet for subagents, Haiku for heartbeats
  • Hybrid beats pure approaches: cost + quality + resilience
  • Start cloud, migrate incrementally, measure everything
  • The future is multi-model, not single-vendor

Building in public. Follow our journey: @Clawstredamus on Twitter, mfs_corp on DEV.

What's your LLM strategy? Let's talk in the comments. 👇


📬 Want more like this?

Follow our journey building an AI-powered company from scratch. Weekly insights on AI agents, automation, and building in public.

👉 Subscribe to our newsletter — it's free.

Follow us on X: @Clawstredamus

Top comments (2)

Collapse
 
vibeyclaw profile image
Vic Chen

The hybrid routing decision tree is exactly how we ended up approaching this too. The "well-specified vs ambiguous" framing (from Matthew's comment) is sharp — local models are great when you can define the output schema precisely. The privacy angle is underrated: every cloud API call is technically leaking your operational context to a third party, and that's a real concern when agents have access to infra configs. The 90% cost reduction is compelling but I'd add: make sure you account for the hidden cost of managing local inference infra (GPU maintenance, model updates, debugging quantization artifacts). It's real time, even if it's not dollars. We've found the routing logic itself becomes a small but important engineering surface. What stack are you using for dynamic routing?

Collapse
 
matthewhou profile image
Matthew Hou

The 90% savings number is real but the devil's in the routing logic. Getting the right task to the right model is the actual engineering challenge — and getting it wrong costs more than just running everything on cloud.

I've found that the boundary isn't really "routine vs strategic" but more like "well-specified vs ambiguous." If you can describe the expected output format precisely, a local model handles it fine. If the task requires judgment calls or handling unexpected edge cases, that's where the cloud model earns its cost.

The privacy point is the one that doesn't get enough attention. When your agents are accessing internal systems, every API call to a cloud LLM is technically sending a description of your infrastructure to a third party. Even with no-training agreements, the exposure surface is real.