The pitch for running LLMs on your own Mac is seductive: no rate limits, no API keys, no data leaving the machine. Then you put the actual numbers in a spreadsheet and the cloud wins on cost alone — usually by 30x or more.
A Hacker News thread on offline LLM energy use this week ran the arithmetic, and the gap between "feels free" and "actually free" is wider than most developers expect. The framing matters: when developers compare local vs cloud they usually mean "free vs metered." That mental model is wrong. Local has a fixed cost (hardware plus electricity over time) and cloud has a variable cost (per token). The question isn't which is free; it's which one has a lower total cost for your specific usage pattern.
The hardware and per-token math
To run a 70B-parameter model with reasonable quality at usable speeds, you need 48GB of unified memory minimum, ideally more. The configurations actually capable of holding Llama 3.3 70B or Qwen 2.5 72B without aggressive quantization that degrades output:
- M-series Max MacBook Pro, 64GB: ~$3,999
- M-series Ultra Mac Studio, 128GB: ~$4,799
- M-series Ultra Mac Studio, 192GB: ~$6,599
Drop below 32GB of unified memory and you're running 8B-class models — fine for autocomplete, not fine for anything you'd otherwise call OpenRouter for.
Assume a three-year useful life. A $6,599 Mac Studio depreciates at $6.03/day before electricity. If you use it for inference 4 hours a day, you're amortizing $1.51/hour of hardware cost before the GPU produces a single token.
A maxed Ultra running Llama 3.3 70B in 4-bit quantization produces roughly 10-15 tokens per second on a typical prompt. Call it 13 tokens/sec sustained. Under inference load, the Studio draws 150-220W from the wall. Run those numbers for one hour:
- Tokens produced: ~47,000
- Energy: ~0.2 kWh
- Electricity at $0.20/kWh: $0.04
- Hardware amortization: $1.51
- All-in cost per million tokens: ~$33
Now price the same workload on OpenRouter:
- DeepSeek V3.1: $0.27/MTok input, $1.10/MTok output
- Llama 3.3 70B: $0.40-0.80/MTok blended depending on provider
- Qwen 2.5 72B Instruct: $0.40/MTok blended
For a 70%-input/30%-output mix, you'll pay $0.50-$0.80 per million tokens on OpenRouter for the same models running on your Mac. That's a 40-60x cost advantage for the cloud — and the cloud is 5-10x faster per token thanks to H100s and B200s on the other end. You'd need to run the Mac at full inference load 24 hours a day for nearly a year before per-token cost dropped below cloud pricing, and at that point you've consumed a third of the hardware's useful life.
The "free local inference" framing assumes the hardware cost is already sunk. If you own the Mac for other reasons and only use it occasionally for LLM work, the marginal cost really is close to electricity. But buying a $6,000+ machine specifically for LLM workloads almost never pencils out on cost alone.
Where local actually wins
The math flips in three specific scenarios:
Privacy-constrained workloads. Healthcare records, internal source code under NDA, financial data with regulatory exposure — these can't legally or contractually go to a third-party API. Local inference isn't competing on cost; it's competing with "you can't do this at all."
High-volume team autocomplete. A team running a self-hosted Continue.dev or local Codestral instance with 10+ engineers hitting it constantly can saturate a Mac Studio's throughput in a way that beats per-token billing. The break-even arrives around 4-5 million tokens/day of sustained traffic per machine.
Latency-bound interactive use. OpenRouter routes through public internet, often with 200-500ms before the first token. A local M-series produces time-to-first-token under 100ms. For agentic loops with many small calls, that overhead compounds.
Offline reliability. Plane wifi, conference networks, oncall in a basement. The Mac doesn't care.
Outside those scenarios, the cloud math is brutal.
What the numbers don't show
Raw cost-per-token is only one axis. A few things the math obscures:
- Model quality. OpenRouter exposes Claude Sonnet 4.5, GPT-4o, Gemini 2.5 Pro. A local 70B is roughly competitive with GPT-4o-mini on most benchmarks and meaningfully worse on hard reasoning. If output quality matters, the cloud option isn't substitutable.
- Concurrency. Your Mac runs one inference at a time at full speed. OpenRouter scales to whatever load you throw at it.
- Tail latency. Cloud APIs occasionally hang for 30+ seconds; a local instance is more predictable.
- Heat and noise. A Studio under sustained inference load runs hot enough that you hear the fans. In a quiet home office, that matters.
The decision framework
Before specing a Mac Studio for inference work, run this checklist:
- How many tokens per day will you actually generate? Most developers writing code with AI use 50K-500K tokens/day. At OpenRouter prices, that's $0.05-$2/day. A $6,599 Mac needs 3-15 years of usage at those volumes to break even on hardware cost.
- Do you need a frontier model? If the work involves complex reasoning, multi-step planning, or production-quality writing, you need Claude or GPT-4-class output, not a local 70B.
- Do you have a compliance reason? This is the only category where cost analysis doesn't apply.
- Are you running an inference workload, not a development workload? If you're serving end users from the Mac, the math changes. If you're just coding faster, it usually doesn't.
The Mac Studio is an excellent machine. The case for buying one specifically because you want to run LLMs locally is much weaker than the YouTube benchmarks suggest. For most developers, $20-50/month on OpenRouter paired with hardware you already own beats a fresh purchase on every measurable axis except sovereignty.
Originally published at pickuma.com. Subscribe to the RSS or follow @pickuma.bsky.social for new reviews.
Top comments (0)