DEV Community

pickuma
pickuma

Posted on • Originally published at pickuma.com

Mac Mini as AI Agent Infrastructure: Why Apple Silicon Powers Local LLM Inference

The Mac Mini is sitting on desks in a growing number of engineering teams not as a workstation, but as a dedicated inference node — a small, silent box that answers API calls from coding agents, internal chatbots, and document-processing pipelines. The reason is mostly architectural, and worth understanding before you spend money on hardware you will later regret.

Why Unified Memory Changes the Arithmetic

LLM inference is a memory-bandwidth-bound workload. At each token generation step, the model weights — which can be 4–40 GB depending on model size and quantization — get read from memory into compute units. The faster that data transfer happens, the faster tokens appear. This is why GPU memory bandwidth is the metric that actually predicts inference throughput, not raw FLOPS.

Apple Silicon's unified memory architecture means the CPU, GPU, and Neural Engine all share a single physical memory pool behind a single high-bandwidth interconnect. There is no PCIe bus between system RAM and GPU VRAM, because there is no separate VRAM. The M4 delivers 120 GB/s of memory bandwidth; the M4 Pro steps that up to 273 GB/s. Contrast that with a desktop CPU, which typically reads from DRAM at 50–80 GB/s, and the key point becomes clear: when you run a model on an M-series chip, the GPU accesses model weights at roughly the same bandwidth as a mid-range discrete GPU — and it does so from memory that is also available to the CPU and system, with no copy overhead.

This matters practically. A Mac Mini M4 Pro with 48 GB of unified memory can hold a 32B parameter model quantized to 4-bit (roughly 18–20 GB) in memory that the inference engine uses directly. On a conventional PC, 48 GB of GPU VRAM costs significantly more than the entire Mac Mini system, because high-capacity VRAM is found only on professional cards like the NVIDIA A6000 or H100 NVL.

What You Actually Get Per Dollar

The M4 Mac Mini starts at $599 (16 GB) and goes to around $2,199 for the M4 Pro with 64 GB. Here is how that translates to real inference throughput, based on benchmarks collected from multiple community evaluations with Ollama and the MLX backend:

Config Price Model Tok/s (decode)
M4 / 16 GB $599 Llama 3.1 8B ~21
M4 Pro / 24 GB $1,399 DeepSeek-R1 14B ~18
M4 Pro / 48 GB $1,799 Qwen 2.5 32B ~12
M4 Pro / 64 GB $2,199 Llama 3.1 70B (Q3) ~6–8

For comparison, an RTX 4090 desktop running the same 8B model at 4-bit delivers roughly 75 tokens/second — about 3.5x faster. A cloud H100 at 8-bit can produce several thousand tokens/second at full batch utilization. The Mac Mini is not competing at those speeds.

What it does compete on is cost per watt and cost per idle hour. The M4 Mac Mini Pro draws about 30–40 W under full inference load; at idle it drops to roughly 5–12 W. An RTX 4090 desktop pulls 350–450 W under load. Running either system 8 hours a day at $0.15/kWh, the Mac Mini's annual electricity cost is around $14–16; the 4090 desktop runs closer to $160. The breakeven against cloud GPU rental is also fast: an on-demand RTX 4090 instance costs roughly $0.29/hour from budget providers, which means a $599 Mac Mini pays for itself in raw rental equivalence in about 85 days of 8-hour use — before factoring in any latency, data-egress, or privacy considerations.

The 16 GB M4 is the version most likely to disappoint you. After macOS takes roughly 4 GB and a running Ollama process reserves headroom, you have about 12 GB available for model weights. That is tight for a 7B model in full Q4 quantization, and it leaves no room to run the host OS normally alongside inference. If you are buying for anything beyond personal experimentation, the M4 Pro with 48 GB is the minimum config that gives you genuine flexibility: it holds a 32B model comfortably and leaves enough headroom to run other processes without thrashing.

Setting Up the Inference Stack

The two tools most teams reach for are Ollama and the MLX framework, and as of March 2026 they are converging: Ollama 0.19 shipped an MLX backend (currently in preview) that replaces the previous Metal backend for Apple Silicon. The performance improvement is substantial — benchmarks on M5 Max hardware showed decode throughput jump from 58 to 134 tokens/second for the same model with no configuration change beyond setting OLLAMA_USE_MLX=1 before starting ollama serve. The MLX backend currently requires 32 GB or more of unified memory; the 16 GB models stay on the older Metal path.

Once Ollama is running, it exposes an OpenAI-compatible endpoint at http://localhost:11434/v1. Any tool that speaks the OpenAI API format — Claude Code, Cursor, LangChain, custom agent loops — can point at your Mac Mini without modification beyond swapping the base URL.

For networking the box as a shared inference node rather than a personal machine, the practical approach is a mesh VPN like Tailscale. You assign the Mac Mini a stable Tailscale address, configure Ollama to listen on 0.0.0.0 rather than localhost, and any device on your Tailnet — including CI runners and remote dev machines — can reach the inference endpoint over an encrypted tunnel without exposing it to the public internet.

Memory Planning and Model Selection

The rule of thumb for 4-bit quantized models is roughly 0.6 GB per billion parameters, so a 7B model needs about 4–5 GB and a 70B model needs about 40–42 GB. You should subtract 4–6 GB from whatever your unified memory total is for macOS and system processes. With that math:

  • 24 GB systems: Comfortable for 13B models; can run 14B models without much headroom.
  • 48 GB systems: Solid for 32B–34B models at Q4, or 70B models at very aggressive Q2/Q3 quantization (noticeably degraded quality at Q2).
  • 64 GB systems: Fits Llama 3.1 70B at Q4 with minimal headroom; Qwen 72B workable at Q3.

If you need multiple models resident simultaneously — for example, a coding model and a general-purpose model that swap based on request type — factor each into the budget separately.

Honest Limits

The Mac Mini is not appropriate for every inference workload, and the gap with datacenter hardware is real.

Throughput under concurrent load is the main problem. A single Mac Mini M4 Pro serving multiple users simultaneously will queue requests rather than batch them efficiently. vLLM's continuous batching, which allows an H100 to process dozens of concurrent requests with near-linear throughput scaling, has no equivalent on Apple Silicon today. If you are building a product that needs to serve more than one or two simultaneous users at acceptable latency, you will hit this ceiling quickly.

Training and fine-tuning are not its job. Apple's MPS (Metal Performance Shaders) backend in PyTorch remains incomplete for gradient-based training workloads. Key CUDA libraries — flash attention, bitsandbytes, and several others — have no MPS equivalents. If you are doing any fine-tuning, keep a cloud GPU instance for that job.

The CUDA ecosystem gap is real. Many research-grade inference optimizations are written against CUDA and NCCL. If your stack depends on libraries that have not published MPS or MLX backends, you will either wait for the port or work around it.

Model quality is bounded by what fits. The largest practical model on a 64 GB Mac Mini is roughly 70B parameters at degraded quantization. The leading frontier models — GPT-4-class and above — are not available as open weights in that parameter range at the time of writing. For tasks that genuinely require that capability tier, the Mac Mini is not a substitute for the API.

Where it does work well: a solo developer or small team with privacy requirements, a home office that wants persistent local inference without a monthly bill, a CI pipeline that runs evals against a stable local model without network latency variance, or an AI agent loop that needs low-latency iteration on a 7B–14B model. In those scenarios, the combination of low power draw, zero egress cost, and the OpenAI-compatible API makes it a sensible infrastructure choice — not because it outruns cloud GPUs, but because the tradeoffs favor it for that specific workload profile.


Originally published at pickuma.com. Subscribe to the RSS or follow @pickuma.bsky.social for new reviews.

Top comments (0)