Sizing a Mac mini M4 for Local AI: An Architect's Breakdown by Task

#ai #llm #machinelearning #performance

Every few weeks someone asks me the same question: "Should I buy a Mac mini M4 to run AI locally?" And every time, my answer is the same - that's the wrong question to lead with. The right question is: which task, at what quality, on how much memory? Hardware is the last decision, not the first.

I've been chasing the same goal a lot of practitioners have: becoming self-sufficient on local AI so I'm less dependent on cloud LLM subscriptions, without sacrificing output quality. My current Windows machine has no usable GPU, which makes tools like Ollama and LM Studio frustrating at best. The Mac mini M4 is an obvious candidate. But "is it good?" is meaningless until you define what you're asking it to do. So let's do this the way we'd plan any piece of infrastructure: start from the workload and work backward to the spec.

The One Constraint That Governs Everything: Unified Memory

On Apple Silicon, the instinct from the PC world - "I need a bigger GPU", leads you astray. The Mac mini M4 doesn't have a discrete GPU with its own VRAM. It has unified memory, a single pool shared by the CPU and GPU. For local inference, this is actually a strength: there's no copying model weights across a PCIe bus, and the whole memory pool is available to the model.

The catch is the part people underestimate. Your maximum usable model size is, to a first approximation, a function of how much unified memory you have. A quantized model's weights plus its context window plus the OS overhead all have to fit in that one pool. And on a Mac mini, you cannot upgrade the memory after purchase, it's part of the chip package. So the single most important architectural decision happens at the configurator screen, before the box ever ships.

That reframes the whole buying decision. The CPU tier and core counts matter far less than the memory you select. Spend there.

Mapping Tasks to Memory Tiers

Let's break the workloads into tiers, because the memory requirement scales dramatically with task complexity.

Tier 1: Q&A and chat. Running a 7-8B parameter model (think Llama or Qwen at 4-bit quantization) for conversational Q&A, summarization, or general assistant work is comfortable on 16GB of unified memory. This is the base Mac mini M4's sweet spot. If your goal is to learn the tooling, run a personal assistant, or do light text work offline, the base model is genuinely enough. Don't over-buy for this.

Tier 2: Document processing and RAG. This is where memory pressure jumps, because you're no longer running one thing. A retrieval-augmented setup runs an embedding model, a mid-size generation model, and a vector store concurrently. They all compete for the same unified pool. I'd configure 24-32GB here so the model and the index aren't evicting each other. This is the tier most enterprise practitioners actually need, and it's the one most often under-specced.

Tier 3: Local coding assistants. Useful local coding help means 14B to 32B class models. Plan for 32-64GB. Below that, you're forced into aggressive quantization, which costs you code quality, and your tokens-per-second drops to the point where the assistant is something you demo rather than something you actually work inside all day.

What a Local Setup Actually Requires

Hardware is only one layer. A working local AI stack has a few components worth naming explicitly, because each is a decision:

A runtime to serve the model - Ollama or LM Studio are the common choices, and both run cleanly on Apple Silicon.
The model itself, at an appropriate quantization. 4-bit (Q4) is the usual quality/size compromise; lighter quantization saves memory at a real quality cost.
For RAG, an embedding model plus a vector store (Chroma, LanceDB, or similar) and an orchestration layer.
Headroom. Never size to 100% of memory - the OS and context window need room, and a 32K-token context isn't free.

Here's a minimal example of standing up a local model with Ollama, the kind of thing you'd run on day one:

bash
# Install and pull a quantized 8B model
ollama pull llama3.1:8b

# Run it interactively
ollama run llama3.1:8b

# Or call it as a local API for your app
curl http://localhost:11434/api/generate -d '{
  "model": "llama3.1:8b",
  "prompt": "Summarize the attached design doc in 5 bullets.",
  "stream": false
}'

So, Should You Buy One?

For local AI development, the Mac mini M4 is a genuinely strong choice; it's silent, sips power compared to a GPU tower, and the unified memory architecture is well suited to inference. The honest nuance is in the configuration. The base 16GB unit is an excellent, affordable learning and chat rig. But if your real work is document processing, RAG, or local coding, treat the base model as a starting point and configure the memory up. That's where your budget delivers the most return.

The Windows-with-no-GPU situation many of us are in is exactly the gap the mini fills well; not because it's the most powerful machine, but because it makes the whole local inference experience frictionless at a low running cost.

Three Key Takeaways

Size from the workload, not the spec sheet. Q&A wants 16GB, RAG wants 24-32GB, local coding wants 32-64GB. Decide what you're running before you decide what you're buying.
Unified memory is the ceiling, and it's permanent. You can't upgrade it later, so buy for what you'll run in 18 months, not what you're testing this week.
Spend on RAM, not the CPU tier. On Apple Silicon, memory is the spec that unlocks bigger models; the rest is secondary for inference workloads.

If you've been running local models on a base Mac mini, I'd genuinely like to know where it stopped being enough; that boundary is the most useful data point for anyone sizing their first machine.

Top comments (2)

Alex Shev • Jun 27

Starting from task shape is the right way to size local AI hardware. People often ask whether a machine can run AI, but the real variables are model size, context length, concurrency, latency tolerance, and whether the job is chat, coding, embeddings, image, or batch inference.

Some comments may only be visible to logged-in visitors. Sign in to view all comments.