DEV Community

Jovan Chan
Jovan Chan

Posted on • Originally published at runaihome.com

Running 100B+ Parameter Models on Mac Studio: What Actually Works (2026)

This article was originally published on runaihome.com

Most guides on running large models locally gloss over the part that actually determines whether it works: memory capacity, memory bandwidth, and the brutal math of what happens when a dense 100B-parameter model has to fit into a single machine. This guide covers all three, with specific numbers for the Mac Studio hardware that exists today — and a frank assessment of what Apple's 2026 memory supply crisis means for anyone planning to run these models.

The short version: If you already own a Mac Studio M3 Ultra with 192GB or 256GB of unified memory, you can run Llama 3.1 405B locally — at 3–5 tokens per second. If you owned the discontinued 512GB config, DeepSeek-V3/R1 runs at 20+ tok/s. If you're buying new today, you cannot buy a Mac Studio capable of any 100B+ model. All high-memory configurations were pulled from Apple's store by May 2026.

That context matters before you go further.

The Hardware That Disappeared

The Mac Studio with M3 Ultra launched in March 2025 with five memory configurations: 96GB, 128GB, 192GB, 256GB, and 512GB. The M3 Ultra's 819 GB/s memory bandwidth and unified memory architecture made it, briefly, the most practical consumer system for running the largest open-weight models.

Then the DRAM shortage arrived. In March 2026, Apple removed the 512GB upgrade option and raised the price of the 256GB option by $400. By May 2026, Apple removed the 192GB and 256GB options as well, citing ongoing memory supply constraints and AI hardware demand. The only Mac Studio M3 Ultra you can buy new today ships with 96GB of unified memory and costs $3,999.

The Mac Studio M4 Max is also available, with up to 128GB and 546 GB/s of memory bandwidth. Neither of these configurations can run any 100B+ model.

If you're buying new, the path to 100B+ models requires either waiting for the M5 Ultra (expected WWDC 2026 or fall 2026, rumored max 256GB) or finding a used M3 Ultra with 192GB or 256GB on the secondary market.

Dense vs. MoE: Two Very Different 100B+ Experiences

"100B+ parameter model" covers two fundamentally different architectures, and they perform very differently at inference time.

Dense models (Llama 3.1 405B) activate every parameter on every forward pass. At inference, the GPU must stream all ~242GB of Q4-quantized weights through memory for each output token. Memory bandwidth is the hard ceiling — the M3 Ultra's 819 GB/s determines your speed, and the math doesn't lie: at 2–5 tok/s, these models are below comfortable interactive reading speed.

Mixture-of-Experts models (DeepSeek-V3 and DeepSeek-R1, both 671B total parameters with only 37B activated per token) activate only a small subset of expert layers per forward pass. Despite storing 671B parameters in memory, the GPU only streams roughly 37B worth of weights per output token. This is why DeepSeek-V3 on a 512GB M3 Ultra runs at over 20 tok/s with MLX-LM, while Llama 3.1 405B on the same hardware tops out around 5 tok/s.

Both are "100B+ parameter models." Only one is practically interactive.

Memory Requirements by Model and Quantization

The table below shows approximate GGUF sizes for the key 100B+ open-weight models at each quantization level. Actual RAM required adds KV cache on top — budget 5–20GB depending on context length.

Model Q2_K Q3_K_M Q4_K_M Q8_0
Llama 3.1 405B ~150 GB ~195 GB ~242 GB ~430 GB
DeepSeek-V3/R1 671B ~180 GB ~280 GB ~370–405 GB ~670 GB

(Sizes estimated from verified 70B GGUF sizes scaled proportionally, confirmed against Hugging Face GGUF repository and community data.)

What Runs on Which Hardware

Mac Studio Config Max New (May 2026) Llama 405B DeepSeek V3/R1 671B
M4 Max 128GB ✅ Available ❌ Won't fit any quant ❌ Won't fit
M3 Ultra 96GB ✅ Available ❌ Won't fit ❌ Won't fit
M3 Ultra 192GB ❌ Discontinued ✅ Q2_K or Q3_K_M ❌ Won't fit
M3 Ultra 256GB ❌ Discontinued ✅ Q4_K_M (tight) ❌ Q4 won't fit
M3 Ultra 512GB ❌ Discontinued ✅ Any quant ✅ Q4_K_M

The 256GB M3 Ultra running Llama 405B at Q4_K_M has about 14GB headroom for KV cache — enough for short-to-medium contexts but limiting at 32K+. If you have a 256GB machine, Q3_K_M is the safer pick: ~195GB for weights leaves ~61GB for cache and system overhead, giving you comfortable room at long contexts.

The Real Benchmark Numbers

These numbers come from systematic MLX inference benchmarks on M3 Ultra, published by the MLX team in early 2026. Testing used a 512GB M3 Ultra, but since performance is bandwidth-limited (not capacity-limited), the numbers apply equally to a 192GB or 256GB system running compatible quantizations.

Llama 3.1 405B on M3 Ultra (MLX-LM):

Quantization 1K context 4K context 16K context 32K context
Q2_K 5.1 tok/s 4.9 tok/s 4.4 tok/s OOM on 512GB
Q3_K_M 3.6 tok/s 3.6 tok/s 3.3 tok/s 3.0 tok/s
Q4_K_M 2.9 tok/s 2.9 tok/s 2.7 tok/s 2.5 tok/s

The MLX team's own conclusion in that thread: "dense models >100B are impractical for interactive use." Five tokens per second is roughly 3× slower than a reader can comfortably track streaming output. Three tok/s is closer to what you'd get on a well-specced laptop running a 7B model.

DeepSeek-V3 671B on M3 Ultra 512GB (MLX-LM, Q4_K_M):

Over 20 tok/s at short context lengths, as independently verified by Apple researcher Awni Hannun in March 2025. The MoE architecture's sparse activation makes the 819 GB/s bandwidth dramatically more effective. Prefill (prompt ingestion) remains a bottleneck — community reports for llama.cpp Metal on this hardware describe very long prefill times for multi-thousand token prompts, which is the main practical limitation for long-document workflows. MLX's native Metal kernels meaningfully reduce prefill time versus llama.cpp.

Reference comparison — Llama 3.3 70B on M3 Ultra 96GB (MLX):

Approximately 25–30 tok/s at short context (the M3 Ultra's 819 GB/s bandwidth outpaces the M4 Max's 546 GB/s for models of this size). This is the model that actually makes the M3 Ultra shine: well above interactive threshold, comfortably in working memory, and available on the $3,999 base config.

How to Actually Run These on Apple Silicon

For Mac, the right inference stack is MLX-LM, not Ollama or plain llama.cpp. Ollama uses llama.cpp internally, and on Apple Silicon that leaves 30–50% performance on the table compared to MLX's Metal-native implementation. For a machine that costs $4,000+, the difference matters.

Install MLX-LM (requires macOS 14+, Python 3.10+):

python3 -m venv mlx_env
source mlx_env/bin/activate
pip install mlx-lm
Enter fullscreen mode Exit fullscreen mode

Run Llama 3.1 405B (192GB+ M3 Ultra, Q3_K_M):


bash
# Find the exact model ID at https://huggingface.co/mlx-community
m
Enter fullscreen mode Exit fullscreen mode

Top comments (0)