Alan West

Posted on Apr 4

How to Get Gemma 4 26B Running on a Mac Mini with Ollama

#ollama #llm #machinelearning #apple

So you picked up a Mac mini with the idea of running local LLMs, pulled Gemma 4 26B through Ollama, and... it either crawls at 2 tokens per second or just refuses to load. I've been there. Let me walk you through what's actually going on and how to fix it.

The Problem: "Why Is This So Slow?"

The Mac mini with Apple Silicon is genuinely great hardware for local inference. Unified memory means the GPU can access your full RAM pool — no separate VRAM needed. But out of the box, macOS doesn't allocate enough memory to the GPU for a 26B parameter model, and Ollama's defaults aren't tuned for your specific hardware.

The result? The model either fails to load, gets killed by the OOM reaper, or runs painfully slowly because half the layers are falling back to CPU inference.

Step 0: Check Your Hardware

Before anything else, verify what you're working with:

# Check your chip and memory
sysctl -n machdep.cpu.brand_string
sysctl -n hw.memsize | awk '{print $1/1024/1024/1024 " GB"}'

# Check how much memory the GPU can actually use
sudo sysctl iogpu.wired_limit_mb

For Gemma 4 26B, you realistically need a Mac mini with at least 32GB of unified memory. The 16GB models won't cut it — even with aggressive quantization, you'll be swapping constantly. If you've got 24GB, it's possible with a Q4 quantization but it'll be tight.

Step 1: Install Ollama Properly

If you haven't already:

# Install via the official method
curl -fsSL https://ollama.com/install.sh | sh

# Verify it's running
ollama --version
ollama list

A common mistake is installing an outdated version through Homebrew that doesn't have the latest Metal optimizations. Grab it directly from ollama.com or use brew install ollama and make sure you're on the latest release. Ollama's Metal backend has improved significantly in recent versions, so this actually matters.

Step 2: Pull the Right Model Variant

This is where people trip up. Don't just run ollama pull gemma4:26b without thinking about quantization.

# See what variants are available
ollama show gemma4:26b --modelfile

# For 32GB Mac mini, the Q4_K_M quantization is the sweet spot
ollama run gemma4:26b-q4_K_M

# For 64GB+ machines, you can run the full Q8 or even FP16
ollama run gemma4:26b-q8_0

Here's the rough math on memory requirements:

Q4_K_M (~15GB): Fits comfortably on 32GB with room for context
Q6_K (~20GB): Needs 32GB, tight on context window
Q8_0 (~27GB): Needs 48GB+ realistically
FP16 (~52GB): You need the 64GB Mac mini or a Mac Studio

Q4_K_M is genuinely good. I've compared outputs side-by-side with Q8 and the quality difference is minimal for most coding and writing tasks. Don't let quantization snobbery push you into configs your hardware can't handle.

Step 3: The Environment Variables That Actually Matter

This is the part most guides skip. Ollama respects several environment variables that dramatically affect performance on Apple Silicon.

# Set the number of GPU layers — for 26B on 32GB, start here
export OLLAMA_NUM_GPU=99

# Keep the model loaded longer (default is 5 minutes)
# Set to 0 for "keep loaded forever"
export OLLAMA_KEEP_ALIVE=0

# If you're running this as a server, bind to all interfaces
export OLLAMA_HOST=0.0.0.0:11434

# Control parallel request handling
export OLLAMA_NUM_PARALLEL=1

# Restart Ollama after setting these
launchctl setenv OLLAMA_NUM_GPU 99
launchctl setenv OLLAMA_KEEP_ALIVE 0

OLLAMA_NUM_GPU=99 tells Ollama to offload as many layers as possible to the GPU. On a Mac with unified memory, this is almost always what you want. Setting it lower forces layers onto the CPU, which tanks your token generation speed.

The OLLAMA_KEEP_ALIVE=0 trick is important if you're using this as a dev server. Without it, Ollama unloads the model after 5 minutes of inactivity, and reloading a 26B model takes 15-30 seconds.

Making It Persistent

If you're running Ollama as a macOS service (which is the default after installation), you need to set these via launchctl or create a plist override. Otherwise they reset on every restart.

# Create or edit the Ollama plist to include environment variables
# Location: ~/Library/LaunchAgents/ or /Library/LaunchDaemons/
# Add these keys to the EnvironmentVariables dict:
#   OLLAMA_NUM_GPU = 99
#   OLLAMA_KEEP_ALIVE = 0

# After editing, reload the service
launchctl unload ~/Library/LaunchAgents/com.ollama.ollama.plist
launchctl load ~/Library/LaunchAgents/com.ollama.ollama.plist

Step 4: Tuning the Context Window

The default context window in Ollama is 2048 tokens. For a 26B model on 32GB, you can safely push this higher — but going too far will eat into your available memory and cause swapping.

# Create a custom Modelfile with a larger context
cat << 'EOF' > Modelfile
FROM gemma4:26b-q4_K_M
PARAMETER num_ctx 8192
PARAMETER temperature 0.7
EOF

# Build your custom model
ollama create gemma4-custom -f Modelfile

# Run it
ollama run gemma4-custom

On a 32GB machine with Q4_K_M, I've found 8192 to be a good balance. You can push to 16384 on 48GB+. Watch Activity Monitor → Memory Pressure while you test — if it goes yellow, dial back.

Step 5: Verify It's Actually Using the GPU

Here's the sanity check most people forget:

# While a model is loaded, check GPU utilization
sudo powermetrics --samplers gpu_power -i 1000 -n 1

# Or watch memory allocation
vm_stat 1

If powermetrics shows near-zero GPU activity while you're generating tokens, something's wrong. The model is running on CPU. Go back and check your OLLAMA_NUM_GPU setting.

Common Gotchas

"model requires more system memory": You're trying to load a quantization that's too large. Drop down to Q4_K_M or Q4_K_S.
Extremely slow first response: This is normal — the model is loading into memory. Subsequent responses should be faster. Use OLLAMA_KEEP_ALIVE=0 to avoid reloading.
Ollama process killed: macOS memory pressure killed it. Close browser tabs, Electron apps (yes, Slack and VS Code count), and try again. Alternatively, use a smaller quantization.
Garbled or poor quality output: You might have pulled a broken quantization. Delete it with ollama rm and pull again. Also check you're using the correct prompt template — Gemma models are particular about their chat format, and Ollama handles this automatically if you use the official model tag.

Performance Expectations

Let's be honest about what to expect. On a Mac mini M4 Pro with 48GB:

Q4_K_M: ~15-20 tokens/sec generation
Q8_0: ~8-12 tokens/sec generation

On a base M4 Mac mini with 32GB:

Q4_K_M: ~10-15 tokens/sec generation

These numbers are for generation (output tokens). Prompt processing is faster. Your mileage will vary based on context length, what else is running, and thermal conditions (the Mac mini does throttle under sustained load).

The Bottom Line

Running Gemma 4 26B locally on a Mac mini is absolutely viable — you just need to pick the right quantization for your RAM and actually configure Ollama's environment variables. The defaults assume you might be running on a potato, so they're conservative.

The key takeaways:

32GB minimum, 48GB+ recommended
Q4_K_M quantization is the sweet spot for most setups
Set OLLAMA_NUM_GPU=99 to maximize GPU layer offloading
Set OLLAMA_KEEP_ALIVE=0 if you're using it as a persistent dev server
Use a Modelfile to bump context window beyond the 2048 default

Once it's dialed in, having a 26B model running locally with zero API costs and full privacy is genuinely useful for day-to-day development work. It's not GPT-4 fast, but it's yours.

DEV Community