Gemma 4 on Apple Silicon: 85 tok/s with a pip install

#gemma4 #applesilicon #mlx #localai

Last week Google released Gemma 4 — their most capable open-weight model family. Within hours I had it running locally on my Mac at 85 tokens/second, with full tool calling, streaming, and an OpenAI-compatible API that works with every major AI framework.

Here's how, and what the benchmarks actually look like.

Setup: 2 commands

pip install rapid-mlx
rapid-mlx serve gemma-4-26b

That's it. The server downloads the 4-bit MLX-quantized model (~14 GB) and starts an OpenAI-compatible API on http://localhost:8000/v1.

Benchmarks: Gemma 4 26B on M3 Ultra

I benchmarked three engines on the same machine (M3 Ultra, 192GB), same model (Gemma 4 26B-A4B 4-bit), same prompt:

Engine	Decode (tok/s)	TTFT	Notes
Rapid-MLX	85 tok/s	0.26s	MLX-native, prompt cache
mlx-vlm	84 tok/s	0.31s	VLM library (no tool calling)
Ollama	75 tok/s	0.08s	llama.cpp backend

Rapid-MLX is 13% faster than Ollama on decode. Ollama has faster TTFT (it uses llama.cpp's Metal kernels for prefill), but for interactive use the decode speed is what you feel.

On smaller models the gap is wider — Rapid-MLX hits 168 tok/s on Qwen3.5-4B vs Ollama's ~70 tok/s (2.4x).

Tool Calling That Actually Works

This is where it gets interesting. Most local inference servers either don't support tool calling, or support it for one model family. Rapid-MLX ships 18 built-in tool call parsers covering:

Qwen 3 / 3.5 (hermes format)
Gemma 4 (native <|tool_call> format)
GLM-4.7, MiniMax, GPT-OSS
Llama 3, Mistral, DeepSeek
And more

Tool calling works out of the box — no extra flags needed for supported models:

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "default",
    "messages": [{"role": "user", "content": "What is the weather in Tokyo?"}],
    "tools": [{
      "type": "function",
      "function": {
        "name": "get_weather",
        "description": "Get weather for a city",
        "parameters": {
          "type": "object",
          "properties": {"city": {"type": "string"}},
          "required": ["city"]
        }
      }
    }]
  }'

Response:

{
  "choices": [{
    "message": {
      "tool_calls": [{
        "function": {
          "name": "get_weather",
          "arguments": "{\"city\": \"Tokyo\"}"
        }
      }]
    }
  }]
}

The tool call arguments are properly parsed — including bare numeric values like {a: 3, b: 4} that Gemma 4 emits without JSON quotes.

Works With Everything

Because it's OpenAI-compatible, you can point any AI framework at it:

PydanticAI

from pydantic_ai import Agent
from pydantic_ai.models.openai import OpenAIChatModel
from pydantic_ai.providers.openai import OpenAIProvider

model = OpenAIChatModel(
    model_name="default",
    provider=OpenAIProvider(
        base_url="http://localhost:8000/v1",
        api_key="not-needed",
    ),
)

agent = Agent(model)
result = agent.run_sync("What is 2+2?")
print(result.output)  # "4"

I've verified this end-to-end with structured output (output_type=BaseModel), streaming, multi-turn conversations, and multi-tool workflows. Test suite here.

LangChain

from langchain_openai import ChatOpenAI

llm = ChatOpenAI(
    model="default",
    base_url="http://localhost:8000/v1",
    api_key="not-needed",
)

# Tool calling works
from langchain_core.tools import tool

@tool
def multiply(a: int, b: int) -> int:
    """Multiply two numbers."""
    return a * b

result = llm.bind_tools([multiply]).invoke("What is 6 * 7?")
print(result.tool_calls)  # [{"name": "multiply", "args": {"a": 6, "b": 7}}]

Aider (AI pair programming)

export OPENAI_API_BASE=http://localhost:8000/v1
export OPENAI_API_KEY=not-needed
aider --model openai/gemma-4-26b

Aider's full edit-and-commit workflow works — I tested it modifying a Python file with Gemma 4. Test script here.

Full Compatibility List

Client	Status	Notes
PydanticAI	Tested (6/6)	Streaming, structured output, multi-tool
LangChain	Tested (6/6)	Tools, streaming, structured output
smolagents	Tested (4/4)	CodeAgent + ToolCallingAgent
Anthropic SDK	Tested (5/5)	Via `/v1/messages` endpoint
Aider	Tested	CLI edit-and-commit workflow
LibreChat	Tested (4/4)	Docker E2E with `librechat.yaml`
Open WebUI	Tested (3/4)	Docker, model fetch, streaming
Cursor	Compatible	Settings UI config
Claude Code	Compatible	`OPENAI_BASE_URL` env var
Continue.dev	Compatible	YAML config

Every "Tested" entry has an automated test script in the repo — not just "I tried it once."

What Model Should I Run?

Depends on your Mac's RAM:

Mac	Model	Speed	Use Case
16 GB MacBook Air	Qwen3.5-4B	168 tok/s	Chat, coding, tools
32 GB MacBook Pro	Gemma 4 26B-A4B	85 tok/s	General purpose, tool calling
64 GB Mac Mini/Studio	Qwen3.5-35B	83 tok/s	Smart + fast balance
96+ GB Mac Studio/Pro	Qwen3.5-122B	57 tok/s	Frontier intelligence

Quick alias lookup:

rapid-mlx models

Under the Hood

A few things that make this work well:

Prompt cache — Repeated system prompts (common in agent frameworks) are cached. On multi-turn conversations, only new tokens are processed. This cuts TTFT by 2-10x on follow-up messages.

OutputRouter — A token-level state machine that separates model output into channels (content / reasoning / tool calls) in real-time. No regex post-processing, no leakage of <think> tags or tool markup into the content stream.

Auto-detection — Model family, tool parser, and reasoning parser are auto-detected from the model name. No manual --tool-parser hermes flags needed (though you can override).

Try It

# Homebrew
brew install raullenchai/rapid-mlx/rapid-mlx

# or pip
pip install rapid-mlx

# Serve Gemma 4
rapid-mlx serve gemma-4-26b

# Point any OpenAI-compatible app at http://localhost:8000/v1