DEV Community

Cover image for Gemma 4 on Apple Silicon: 85 tok/s with a pip install
Raullen Chai
Raullen Chai

Posted on

Gemma 4 on Apple Silicon: 85 tok/s with a pip install

Last week Google released Gemma 4 — their most capable open-weight model family. Within hours I had it running locally on my Mac at 85 tokens/second, with full tool calling, streaming, and an OpenAI-compatible API that works with every major AI framework.

Here's how, and what the benchmarks actually look like.

Setup: 2 commands

pip install rapid-mlx
rapid-mlx serve gemma-4-26b
Enter fullscreen mode Exit fullscreen mode

That's it. The server downloads the 4-bit MLX-quantized model (~14 GB) and starts an OpenAI-compatible API on http://localhost:8000/v1.

Rapid-MLX demo

Benchmarks: Gemma 4 26B on M3 Ultra

I benchmarked three engines on the same machine (M3 Ultra, 192GB), same model (Gemma 4 26B-A4B 4-bit), same prompt:

Engine Decode (tok/s) TTFT Notes
Rapid-MLX 85 tok/s 0.26s MLX-native, prompt cache
mlx-vlm 84 tok/s 0.31s VLM library (no tool calling)
Ollama 75 tok/s 0.08s llama.cpp backend

Rapid-MLX is 13% faster than Ollama on decode. Ollama has faster TTFT (it uses llama.cpp's Metal kernels for prefill), but for interactive use the decode speed is what you feel.

On smaller models the gap is wider — Rapid-MLX hits 168 tok/s on Qwen3.5-4B vs Ollama's ~70 tok/s (2.4x).

Tool Calling That Actually Works

This is where it gets interesting. Most local inference servers either don't support tool calling, or support it for one model family. Rapid-MLX ships 18 built-in tool call parsers covering:

  • Qwen 3 / 3.5 (hermes format)
  • Gemma 4 (native <|tool_call> format)
  • GLM-4.7, MiniMax, GPT-OSS
  • Llama 3, Mistral, DeepSeek
  • And more

Tool calling works out of the box — no extra flags needed for supported models:

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "default",
    "messages": [{"role": "user", "content": "What is the weather in Tokyo?"}],
    "tools": [{
      "type": "function",
      "function": {
        "name": "get_weather",
        "description": "Get weather for a city",
        "parameters": {
          "type": "object",
          "properties": {"city": {"type": "string"}},
          "required": ["city"]
        }
      }
    }]
  }'
Enter fullscreen mode Exit fullscreen mode

Response:

{
  "choices": [{
    "message": {
      "tool_calls": [{
        "function": {
          "name": "get_weather",
          "arguments": "{\"city\": \"Tokyo\"}"
        }
      }]
    }
  }]
}
Enter fullscreen mode Exit fullscreen mode

The tool call arguments are properly parsed — including bare numeric values like {a: 3, b: 4} that Gemma 4 emits without JSON quotes.

Works With Everything

Because it's OpenAI-compatible, you can point any AI framework at it:

PydanticAI

from pydantic_ai import Agent
from pydantic_ai.models.openai import OpenAIChatModel
from pydantic_ai.providers.openai import OpenAIProvider

model = OpenAIChatModel(
    model_name="default",
    provider=OpenAIProvider(
        base_url="http://localhost:8000/v1",
        api_key="not-needed",
    ),
)

agent = Agent(model)
result = agent.run_sync("What is 2+2?")
print(result.output)  # "4"
Enter fullscreen mode Exit fullscreen mode

I've verified this end-to-end with structured output (output_type=BaseModel), streaming, multi-turn conversations, and multi-tool workflows. Test suite here.

LangChain

from langchain_openai import ChatOpenAI

llm = ChatOpenAI(
    model="default",
    base_url="http://localhost:8000/v1",
    api_key="not-needed",
)

# Tool calling works
from langchain_core.tools import tool

@tool
def multiply(a: int, b: int) -> int:
    """Multiply two numbers."""
    return a * b

result = llm.bind_tools([multiply]).invoke("What is 6 * 7?")
print(result.tool_calls)  # [{"name": "multiply", "args": {"a": 6, "b": 7}}]
Enter fullscreen mode Exit fullscreen mode

Aider (AI pair programming)

export OPENAI_API_BASE=http://localhost:8000/v1
export OPENAI_API_KEY=not-needed
aider --model openai/gemma-4-26b
Enter fullscreen mode Exit fullscreen mode

Aider's full edit-and-commit workflow works — I tested it modifying a Python file with Gemma 4. Test script here.

Full Compatibility List

Client Status Notes
PydanticAI Tested (6/6) Streaming, structured output, multi-tool
LangChain Tested (6/6) Tools, streaming, structured output
smolagents Tested (4/4) CodeAgent + ToolCallingAgent
Anthropic SDK Tested (5/5) Via /v1/messages endpoint
Aider Tested CLI edit-and-commit workflow
LibreChat Tested (4/4) Docker E2E with librechat.yaml
Open WebUI Tested (3/4) Docker, model fetch, streaming
Cursor Compatible Settings UI config
Claude Code Compatible OPENAI_BASE_URL env var
Continue.dev Compatible YAML config

Every "Tested" entry has an automated test script in the repo — not just "I tried it once."

What Model Should I Run?

Depends on your Mac's RAM:

Mac Model Speed Use Case
16 GB MacBook Air Qwen3.5-4B 168 tok/s Chat, coding, tools
32 GB MacBook Pro Gemma 4 26B-A4B 85 tok/s General purpose, tool calling
64 GB Mac Mini/Studio Qwen3.5-35B 83 tok/s Smart + fast balance
96+ GB Mac Studio/Pro Qwen3.5-122B 57 tok/s Frontier intelligence

Quick alias lookup:

rapid-mlx models
Enter fullscreen mode Exit fullscreen mode

Under the Hood

A few things that make this work well:

Prompt cache — Repeated system prompts (common in agent frameworks) are cached. On multi-turn conversations, only new tokens are processed. This cuts TTFT by 2-10x on follow-up messages.

OutputRouter — A token-level state machine that separates model output into channels (content / reasoning / tool calls) in real-time. No regex post-processing, no leakage of <think> tags or tool markup into the content stream.

Auto-detection — Model family, tool parser, and reasoning parser are auto-detected from the model name. No manual --tool-parser hermes flags needed (though you can override).

Try It

# Homebrew
brew install raullenchai/rapid-mlx/rapid-mlx

# or pip
pip install rapid-mlx

# Serve Gemma 4
rapid-mlx serve gemma-4-26b

# Point any OpenAI-compatible app at http://localhost:8000/v1
Enter fullscreen mode Exit fullscreen mode

Repo: github.com/raullenchai/Rapid-MLX


Built on Apple's MLX framework and mlx-lm. Licensed Apache 2.0.

Top comments (0)