Last week Google released Gemma 4 — their most capable open-weight model family. Within hours I had it running locally on my Mac at 85 tokens/second, with full tool calling, streaming, and an OpenAI-compatible API that works with every major AI framework.
Here's how, and what the benchmarks actually look like.
Setup: 2 commands
pip install rapid-mlx
rapid-mlx serve gemma-4-26b
That's it. The server downloads the 4-bit MLX-quantized model (~14 GB) and starts an OpenAI-compatible API on http://localhost:8000/v1.
Benchmarks: Gemma 4 26B on M3 Ultra
I benchmarked three engines on the same machine (M3 Ultra, 192GB), same model (Gemma 4 26B-A4B 4-bit), same prompt:
| Engine | Decode (tok/s) | TTFT | Notes |
|---|---|---|---|
| Rapid-MLX | 85 tok/s | 0.26s | MLX-native, prompt cache |
| mlx-vlm | 84 tok/s | 0.31s | VLM library (no tool calling) |
| Ollama | 75 tok/s | 0.08s | llama.cpp backend |
Rapid-MLX is 13% faster than Ollama on decode. Ollama has faster TTFT (it uses llama.cpp's Metal kernels for prefill), but for interactive use the decode speed is what you feel.
On smaller models the gap is wider — Rapid-MLX hits 168 tok/s on Qwen3.5-4B vs Ollama's ~70 tok/s (2.4x).
Tool Calling That Actually Works
This is where it gets interesting. Most local inference servers either don't support tool calling, or support it for one model family. Rapid-MLX ships 18 built-in tool call parsers covering:
- Qwen 3 / 3.5 (hermes format)
- Gemma 4 (native
<|tool_call>format) - GLM-4.7, MiniMax, GPT-OSS
- Llama 3, Mistral, DeepSeek
- And more
Tool calling works out of the box — no extra flags needed for supported models:
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "default",
"messages": [{"role": "user", "content": "What is the weather in Tokyo?"}],
"tools": [{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get weather for a city",
"parameters": {
"type": "object",
"properties": {"city": {"type": "string"}},
"required": ["city"]
}
}
}]
}'
Response:
{
"choices": [{
"message": {
"tool_calls": [{
"function": {
"name": "get_weather",
"arguments": "{\"city\": \"Tokyo\"}"
}
}]
}
}]
}
The tool call arguments are properly parsed — including bare numeric values like {a: 3, b: 4} that Gemma 4 emits without JSON quotes.
Works With Everything
Because it's OpenAI-compatible, you can point any AI framework at it:
PydanticAI
from pydantic_ai import Agent
from pydantic_ai.models.openai import OpenAIChatModel
from pydantic_ai.providers.openai import OpenAIProvider
model = OpenAIChatModel(
model_name="default",
provider=OpenAIProvider(
base_url="http://localhost:8000/v1",
api_key="not-needed",
),
)
agent = Agent(model)
result = agent.run_sync("What is 2+2?")
print(result.output) # "4"
I've verified this end-to-end with structured output (output_type=BaseModel), streaming, multi-turn conversations, and multi-tool workflows. Test suite here.
LangChain
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(
model="default",
base_url="http://localhost:8000/v1",
api_key="not-needed",
)
# Tool calling works
from langchain_core.tools import tool
@tool
def multiply(a: int, b: int) -> int:
"""Multiply two numbers."""
return a * b
result = llm.bind_tools([multiply]).invoke("What is 6 * 7?")
print(result.tool_calls) # [{"name": "multiply", "args": {"a": 6, "b": 7}}]
Aider (AI pair programming)
export OPENAI_API_BASE=http://localhost:8000/v1
export OPENAI_API_KEY=not-needed
aider --model openai/gemma-4-26b
Aider's full edit-and-commit workflow works — I tested it modifying a Python file with Gemma 4. Test script here.
Full Compatibility List
| Client | Status | Notes |
|---|---|---|
| PydanticAI | Tested (6/6) | Streaming, structured output, multi-tool |
| LangChain | Tested (6/6) | Tools, streaming, structured output |
| smolagents | Tested (4/4) | CodeAgent + ToolCallingAgent |
| Anthropic SDK | Tested (5/5) | Via /v1/messages endpoint |
| Aider | Tested | CLI edit-and-commit workflow |
| LibreChat | Tested (4/4) | Docker E2E with librechat.yaml
|
| Open WebUI | Tested (3/4) | Docker, model fetch, streaming |
| Cursor | Compatible | Settings UI config |
| Claude Code | Compatible |
OPENAI_BASE_URL env var |
| Continue.dev | Compatible | YAML config |
Every "Tested" entry has an automated test script in the repo — not just "I tried it once."
What Model Should I Run?
Depends on your Mac's RAM:
| Mac | Model | Speed | Use Case |
|---|---|---|---|
| 16 GB MacBook Air | Qwen3.5-4B | 168 tok/s | Chat, coding, tools |
| 32 GB MacBook Pro | Gemma 4 26B-A4B | 85 tok/s | General purpose, tool calling |
| 64 GB Mac Mini/Studio | Qwen3.5-35B | 83 tok/s | Smart + fast balance |
| 96+ GB Mac Studio/Pro | Qwen3.5-122B | 57 tok/s | Frontier intelligence |
Quick alias lookup:
rapid-mlx models
Under the Hood
A few things that make this work well:
Prompt cache — Repeated system prompts (common in agent frameworks) are cached. On multi-turn conversations, only new tokens are processed. This cuts TTFT by 2-10x on follow-up messages.
OutputRouter — A token-level state machine that separates model output into channels (content / reasoning / tool calls) in real-time. No regex post-processing, no leakage of <think> tags or tool markup into the content stream.
Auto-detection — Model family, tool parser, and reasoning parser are auto-detected from the model name. No manual --tool-parser hermes flags needed (though you can override).
Try It
# Homebrew
brew install raullenchai/rapid-mlx/rapid-mlx
# or pip
pip install rapid-mlx
# Serve Gemma 4
rapid-mlx serve gemma-4-26b
# Point any OpenAI-compatible app at http://localhost:8000/v1
Repo: github.com/raullenchai/Rapid-MLX
Built on Apple's MLX framework and mlx-lm. Licensed Apache 2.0.

Top comments (0)