DEV Community: Raullen Chai

Gemma 4 on Apple Silicon: 85 tok/s with a pip install

Raullen Chai — Tue, 07 Apr 2026 21:45:09 +0000

Last week Google released Gemma 4 — their most capable open-weight model family. Within hours I had it running locally on my Mac at 85 tokens/second, with full tool calling, streaming, and an OpenAI-compatible API that works with every major AI framework.

Here's how, and what the benchmarks actually look like.

Setup: 2 commands

pip install rapid-mlx
rapid-mlx serve gemma-4-26b

That's it. The server downloads the 4-bit MLX-quantized model (~14 GB) and starts an OpenAI-compatible API on http://localhost:8000/v1.

Benchmarks: Gemma 4 26B on M3 Ultra

I benchmarked three engines on the same machine (M3 Ultra, 192GB), same model (Gemma 4 26B-A4B 4-bit), same prompt:

Engine	Decode (tok/s)	TTFT	Notes
Rapid-MLX	85 tok/s	0.26s	MLX-native, prompt cache
mlx-vlm	84 tok/s	0.31s	VLM library (no tool calling)
Ollama	75 tok/s	0.08s	llama.cpp backend

Rapid-MLX is 13% faster than Ollama on decode. Ollama has faster TTFT (it uses llama.cpp's Metal kernels for prefill), but for interactive use the decode speed is what you feel.

On smaller models the gap is wider — Rapid-MLX hits 168 tok/s on Qwen3.5-4B vs Ollama's ~70 tok/s (2.4x).

Tool Calling That Actually Works

This is where it gets interesting. Most local inference servers either don't support tool calling, or support it for one model family. Rapid-MLX ships 18 built-in tool call parsers covering:

Qwen 3 / 3.5 (hermes format)
Gemma 4 (native <|tool_call> format)
GLM-4.7, MiniMax, GPT-OSS
Llama 3, Mistral, DeepSeek
And more

Tool calling works out of the box — no extra flags needed for supported models:

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "default",
    "messages": [{"role": "user", "content": "What is the weather in Tokyo?"}],
    "tools": [{
      "type": "function",
      "function": {
        "name": "get_weather",
        "description": "Get weather for a city",
        "parameters": {
          "type": "object",
          "properties": {"city": {"type": "string"}},
          "required": ["city"]
        }
      }
    }]
  }'

Response:

{
  "choices": [{
    "message": {
      "tool_calls": [{
        "function": {
          "name": "get_weather",
          "arguments": "{\"city\": \"Tokyo\"}"
        }
      }]
    }
  }]
}

The tool call arguments are properly parsed — including bare numeric values like {a: 3, b: 4} that Gemma 4 emits without JSON quotes.

Works With Everything

Because it's OpenAI-compatible, you can point any AI framework at it:

PydanticAI

from pydantic_ai import Agent
from pydantic_ai.models.openai import OpenAIChatModel
from pydantic_ai.providers.openai import OpenAIProvider

model = OpenAIChatModel(
    model_name="default",
    provider=OpenAIProvider(
        base_url="http://localhost:8000/v1",
        api_key="not-needed",
    ),
)

agent = Agent(model)
result = agent.run_sync("What is 2+2?")
print(result.output)  # "4"

I've verified this end-to-end with structured output (output_type=BaseModel), streaming, multi-turn conversations, and multi-tool workflows. Test suite here.

LangChain

from langchain_openai import ChatOpenAI

llm = ChatOpenAI(
    model="default",
    base_url="http://localhost:8000/v1",
    api_key="not-needed",
)

# Tool calling works
from langchain_core.tools import tool

@tool
def multiply(a: int, b: int) -> int:
    """Multiply two numbers."""
    return a * b

result = llm.bind_tools([multiply]).invoke("What is 6 * 7?")
print(result.tool_calls)  # [{"name": "multiply", "args": {"a": 6, "b": 7}}]

Aider (AI pair programming)

export OPENAI_API_BASE=http://localhost:8000/v1
export OPENAI_API_KEY=not-needed
aider --model openai/gemma-4-26b

Aider's full edit-and-commit workflow works — I tested it modifying a Python file with Gemma 4. Test script here.

Full Compatibility List

Client	Status	Notes
PydanticAI	Tested (6/6)	Streaming, structured output, multi-tool
LangChain	Tested (6/6)	Tools, streaming, structured output
smolagents	Tested (4/4)	CodeAgent + ToolCallingAgent
Anthropic SDK	Tested (5/5)	Via `/v1/messages` endpoint
Aider	Tested	CLI edit-and-commit workflow
LibreChat	Tested (4/4)	Docker E2E with `librechat.yaml`
Open WebUI	Tested (3/4)	Docker, model fetch, streaming
Cursor	Compatible	Settings UI config
Claude Code	Compatible	`OPENAI_BASE_URL` env var
Continue.dev	Compatible	YAML config

Every "Tested" entry has an automated test script in the repo — not just "I tried it once."

What Model Should I Run?

Depends on your Mac's RAM:

Mac	Model	Speed	Use Case
16 GB MacBook Air	Qwen3.5-4B	168 tok/s	Chat, coding, tools
32 GB MacBook Pro	Gemma 4 26B-A4B	85 tok/s	General purpose, tool calling
64 GB Mac Mini/Studio	Qwen3.5-35B	83 tok/s	Smart + fast balance
96+ GB Mac Studio/Pro	Qwen3.5-122B	57 tok/s	Frontier intelligence

Quick alias lookup:

rapid-mlx models

Under the Hood

A few things that make this work well:

Prompt cache — Repeated system prompts (common in agent frameworks) are cached. On multi-turn conversations, only new tokens are processed. This cuts TTFT by 2-10x on follow-up messages.

OutputRouter — A token-level state machine that separates model output into channels (content / reasoning / tool calls) in real-time. No regex post-processing, no leakage of <think> tags or tool markup into the content stream.

Auto-detection — Model family, tool parser, and reasoning parser are auto-detected from the model name. No manual --tool-parser hermes flags needed (though you can override).

Try It

# Homebrew
brew install raullenchai/rapid-mlx/rapid-mlx

# or pip
pip install rapid-mlx

# Serve Gemma 4
rapid-mlx serve gemma-4-26b

# Point any OpenAI-compatible app at http://localhost:8000/v1

Repo: github.com/raullenchai/Rapid-MLX

Built on Apple's MLX framework and mlx-lm. Licensed Apache 2.0.

Stop pasting 5,000 lines of logs into Claude. Use a secure context tunnel instead

Raullen Chai — Sun, 25 Jan 2026 05:12:19 +0000

Stop pasting 5,000 lines of logs into Claude. Use a secure context tunnel instead.

Tags: #ai #productivity #cli #security

The Problem: The "Wall of Text" Friction

We've all been there. You're debugging a nasty crash. You have a 2MB log file. You try to paste it into ChatGPT or Claude.

❌ The UI freezes.
❌ The text gets truncated.
❌ You realize you just pasted your API keys into a cloud chat history.

The Solution: vnsh (Vanish)

I built an open-source tool called vnsh. Think of it as an ephemeral "Dropbox" designed specifically for AI agents.

How it works:

You pipe data in your terminal: cat error.log | vn
It encrypts it locally (AES-256-CBC).
It gives you a secure link.
You give that link to Claude.

Because I built a native Model Context Protocol (MCP) server for it, Claude can actually "see" inside the encrypted link and read the file directly.

Quick Start

If you are on Mac/Linux (Homebrew):

brew tap raullenchai/vnsh
brew install vnsh

Or via NPM (Node.js):

npm install -g vnsh-cli

The "Magic" Workflow
Next time you have a git diff that is too long to explain:

git diff | vn
# Output: [https://vnsh.dev/v/abc...#k=](https://vnsh.dev/v/abc...#k=)...
Paste that URL to Claude. It stays fast, the server (me) can't read your code, and the data self-destructs in 24 hours.

Self-Hosting
Since it deals with sensitive data, I made it host-blind (the decryption key is in the URL hash fragment, never sent to the server). But if you are paranoid (like me), you can self-host the whole stack on your own Cloudflare account.

Check it out on GitHub: https://github.com/raullenchai/vnsh

Control Claude Code from Your Phone with Claw

Raullen Chai — Sun, 18 Jan 2026 22:03:15 +0000

The Problem

You're deep in a Claude Code session. It's working through a complex task.

But you need to step away - grab coffee, take a call, pick up kids.

What do you do? Leave it running and hope nothing goes wrong?

## The Solution: Claw

I built Claw (CLaude AnyWhere) - a zero-dependency Python tool that
lets
you monitor and control Claude Code from any device with a browser.

## Features

👀 Live terminal view - see what Claude is doing in real-time
⚡ Quick actions - tap yes, no, continue, or Ctrl+C
📱 Mobile-first - designed for phones with pull-to-refresh
🌐 Access anywhere - --share flag creates instant public URL

## Quick Start


bash
  # Install
  pip install claw-cli

  # Run with remote access
  claw --share

  That's it. Open the URL on your phone. You're in control.

  How It Works

  Claw is a lightweight HTTP server that:
  1. Captures tmux pane content in real-time
  2. Sends keystrokes via tmux send-keys
  3. Serves a mobile-optimized dashboard

  No dependencies beyond Python stdlib. Works on macOS, Linux, and Windows
  (WSL).

  Try It Out

  GitHub: https://github.com/raullenchai/claw
  PyPI: https://pypi.org/project/claw-cli/

  Contributions welcome! Check out our good first issue labels.

  ---
  Built for developers who got tired of walking back to their desks 🦞