A real-world, copy-paste guide to running a personal WhatsApp AI agent entirely on-device on Apple Silicon, with zero per-token API billing. Two agents from one config (a full-access private assistant and a sandboxed public one), swappable local LLM backends (Ollama and MLX), local voice (TTS + STT), and LaunchAgents so everything survives reboots.
Tested on a Mac Studio M3 Ultra (96 GB unified memory), OpenClaw
2026.5.20, Ollama0.24.0,mlx-lm 0.31.3.
TL;DR findings
- You don't need the cloud. A 26B-class model (Gemma 4 26B-A4B, a 4B-active MoE) is plenty for a chatty personal agent and runs comfortably in well under half of 96 GB.
-
Ollama and MLX can coexist as two OpenClaw providers; flip the agent's
primarymodel with a one-line config change. - Benchmark one model at a time. Two large models resident at once throttle each other on memory bandwidth — it nearly halved throughput and produced a totally wrong "Ollama is faster" conclusion until I unloaded the idle one. In isolation the MLX OptiQ-4bit build hit ~73 tok/s vs a contended ~35.
-
MLX has no separate "warm-up" problem.
mlx_lm.serverloads the model at process start and holds it; the LaunchAgent keeps it alive. Ollama lazily unloads, so it needsOLLAMA_KEEP_ALIVE+ a tiny warm-up ping.
0. Prerequisites
# OpenClaw (the agent gateway)
npm install -g openclaw@latest
# Ollama (llama.cpp backend)
brew install ollama
# MLX (Apple-silicon-native inference) in an isolated venv
python3 -m venv ~/mlx-env
~/mlx-env/bin/pip install -U mlx-lm
# ffmpeg (audio transcode for STT)
brew install ffmpeg
Throughout, replace these placeholders with your own values:
| Placeholder | Meaning |
|---|---|
<YOUR_NUMBER_E164> |
your WhatsApp number, e.g. +15551234567
|
<YOUR_GATEWAY_TOKEN> |
a random secret (openssl rand -hex 24) |
<YOUR_GROUP_ID>@g.us |
a WhatsApp group id (optional) |
you@example.com |
your provider OAuth email (optional cloud fallback) |
/Users/you |
your home directory |
1. Architecture
┌───────────────────────────────────────────┐
WhatsApp ───► │ OpenClaw gateway (loopback :18789) │
│ │
│ agent "private" ── full tools │
│ agent "public" ── sandboxed (no bash) │
└───────┬─────────────────────┬─────────────┘
│ model providers │
┌────────────▼─────────┐ ┌─────────▼──────────────┐
│ ollama :11434 │ │ mlx :8080 │
│ (llama.cpp / Metal) │ │ mlx_lm.server (OpenAI- │
│ │ │ compatible endpoint) │
└──────────────────────┘ └────────────────────────┘
voice: OmniVoice TTS :17494 · mlx-whisper STT :17495
One config defines two agents: a full-access private agent bound to your DM, and a locked-down public agent for everyone else.
2. The two local LLM providers
Both providers live under models.providers in ~/.openclaw/openclaw.json. Ollama speaks its native API; MLX is exposed as an OpenAI-compatible server, so OpenClaw talks to it with api: openai-completions.
{
"models": {
"providers": {
"ollama": {
"api": "ollama",
"apiKey": "ollama-local",
"baseUrl": "http://127.0.0.1:11434",
"models": [
{ "id": "gemma4:26b-a4b-it-q8_0", "name": "Gemma 4 26B (Q8_0)",
"contextWindow": 131072, "input": ["text","image"], "reasoning": true,
"cost": { "input": 0, "output": 0, "cacheRead": 0, "cacheWrite": 0 } }
]
},
"mlx": {
"api": "openai-completions",
"apiKey": "mlx",
"baseUrl": "http://127.0.0.1:8080/v1",
"models": [
{ "id": "mlx-community/gemma-4-26B-A4B-it-OptiQ-4bit",
"api": "openai-completions", "name": "Gemma 4 26B-A4B OptiQ-4bit (MLX)",
"contextWindow": 131072, "input": ["text"], "reasoning": true, "maxTokens": 4096,
"cost": { "input": 0, "output": 0, "cacheRead": 0, "cacheWrite": 0 } }
]
}
}
}
}
cost: 0everywhere — these are local, free. It also keeps OpenClaw's usage accounting honest.
3. Wiring models into the agent + aliases
Under agents.defaults, register the models (with short aliases for quick swaps) and pick the primary:
{
"agents": {
"defaults": {
"workspace": "/Users/you/.openclaw/workspace",
"model": {
"primary": "mlx/mlx-community/gemma-4-26B-A4B-it-OptiQ-4bit"
},
"models": {
"ollama/gemma4:26b-a4b-it-q8_0": { "alias": "gemma4-26b-q8", "params": { "think": true } },
"mlx/mlx-community/gemma-4-26B-A4B-it-OptiQ-4bit": { "alias": "gemma4-26b-optiq", "params": { "think": true } }
}
}
}
}
Swap Mika's brain with a one-liner (then restart the gateway):
openclaw config set agents.defaults.model.primary 'ollama/gemma4:26b-a4b-it-q8_0'
openclaw gateway restart
openclaw config get agents.defaults.model.primary # verify
4. Two agents from one config (private vs public)
This is the underrated trick: define a locked-down public persona alongside the full-access private one. The public agent denies the dangerous tools.
{
"agents": {
"list": [
{
"id": "private",
"name": "Mika",
"workspace": "/Users/you/.openclaw/workspace"
},
{
"id": "public",
"name": "Mika (Public)",
"workspace": "/Users/you/.openclaw/workspace/public",
"tools": { "deny": ["bash", "process", "web_search"], "exec": {} }
}
]
},
"bindings": [
{ "agentId": "private", "match": { "channel": "whatsapp", "peer": { "id": "<YOUR_NUMBER_E164>", "kind": "dm" } } },
{ "agentId": "public", "match": { "channel": "whatsapp" } }
]
}
Your DM hits private (can run shell, browse, etc.); everyone else hits public (chat + safe tools only).
5. Ollama: tuning + keeping the model warm
Ollama lazily unloads models after OLLAMA_KEEP_ALIVE (default 5 min), so the next request pays a cold-start. Two fixes: tune the service env, and pre-warm on boot.
5a. Service env (LaunchAgent: ~/Library/LaunchAgents/homebrew.mxcl.ollama.plist)
<key>EnvironmentVariables</key>
<dict>
<key>OLLAMA_FLASH_ATTENTION</key> <string>1</string>
<key>OLLAMA_KEEP_ALIVE</key> <string>24h</string>
<key>OLLAMA_KV_CACHE_TYPE</key> <string>q8_0</string>
</dict>
launchctl unload ~/Library/LaunchAgents/homebrew.mxcl.ollama.plist
launchctl load ~/Library/LaunchAgents/homebrew.mxcl.ollama.plist
5b. Warm-up script — ~/ollama-warmup.sh
#!/bin/bash
# Pre-warm an Ollama model into GPU after service start.
MODEL="${1:-gemma4:26b-a4b-it-q8_0}"
MAX_RETRIES=30; RETRY_INTERVAL=2
echo "[warmup] waiting for Ollama..."
for i in $(seq 1 $MAX_RETRIES); do
if curl -s http://localhost:11434/api/tags >/dev/null 2>&1; then
echo "[warmup] loading $MODEL..."
curl -s http://localhost:11434/api/generate \
-d "{\"model\":\"$MODEL\",\"prompt\":\"hi\",\"stream\":false,\"keep_alive\":\"24h\"}" >/dev/null 2>&1
echo "[warmup] $MODEL warm."; exit 0
fi
sleep $RETRY_INTERVAL
done
echo "[warmup] Ollama did not start in time"; exit 1
5c. Warm-up LaunchAgent — ~/Library/LaunchAgents/com.ollama.warmup.plist
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0"><dict>
<key>Label</key><string>com.ollama.warmup</string>
<key>ProgramArguments</key>
<array>
<string>/Users/you/ollama-warmup.sh</string>
<string>gemma4:26b-a4b-it-q8_0</string>
</array>
<key>RunAtLoad</key><true/>
<key>StandardOutPath</key><string>/tmp/ollama-warmup.log</string>
<key>StandardErrorPath</key><string>/tmp/ollama-warmup.log</string>
</dict></plist>
chmod +x ~/ollama-warmup.sh
launchctl load ~/Library/LaunchAgents/com.ollama.warmup.plist
# disable later (keeps the file): launchctl unload -w ~/Library/LaunchAgents/com.ollama.warmup.plist
6. MLX: a persistent OpenAI-compatible server
mlx_lm.server loads the model at startup and holds it for the life of the process — so the LaunchAgent is the warm-up. Pull a model once (it caches under ~/.cache/huggingface):
~/mlx-env/bin/hf download mlx-community/gemma-4-26B-A4B-it-OptiQ-4bit
LaunchAgent — ~/Library/LaunchAgents/com.mlx-lm.server.plist
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0"><dict>
<key>Label</key><string>com.mlx-lm.server</string>
<key>ProgramArguments</key>
<array>
<string>/Users/you/mlx-env/bin/mlx_lm.server</string>
<string>--model</string><string>mlx-community/gemma-4-26B-A4B-it-OptiQ-4bit</string>
<string>--port</string><string>8080</string>
</array>
<key>EnvironmentVariables</key>
<dict><key>PATH</key><string>/Users/you/mlx-env/bin:/opt/homebrew/bin:/usr/bin:/bin</string></dict>
<key>RunAtLoad</key><true/>
<key>KeepAlive</key><true/>
<key>StandardOutPath</key><string>/tmp/mlx-lm-server.log</string>
<key>StandardErrorPath</key><string>/tmp/mlx-lm-server.log</string>
</dict></plist>
launchctl load ~/Library/LaunchAgents/com.mlx-lm.server.plist
curl -s http://127.0.0.1:8080/v1/models | python3 -m json.tool # confirm it's serving
Why OptiQ?
OptiQ-4bitis a mixed-precision MLX quant tuned for MoE: it keeps the router/gating layers at 8-bit and the experts at 4-bit, so you get near-8-bit quality at ~4-bit size (~16 GB on disk).
7. Benchmarking — and the gotcha that almost fooled me
Same model family, same prompt, 200-token generation, steady-state, wall-clock:
| Backend / build | Decode speed | Resident |
|---|---|---|
MLX gemma-4-26B-A4B-it-OptiQ-4bit (isolated) |
~73 tok/s | ~17 GB |
Ollama gemma4:26b-a4b-it-q8_0
|
~60 tok/s | ~33 GB |
| MLX OptiQ — while Ollama Q8_0 also resident | ~35 tok/s | (contended) |
# Ollama — exact decode rate from the API (excludes prompt + load)
curl -s http://localhost:11434/api/generate -d '{
"model":"gemma4:26b-a4b-it-q8_0","prompt":"Count from 1 to 100 slowly.",
"stream":false,"think":false,"options":{"num_predict":200,"temperature":0}}' \
| python3 -c 'import json,sys;d=json.load(sys.stdin);print(round(d["eval_count"]/(d["eval_duration"]/1e9),1),"tok/s")'
# MLX — wall-clock over completion_tokens (short prompt ⇒ prompt time negligible)
time curl -s http://127.0.0.1:8080/v1/completions -d '{
"model":"mlx-community/gemma-4-26B-A4B-it-OptiQ-4bit",
"prompt":"Count from 1 to 100 slowly.","max_tokens":200,"temperature":0}' >/dev/null
⚠️ The lesson: the first time I ran this, MLX clocked ~35 tok/s and I "concluded" Ollama was 1.7× faster. Wrong. The 33 GB Ollama model was still resident and the two were fighting over memory bandwidth. Unload everything but the model under test (
ollama stop <model>), then measure. In its real deployed condition (only the MLX model resident) OptiQ runs ~73 tok/s — faster and lighter.
8. Bonus: fully-local voice (TTS + STT)
OpenClaw treats both as OpenAI-compatible/CLI endpoints, so no cloud and no keys.
TTS — OmniVoice on :17494 (messages.tts)
{
"messages": {
"tts": {
"provider": "openai",
"auto": "always",
"providers": {
"openai": {
"enabled": true,
"baseUrl": "http://127.0.0.1:17494/v1",
"apiKey": "omnivoice-local",
"model": "omnivoice",
"voice": "female-young-pt"
}
}
}
}
}
(OmniVoice runs behind a small Python wrapper that exposes /v1/audio/speech, kept alive by its own LaunchAgent — same pattern as the STT server below.)
STT — mlx-whisper on :17495
~/whisper-server.py (FastAPI wrapper around mlx_whisper, loads whisper-large-v3-turbo once):
#!/usr/bin/env python3
"""HTTP wrapper for mlx-whisper STT. Loads model once, serves many requests."""
import os, tempfile, time
from contextlib import asynccontextmanager
import mlx_whisper
from fastapi import FastAPI, File, Form, UploadFile
from fastapi.responses import JSONResponse
PORT = int(os.environ.get("WHISPER_PORT", "17495"))
MODEL_REPO = os.environ.get("WHISPER_MODEL", "mlx-community/whisper-large-v3-turbo")
@asynccontextmanager
async def lifespan(_app: FastAPI):
# warm the model on a moment of silence so the first real request is fast
silent = os.path.join(tempfile.gettempdir(), "whisper-warmup.wav")
if not os.path.exists(silent):
import wave
with wave.open(silent, "wb") as w:
w.setnchannels(1); w.setsampwidth(2); w.setframerate(16000)
w.writeframes(b"\x00\x00" * 16000)
mlx_whisper.transcribe(silent, path_or_hf_repo=MODEL_REPO)
yield
app = FastAPI(lifespan=lifespan)
@app.get("/health")
async def health(): return {"status": "ok", "model": MODEL_REPO, "port": PORT}
@app.post("/transcribe")
async def transcribe(file: UploadFile = File(...), language: str | None = Form(None), model: str | None = Form(None)):
started = time.time()
suffix = os.path.splitext(file.filename or "")[1] or ".wav"
with tempfile.NamedTemporaryFile(suffix=suffix, delete=False) as tmp:
tmp.write(await file.read()); path = tmp.name
try:
kw = {"path_or_hf_repo": model or MODEL_REPO}
if language and language.lower() not in ("auto", ""): kw["language"] = language
r = mlx_whisper.transcribe(path, **kw)
return JSONResponse({"text": (r.get("text") or "").strip(),
"language": r.get("language"),
"duration": round(time.time() - started, 3)})
finally:
try: os.unlink(path)
except OSError: pass
if __name__ == "__main__":
import uvicorn
uvicorn.run(app, host="127.0.0.1", port=PORT, log_level="info")
Install + LaunchAgent (~/Library/LaunchAgents/com.mlx-whisper.server.plist):
~/mlx-env/bin/pip install mlx-whisper fastapi 'uvicorn[standard]' python-multipart
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0"><dict>
<key>Label</key><string>com.mlx-whisper.server</string>
<key>ProgramArguments</key>
<array><string>/Users/you/mlx-env/bin/python3</string><string>/Users/you/whisper-server.py</string></array>
<key>EnvironmentVariables</key>
<dict>
<key>PATH</key><string>/opt/homebrew/bin:/usr/bin:/bin</string> <!-- mlx-whisper shells out to ffmpeg -->
<key>WHISPER_PORT</key><string>17495</string>
<key>WHISPER_MODEL</key><string>mlx-community/whisper-large-v3-turbo</string>
</dict>
<key>RunAtLoad</key><true/><key>KeepAlive</key><true/>
<key>StandardOutPath</key><string>/tmp/mlx-whisper-server.log</string>
<key>StandardErrorPath</key><string>/tmp/mlx-whisper-server.log</string>
</dict></plist>
Hook it into OpenClaw as the audio model (tools.media) — transcode to WAV, POST, return text:
{
"tools": {
"media": {
"audio": { "enabled": true },
"models": [{
"type": "cli", "command": "bash", "provider": "mlx-whisper", "model": "whisper-large-v3-turbo",
"capabilities": ["audio"],
"args": ["-c",
"TMP=$(mktemp -t stt) && /opt/homebrew/bin/ffmpeg -y -i \"$1\" -f wav \"$TMP\" 2>/dev/null && curl -s -X POST http://localhost:17495/transcribe -F \"file=@$TMP;filename=stt.wav;type=audio/wav\" -F 'language=pt' | python3 -c \"import json,sys;d=json.load(sys.stdin);print(d.get('text',''))\" ; rm -f \"$TMP\"",
"--"]
}]
}
}
}
Tip:
whisper-basehallucinates a trailing phantom phrase on short clips.whisper-large-v3-turbo(≈809M, ~1.5 GB) fixes it at ~0.3 s/clip on an M3 Ultra.
9. Gotchas worth knowing
-
Co-residency throttles throughput — see §7. Keep one big model resident;
ollama stop <model>frees it instantly. -
Slower model ⇒ blown cron timeouts. A research-heavy scheduled job that finished on a fast quant can time out on a slower one. Bump the job's
timeoutSeconds, or give the cron its own faster model. -
Stale "typing…" indicator. If WhatsApp's connection drops mid-turn, the "composing" presence may never get its "stop." It's cosmetic; an
openclaw gateway restartclears it. -
Reasoning models eat your token budget. With
think: trueand a tinymax_tokens, the model can spend the whole budget "thinking" and return emptycontent. Give it headroom. -
A cloud fallback can stay configured (e.g. an OpenAI/Anthropic OAuth profile) without ever spending API credits — just don't make it
primary. OAuth access tokens expire and re-auth is interactive, so don't rely on it for unattended jobs.
10. Service map (what autostarts)
| LaunchAgent | Purpose | Port |
|---|---|---|
ai.openclaw.gateway |
the agent gateway | 18789 (loopback) |
homebrew.mxcl.ollama |
Ollama backend | 11434 |
com.ollama.warmup |
pre-warm Ollama model (optional) | — |
com.mlx-lm.server |
MLX LLM (OpenAI-compatible) | 8080 |
com.mlx-whisper.server |
STT | 17495 |
com.omnivoice.server |
TTS | 17494 |
launchctl list | grep -E 'openclaw|ollama|mlx|omnivoice'
Built and debugged interactively with Claude Code. Numbers are from a single M3 Ultra (96 GB); your mileage will vary with chip, RAM, and quant.
Top comments (0)