A real-world, copy-paste guide to running a personal WhatsApp AI agent entirely on-device on Apple Silicon, with zero per-token API billing. Two agents from one config (a full-access private assistant and a sandboxed public one), swappable local LLM backends (Ollama and MLX), local voice (TTS + STT), and LaunchAgents so everything survives reboots.
Tested on a Mac Studio M3 Ultra (96 GB unified memory), OpenClaw
2026.5.20, Ollama0.24.0,mlx-lm 0.31.3.
TL;DR findings
- You don't need the cloud. A 26B-class model (Gemma 4 26B-A4B, a 4B-active MoE) is plenty for a chatty personal agent and runs comfortably in well under half of 96 GB.
-
Ollama and MLX can coexist as two OpenClaw providers; flip the agent's
primarymodel with a one-line config change. - Benchmark one model at a time. Two large models resident at once throttle each other on memory bandwidth — it nearly halved throughput and produced a totally wrong "Ollama is faster" conclusion until I unloaded the idle one. In isolation the MLX OptiQ-4bit build hit ~73 tok/s vs a contended ~35.
-
MLX has no separate "warm-up" problem.
mlx_lm.serverloads the model at process start and holds it; the LaunchAgent keeps it alive. Ollama lazily unloads, so it needsOLLAMA_KEEP_ALIVE+ a tiny warm-up ping.
0. Prerequisites
# OpenClaw (the agent gateway)
npm install -g openclaw@latest
# Ollama (llama.cpp backend)
brew install ollama
# MLX (Apple-silicon-native inference) in an isolated venv
python3 -m venv ~/mlx-env
~/mlx-env/bin/pip install -U mlx-lm
# ffmpeg (audio transcode for STT)
brew install ffmpeg
Throughout, replace these placeholders with your own values:
| Placeholder | Meaning |
|---|---|
<YOUR_NUMBER_E164> |
your WhatsApp number, e.g. +15551234567
|
<YOUR_GATEWAY_TOKEN> |
a random secret (openssl rand -hex 24) |
<YOUR_GROUP_ID>@g.us |
a WhatsApp group id (optional) |
you@example.com |
your provider OAuth email (optional cloud fallback) |
/Users/you |
your home directory |
1. Architecture
┌───────────────────────────────────────────┐
WhatsApp ───► │ OpenClaw gateway (loopback :18789) │
│ │
│ agent "private" ── full tools │
│ agent "public" ── sandboxed (no bash) │
└───────┬─────────────────────┬─────────────┘
│ model providers │
┌────────────▼─────────┐ ┌─────────▼──────────────┐
│ ollama :11434 │ │ mlx :8080 │
│ (llama.cpp / Metal) │ │ mlx_lm.server (OpenAI- │
│ │ │ compatible endpoint) │
└──────────────────────┘ └────────────────────────┘
voice: OmniVoice TTS :17494 · mlx-whisper STT :17495
One config defines two agents: a full-access private agent bound to your DM, and a locked-down public agent for everyone else.
2. The two local LLM providers
Both providers live under models.providers in ~/.openclaw/openclaw.json. Ollama speaks its native API; MLX is exposed as an OpenAI-compatible server, so OpenClaw talks to it with api: openai-completions.
{
"models": {
"providers": {
"ollama": {
"api": "ollama",
"apiKey": "ollama-local",
"baseUrl": "http://127.0.0.1:11434",
"models": [
{ "id": "gemma4:26b-a4b-it-q8_0", "name": "Gemma 4 26B (Q8_0)",
"contextWindow": 131072, "input": ["text","image"], "reasoning": true,
"cost": { "input": 0, "output": 0, "cacheRead": 0, "cacheWrite": 0 } }
]
},
"mlx": {
"api": "openai-completions",
"apiKey": "mlx",
"baseUrl": "http://127.0.0.1:8080/v1",
"models": [
{ "id": "mlx-community/gemma-4-26B-A4B-it-OptiQ-4bit",
"api": "openai-completions", "name": "Gemma 4 26B-A4B OptiQ-4bit (MLX)",
"contextWindow": 131072, "input": ["text"], "reasoning": true, "maxTokens": 4096,
"cost": { "input": 0, "output": 0, "cacheRead": 0, "cacheWrite": 0 } }
]
}
}
}
}
cost: 0everywhere — these are local, free. It also keeps OpenClaw's usage accounting honest.
3. Wiring models into the agent + aliases
Under agents.defaults, register the models (with short aliases for quick swaps) and pick the primary:
{
"agents": {
"defaults": {
"workspace": "/Users/you/.openclaw/workspace",
"model": {
"primary": "mlx/mlx-community/gemma-4-26B-A4B-it-OptiQ-4bit"
},
"models": {
"ollama/gemma4:26b-a4b-it-q8_0": { "alias": "gemma4-26b-q8", "params": { "think": true } },
"mlx/mlx-community/gemma-4-26B-A4B-it-OptiQ-4bit": { "alias": "gemma4-26b-optiq", "params": { "think": true } }
}
}
}
}
Swap Mika's brain with a one-liner (then restart the gateway):
openclaw config set agents.defaults.model.primary 'ollama/gemma4:26b-a4b-it-q8_0'
openclaw gateway restart
openclaw config get agents.defaults.model.primary # verify
4. Two agents from one config (private vs public)
This is the underrated trick: define a locked-down public persona alongside the full-access private one. The public agent denies the dangerous tools.
{
"agents": {
"list": [
{
"id": "private",
"name": "Mika",
"workspace": "/Users/you/.openclaw/workspace"
},
{
"id": "public",
"name": "Mika (Public)",
"workspace": "/Users/you/.openclaw/workspace/public",
"tools": { "deny": ["bash", "process", "web_search"], "exec": {} }
}
]
},
"bindings": [
{ "agentId": "private", "match": { "channel": "whatsapp", "peer": { "id": "<YOUR_NUMBER_E164>", "kind": "dm" } } },
{ "agentId": "public", "match": { "channel": "whatsapp" } }
]
}
Your DM hits private (can run shell, browse, etc.); everyone else hits public (chat + safe tools only).
5. Ollama: tuning + keeping the model warm
Ollama lazily unloads models after OLLAMA_KEEP_ALIVE (default 5 min), so the next request pays a cold-start. Two fixes: tune the service env, and pre-warm on boot.
5a. Service env (LaunchAgent: ~/Library/LaunchAgents/homebrew.mxcl.ollama.plist)
<key>EnvironmentVariables</key>
<dict>
<key>OLLAMA_FLASH_ATTENTION</key> <string>1</string>
<key>OLLAMA_KEEP_ALIVE</key> <string>24h</string>
<key>OLLAMA_KV_CACHE_TYPE</key> <string>q8_0</string>
</dict>
launchctl unload ~/Library/LaunchAgents/homebrew.mxcl.ollama.plist
launchctl load ~/Library/LaunchAgents/homebrew.mxcl.ollama.plist
5b. Warm-up script — ~/ollama-warmup.sh
#!/bin/bash
# Pre-warm an Ollama model into GPU after service start.
MODEL="${1:-gemma4:26b-a4b-it-q8_0}"
MAX_RETRIES=30; RETRY_INTERVAL=2
echo "[warmup] waiting for Ollama..."
for i in $(seq 1 $MAX_RETRIES); do
if curl -s http://localhost:11434/api/tags >/dev/null 2>&1; then
echo "[warmup] loading $MODEL..."
curl -s http://localhost:11434/api/generate \
-d "{\"model\":\"$MODEL\",\"prompt\":\"hi\",\"stream\":false,\"keep_alive\":\"24h\"}" >/dev/null 2>&1
echo "[warmup] $MODEL warm."; exit 0
fi
sleep $RETRY_INTERVAL
done
echo "[warmup] Ollama did not start in time"; exit 1
5c. Warm-up LaunchAgent — ~/Library/LaunchAgents/com.ollama.warmup.plist
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0"><dict>
<key>Label</key><string>com.ollama.warmup</string>
<key>ProgramArguments</key>
<array>
<string>/Users/you/ollama-warmup.sh</string>
<string>gemma4:26b-a4b-it-q8_0</string>
</array>
<key>RunAtLoad</key><true/>
<key>StandardOutPath</key><string>/tmp/ollama-warmup.log</string>
<key>StandardErrorPath</key><string>/tmp/ollama-warmup.log</string>
</dict></plist>
chmod +x ~/ollama-warmup.sh
launchctl load ~/Library/LaunchAgents/com.ollama.warmup.plist
# disable later (keeps the file): launchctl unload -w ~/Library/LaunchAgents/com.ollama.warmup.plist
6. MLX: a persistent OpenAI-compatible server
mlx_lm.server loads the model at startup and holds it for the life of the process — so the LaunchAgent is the warm-up. Pull a model once (it caches under ~/.cache/huggingface):
~/mlx-env/bin/hf download mlx-community/gemma-4-26B-A4B-it-OptiQ-4bit
LaunchAgent — ~/Library/LaunchAgents/com.mlx-lm.server.plist
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0"><dict>
<key>Label</key><string>com.mlx-lm.server</string>
<key>ProgramArguments</key>
<array>
<string>/Users/you/mlx-env/bin/mlx_lm.server</string>
<string>--model</string><string>mlx-community/gemma-4-26B-A4B-it-OptiQ-4bit</string>
<string>--port</string><string>8080</string>
</array>
<key>EnvironmentVariables</key>
<dict><key>PATH</key><string>/Users/you/mlx-env/bin:/opt/homebrew/bin:/usr/bin:/bin</string></dict>
<key>RunAtLoad</key><true/>
<key>KeepAlive</key><true/>
<key>StandardOutPath</key><string>/tmp/mlx-lm-server.log</string>
<key>StandardErrorPath</key><string>/tmp/mlx-lm-server.log</string>
</dict></plist>
launchctl load ~/Library/LaunchAgents/com.mlx-lm.server.plist
curl -s http://127.0.0.1:8080/v1/models | python3 -m json.tool # confirm it's serving
Why OptiQ?
OptiQ-4bitis a mixed-precision MLX quant tuned for MoE: it keeps the router/gating layers at 8-bit and the experts at 4-bit, so you get near-8-bit quality at ~4-bit size (~16 GB on disk).
7. Benchmarking — and the gotcha that almost fooled me
Same model family, same prompt, 200-token generation, steady-state, wall-clock:
| Backend / build | Decode speed | Resident |
|---|---|---|
MLX gemma-4-26B-A4B-it-OptiQ-4bit (isolated) |
~73 tok/s | ~17 GB |
Ollama gemma4:26b-a4b-it-q8_0
|
~60 tok/s | ~33 GB |
| MLX OptiQ — while Ollama Q8_0 also resident | ~35 tok/s | (contended) |
# Ollama — exact decode rate from the API (excludes prompt + load)
curl -s http://localhost:11434/api/generate -d '{
"model":"gemma4:26b-a4b-it-q8_0","prompt":"Count from 1 to 100 slowly.",
"stream":false,"think":false,"options":{"num_predict":200,"temperature":0}}' \
| python3 -c 'import json,sys;d=json.load(sys.stdin);print(round(d["eval_count"]/(d["eval_duration"]/1e9),1),"tok/s")'
# MLX — wall-clock over completion_tokens (short prompt ⇒ prompt time negligible)
time curl -s http://127.0.0.1:8080/v1/completions -d '{
"model":"mlx-community/gemma-4-26B-A4B-it-OptiQ-4bit",
"prompt":"Count from 1 to 100 slowly.","max_tokens":200,"temperature":0}' >/dev/null
⚠️ The lesson: the first time I ran this, MLX clocked ~35 tok/s and I "concluded" Ollama was 1.7× faster. Wrong. The 33 GB Ollama model was still resident and the two were fighting over memory bandwidth. Unload everything but the model under test (
ollama stop <model>), then measure. In its real deployed condition (only the MLX model resident) OptiQ runs ~73 tok/s — faster and lighter.
8. Bonus: fully-local voice (TTS + STT)
OpenClaw treats both as OpenAI-compatible/CLI endpoints, so no cloud and no keys.
TTS — OmniVoice on :17494 (messages.tts)
{
"messages": {
"tts": {
"provider": "openai",
"auto": "always",
"providers": {
"openai": {
"enabled": true,
"baseUrl": "http://127.0.0.1:17494/v1",
"apiKey": "omnivoice-local",
"model": "omnivoice",
"voice": "female-young-pt"
}
}
}
}
}
(OmniVoice runs behind a small Python wrapper that exposes /v1/audio/speech, kept alive by its own LaunchAgent — same pattern as the STT server below.)
STT — mlx-whisper on :17495
~/whisper-server.py (FastAPI wrapper around mlx_whisper, loads whisper-large-v3-turbo once):
#!/usr/bin/env python3
"""HTTP wrapper for mlx-whisper STT. Loads model once, serves many requests."""
import os, tempfile, time
from contextlib import asynccontextmanager
import mlx_whisper
from fastapi import FastAPI, File, Form, UploadFile
from fastapi.responses import JSONResponse
PORT = int(os.environ.get("WHISPER_PORT", "17495"))
MODEL_REPO = os.environ.get("WHISPER_MODEL", "mlx-community/whisper-large-v3-turbo")
@asynccontextmanager
async def lifespan(_app: FastAPI):
# warm the model on a moment of silence so the first real request is fast
silent = os.path.join(tempfile.gettempdir(), "whisper-warmup.wav")
if not os.path.exists(silent):
import wave
with wave.open(silent, "wb") as w:
w.setnchannels(1); w.setsampwidth(2); w.setframerate(16000)
w.writeframes(b"\x00\x00" * 16000)
mlx_whisper.transcribe(silent, path_or_hf_repo=MODEL_REPO)
yield
app = FastAPI(lifespan=lifespan)
@app.get("/health")
async def health(): return {"status": "ok", "model": MODEL_REPO, "port": PORT}
@app.post("/transcribe")
async def transcribe(file: UploadFile = File(...), language: str | None = Form(None), model: str | None = Form(None)):
started = time.time()
suffix = os.path.splitext(file.filename or "")[1] or ".wav"
with tempfile.NamedTemporaryFile(suffix=suffix, delete=False) as tmp:
tmp.write(await file.read()); path = tmp.name
try:
kw = {"path_or_hf_repo": model or MODEL_REPO}
if language and language.lower() not in ("auto", ""): kw["language"] = language
r = mlx_whisper.transcribe(path, **kw)
return JSONResponse({"text": (r.get("text") or "").strip(),
"language": r.get("language"),
"duration": round(time.time() - started, 3)})
finally:
try: os.unlink(path)
except OSError: pass
if __name__ == "__main__":
import uvicorn
uvicorn.run(app, host="127.0.0.1", port=PORT, log_level="info")
Install + LaunchAgent (~/Library/LaunchAgents/com.mlx-whisper.server.plist):
~/mlx-env/bin/pip install mlx-whisper fastapi 'uvicorn[standard]' python-multipart
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0"><dict>
<key>Label</key><string>com.mlx-whisper.server</string>
<key>ProgramArguments</key>
<array><string>/Users/you/mlx-env/bin/python3</string><string>/Users/you/whisper-server.py</string></array>
<key>EnvironmentVariables</key>
<dict>
<key>PATH</key><string>/opt/homebrew/bin:/usr/bin:/bin</string> <!-- mlx-whisper shells out to ffmpeg -->
<key>WHISPER_PORT</key><string>17495</string>
<key>WHISPER_MODEL</key><string>mlx-community/whisper-large-v3-turbo</string>
</dict>
<key>RunAtLoad</key><true/><key>KeepAlive</key><true/>
<key>StandardOutPath</key><string>/tmp/mlx-whisper-server.log</string>
<key>StandardErrorPath</key><string>/tmp/mlx-whisper-server.log</string>
</dict></plist>
Hook it into OpenClaw as the audio model (tools.media) — transcode to WAV, POST, return text:
{
"tools": {
"media": {
"audio": { "enabled": true },
"models": [{
"type": "cli", "command": "bash", "provider": "mlx-whisper", "model": "whisper-large-v3-turbo",
"capabilities": ["audio"],
"args": ["-c",
"TMP=$(mktemp -t stt) && /opt/homebrew/bin/ffmpeg -y -i \"$1\" -f wav \"$TMP\" 2>/dev/null && curl -s -X POST http://localhost:17495/transcribe -F \"file=@$TMP;filename=stt.wav;type=audio/wav\" -F 'language=pt' | python3 -c \"import json,sys;d=json.load(sys.stdin);print(d.get('text',''))\" ; rm -f \"$TMP\"",
"--"]
}]
}
}
}
Tip:
whisper-basehallucinates a trailing phantom phrase on short clips.whisper-large-v3-turbo(≈809M, ~1.5 GB) fixes it at ~0.3 s/clip on an M3 Ultra.
9. Gotchas worth knowing
-
Co-residency throttles throughput — see §7. Keep one big model resident;
ollama stop <model>frees it instantly. -
Slower model ⇒ blown cron timeouts. A research-heavy scheduled job that finished on a fast quant can time out on a slower one. Bump the job's
timeoutSeconds, or give the cron its own faster model. -
Stale "typing…" indicator. If WhatsApp's connection drops mid-turn, the "composing" presence may never get its "stop." It's cosmetic; an
openclaw gateway restartclears it. -
Reasoning models eat your token budget. With
think: trueand a tinymax_tokens, the model can spend the whole budget "thinking" and return emptycontent. Give it headroom. -
A cloud fallback can stay configured (e.g. an OpenAI/Anthropic OAuth profile) without ever spending API credits — just don't make it
primary. OAuth access tokens expire and re-auth is interactive, so don't rely on it for unattended jobs.
10. Service map (what autostarts)
| LaunchAgent | Purpose | Port |
|---|---|---|
ai.openclaw.gateway |
the agent gateway | 18789 (loopback) |
homebrew.mxcl.ollama |
Ollama backend | 11434 |
com.ollama.warmup |
pre-warm Ollama model (optional) | — |
com.mlx-lm.server |
MLX LLM (OpenAI-compatible) | 8080 |
com.mlx-whisper.server |
STT | 17495 |
com.omnivoice.server |
TTS | 17494 |
launchctl list | grep -E 'openclaw|ollama|mlx|omnivoice'
Built and debugged interactively with Claude Code. Numbers are from a single M3 Ultra (96 GB); your mileage will vary with chip, RAM, and quant.
Top comments (2)
Mac Studio + MLX + Ollama is a genuinely good local-agent stack right now - unified memory means you can hold a sizable model without a discrete GPU's VRAM ceiling, and MLX is finally making Apple Silicon a real inference target instead of an afterthought. The appeal is obvious: no per-token bill, no data leaving the machine, no rate limits. For privacy-sensitive or high-volume workloads that's a real unlock.
The honest tradeoff to keep in front of readers: local buys you privacy and zero marginal cost, but you pay in capability ceiling and throughput - a local model is a step below frontier on hard reasoning, and a single Mac Studio is one concurrency lane, so it shines for personal/agentic background work and struggles for anything needing peak quality or parallelism. Which is exactly why I landed on routing instead of all-local for Moonshift, the thing I build - a multi-agent pipeline that takes a prompt to a deployed SaaS, sending each job to the cheapest model that can actually do it (cheap/local-class for the easy 80%, frontier only where it's needed), so a full build lands ~$3 flat. First run's free, no card. Local-first and route-to-cheapest are the same instinct (don't overpay for capability you don't need) at different points. Great writeup. What size model are you running on it, and where does it hit the quality wall - planning/reasoning, or long-context tasks?
Running OpenClaw and Ollama locally sounds like it could be really handy for keeping things private. Have you noticed any performance differences compared to running it remotely?