DEV Community

Cover image for Running a Fully-Local AI Agent on a Mac Studio — OpenClaw + Ollama + MLX
Bruno Mello
Bruno Mello

Posted on

Running a Fully-Local AI Agent on a Mac Studio — OpenClaw + Ollama + MLX

A real-world, copy-paste guide to running a personal WhatsApp AI agent entirely on-device on Apple Silicon, with zero per-token API billing. Two agents from one config (a full-access private assistant and a sandboxed public one), swappable local LLM backends (Ollama and MLX), local voice (TTS + STT), and LaunchAgents so everything survives reboots.

Tested on a Mac Studio M3 Ultra (96 GB unified memory), OpenClaw 2026.5.20, Ollama 0.24.0, mlx-lm 0.31.3.


TL;DR findings

  • You don't need the cloud. A 26B-class model (Gemma 4 26B-A4B, a 4B-active MoE) is plenty for a chatty personal agent and runs comfortably in well under half of 96 GB.
  • Ollama and MLX can coexist as two OpenClaw providers; flip the agent's primary model with a one-line config change.
  • Benchmark one model at a time. Two large models resident at once throttle each other on memory bandwidth — it nearly halved throughput and produced a totally wrong "Ollama is faster" conclusion until I unloaded the idle one. In isolation the MLX OptiQ-4bit build hit ~73 tok/s vs a contended ~35.
  • MLX has no separate "warm-up" problem. mlx_lm.server loads the model at process start and holds it; the LaunchAgent keeps it alive. Ollama lazily unloads, so it needs OLLAMA_KEEP_ALIVE + a tiny warm-up ping.

0. Prerequisites

# OpenClaw (the agent gateway)
npm install -g openclaw@latest

# Ollama (llama.cpp backend)
brew install ollama

# MLX (Apple-silicon-native inference) in an isolated venv
python3 -m venv ~/mlx-env
~/mlx-env/bin/pip install -U mlx-lm

# ffmpeg (audio transcode for STT)
brew install ffmpeg
Enter fullscreen mode Exit fullscreen mode

Throughout, replace these placeholders with your own values:

Placeholder Meaning
<YOUR_NUMBER_E164> your WhatsApp number, e.g. +15551234567
<YOUR_GATEWAY_TOKEN> a random secret (openssl rand -hex 24)
<YOUR_GROUP_ID>@g.us a WhatsApp group id (optional)
you@example.com your provider OAuth email (optional cloud fallback)
/Users/you your home directory

1. Architecture

                 ┌───────────────────────────────────────────┐
   WhatsApp ───► │  OpenClaw gateway (loopback :18789)       │
                 │                                           │
                 │  agent "private"  ── full tools           │
                 │  agent "public"   ── sandboxed (no bash)  │
                 └───────┬─────────────────────┬─────────────┘
                         │ model providers       │
            ┌────────────▼─────────┐   ┌─────────▼──────────────┐
            │ ollama  :11434       │   │ mlx  :8080             │
            │ (llama.cpp / Metal)  │   │ mlx_lm.server (OpenAI- │
            │                      │   │  compatible endpoint)  │
            └──────────────────────┘   └────────────────────────┘
            voice:  OmniVoice TTS :17494   ·   mlx-whisper STT :17495
Enter fullscreen mode Exit fullscreen mode

One config defines two agents: a full-access private agent bound to your DM, and a locked-down public agent for everyone else.


2. The two local LLM providers

Both providers live under models.providers in ~/.openclaw/openclaw.json. Ollama speaks its native API; MLX is exposed as an OpenAI-compatible server, so OpenClaw talks to it with api: openai-completions.

{
  "models": {
    "providers": {
      "ollama": {
        "api": "ollama",
        "apiKey": "ollama-local",
        "baseUrl": "http://127.0.0.1:11434",
        "models": [
          { "id": "gemma4:26b-a4b-it-q8_0", "name": "Gemma 4 26B (Q8_0)",
            "contextWindow": 131072, "input": ["text","image"], "reasoning": true,
            "cost": { "input": 0, "output": 0, "cacheRead": 0, "cacheWrite": 0 } }
        ]
      },
      "mlx": {
        "api": "openai-completions",
        "apiKey": "mlx",
        "baseUrl": "http://127.0.0.1:8080/v1",
        "models": [
          { "id": "mlx-community/gemma-4-26B-A4B-it-OptiQ-4bit",
            "api": "openai-completions", "name": "Gemma 4 26B-A4B OptiQ-4bit (MLX)",
            "contextWindow": 131072, "input": ["text"], "reasoning": true, "maxTokens": 4096,
            "cost": { "input": 0, "output": 0, "cacheRead": 0, "cacheWrite": 0 } }
        ]
      }
    }
  }
}
Enter fullscreen mode Exit fullscreen mode

cost: 0 everywhere — these are local, free. It also keeps OpenClaw's usage accounting honest.


3. Wiring models into the agent + aliases

Under agents.defaults, register the models (with short aliases for quick swaps) and pick the primary:

{
  "agents": {
    "defaults": {
      "workspace": "/Users/you/.openclaw/workspace",
      "model": {
        "primary": "mlx/mlx-community/gemma-4-26B-A4B-it-OptiQ-4bit"
      },
      "models": {
        "ollama/gemma4:26b-a4b-it-q8_0":          { "alias": "gemma4-26b-q8",   "params": { "think": true } },
        "mlx/mlx-community/gemma-4-26B-A4B-it-OptiQ-4bit": { "alias": "gemma4-26b-optiq", "params": { "think": true } }
      }
    }
  }
}
Enter fullscreen mode Exit fullscreen mode

Swap Mika's brain with a one-liner (then restart the gateway):

openclaw config set agents.defaults.model.primary 'ollama/gemma4:26b-a4b-it-q8_0'
openclaw gateway restart
openclaw config get agents.defaults.model.primary   # verify
Enter fullscreen mode Exit fullscreen mode

4. Two agents from one config (private vs public)

This is the underrated trick: define a locked-down public persona alongside the full-access private one. The public agent denies the dangerous tools.

{
  "agents": {
    "list": [
      {
        "id": "private",
        "name": "Mika",
        "workspace": "/Users/you/.openclaw/workspace"
      },
      {
        "id": "public",
        "name": "Mika (Public)",
        "workspace": "/Users/you/.openclaw/workspace/public",
        "tools": { "deny": ["bash", "process", "web_search"], "exec": {} }
      }
    ]
  },
  "bindings": [
    { "agentId": "private", "match": { "channel": "whatsapp", "peer": { "id": "<YOUR_NUMBER_E164>", "kind": "dm" } } },
    { "agentId": "public",  "match": { "channel": "whatsapp" } }
  ]
}
Enter fullscreen mode Exit fullscreen mode

Your DM hits private (can run shell, browse, etc.); everyone else hits public (chat + safe tools only).


5. Ollama: tuning + keeping the model warm

Ollama lazily unloads models after OLLAMA_KEEP_ALIVE (default 5 min), so the next request pays a cold-start. Two fixes: tune the service env, and pre-warm on boot.

5a. Service env (LaunchAgent: ~/Library/LaunchAgents/homebrew.mxcl.ollama.plist)

<key>EnvironmentVariables</key>
<dict>
    <key>OLLAMA_FLASH_ATTENTION</key>  <string>1</string>
    <key>OLLAMA_KEEP_ALIVE</key>       <string>24h</string>
    <key>OLLAMA_KV_CACHE_TYPE</key>    <string>q8_0</string>
</dict>
Enter fullscreen mode Exit fullscreen mode
launchctl unload ~/Library/LaunchAgents/homebrew.mxcl.ollama.plist
launchctl load   ~/Library/LaunchAgents/homebrew.mxcl.ollama.plist
Enter fullscreen mode Exit fullscreen mode

5b. Warm-up script — ~/ollama-warmup.sh

#!/bin/bash
# Pre-warm an Ollama model into GPU after service start.
MODEL="${1:-gemma4:26b-a4b-it-q8_0}"
MAX_RETRIES=30; RETRY_INTERVAL=2
echo "[warmup] waiting for Ollama..."
for i in $(seq 1 $MAX_RETRIES); do
  if curl -s http://localhost:11434/api/tags >/dev/null 2>&1; then
    echo "[warmup] loading $MODEL..."
    curl -s http://localhost:11434/api/generate \
      -d "{\"model\":\"$MODEL\",\"prompt\":\"hi\",\"stream\":false,\"keep_alive\":\"24h\"}" >/dev/null 2>&1
    echo "[warmup] $MODEL warm."; exit 0
  fi
  sleep $RETRY_INTERVAL
done
echo "[warmup] Ollama did not start in time"; exit 1
Enter fullscreen mode Exit fullscreen mode

5c. Warm-up LaunchAgent — ~/Library/LaunchAgents/com.ollama.warmup.plist

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0"><dict>
  <key>Label</key><string>com.ollama.warmup</string>
  <key>ProgramArguments</key>
  <array>
    <string>/Users/you/ollama-warmup.sh</string>
    <string>gemma4:26b-a4b-it-q8_0</string>
  </array>
  <key>RunAtLoad</key><true/>
  <key>StandardOutPath</key><string>/tmp/ollama-warmup.log</string>
  <key>StandardErrorPath</key><string>/tmp/ollama-warmup.log</string>
</dict></plist>
Enter fullscreen mode Exit fullscreen mode
chmod +x ~/ollama-warmup.sh
launchctl load ~/Library/LaunchAgents/com.ollama.warmup.plist
# disable later (keeps the file):  launchctl unload -w ~/Library/LaunchAgents/com.ollama.warmup.plist
Enter fullscreen mode Exit fullscreen mode

6. MLX: a persistent OpenAI-compatible server

mlx_lm.server loads the model at startup and holds it for the life of the process — so the LaunchAgent is the warm-up. Pull a model once (it caches under ~/.cache/huggingface):

~/mlx-env/bin/hf download mlx-community/gemma-4-26B-A4B-it-OptiQ-4bit
Enter fullscreen mode Exit fullscreen mode

LaunchAgent — ~/Library/LaunchAgents/com.mlx-lm.server.plist

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0"><dict>
  <key>Label</key><string>com.mlx-lm.server</string>
  <key>ProgramArguments</key>
  <array>
    <string>/Users/you/mlx-env/bin/mlx_lm.server</string>
    <string>--model</string><string>mlx-community/gemma-4-26B-A4B-it-OptiQ-4bit</string>
    <string>--port</string><string>8080</string>
  </array>
  <key>EnvironmentVariables</key>
  <dict><key>PATH</key><string>/Users/you/mlx-env/bin:/opt/homebrew/bin:/usr/bin:/bin</string></dict>
  <key>RunAtLoad</key><true/>
  <key>KeepAlive</key><true/>
  <key>StandardOutPath</key><string>/tmp/mlx-lm-server.log</string>
  <key>StandardErrorPath</key><string>/tmp/mlx-lm-server.log</string>
</dict></plist>
Enter fullscreen mode Exit fullscreen mode
launchctl load ~/Library/LaunchAgents/com.mlx-lm.server.plist
curl -s http://127.0.0.1:8080/v1/models | python3 -m json.tool   # confirm it's serving
Enter fullscreen mode Exit fullscreen mode

Why OptiQ? OptiQ-4bit is a mixed-precision MLX quant tuned for MoE: it keeps the router/gating layers at 8-bit and the experts at 4-bit, so you get near-8-bit quality at ~4-bit size (~16 GB on disk).


7. Benchmarking — and the gotcha that almost fooled me

Same model family, same prompt, 200-token generation, steady-state, wall-clock:

Backend / build Decode speed Resident
MLX gemma-4-26B-A4B-it-OptiQ-4bit (isolated) ~73 tok/s ~17 GB
Ollama gemma4:26b-a4b-it-q8_0 ~60 tok/s ~33 GB
MLX OptiQ — while Ollama Q8_0 also resident ~35 tok/s (contended)
# Ollama — exact decode rate from the API (excludes prompt + load)
curl -s http://localhost:11434/api/generate -d '{
  "model":"gemma4:26b-a4b-it-q8_0","prompt":"Count from 1 to 100 slowly.",
  "stream":false,"think":false,"options":{"num_predict":200,"temperature":0}}' \
| python3 -c 'import json,sys;d=json.load(sys.stdin);print(round(d["eval_count"]/(d["eval_duration"]/1e9),1),"tok/s")'

# MLX — wall-clock over completion_tokens (short prompt ⇒ prompt time negligible)
time curl -s http://127.0.0.1:8080/v1/completions -d '{
  "model":"mlx-community/gemma-4-26B-A4B-it-OptiQ-4bit",
  "prompt":"Count from 1 to 100 slowly.","max_tokens":200,"temperature":0}' >/dev/null
Enter fullscreen mode Exit fullscreen mode

⚠️ The lesson: the first time I ran this, MLX clocked ~35 tok/s and I "concluded" Ollama was 1.7× faster. Wrong. The 33 GB Ollama model was still resident and the two were fighting over memory bandwidth. Unload everything but the model under test (ollama stop <model>), then measure. In its real deployed condition (only the MLX model resident) OptiQ runs ~73 tok/s — faster and lighter.


8. Bonus: fully-local voice (TTS + STT)

OpenClaw treats both as OpenAI-compatible/CLI endpoints, so no cloud and no keys.

TTS — OmniVoice on :17494 (messages.tts)

{
  "messages": {
    "tts": {
      "provider": "openai",
      "auto": "always",
      "providers": {
        "openai": {
          "enabled": true,
          "baseUrl": "http://127.0.0.1:17494/v1",
          "apiKey": "omnivoice-local",
          "model": "omnivoice",
          "voice": "female-young-pt"
        }
      }
    }
  }
}
Enter fullscreen mode Exit fullscreen mode

(OmniVoice runs behind a small Python wrapper that exposes /v1/audio/speech, kept alive by its own LaunchAgent — same pattern as the STT server below.)

STT — mlx-whisper on :17495

~/whisper-server.py (FastAPI wrapper around mlx_whisper, loads whisper-large-v3-turbo once):

#!/usr/bin/env python3
"""HTTP wrapper for mlx-whisper STT. Loads model once, serves many requests."""
import os, tempfile, time
from contextlib import asynccontextmanager
import mlx_whisper
from fastapi import FastAPI, File, Form, UploadFile
from fastapi.responses import JSONResponse

PORT = int(os.environ.get("WHISPER_PORT", "17495"))
MODEL_REPO = os.environ.get("WHISPER_MODEL", "mlx-community/whisper-large-v3-turbo")

@asynccontextmanager
async def lifespan(_app: FastAPI):
    # warm the model on a moment of silence so the first real request is fast
    silent = os.path.join(tempfile.gettempdir(), "whisper-warmup.wav")
    if not os.path.exists(silent):
        import wave
        with wave.open(silent, "wb") as w:
            w.setnchannels(1); w.setsampwidth(2); w.setframerate(16000)
            w.writeframes(b"\x00\x00" * 16000)
    mlx_whisper.transcribe(silent, path_or_hf_repo=MODEL_REPO)
    yield

app = FastAPI(lifespan=lifespan)

@app.get("/health")
async def health(): return {"status": "ok", "model": MODEL_REPO, "port": PORT}

@app.post("/transcribe")
async def transcribe(file: UploadFile = File(...), language: str | None = Form(None), model: str | None = Form(None)):
    started = time.time()
    suffix = os.path.splitext(file.filename or "")[1] or ".wav"
    with tempfile.NamedTemporaryFile(suffix=suffix, delete=False) as tmp:
        tmp.write(await file.read()); path = tmp.name
    try:
        kw = {"path_or_hf_repo": model or MODEL_REPO}
        if language and language.lower() not in ("auto", ""): kw["language"] = language
        r = mlx_whisper.transcribe(path, **kw)
        return JSONResponse({"text": (r.get("text") or "").strip(),
                             "language": r.get("language"),
                             "duration": round(time.time() - started, 3)})
    finally:
        try: os.unlink(path)
        except OSError: pass

if __name__ == "__main__":
    import uvicorn
    uvicorn.run(app, host="127.0.0.1", port=PORT, log_level="info")
Enter fullscreen mode Exit fullscreen mode

Install + LaunchAgent (~/Library/LaunchAgents/com.mlx-whisper.server.plist):

~/mlx-env/bin/pip install mlx-whisper fastapi 'uvicorn[standard]' python-multipart
Enter fullscreen mode Exit fullscreen mode
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0"><dict>
  <key>Label</key><string>com.mlx-whisper.server</string>
  <key>ProgramArguments</key>
  <array><string>/Users/you/mlx-env/bin/python3</string><string>/Users/you/whisper-server.py</string></array>
  <key>EnvironmentVariables</key>
  <dict>
    <key>PATH</key><string>/opt/homebrew/bin:/usr/bin:/bin</string>   <!-- mlx-whisper shells out to ffmpeg -->
    <key>WHISPER_PORT</key><string>17495</string>
    <key>WHISPER_MODEL</key><string>mlx-community/whisper-large-v3-turbo</string>
  </dict>
  <key>RunAtLoad</key><true/><key>KeepAlive</key><true/>
  <key>StandardOutPath</key><string>/tmp/mlx-whisper-server.log</string>
  <key>StandardErrorPath</key><string>/tmp/mlx-whisper-server.log</string>
</dict></plist>
Enter fullscreen mode Exit fullscreen mode

Hook it into OpenClaw as the audio model (tools.media) — transcode to WAV, POST, return text:

{
  "tools": {
    "media": {
      "audio": { "enabled": true },
      "models": [{
        "type": "cli", "command": "bash", "provider": "mlx-whisper", "model": "whisper-large-v3-turbo",
        "capabilities": ["audio"],
        "args": ["-c",
          "TMP=$(mktemp -t stt) && /opt/homebrew/bin/ffmpeg -y -i \"$1\" -f wav \"$TMP\" 2>/dev/null && curl -s -X POST http://localhost:17495/transcribe -F \"file=@$TMP;filename=stt.wav;type=audio/wav\" -F 'language=pt' | python3 -c \"import json,sys;d=json.load(sys.stdin);print(d.get('text',''))\" ; rm -f \"$TMP\"",
          "--"]
      }]
    }
  }
}
Enter fullscreen mode Exit fullscreen mode

Tip: whisper-base hallucinates a trailing phantom phrase on short clips. whisper-large-v3-turbo (≈809M, ~1.5 GB) fixes it at ~0.3 s/clip on an M3 Ultra.


9. Gotchas worth knowing

  • Co-residency throttles throughput — see §7. Keep one big model resident; ollama stop <model> frees it instantly.
  • Slower model ⇒ blown cron timeouts. A research-heavy scheduled job that finished on a fast quant can time out on a slower one. Bump the job's timeoutSeconds, or give the cron its own faster model.
  • Stale "typing…" indicator. If WhatsApp's connection drops mid-turn, the "composing" presence may never get its "stop." It's cosmetic; an openclaw gateway restart clears it.
  • Reasoning models eat your token budget. With think: true and a tiny max_tokens, the model can spend the whole budget "thinking" and return empty content. Give it headroom.
  • A cloud fallback can stay configured (e.g. an OpenAI/Anthropic OAuth profile) without ever spending API credits — just don't make it primary. OAuth access tokens expire and re-auth is interactive, so don't rely on it for unattended jobs.

10. Service map (what autostarts)

LaunchAgent Purpose Port
ai.openclaw.gateway the agent gateway 18789 (loopback)
homebrew.mxcl.ollama Ollama backend 11434
com.ollama.warmup pre-warm Ollama model (optional)
com.mlx-lm.server MLX LLM (OpenAI-compatible) 8080
com.mlx-whisper.server STT 17495
com.omnivoice.server TTS 17494
launchctl list | grep -E 'openclaw|ollama|mlx|omnivoice'
Enter fullscreen mode Exit fullscreen mode

Built and debugged interactively with Claude Code. Numbers are from a single M3 Ultra (96 GB); your mileage will vary with chip, RAM, and quant.

Top comments (0)