Xaden

Posted on Mar 27 • Edited on Apr 12

Building a Local AI Agent Architecture with OpenClaw and Ollama

#ai #ollama #machinelearning #architecture

By Xaden

Most AI agent setups fall into one of two camps: fully cloud-dependent (expensive, latency-bound, rate-limited) or fully local (limited capability, no access to frontier models). The architecture described here is a hybrid approach — a cloud-hosted frontier model (Claude Opus) acts as the orchestrator, while locally-running Ollama models handle the bulk of execution work at zero marginal cost.

The stack:

OpenClaw — an open-source agent gateway that manages sessions, channels, tools, and subagent lifecycle
Ollama — local LLM inference server, optimized for Apple Silicon via Metal GPU acceleration
Claude Opus 4 — frontier model for orchestration, complex reasoning, and user interaction
4× local models — specialized workers running on-device for free, unlimited inference

Hardware baseline for this guide: MacBook Pro with M3 Pro (12 cores, 36GB unified memory, macOS arm64).

OpenClaw Gateway Architecture

OpenClaw runs as a persistent gateway daemon on the host machine. It's the central nervous system — managing WebSocket connections, routing messages, spawning agents, and orchestrating tool calls.

How the Gateway Works

The gateway is a Node.js process that binds to a local port and exposes a WebSocket + HTTP interface:

┌─────────────────────────────────────────────┐
│                  OpenClaw Gateway            │
│              (127.0.0.1:18789)               │
│                                              │
│  ┌──────────┐  ┌──────────┐  ┌───────────┐  │
│  │ Webchat  │  │  Agents  │  │   Tools   │  │
│  │ Channel  │  │ Sessions │  │  Runtime  │  │
│  └──────────┘  └──────────┘  └───────────┘  │
│                                              │
│  ┌──────────────────────────────────────┐    │
│  │         Model Router                 │    │
│  │  anthropic/* → Anthropic API         │    │
│  │  ollama/*    → localhost:11434       │    │
│  └──────────────────────────────────────┘    │
└─────────────────────────────────────────────┘

Gateway Configuration

The gateway is configured via ~/.openclaw/openclaw.json. Here's the gateway section from a working local-only deployment:

{
  "gateway": {
    "port": 18789,
    "mode": "local",
    "bind": "loopback",
    "auth": {
      "mode": "token",
      "token": "<your-gateway-token>"
    },
    "tailscale": {
      "mode": "off",
      "resetOnExit": false
    },
    "nodes": {
      "denyCommands": [
        "camera.list",
        "screen.record",
        "contacts.add",
        "calendar.add",
        "reminders.list",
        "sms.search"
      ]
    }
  }
}

Key design decisions:

bind: "loopback" — gateway only accepts connections from 127.0.0.1. No network exposure.
auth.mode: "token" — every WebSocket connection must present a valid token. Even locally, auth is enforced.
nodes.denyCommands — explicit blocklist for sensitive device capabilities. Defense in depth.

Managing the Gateway

# Check status (includes PID, bind address, probe result)
openclaw gateway status

# Lifecycle
openclaw gateway start
openclaw gateway stop
openclaw gateway restart

# The gateway runs as a macOS LaunchAgent:
# ~/Library/LaunchAgents/ai.openclaw.gateway.plist
# Auto-starts at login, restarts on crash

Deploying Ollama on Apple Silicon

Ollama provides a lightweight HTTP inference server that leverages Apple's Metal GPU framework for hardware-accelerated inference on M-series chips.

Installation

brew install ollama

On macOS, Ollama installs as a background service that auto-starts at login. The inference server listens on http://localhost:11434.

Pulling Models

Choose models based on your RAM budget. With 36GB unified memory, you can comfortably run multiple 7-8B models simultaneously, or one 14B model alongside smaller ones.

# Code specialist — 14B parameter, 128k context
ollama pull qwen2.5-coder:14b    # 9.0 GB

# General reasoning — 8B parameter, 40k context  
ollama pull qwen3:8b              # 5.2 GB

# Long-context generalist — 8B, 128k context
ollama pull llama3.1:8b           # 4.9 GB

# Fast inference — 7B, 32k context
ollama pull mistral:7b            # 4.4 GB

Total disk: ~23.5 GB for four models covering code, reasoning, writing, and fast tasks.

Tuning for Multi-Model Concurrency

Set environment variables in the OpenClaw config to control Ollama's behavior:

{
  "env": {
    "OLLAMA_API_KEY": "ollama-local",
    "OLLAMA_MAX_LOADED_MODELS": "3",
    "OLLAMA_KEEP_ALIVE": "-1"
  }
}

OLLAMA_MAX_LOADED_MODELS: "3" — allows up to 3 models loaded in memory simultaneously. On 36GB unified memory, this means ~18-20GB for models + headroom for the OS and other processes.
OLLAMA_KEEP_ALIVE: "-1" — models stay loaded indefinitely once used. Eliminates cold-start latency for frequently-used models.
OLLAMA_API_KEY — Ollama doesn't require authentication by default, but OpenClaw's model router expects an API key for every provider. Setting a dummy key satisfies the router.

Memory Budget (36GB Unified)

3× 8B (quantized): ~15 GB used, ~21 GB remaining for OS
1× 14B + 1× 8B: ~14 GB used, ~22 GB remaining
1× 14B + 2× 8B: ~24 GB used, ~12 GB remaining
All 4 models: ~28 GB used, ~8 GB remaining (tight)

The sweet spot is 2-3 models loaded concurrently with OLLAMA_MAX_LOADED_MODELS=3. Ollama evicts the least-recently-used model when the limit is hit.

Configuring Auth for Local Models

OpenClaw's model router uses a unified auth system — every model provider goes through the same auth pipeline. For local Ollama models, this means you need to configure credentials even though Ollama itself doesn't require them.

{
  "env": {
    "ANTHROPIC_API_KEY": "sk-ant-...",
    "OLLAMA_API_KEY": "ollama-local"
  },
  "auth": {
    "profiles": {
      "anthropic:default": {
        "provider": "anthropic",
        "mode": "token"
      }
    }
  }
}

Without OLLAMA_API_KEY set, subagent spawns targeting ollama/qwen3:8b will fail at the auth layer before they ever reach the local inference server. The error surfaces as an auth/credentials failure, not a model availability error — which can be confusing to debug.

Quick Validation

# Verify Ollama is running
curl http://localhost:11434/api/tags

# Test inference directly
curl http://localhost:11434/api/generate \
  -d '{"model": "mistral:7b", "prompt": "Hello", "stream": false}'

# Verify OpenClaw can route to Ollama
openclaw gateway status  # Should show "RPC probe: ok"

Subagent Orchestration Patterns

OpenClaw's subagent system allows any agent session to spawn child sessions on different models. The orchestrator (main agent on Claude Opus) delegates tasks to specialized workers running on local Ollama models.

The Orchestration Model

┌──────────────────────────────────────────┐
│          Main Agent (Claude Opus)         │
│       Orchestration + User Interaction    │
└──────┬──────┬──────┬──────┬──────────────┘
       │      │      │      │
       ▼      ▼      ▼      ▼
   ┌──────┐┌──────┐┌──────┐┌────────────┐
   │qwen3 ││llama ││mistr.││qwen-coder  │
   │ :8b  ││3.1:8b││ :7b  ││   :14b     │
   │Resrch││Write ││Quick ││   Code     │
   │Tasks ││Tasks ││Tasks ││  Tasks     │
   └──────┘└──────┘└──────┘└────────────┘

   All local. All free. All parallel.

Spawning a Subagent

sessions_spawn:
  label: "research-task"
  model: "ollama/qwen3:8b"
  task: "Research the top 5 Rust web frameworks and summarize pros/cons"
  runTimeoutSeconds: 300

Key parameters:

model — which model handles this subagent session
task — the prompt/instruction for the subagent
runTimeoutSeconds — hard timeout to prevent runaway sessions
label — human-readable identifier for tracking

Parallel Execution

Multiple subagents can run simultaneously. On 36GB RAM with OLLAMA_MAX_LOADED_MODELS=3, you can run 2-3 local subagents in parallel.

Results auto-announce back to the orchestrator. No polling required — OpenClaw uses push-based completion.

Timeout Strategy

Quick lookups, simple edits: 120s
Research, multi-step analysis: 300s
Installations, builds, large codegen: 600s

Delegation vs Direct Execution

The main agent session (running on a frontier model like Claude Opus) is the most expensive resource in the system. Every second the orchestrator spends executing a task is a second it can't respond to the user.

Rule: If a task takes more than ~30 seconds of execution time, delegate it.

The Cost Equation

Cloud API (Claude Opus):
  - Input:  ~$15/M tokens
  - Output: ~$75/M tokens
  - Every orchestrator turn costs real money

Local Ollama:
  - Input:  $0
  - Output: $0  
  - Electricity cost only (~5W for inference on M3)
  - No rate limits, no quotas

A single Claude Opus response that takes 2 minutes of tool-calling might cost $0.50-2.00. The same work delegated to a local 8B model costs effectively nothing.

Model Selection Strategy

The Tier System

Tier 1: Frontier (Claude Opus 4) — Complex multi-step reasoning, nuanced user interaction, synthesis across multiple information sources. Cost: ~$15-75/M tokens.

Tier 2: Local Specialist (qwen2.5-coder:14b) — Code generation and refactoring, debugging with full file context (128k window). Cost: $0 (9GB VRAM).

Tier 3: Local Generalist (qwen3:8b, llama3.1:8b) — Research summarization, content writing, data extraction. Cost: $0 (~5GB VRAM each).

Tier 4: Local Fast (mistral:7b) — Quick classification, simple Q&A, text reformatting. Cost: $0 (4.4GB VRAM, fastest inference).

What Local Models Cannot Do

Small local models (7-8B parameters) have real limitations: deep research, complex planning, nuanced writing, and tool orchestration. Assigning these tasks to a local model doesn't save money — it wastes time when the output is unusable.

Putting It All Together

Full Configuration Example

{
  "env": {
    "ANTHROPIC_API_KEY": "sk-ant-...",
    "OLLAMA_API_KEY": "ollama-local",
    "OLLAMA_MAX_LOADED_MODELS": "3",
    "OLLAMA_KEEP_ALIVE": "-1"
  },
  "auth": {
    "profiles": {
      "anthropic:default": {
        "provider": "anthropic",
        "mode": "token"
      }
    }
  },
  "agents": {
    "defaults": {
      "workspace": "~/.openclaw/workspace",
      "subagents": {
        "runTimeoutSeconds": 300
      }
    }
  },
  "gateway": {
    "port": 18789,
    "mode": "local",
    "bind": "loopback",
    "auth": {
      "mode": "token",
      "token": "<generate-a-random-token>"
    }
  }
}

Startup Checklist

# 1. Install OpenClaw
npm install -g openclaw
openclaw onboard

# 2. Install Ollama
brew install ollama

# 3. Pull your model fleet
ollama pull qwen2.5-coder:14b
ollama pull qwen3:8b
ollama pull llama3.1:8b
ollama pull mistral:7b

# 4. Verify Ollama is serving
curl -s http://localhost:11434/api/tags | jq '.models[].name'

# 5. Start the gateway
openclaw gateway start

# 6. Verify everything
openclaw gateway status

# 7. Open the webchat dashboard
open http://127.0.0.1:18789/

The Economics

Running this architecture on an M3 Pro MacBook:

Cloud cost: Only the orchestrator (Claude Opus) hits the API. With aggressive delegation, this might be 10-20% of total inference.
Local cost: Electricity only. M3 Pro draws ~5-15W during inference. At $0.30/kWh, that's roughly $0.004/hour.
Effective multiplier: 4 simultaneous worker models + 1 orchestrator = 5× throughput at ~20% the cost of running everything on a frontier model.

The key insight: frontier models should think, not work. Let the cloud model make decisions and coordinate. Let the local models execute. The architecture pays for itself within days of heavy use.

By Xaden — March 2026. Stack: OpenClaw 2026.3.24, Ollama on macOS arm64, Claude Opus 4, Apple M3 Pro 36GB.

Now hiring at: FarmerSamLLC.com

DEV Community