WonderLab

Posted on Feb 28

OpenClaw Source Deep Dive (1): Gateway — Why Does a Personal AI Assistant Need a Central Hub?

#openclaw #ai #opensource

Series goal: After reading the full series, you'll be able to do custom development on OpenClaw and build a similar system from scratch.

Core question in this article: What is the Gateway? Why does it exist? How was it designed, step by step?

Start With What You Experience as a User

Imagine you've set up OpenClaw. Your typical day looks like this:

In the morning you ask it on WhatsApp: "Summarize my meeting notes from today." It opens folders on your computer, generates a document, and sends you the link.
In the afternoon on Slack you ask: "What's the server status right now?" It SSHes into the server, runs commands, and reports back.
Meanwhile you have the Web UI open in a browser, watching the task progress in real time as logs scroll by.
Your iPhone is nearby, ready to wake the assistant with your voice at any moment.

There's something subtle happening here: four different entry points (WhatsApp, Slack, Web UI, iPhone) are all using the same AI assistant simultaneously, and they all see synchronized state.

This creates an engineering challenge.

The Problem: Who Coordinates All of This?

Imagine there was no central hub at all:

WhatsApp connects → AI process A
Slack connects    → AI process B
Web UI connects   → AI process C
iPhone connects   → AI process D

Four independent processes. What you asked on WhatsApp, Slack can't see. The status in the Web UI isn't in sync with what's actually executing. You say "stop" in Slack but the AI on WhatsApp keeps running.

This doesn't work.

All entry points must share the same AI's same state. That means you need a single coordination center — something that connects to all messaging channels, manages the one AI execution process, and synchronizes state in real time to all connected clients.

That is the fundamental reason OpenClaw's Gateway exists.

Gateway's Nature: A Control Plane

OpenClaw's code comments describe the Gateway as:

Gateway WebSocket control plane

Control plane is a networking term: the layer responsible for decisions and coordination, as opposed to the layer that carries data.

In plain terms, the Gateway does three things:

① Message hub: All messaging channels (WhatsApp, Telegram, Slack...) funnel into the Gateway. It decides which AI session handles each message, then routes the AI's reply back.

② Command center: CLI, macOS App, Web UI, and phone apps all control the AI through the Gateway — starting and stopping sessions, checking status, changing configuration, triggering tasks.

③ State broadcast station: While the AI executes, the Gateway broadcasts real-time state to all connected clients. A question you asked on your phone can be watched streaming in the Web UI on your laptop.

Understanding these three roles makes every subsequent design decision fall naturally into place.

First Design Decision: Why WebSocket?

Once we know "the Gateway needs to synchronize state to multiple clients in real time," the next question is: which protocol?

The most common choice is HTTP. But HTTP has a fundamental limitation: it's request-response only — the client must ask before the server can answer. The server has no way to push messages proactively.

The Gateway has a strong requirement: as the AI generates a reply, each token must be streamed to all clients immediately — not waiting for a complete sentence, but character by character, like a typewriter (this is LLM "streaming output").

Two ways to approximate this with HTTP:

Long polling: client keeps asking "any new content?", server answers when it has something. High latency, heavy connection overhead.
SSE (Server-Sent Events): server can push, but only one direction — the client can't send commands at the same time.

Neither works well. What OpenClaw needs is: both client and server can send messages at any time, over a persistent connection with no repeated handshake overhead.

This is exactly what WebSocket was designed for. Once established, both sides can send at any time with very low latency.

Regular HTTP:
  Client →→→ request →→→ Server
  Client ←←← response ←←← Server
  (connection closes, start over next time)

WebSocket:
  After one connection setup, both sides send freely:
  Client →→→ "run this command" →→→ Server
  Server ←←← "AI is thinking..." ←←← Server (proactive push)
  Server ←←← "AI says: ..." ←←← Server (continues pushing)
  Client →→→ "stop" →→→ Server
  (connection stays open)

That's why the Gateway chose WebSocket as its primary protocol — not because it's trendy, but because the business requirements demanded it.

HTTP doesn't disappear. The Gateway also listens on HTTP for: browser access to the Web UI (must be HTTP), Slack/webhook callbacks (third parties only speak HTTP), OpenAI-compatible endpoints (to integrate with existing SDKs). These are all auxiliary scenarios.

Second Design Decision: How Does the Server Know Who You Are?

The Gateway is now accepting WebSocket connections. Clients that connect could be:

Your own CLI (fully trusted, can do anything)
Your Web UI (you use it yourself, but ideally read-only to prevent accidents)
Your iPhone node (it can report camera frames, but shouldn't change config)
A webhook call (external trigger, minimal permissions)

These four client types need different permissions. How do you tell them apart?

The simplest approach: different tokens for different clients. But that's high maintenance overhead, and too coarse — you can't express "Web UI can view session list but can't delete sessions."

OpenClaw's solution is a three-layer auth model, where each layer solves a different problem:

Layer 1: Are You Legitimate? (HTTP-level Token)

When establishing the WebSocket connection, the HTTP Upgrade request must carry a token:

GET /ws HTTP/1.1
Authorization: Bearer your-token-here

This layer answers one question: is this a valid Gateway token? Valid = allow the connection; invalid = disconnect. It's the doorman — only cares about "can you come in."

Layer 2: What Are You? (Role at `connect` handshake)

Once through the door, the client sends its first message — the connect message:

{
  "method": "connect",
  "params": {
    "token": "...",
    "role": "operator",
    "clientId": "macos-app"
  }
}

The role field has exactly two values:

operator: human operator. CLI, macOS App, Web UI are all operators.
node: device node. iPhone, Android, macOS node mode are all nodes.

The methods available to these two roles are completely disjoint:

// src/gateway/role-policy.ts
export function isRoleAuthorizedForMethod(role, method) {
  if (isNodeRoleMethod(method)) {
    return role === "node";   // node-only methods: only devices can call
  }
  return role === "operator"; // everything else: only human operators can call
}

An iPhone (node role) cannot call config.apply to modify configuration — even with a valid token, the wrong role means no access. Conversely, CLI (operator role) can't call node.invoke.result (that's used by device nodes to report execution results).

Why put role in the connect message instead of the HTTP layer?

Because HTTP only handles "can you come in," while role determines "which rooms can you enter." Separating the layers means you can use one token but get different permissions depending on the role — very convenient for testing and debugging.

Layer 3: What Exactly Can You Do? (Fine-grained Scopes)

For operator-role clients, there's an even finer level of control:

// src/gateway/method-scopes.ts
const READ_SCOPE  = "operator.read";   // read-only: view status, read config
const WRITE_SCOPE = "operator.write";  // write: trigger Agent, change config
const ADMIN_SCOPE = "operator.admin";  // full access

This solves a real use case: the Web UI can be exposed externally (e.g., for team members to view AI execution logs), but you don't want them triggering Agent runs or changing configuration. Just give their connection only READ_SCOPE, and permission isolation is achieved without maintaining separate tokens.

Three layers together:

HTTP Token → Can you connect at all?
Role       → Human operator or device node?
Scope      → Within your role, which specific operations can you perform?

Third Design Decision: Why Must `connect` Be the First Message?

Now that you understand the three-layer auth, a natural question arises:

Role and Scope are in the connect message, but the token is in the HTTP header. Why not put everything in the HTTP header and skip the connect step?

Because after the WebSocket upgrade, the server no longer has access to the HTTP headers — they're only transmitted once during the HTTP handshake, and subsequent WebSocket frames don't carry them.

So WebSocket-layer authentication must happen through a WebSocket message. The connect message is that handshake.

Client → Server: HTTP Upgrade (with Bearer Token)
                 [Layer 1: can you come in?]

WebSocket connection established

Client → Server: { method: "connect", params: { role, scopes, clientId, ... } }
                 [Layers 2+3: who are you and what can you do?]

Server → Client: { type: "hello-ok", gatewayMethods: [...], events: [...], ... }
                 [Handshake complete; here's what this Gateway supports]

What happens if you send connect again after the handshake?

// src/gateway/server-methods/connect.ts
export const connectHandlers = {
  connect: ({ respond }) => {
    respond(false, undefined, errorShape("connect is only valid as the first request"));
  },
};

Immediate error. This 12-line file is just a fallback — the actual connect logic is in ws-connection/message-handler.ts, intercepted before message routing reaches the handler map. Under normal operation you never hit this fallback.

What's in the hello-ok response?

More than just "auth successful" — it's a full capability manifest:

{
  type: "hello-ok",
  gatewayMethods: ["health", "agent", "sessions.list", ...],  // all supported RPC methods
  events: ["agent", "presence", "tick", ...],                  // all events that will be pushed
  healthSnapshot: { ... },    // current system health snapshot
  presenceSnapshot: { ... },  // current presence state snapshot
}

Note that gatewayMethods is dynamically generated:

// src/gateway/server-methods-list.ts
export function listGatewayMethods(): string[] {
  const channelMethods = listChannelPlugins()
    .flatMap((plugin) => plugin.gatewayMethods ?? []);
  return Array.from(new Set([...BASE_METHODS, ...channelMethods]));
}

If you install the MS Teams plugin, it can register its own RPC methods, and this list grows accordingly. Clients learn what the server supports at handshake time — no guessing from documentation, no version-number-based compatibility checks.

Fourth Design Decision: Managing 90 Methods

The Gateway supports around 90 RPC methods (health, agent, sessions.list, config.set...).

How are these registered? OpenClaw's solution is surprisingly simple:

// src/gateway/server-methods.ts
export const coreGatewayHandlers = {
  ...connectHandlers,    // connect
  ...healthHandlers,     // health
  ...agentHandlers,      // agent, agent.wait
  ...sessionsHandlers,   // sessions.list, sessions.patch, sessions.reset ...
  ...configHandlers,     // config.get, config.set, config.apply ...
  ...cronHandlers,       // cron.list, cron.add, cron.run ...
  ...skillsHandlers,     // skills.status, skills.install ...
  ...nodeHandlers,       // node.list, node.invoke ...
  // ... ~30 handler groups total
};

A flat JavaScript object: key is method name string, value is handler function. No routing tree, no middleware chain. Just a Map.

When a message arrives:

// look up handler → call it
const handler = extraHandlers?.[req.method] ?? coreGatewayHandlers[req.method];
if (!handler) { respond(error("unknown method")); return; }
handler({ req, respond, client, context });

Why not a "proper" routing framework?

Because for 90 methods, a routing tree is overkill — hash table lookup is O(1), and a routing tree only adds parsing overhead and code complexity.

How do plugins add new methods?

Notice extraHandlers?.[req.method] comes first — plugin-registered handlers have higher priority than core handlers. A plugin just exports an object of the same type, spreads it in at load time, and it can register new methods or even override built-in behavior.

Fifth Design Decision: Real-Time Sync Across Multiple Clients

The Gateway maintains a set of all currently connected clients:

const clients = new Set<GatewayWsClient>();

When the AI produces new output, the Gateway calls broadcast, sending an event to every client in the set:

broadcast("agent", {
  phase: "streaming",
  sessionKey: "agent:main:dm:alice",
  text: "Analyzing your file...",
})

All connected clients — CLI, Web UI, iPhone — receive this simultaneously and display the AI's output in real time.

What happens when a client disconnects and reconnects?

The broadcast function has a stateVersion parameter:

broadcast("presence", payload, {
  stateVersion: { presence: currentPresenceVersion }
})

Every state change increments the version. When a client reconnects, it includes the last version it remembers. If the server's version is newer, the server sends a complete state snapshot rather than an incremental update.

This solves a classic distributed systems problem: how do you recover state changes missed during a disconnection? The answer: don't try to recover them — just send the latest full state. Simple, reliable, no risk of missed-update inconsistency.

Putting It All Together

Here's the complete Gateway workflow:

1. User sends a message on WhatsApp
          ↓
2. WhatsApp channel receives it, passes to Gateway via internal event
          ↓
3. Gateway routing layer decides which Agent's session handles it
   (this is the next article's topic: Channel & Routing System)
          ↓
4. Agent begins executing, producing streaming output
          ↓
5. Gateway calls broadcast("agent", { text: "..." })
          ↓
6. All connected clients receive simultaneously:
   - Web UI shows real-time progress
   - iPhone shows a notification
   - CLI prints output
          ↓
7. Agent finishes; reply is sent back through Gateway to WhatsApp

Every design choice in this flow has a clear rationale:

Problem	Solution	Why
Multiple clients sharing one AI's state	Gateway as single hub	No coordination without a hub
Real-time bidirectional communication	WebSocket	HTTP can't do server-initiated push
Different clients need different permissions	Three-layer auth (Token/Role/Scope)	Granularity from coarse to fine, each layer has clear responsibility
Managing 90 methods	Flat handler Map	Simple, fast, trivial for plugins to extend
State recovery after reconnect	Version numbers + full snapshots	Simplest reliable approach; immune to missed-update bugs

The Startup Sequence: What Happens When Gateway Powers On

With the design understood, the startup sequence makes natural sense. startGatewayServer() initializes in this order:

① Read config file; auto-migrate if it's an older format
② Pre-check all secret references — one missing = immediate error exit (Fail-Fast)
③ Generate or verify Gateway token
④ Load all plugins (channel plugins, feature plugins)
⑤ Connect to all messaging channels (WhatsApp, Telegram, Slack...)
⑥ Attach WebSocket handlers and start listening
⑦ Start Cron scheduler, heartbeat monitor, local network discovery

Step ② — "Fail-Fast" — is worth highlighting. If the config references a non-existent API key, many systems handle this by "starting anyway, error when you actually need it." OpenClaw doesn't — check at startup, refuse to start if something is wrong, fail with a clear message.

For a personal AI assistant this matters especially: an assistant running with bad configuration causes "messages sent with no response" — the hardest kind of failure to debug. Better to refuse to start with a clear error message from the very beginning.

Summary

This article derived Gateway's core design from user experience:

Why Gateway exists: multiple channels + multiple clients need a single coordination hub
Why WebSocket: real-time bidirectional communication is a hard requirement; HTTP can't deliver it
Why the connect handshake: WebSocket layer needs its own authentication mechanism
Why three-layer auth: different clients have different trust levels; permissions need layering
Why a flat handler Map: simple, sufficient, zero friction for plugin extension
Why version numbers + full snapshots: most reliable approach for reconnect scenarios

Next article dives into the most complex internal logic of the Gateway — the Channel & Routing System:

When a WhatsApp message arrives, how does OpenClaw know which Agent it should go to? If you've configured multiple Agents for different groups and contacts, how does it route to the right place?

Behind this is an 8-level priority routing system, elegantly designed.

Source paths: src/gateway/ | Key files: server.impl.ts, server-methods.ts, role-policy.ts, method-scopes.ts

DEV Community

OpenClaw Source Deep Dive (1): Gateway — Why Does a Personal AI Assistant Need a Central Hub?

Start With What You Experience as a User

The Problem: Who Coordinates All of This?

Gateway's Nature: A Control Plane

First Design Decision: Why WebSocket?

Second Design Decision: How Does the Server Know Who You Are?

Layer 1: Are You Legitimate? (HTTP-level Token)

Layer 2: What Are You? (Role at `connect` handshake)

Layer 3: What Exactly Can You Do? (Fine-grained Scopes)

Third Design Decision: Why Must `connect` Be the First Message?

Fourth Design Decision: Managing 90 Methods

Fifth Design Decision: Real-Time Sync Across Multiple Clients

Putting It All Together

The Startup Sequence: What Happens When Gateway Powers On

Summary

Top comments (0)

Start With What You Experience as a User

The Problem: Who Coordinates All of This?

Gateway's Nature: A Control Plane

First Design Decision: Why WebSocket?

Second Design Decision: How Does the Server Know Who You Are?

Layer 1: Are You Legitimate? (HTTP-level Token)

Layer 2: What Are You? (Role at connect handshake)

Layer 3: What Exactly Can You Do? (Fine-grained Scopes)

Third Design Decision: Why Must connect Be the First Message?

Fourth Design Decision: Managing 90 Methods

Fifth Design Decision: Real-Time Sync Across Multiple Clients

Putting It All Together

The Startup Sequence: What Happens When Gateway Powers On

Summary

Layer 2: What Are You? (Role at `connect` handshake)

Third Design Decision: Why Must `connect` Be the First Message?