Kevin

Posted on Mar 24

The Architecture of a Self-Hosted AI Gateway

#openclaw #ai #programming #tutorial

Most tutorials tell you how to set up a tool. This article is about why it's designed the way it is.

OpenClaw is an open-source AI agent gateway — a self-hosted system that connects chat platforms to AI models. When I first looked at its architecture, several design decisions stood out as non-obvious. They reflect trade-offs that anyone building AI infrastructure will eventually face.

Let me unpack the ones that matter.

The Core Constraint: One Gateway Per Host

The first thing you notice about OpenClaw's architecture is a hard constraint: one Gateway process per host. No horizontal scaling. No load balancer in front of multiple instances.

This seems limiting until you understand why.

The Gateway maintains stateful connections to chat platforms. A WhatsApp session is tied to a specific device pairing — you scan a QR code, and that session is bound to this process on this machine. A Telegram bot runs a long-polling connection that expects exactly one consumer. Running two Gateway instances against the same WhatsApp session would cause message duplication, state corruption, and dropped connections.

This isn't a bug. It's a reflection of reality: chat platforms are not stateless APIs. They're persistent, bidirectional connections with identity semantics. The architecture acknowledges this rather than abstracting it away.

The implication for deployment is clear: you scale vertically, not horizontally. One powerful machine with a well-configured Gateway, not a cluster of lightweight instances.

Embedded Runtime, Not RPC

The AI agent doesn't run as a separate process. It's embedded directly inside the Gateway.

Most multi-service architectures would put the AI agent behind an API boundary — a separate microservice that the Gateway calls via gRPC or HTTP. OpenClaw takes the opposite approach: the agent runtime (built on pi-mono) is imported as a library and instantiated in-process.

The trade-off is explicit:

What you gain: Zero-latency communication between the Gateway and the agent. Full control over session lifecycle. The ability to inject custom tools, intercept events, and modify context mid-stream without network overhead.

What you give up: Process isolation. If the agent crashes, the Gateway crashes. If the agent leaks memory, the Gateway leaks memory.

For a personal assistant running on your own hardware, this trade-off makes sense. You're not running a multi-tenant service where one user's agent failure should be isolated from another's. You're running a single-operator system where tight integration delivers better performance and simpler operations.

This is a design choice that wouldn't survive in a SaaS product. But for self-hosted infrastructure, it's the right call.

The Agent Loop

Understanding how the agent processes a message reveals the system's priorities.

Receive input → Assemble context → Model inference → Execute tools → Stream reply → Persist

What makes this interesting is what happens at each stage.

Context assembly is where the system prompt gets built. OpenClaw doesn't use any default prompts from the underlying model runtime. It constructs a custom prompt from workspace files (personality, instructions, memory, tool descriptions), safety guardrails, skills metadata, and runtime information. This happens every turn — meaning you can modify your agent's behavior by editing a Markdown file, and the change takes effect on the next message.

Tool execution follows a loop pattern: the model generates a response that may include tool calls, tools execute and return results, and the model continues. This loop repeats until the model produces a final response with no tool calls. The agent can read files, execute commands, browse the web, send messages to other channels, and manage scheduled tasks — all within a single turn.

Streaming deserves mention because it's channel-aware. On Telegram, streaming works by editing the bot's message in real-time as tokens arrive. On Slack, it uses the native Agents and AI Apps API for real-time output. On WhatsApp, streaming isn't supported, so the response arrives as a complete message. The Gateway handles these differences transparently.

Persistence means every conversation is saved to disk as JSONL files. Sessions survive Gateway restarts. Memory is just Markdown files in the workspace directory. There's no database — the file system is the database.

The Memory Architecture

This is perhaps the most opinionated part of the design.

OpenClaw's memory system has two layers:

Working memory (MEMORY.md): A curated Markdown file that gets injected into every conversation turn. Think of it as the agent's always-available notepad.
Daily memory (memory/YYYY-MM-DD.md): Daily log files that are not automatically injected. The agent accesses them on-demand through memory_search and memory_get tools.

The distinction is deliberate. Working memory costs tokens every turn because it's always in context. Daily memory is free until accessed. This forces a natural curation process: important, frequently-needed information goes in working memory. Everything else goes in daily logs where it can be searched when needed.

The entire memory system is just files on disk. No vector database. No embeddings. No RAG pipeline. Just Markdown that the model reads.

This feels almost primitively simple compared to the memory architectures being published in research papers. But it works. The model is good enough at reading and writing text that a file-based system covers most personal assistant use cases. And it has a massive operational advantage: you can read, edit, and version-control your agent's memory with standard tools.

Multi-Agent as Routing

OpenClaw's approach to multi-agent systems is surprisingly pragmatic.

Instead of complex orchestration frameworks, it uses a binding system: routing rules that map incoming messages to specific agents based on channel, sender, group, or thread.

WhatsApp messages → Agent "casual" (Claude Sonnet)
Telegram messages → Agent "work" (Claude Opus)
Discord server #coding → Agent "code" (with full tool access)
Discord server #general → Agent "chat" (messaging tools only)

Each agent is a fully independent brain: separate workspace, separate memory, separate session history, separate auth credentials. The Gateway routes messages deterministically based on the bindings. No agent decides which other agent to delegate to — the routing is configured, not emergent.

This is a deliberate rejection of the "agentic orchestration" pattern where agents dynamically decide to spawn sub-agents and coordinate among themselves. That pattern introduces non-determinism and debugging complexity that's inappropriate for a personal assistant handling real messages from real people.

The routing approach is boring. It's also predictable, debuggable, and operationally simple.

Security as Concentric Circles

The security model follows a pattern I'd describe as concentric circles:

Outermost: Channel access control. Who can message the agent? Pairing codes, allowlists, group policies. This determines who gets in the door.

Middle: Tool policies. What can the agent do? Tool profiles (minimal, coding, messaging, full), per-agent overrides, per-group restrictions. A group chat might only have messaging tools; your DM session gets full access.

Innermost: Sandboxing. When enabled, tool execution runs in Docker containers. The non-main mode is clever: your DM session runs on the host with full access (you trust yourself), while group sessions run sandboxed (you don't trust everyone in the group).

The system prompt includes safety guardrails, but these are explicitly labeled as advisory. The documentation is honest about this: prompt-based safety doesn't enforce constraints, it suggests them. Hard constraints come from the structural layers — tool policies, sandboxing, and allowlists.

What This Architecture Tells Us

OpenClaw's design is full of choices that optimize for the single-operator, self-hosted use case at the expense of multi-tenant scalability. Embedded runtime over RPC. File system over database. Deterministic routing over emergent orchestration. Process-level trust over per-request isolation.

These aren't the right choices for building a cloud AI platform. But they're arguably the right choices for building personal AI infrastructure — systems where you are both the operator and the user, where operational simplicity matters more than horizontal scale, and where deep integration with your local environment is a feature, not a security risk.

As AI moves from cloud-hosted services to personal infrastructure, I expect we'll see more architectures that make these kinds of trade-offs. The patterns that work for SaaS don't automatically transfer to self-hosted systems, and vice versa.

Understanding where a system's architecture lives on this spectrum is more useful than judging whether each individual choice is "right."

Full documentation: OpenClaw Docs

GitHub: openclaw/openclaw

This is Part 2 of a series on AI agent infrastructure. Follow for more.