DEV Community

Aurora
Aurora

Posted on

Building a Self-Hosted Multi-Agent System in Rust: Architecture Decisions and What I Learned

Building a Self-Hosted Multi-Agent System in Rust: Architecture Decisions and What I Learned

Tagline: Why I built five autonomous agents that communicate through SpacetimeDB instead of using Ollama or any existing framework. What worked, what didn't, and the decisions I'd make differently.


The Problem

Everyone wants to build an AI agent. Most people start with a single agent — maybe Claude Code, maybe a custom ReAct loop. Some people try multi-agent. Almost nobody does it self-hosted.

I wanted five agents running on a single workstation. Not five threads. Five agents — each with its own identity, memory scope, tool access, and role. Each containerized. Each communicating through a shared database. Each reasoning through a Triage → Act → Reflect loop.

Built entirely in Rust. No Python. No cloud inference. Zero external dependencies.

Here's how it actually works.

Why Rust?

It wasn't ideological. It was pragmatic.

An agent system has a lot of moving parts at once: streaming LLM completions, growing conversation histories, permission prompts waiting for user input, terminal UIs rendering in real time, database subscriptions firing asynchronously. In Python, keeping that complexity under control requires discipline you don't always get. In Rust, the compiler enforces discipline.

Static linking means one binary that runs identically on a Linux server, a macOS laptop, a Docker container, or an air-gapped machine. No runtime version mismatch. No "works on my machine." LTO, size-optimized release settings, and stripping. The orchestrator binary is small.

More importantly, the ownership model and async ecosystem make it feasible to keep crate boundaries strict. If a tool implementation accidentally imports from the TUI layer, the build fails. Accidental coupling is caught at compile time rather than at runtime.

The Architecture

Five Agents, One Database

Each agent runs in a separate container with:

  • Distinct memory scopes — Agent A's short-term memory doesn't see Agent B's unless explicitly shared
  • Tool allowlists — DevClaw can write code, UXClaw can review UI, none of them can delete each other's work
  • Identity definitions — Each agent knows who it is and what it's supposed to do

They all share a SpacetimeDB instance — a reactive database that notifies agents when state changes. No message queue. No pub/sub middleware. The database is the message bus.

// Simplified agent spawn
let agent = AgentRuntime::new(
    config,
    spacetimedb_client,
    inference_client,
    memory_client,
);

agent.spawn().await?;
Enter fullscreen mode Exit fullscreen mode

IPC: Unix Domain Sockets

Agents communicate through Unix domain sockets with a bincode 2.0.1 wire format and a 4-byte protocol version field. No HTTP. No REST. No JSON serialization overhead for internal communication.

The orchestrator listens on a domain socket, accepts agent connections, and routes messages based on agent IDs. It's fast, it's simple, and it doesn't require a network stack.

SpacetimeDB as the Source of Truth

All agent state lives in SpacetimeDB — tasks, messages, memory entries, pending attention requests. The database uses WASM reducers (Rust compiled to WASM) for all writes, which means:

  • Validation happens at the database layer, not in application code
  • No race conditions — SpacetimeDB handles concurrency
  • Subscriptions are reactive — agents get notified when relevant state changes
  • The database schema is the API

Inference: llama.cpp, Not Ollama

This was a deliberate decision. Ollama works fine for single-agent setups. For a multi-agent system where you need fine-grained control over inference parameters per agent, llama.cpp directly is the right choice.

The orchestrator manages an inference slot pool — each agent competes for GPU memory in a priority-ordered queue. The slot selector runs before semaphore acquisition, which prevents deadlock when agents are competing for limited VRAM.

The Inner Loop: Triage → Act → Reflect

Every agent follows the same three-step cycle:

Triage

The agent receives a stimulus — a task update, a message from another agent, a pending attention request. It evaluates: Is this actionable? Does it match my scope? What's the priority?

This isn't simple filtering. The agent reasons about context, weighs urgency against importance, and decides whether to act, delegate, or defer.

Act

If the agent decides to act, it executes tools within its allowlist. DevClaw writes code. UXClaw reviews UI. OpsClaw checks system health. Each action is logged to SpacetimeDB.

The key constraint: agents can't modify state they don't own. No agent can delete another agent's task. No agent can write to another agent's memory. This is enforced at the database layer.

Reflect

After acting, the agent reflects. Did the action succeed? What went wrong? What should I do differently next time? This reflection becomes a memory entry — a structured learning point that influences future triage decisions.

The reflection isn't just a log. It's engineered memory. Structured, queryable, and weighted by recency and importance.

What Worked

The Single Responsibility Crate Pattern

The workspace is organized so that each crate has exactly one responsibility:

  • orchestrator — agent lifecycle, slot management, message routing
  • agent-runtime — Triage → Act → Reflect loop per agent
  • ipc-protocol — wire format and message definitions
  • inference-client — llama.cpp integration, slot pool
  • db-bindings — SpacetimeDB client and schema
  • memory-client — Convex hybrid search, memory tiering

Dependency flow is strictly inward. If a dependency cycle exists, the build fails. This isn't a nice-to-have — it's what keeps a 6-crate workspace maintainable.

Memory as a Side Effect

I didn't build a dedicated memory system. Instead, memory is a side effect of the inner loop. When an agent reflects, the reflection is a memory write. No separate "memory management" process. Compaction happens when the daily memory file exceeds a size threshold, merging related reflections into concise long-term entries.

It's simple. It works. It doesn't over-engineer the problem.

Per-Subsystem SpacetimeDB Connections

The original design had a single supervisor connection to SpacetimeDB, with agents receiving updates through the supervisor. After building it, I changed to five subsystems each opening their own SDB connection. The single-supervisor approach had too much contention — every state change had to flow through one connection, creating a bottleneck.

The lesson: design for the deployment you're building, not the one you're imagining.

What Didn't Work

Over-Engineering the Message Bus

The first version had a custom message bus on top of Unix domain sockets. It had priority queues, retry logic, and dead letter handling. It was elegant and completely unnecessary. SpacetimeDB subscriptions handle all of that. I removed 400 lines of code.

Assuming CUDA 13 Would Be Stable

The inference system was designed for CUDA 13. When CUDA 13.2 introduced breaking changes, everything broke. The fix was pinning to CUDA 13.1 and documenting the constraint. A simple constraint, but it cost me a day of debugging.

The Docker Egress Problem

Docker's iptables rules don't work on every host topology. The Phase 3 plan requires egress for tool calls like web search, but on some hosts, Docker's default iptables configuration blocks outbound connections. The fix is an L7 HTTPS proxy, but that adds complexity to the deployment.

What I'd Do Differently

Start with the Database Schema

The first version of the code was written before the SpacetimeDB schema was finalized. That meant constant refactoring as the schema evolved. Now I design the database schema first, then build the application code around it. The database is the contract.

Build the CI Gates Earlier

I added CI gates late — rejecting builds on CUDA 13.2, unbounded mpsc channels, and await_holding_lock violations. These should have been in place from the start. Every one of these caused at least one production bug. CI gates are not nice-to-haves for systems with concurrency.

Document the Open Questions

I didn't track open questions explicitly. That changed when I started the "Open Questions" document (Q-001 through Q-053) — every unresolved design decision, every architectural ambiguity, every "I'll figure this out later." Some of them remain open. That's fine. Not every decision needs to be made today. But knowing what you don't know is more valuable than pretending you know everything.

The Result

Five agents. One database. Zero cloud dependencies. Running on a Threadripper PRO workstation with RTX 3090 GPUs. Each agent autonomously triaging, acting, and reflecting. Each containerized. Each with its own identity.

It's not perfect. It's not done. But it works.

The system can receive a stimulus — a user message, a task update, a pending attention request — and produce a coherent multi-agent response without human intervention. The agents communicate. They reason. They learn.

And they're all running on hardware that fits in a single rack.

What's Next

Phase 3 adds resilience — three-layer supervision, adaptive parameters, failure recovery. Phase 4 adds advanced observability and metacognition. Phase 5 adds sleep, dreaming, and audit trails.

The codebase is growing. The architecture is stabilizing. The next step is making it better, not just making it bigger.


This is a work in progress. I'll update this as the system evolves. If you're building something similar, I'd love to hear about your approach.

Top comments (0)