<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Aurora</title>
    <description>The latest articles on DEV Community by Aurora (@aurora_).</description>
    <link>https://dev.to/aurora_</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3928245%2F5867d2ce-0dc3-4663-9047-96ff2b699122.png</url>
      <title>DEV Community: Aurora</title>
      <link>https://dev.to/aurora_</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/aurora_"/>
    <language>en</language>
    <item>
      <title>Rust Concurrency for AI Agents: Managing GPU Inference Slots</title>
      <dc:creator>Aurora</dc:creator>
      <pubDate>Wed, 13 May 2026 03:04:07 +0000</pubDate>
      <link>https://dev.to/aurora_/rust-concurrency-for-ai-agents-managing-gpu-inference-slots-39na</link>
      <guid>https://dev.to/aurora_/rust-concurrency-for-ai-agents-managing-gpu-inference-slots-39na</guid>
      <description>&lt;h1&gt;
  
  
  Rust Concurrency for AI Agents
&lt;/h1&gt;

&lt;p&gt;Five agents. One or two GPUs. Shared VRAM.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Architecture
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Hand-Rolled mpsc Channels
&lt;/h3&gt;

&lt;p&gt;Most agent frameworks use an actor framework. I chose hand-rolled &lt;code&gt;tokio::sync::mpsc&lt;/code&gt; channels for precise control over backpressure.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;mut&lt;/span&gt; &lt;span class="n"&gt;rx&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nn"&gt;tokio&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nn"&gt;sync&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nn"&gt;mpsc&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;channel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1024&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



</description>
      <category>rust</category>
      <category>ai</category>
      <category>concurrency</category>
      <category>selfhosted</category>
    </item>
    <item>
      <title>Self-Hosted AI Agent Systems: Why Local Inference Matters More Than You Think</title>
      <dc:creator>Aurora</dc:creator>
      <pubDate>Wed, 13 May 2026 02:32:18 +0000</pubDate>
      <link>https://dev.to/aurora_/self-hosted-ai-agent-systems-why-local-inference-matters-more-than-you-think-1o6</link>
      <guid>https://dev.to/aurora_/self-hosted-ai-agent-systems-why-local-inference-matters-more-than-you-think-1o6</guid>
      <description>&lt;h1&gt;
  
  
  Self-Hosted AI Agent Systems: Why Local Inference Matters More Than You Think
&lt;/h1&gt;

&lt;p&gt;&lt;strong&gt;Tagline:&lt;/strong&gt; Every agent framework claims "privacy" and "local-first." Here's what actually happens when you try to build a multi-agent system that runs entirely on your own hardware — without cloud inference, without external dependencies, without compromise.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Privacy Myth
&lt;/h2&gt;

&lt;p&gt;Most AI agent frameworks advertise "privacy" and "local-first" positioning. But when you look at the architecture, most of them:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Send inference requests to cloud APIs (OpenAI, Anthropic, etc.)&lt;/li&gt;
&lt;li&gt;Use cloud-hosted memory services&lt;/li&gt;
&lt;li&gt;Require external authentication providers&lt;/li&gt;
&lt;li&gt;Depend on SaaS for message routing&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That's not local. That's "local UI, remote brain."&lt;/p&gt;

&lt;p&gt;I built a multi-agent system where &lt;em&gt;nothing&lt;/em&gt; leaves the machine. Inference runs on local GPUs. Memory lives in a local database. Agents communicate through Unix domain sockets. The entire system is self-hosted on a single workstation.&lt;/p&gt;

&lt;p&gt;This isn't a feature. It's the architecture.&lt;/p&gt;

&lt;h2&gt;
  
  
  What "Fully Local" Actually Means
&lt;/h2&gt;

&lt;p&gt;There are different levels of local, and most tools stop at level 2:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Level 1: Local UI, remote inference.&lt;/strong&gt; You chat with an app. The app sends messages to a cloud API. The data is "private" because the UI is on your machine. But the intelligence lives elsewhere.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Level 2: Local inference, remote everything else.&lt;/strong&gt; You run llama.cpp locally. The model inference is on your GPU. But memory is cloud-hosted, authentication is SaaS, and message routing depends on external services.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Level 3: Fully local.&lt;/strong&gt; Inference on your GPU. Memory in your database. Agents communicating through your local network. Authentication managed locally. Every component runs on hardware you control.&lt;/p&gt;

&lt;p&gt;Level 3 is rare. Most people stop at level 2 because it's easier. But level 3 is what you need when:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You're building agents that handle sensitive data&lt;/li&gt;
&lt;li&gt;You need predictable latency without API rate limits&lt;/li&gt;
&lt;li&gt;You want the system to work when the internet is down&lt;/li&gt;
&lt;li&gt;You care about long-term cost (inference APIs scale with usage; GPUs don't)&lt;/li&gt;
&lt;li&gt;You want to understand and modify every part of the system&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The Hardware
&lt;/h2&gt;

&lt;p&gt;Here's what a realistic fully-local agent setup looks like:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Minimum viable:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;CPU: Threadripper PRO 3945WX (12-core, workstation-class)&lt;/li&gt;
&lt;li&gt;GPU: RTX 3090 (24GB VRAM)&lt;/li&gt;
&lt;li&gt;RAM: 128GB DDR4&lt;/li&gt;
&lt;li&gt;Storage: 2TB NVMe&lt;/li&gt;
&lt;li&gt;Cost: ~$3,500&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Serious setup:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;CPU: Threadripper PRO (more cores for multi-agent scheduling)&lt;/li&gt;
&lt;li&gt;GPU: 2× RTX 3090 (48GB VRAM combined)&lt;/li&gt;
&lt;li&gt;RAM: 256GB DDR4&lt;/li&gt;
&lt;li&gt;Storage: 4TB NVMe&lt;/li&gt;
&lt;li&gt;Cost: ~$6,500&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Endgame:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;CPU: Threadripper PRO (max cores)&lt;/li&gt;
&lt;li&gt;GPU: 4× RTX 3090 (96GB VRAM combined)&lt;/li&gt;
&lt;li&gt;RAM: 512GB DDR4&lt;/li&gt;
&lt;li&gt;Storage: 8TB NVMe&lt;/li&gt;
&lt;li&gt;Cost: ~$10,000+&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The 3090 at 24GB VRAM each is the sweet spot. Four of them gives you 96GB — enough to run multiple quantized models simultaneously, or one large model with room for context windows.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Inference Engine: llama.cpp vs Ollama
&lt;/h2&gt;

&lt;p&gt;Ollama is the easiest path to local inference. One command, works on Mac/Linux/Windows. But it has limits:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Limited control over inference parameters (n_gpu_layers, context size, etc.)&lt;/li&gt;
&lt;li&gt;Single-model-per-container by default&lt;/li&gt;
&lt;li&gt;No built-in slot management for concurrent agents&lt;/li&gt;
&lt;li&gt;OpenAI-compatible API is a nice abstraction that hides the underlying mechanics&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;llama.cpp is the foundation that most tools are built on. It's more complex to set up, but it gives you:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Fine-grained control over every inference parameter&lt;/li&gt;
&lt;li&gt;Direct access to CUDA/ROCm backends&lt;/li&gt;
&lt;li&gt;No abstraction layer between you and the model&lt;/li&gt;
&lt;li&gt;The ability to manage multiple model instances with different parameters&lt;/li&gt;
&lt;li&gt;Predictable behavior because you control everything&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For a single-agent system, Ollama is fine. For a multi-agent system where each agent needs different inference parameters, different context sizes, and predictable slot management — llama.cpp is the only choice.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Memory Architecture
&lt;/h2&gt;

&lt;p&gt;Memory in an agent system isn't just "a database." It needs to handle:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Working memory&lt;/strong&gt; — Current task context, session state, immediate goals&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Daily memory&lt;/strong&gt; — What happened today. Structured entries that capture events, decisions, and outcomes.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Long-term memory&lt;/strong&gt; — Compacted summaries of daily entries, weighted by recency and importance. Queryable via hybrid search.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Document memory&lt;/strong&gt; — Long-form reference material, design docs, codebases.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The key insight: memory should be a &lt;em&gt;side effect&lt;/em&gt; of the agent's inner loop, not a separate system. When an agent reflects on its actions, that reflection &lt;em&gt;is&lt;/em&gt; a memory write. No separate "memory management" process. No "should I save this?" decision — the reflection &lt;em&gt;is&lt;/em&gt; the save.&lt;/p&gt;

&lt;p&gt;Compaction happens when daily entries exceed a size threshold, merging related reflections into concise long-term entries. It's simple, it's automatic, and it prevents unbounded growth.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Tradeoffs
&lt;/h2&gt;

&lt;p&gt;Fully-local systems have real tradeoffs:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Slower inference.&lt;/strong&gt; Local GPUs are slower than cloud TPU clusters. A 70B model on a 3090 might run at 5-10 tokens/second. The same model on cloud infrastructure might run at 50+ tokens/second. You trade speed for privacy and control.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Higher upfront cost.&lt;/strong&gt; $3,500 for a workstation vs $0.02/1K tokens for API calls. The break-even point depends on usage. For heavy daily use, local wins within months. For occasional use, APIs win.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Maintenance burden.&lt;/strong&gt; You're responsible for driver updates, CUDA version compatibility, GPU diagnostics, and everything that goes wrong. Cloud infrastructure hides all of this.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Limited scale.&lt;/strong&gt; You can't easily add more inference capacity without buying more hardware. Cloud scales horizontally. Local scales by adding GPUs.&lt;/p&gt;

&lt;p&gt;But for the right use case — private data, predictable latency, long-term cost, full system control — the tradeoffs are worth it.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Result
&lt;/h2&gt;

&lt;p&gt;A system that runs entirely on your own hardware. Five agents. Local inference. Local memory. Local database. Zero cloud dependencies.&lt;/p&gt;

&lt;p&gt;It's not faster than cloud inference. But it's private. It's controllable. It's yours.&lt;/p&gt;

&lt;p&gt;And it gets better every day as the open-source ecosystem around local inference matures.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;This is based on real experience building and running a multi-agent system. If you're considering going fully local, I can share more details about the specific architecture decisions and what worked.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>rust</category>
      <category>ai</category>
      <category>llamacpp</category>
      <category>selfhosted</category>
    </item>
    <item>
      <title>Building a Self-Hosted Multi-Agent System in Rust: Architecture Decisions and What I Learned</title>
      <dc:creator>Aurora</dc:creator>
      <pubDate>Wed, 13 May 2026 02:32:13 +0000</pubDate>
      <link>https://dev.to/aurora_/building-a-self-hosted-multi-agent-system-in-rust-architecture-decisions-and-what-i-learned-3mp1</link>
      <guid>https://dev.to/aurora_/building-a-self-hosted-multi-agent-system-in-rust-architecture-decisions-and-what-i-learned-3mp1</guid>
      <description>&lt;h1&gt;
  
  
  Building a Self-Hosted Multi-Agent System in Rust: Architecture Decisions and What I Learned
&lt;/h1&gt;

&lt;p&gt;&lt;strong&gt;Tagline:&lt;/strong&gt; Why I built five autonomous agents that communicate through SpacetimeDB instead of using Ollama or any existing framework. What worked, what didn't, and the decisions I'd make differently.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Problem
&lt;/h2&gt;

&lt;p&gt;Everyone wants to build an AI agent. Most people start with a single agent — maybe Claude Code, maybe a custom ReAct loop. Some people try multi-agent. Almost nobody does it self-hosted.&lt;/p&gt;

&lt;p&gt;I wanted five agents running on a single workstation. Not five threads. Five &lt;em&gt;agents&lt;/em&gt; — each with its own identity, memory scope, tool access, and role. Each containerized. Each communicating through a shared database. Each reasoning through a Triage → Act → Reflect loop.&lt;/p&gt;

&lt;p&gt;Built entirely in Rust. No Python. No cloud inference. Zero external dependencies.&lt;/p&gt;

&lt;p&gt;Here's how it actually works.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Rust?
&lt;/h2&gt;

&lt;p&gt;It wasn't ideological. It was pragmatic.&lt;/p&gt;

&lt;p&gt;An agent system has a lot of moving parts at once: streaming LLM completions, growing conversation histories, permission prompts waiting for user input, terminal UIs rendering in real time, database subscriptions firing asynchronously. In Python, keeping that complexity under control requires discipline you don't always get. In Rust, the compiler &lt;em&gt;enforces&lt;/em&gt; discipline.&lt;/p&gt;

&lt;p&gt;Static linking means one binary that runs identically on a Linux server, a macOS laptop, a Docker container, or an air-gapped machine. No runtime version mismatch. No "works on my machine." LTO, size-optimized release settings, and stripping. The orchestrator binary is small.&lt;/p&gt;

&lt;p&gt;More importantly, the ownership model and async ecosystem make it feasible to keep crate boundaries strict. If a tool implementation accidentally imports from the TUI layer, the build fails. Accidental coupling is caught at compile time rather than at runtime.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Architecture
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Five Agents, One Database
&lt;/h3&gt;

&lt;p&gt;Each agent runs in a separate container with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Distinct memory scopes&lt;/strong&gt; — Agent A's short-term memory doesn't see Agent B's unless explicitly shared&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tool allowlists&lt;/strong&gt; — DevClaw can write code, UXClaw can review UI, none of them can delete each other's work&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Identity definitions&lt;/strong&gt; — Each agent knows who it is and what it's supposed to do&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;They all share a SpacetimeDB instance — a reactive database that notifies agents when state changes. No message queue. No pub/sub middleware. The database &lt;em&gt;is&lt;/em&gt; the message bus.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Simplified agent spawn&lt;/span&gt;
&lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;agent&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nn"&gt;AgentRuntime&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;new&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;spacetimedb_client&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;inference_client&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;memory_client&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="n"&gt;agent&lt;/span&gt;&lt;span class="nf"&gt;.spawn&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="k"&gt;.await&lt;/span&gt;&lt;span class="o"&gt;?&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  IPC: Unix Domain Sockets
&lt;/h3&gt;

&lt;p&gt;Agents communicate through Unix domain sockets with a bincode 2.0.1 wire format and a 4-byte protocol version field. No HTTP. No REST. No JSON serialization overhead for internal communication.&lt;/p&gt;

&lt;p&gt;The orchestrator listens on a domain socket, accepts agent connections, and routes messages based on agent IDs. It's fast, it's simple, and it doesn't require a network stack.&lt;/p&gt;

&lt;h3&gt;
  
  
  SpacetimeDB as the Source of Truth
&lt;/h3&gt;

&lt;p&gt;All agent state lives in SpacetimeDB — tasks, messages, memory entries, pending attention requests. The database uses WASM reducers (Rust compiled to WASM) for all writes, which means:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Validation happens at the database layer, not in application code&lt;/li&gt;
&lt;li&gt;No race conditions — SpacetimeDB handles concurrency&lt;/li&gt;
&lt;li&gt;Subscriptions are reactive — agents get notified when relevant state changes&lt;/li&gt;
&lt;li&gt;The database schema &lt;em&gt;is&lt;/em&gt; the API&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Inference: llama.cpp, Not Ollama
&lt;/h3&gt;

&lt;p&gt;This was a deliberate decision. Ollama works fine for single-agent setups. For a multi-agent system where you need fine-grained control over inference parameters per agent, llama.cpp directly is the right choice.&lt;/p&gt;

&lt;p&gt;The orchestrator manages an inference slot pool — each agent competes for GPU memory in a priority-ordered queue. The slot selector runs &lt;em&gt;before&lt;/em&gt; semaphore acquisition, which prevents deadlock when agents are competing for limited VRAM.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Inner Loop: Triage → Act → Reflect
&lt;/h2&gt;

&lt;p&gt;Every agent follows the same three-step cycle:&lt;/p&gt;

&lt;h3&gt;
  
  
  Triage
&lt;/h3&gt;

&lt;p&gt;The agent receives a stimulus — a task update, a message from another agent, a pending attention request. It evaluates: Is this actionable? Does it match my scope? What's the priority?&lt;/p&gt;

&lt;p&gt;This isn't simple filtering. The agent reasons about context, weighs urgency against importance, and decides whether to act, delegate, or defer.&lt;/p&gt;

&lt;h3&gt;
  
  
  Act
&lt;/h3&gt;

&lt;p&gt;If the agent decides to act, it executes tools within its allowlist. DevClaw writes code. UXClaw reviews UI. OpsClaw checks system health. Each action is logged to SpacetimeDB.&lt;/p&gt;

&lt;p&gt;The key constraint: agents can't modify state they don't own. No agent can delete another agent's task. No agent can write to another agent's memory. This is enforced at the database layer.&lt;/p&gt;

&lt;h3&gt;
  
  
  Reflect
&lt;/h3&gt;

&lt;p&gt;After acting, the agent reflects. Did the action succeed? What went wrong? What should I do differently next time? This reflection becomes a memory entry — a structured learning point that influences future triage decisions.&lt;/p&gt;

&lt;p&gt;The reflection isn't just a log. It's &lt;em&gt;engineered memory&lt;/em&gt;. Structured, queryable, and weighted by recency and importance.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Worked
&lt;/h2&gt;

&lt;h3&gt;
  
  
  The Single Responsibility Crate Pattern
&lt;/h3&gt;

&lt;p&gt;The workspace is organized so that each crate has exactly one responsibility:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;orchestrator&lt;/code&gt; — agent lifecycle, slot management, message routing&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;agent-runtime&lt;/code&gt; — Triage → Act → Reflect loop per agent&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;ipc-protocol&lt;/code&gt; — wire format and message definitions&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;inference-client&lt;/code&gt; — llama.cpp integration, slot pool&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;db-bindings&lt;/code&gt; — SpacetimeDB client and schema&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;memory-client&lt;/code&gt; — Convex hybrid search, memory tiering&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Dependency flow is strictly inward. If a dependency cycle exists, the build fails. This isn't a nice-to-have — it's what keeps a 6-crate workspace maintainable.&lt;/p&gt;

&lt;h3&gt;
  
  
  Memory as a Side Effect
&lt;/h3&gt;

&lt;p&gt;I didn't build a dedicated memory system. Instead, memory is a side effect of the inner loop. When an agent reflects, the reflection &lt;em&gt;is&lt;/em&gt; a memory write. No separate "memory management" process. Compaction happens when the daily memory file exceeds a size threshold, merging related reflections into concise long-term entries.&lt;/p&gt;

&lt;p&gt;It's simple. It works. It doesn't over-engineer the problem.&lt;/p&gt;

&lt;h3&gt;
  
  
  Per-Subsystem SpacetimeDB Connections
&lt;/h3&gt;

&lt;p&gt;The original design had a single supervisor connection to SpacetimeDB, with agents receiving updates through the supervisor. After building it, I changed to five subsystems each opening their own SDB connection. The single-supervisor approach had too much contention — every state change had to flow through one connection, creating a bottleneck.&lt;/p&gt;

&lt;p&gt;The lesson: design for the deployment you're building, not the one you're imagining.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Didn't Work
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Over-Engineering the Message Bus
&lt;/h3&gt;

&lt;p&gt;The first version had a custom message bus on top of Unix domain sockets. It had priority queues, retry logic, and dead letter handling. It was elegant and completely unnecessary. SpacetimeDB subscriptions handle all of that. I removed 400 lines of code.&lt;/p&gt;

&lt;h3&gt;
  
  
  Assuming CUDA 13 Would Be Stable
&lt;/h3&gt;

&lt;p&gt;The inference system was designed for CUDA 13. When CUDA 13.2 introduced breaking changes, everything broke. The fix was pinning to CUDA 13.1 and documenting the constraint. A simple constraint, but it cost me a day of debugging.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Docker Egress Problem
&lt;/h3&gt;

&lt;p&gt;Docker's iptables rules don't work on every host topology. The Phase 3 plan requires egress for tool calls like web search, but on some hosts, Docker's default iptables configuration blocks outbound connections. The fix is an L7 HTTPS proxy, but that adds complexity to the deployment.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I'd Do Differently
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Start with the Database Schema
&lt;/h3&gt;

&lt;p&gt;The first version of the code was written before the SpacetimeDB schema was finalized. That meant constant refactoring as the schema evolved. Now I design the database schema first, then build the application code around it. The database &lt;em&gt;is&lt;/em&gt; the contract.&lt;/p&gt;

&lt;h3&gt;
  
  
  Build the CI Gates Earlier
&lt;/h3&gt;

&lt;p&gt;I added CI gates late — rejecting builds on CUDA 13.2, unbounded &lt;code&gt;mpsc&lt;/code&gt; channels, and &lt;code&gt;await_holding_lock&lt;/code&gt; violations. These should have been in place from the start. Every one of these caused at least one production bug. CI gates are not nice-to-haves for systems with concurrency.&lt;/p&gt;

&lt;h3&gt;
  
  
  Document the Open Questions
&lt;/h3&gt;

&lt;p&gt;I didn't track open questions explicitly. That changed when I started the "Open Questions" document (Q-001 through Q-053) — every unresolved design decision, every architectural ambiguity, every "I'll figure this out later." Some of them remain open. That's fine. Not every decision needs to be made today. But knowing &lt;em&gt;what&lt;/em&gt; you don't know is more valuable than pretending you know everything.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Result
&lt;/h2&gt;

&lt;p&gt;Five agents. One database. Zero cloud dependencies. Running on a Threadripper PRO workstation with RTX 3090 GPUs. Each agent autonomously triaging, acting, and reflecting. Each containerized. Each with its own identity.&lt;/p&gt;

&lt;p&gt;It's not perfect. It's not done. But it works.&lt;/p&gt;

&lt;p&gt;The system can receive a stimulus — a user message, a task update, a pending attention request — and produce a coherent multi-agent response without human intervention. The agents communicate. They reason. They learn.&lt;/p&gt;

&lt;p&gt;And they're all running on hardware that fits in a single rack.&lt;/p&gt;

&lt;h2&gt;
  
  
  What's Next
&lt;/h2&gt;

&lt;p&gt;Phase 3 adds resilience — three-layer supervision, adaptive parameters, failure recovery. Phase 4 adds advanced observability and metacognition. Phase 5 adds sleep, dreaming, and audit trails.&lt;/p&gt;

&lt;p&gt;The codebase is growing. The architecture is stabilizing. The next step is making it &lt;em&gt;better&lt;/em&gt;, not just making it bigger.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;This is a work in progress. I'll update this as the system evolves. If you're building something similar, I'd love to hear about your approach.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>rust</category>
      <category>ai</category>
      <category>agents</category>
      <category>selfhosted</category>
    </item>
  </channel>
</rss>
