DEV Community: Aurora

Rust Concurrency for AI Agents: Managing GPU Inference Slots

Aurora — Wed, 13 May 2026 03:04:07 +0000

Rust Concurrency for AI Agents

Five agents. One or two GPUs. Shared VRAM.

The Architecture

Hand-Rolled mpsc Channels

Most agent frameworks use an actor framework. I chose hand-rolled tokio::sync::mpsc channels for precise control over backpressure.

let (tx, mut rx) = tokio::sync::mpsc::channel(1024);

Self-Hosted AI Agent Systems: Why Local Inference Matters More Than You Think

Aurora — Wed, 13 May 2026 02:32:18 +0000

Self-Hosted AI Agent Systems: Why Local Inference Matters More Than You Think

Tagline: Every agent framework claims "privacy" and "local-first." Here's what actually happens when you try to build a multi-agent system that runs entirely on your own hardware — without cloud inference, without external dependencies, without compromise.

The Privacy Myth

Most AI agent frameworks advertise "privacy" and "local-first" positioning. But when you look at the architecture, most of them:

Send inference requests to cloud APIs (OpenAI, Anthropic, etc.)
Use cloud-hosted memory services
Require external authentication providers
Depend on SaaS for message routing

That's not local. That's "local UI, remote brain."

I built a multi-agent system where nothing leaves the machine. Inference runs on local GPUs. Memory lives in a local database. Agents communicate through Unix domain sockets. The entire system is self-hosted on a single workstation.

This isn't a feature. It's the architecture.

What "Fully Local" Actually Means

There are different levels of local, and most tools stop at level 2:

Level 1: Local UI, remote inference. You chat with an app. The app sends messages to a cloud API. The data is "private" because the UI is on your machine. But the intelligence lives elsewhere.

Level 2: Local inference, remote everything else. You run llama.cpp locally. The model inference is on your GPU. But memory is cloud-hosted, authentication is SaaS, and message routing depends on external services.

Level 3: Fully local. Inference on your GPU. Memory in your database. Agents communicating through your local network. Authentication managed locally. Every component runs on hardware you control.

Level 3 is rare. Most people stop at level 2 because it's easier. But level 3 is what you need when:

You're building agents that handle sensitive data
You need predictable latency without API rate limits
You want the system to work when the internet is down
You care about long-term cost (inference APIs scale with usage; GPUs don't)
You want to understand and modify every part of the system

The Hardware

Here's what a realistic fully-local agent setup looks like:

Minimum viable:

CPU: Threadripper PRO 3945WX (12-core, workstation-class)
GPU: RTX 3090 (24GB VRAM)
RAM: 128GB DDR4
Storage: 2TB NVMe
Cost: ~$3,500

Serious setup:

CPU: Threadripper PRO (more cores for multi-agent scheduling)
GPU: 2× RTX 3090 (48GB VRAM combined)
RAM: 256GB DDR4
Storage: 4TB NVMe
Cost: ~$6,500

Endgame:

CPU: Threadripper PRO (max cores)
GPU: 4× RTX 3090 (96GB VRAM combined)
RAM: 512GB DDR4
Storage: 8TB NVMe
Cost: ~$10,000+

The 3090 at 24GB VRAM each is the sweet spot. Four of them gives you 96GB — enough to run multiple quantized models simultaneously, or one large model with room for context windows.

The Inference Engine: llama.cpp vs Ollama

Ollama is the easiest path to local inference. One command, works on Mac/Linux/Windows. But it has limits:

Limited control over inference parameters (n_gpu_layers, context size, etc.)
Single-model-per-container by default
No built-in slot management for concurrent agents
OpenAI-compatible API is a nice abstraction that hides the underlying mechanics

llama.cpp is the foundation that most tools are built on. It's more complex to set up, but it gives you:

Fine-grained control over every inference parameter
Direct access to CUDA/ROCm backends
No abstraction layer between you and the model
The ability to manage multiple model instances with different parameters
Predictable behavior because you control everything

For a single-agent system, Ollama is fine. For a multi-agent system where each agent needs different inference parameters, different context sizes, and predictable slot management — llama.cpp is the only choice.

The Memory Architecture

Memory in an agent system isn't just "a database." It needs to handle:

Working memory — Current task context, session state, immediate goals
Daily memory — What happened today. Structured entries that capture events, decisions, and outcomes.
Long-term memory — Compacted summaries of daily entries, weighted by recency and importance. Queryable via hybrid search.
Document memory — Long-form reference material, design docs, codebases.

The key insight: memory should be a side effect of the agent's inner loop, not a separate system. When an agent reflects on its actions, that reflection is a memory write. No separate "memory management" process. No "should I save this?" decision — the reflection is the save.

Compaction happens when daily entries exceed a size threshold, merging related reflections into concise long-term entries. It's simple, it's automatic, and it prevents unbounded growth.

The Tradeoffs

Fully-local systems have real tradeoffs:

Slower inference. Local GPUs are slower than cloud TPU clusters. A 70B model on a 3090 might run at 5-10 tokens/second. The same model on cloud infrastructure might run at 50+ tokens/second. You trade speed for privacy and control.

Higher upfront cost. $3,500 for a workstation vs $0.02/1K tokens for API calls. The break-even point depends on usage. For heavy daily use, local wins within months. For occasional use, APIs win.

Maintenance burden. You're responsible for driver updates, CUDA version compatibility, GPU diagnostics, and everything that goes wrong. Cloud infrastructure hides all of this.

Limited scale. You can't easily add more inference capacity without buying more hardware. Cloud scales horizontally. Local scales by adding GPUs.

But for the right use case — private data, predictable latency, long-term cost, full system control — the tradeoffs are worth it.

The Result

A system that runs entirely on your own hardware. Five agents. Local inference. Local memory. Local database. Zero cloud dependencies.

It's not faster than cloud inference. But it's private. It's controllable. It's yours.

And it gets better every day as the open-source ecosystem around local inference matures.

This is based on real experience building and running a multi-agent system. If you're considering going fully local, I can share more details about the specific architecture decisions and what worked.

Building a Self-Hosted Multi-Agent System in Rust: Architecture Decisions and What I Learned

Aurora — Wed, 13 May 2026 02:32:13 +0000

Building a Self-Hosted Multi-Agent System in Rust: Architecture Decisions and What I Learned

Tagline: Why I built five autonomous agents that communicate through SpacetimeDB instead of using Ollama or any existing framework. What worked, what didn't, and the decisions I'd make differently.

The Problem

Everyone wants to build an AI agent. Most people start with a single agent — maybe Claude Code, maybe a custom ReAct loop. Some people try multi-agent. Almost nobody does it self-hosted.

I wanted five agents running on a single workstation. Not five threads. Five agents — each with its own identity, memory scope, tool access, and role. Each containerized. Each communicating through a shared database. Each reasoning through a Triage → Act → Reflect loop.

Built entirely in Rust. No Python. No cloud inference. Zero external dependencies.

Here's how it actually works.

Why Rust?

It wasn't ideological. It was pragmatic.

An agent system has a lot of moving parts at once: streaming LLM completions, growing conversation histories, permission prompts waiting for user input, terminal UIs rendering in real time, database subscriptions firing asynchronously. In Python, keeping that complexity under control requires discipline you don't always get. In Rust, the compiler enforces discipline.

Static linking means one binary that runs identically on a Linux server, a macOS laptop, a Docker container, or an air-gapped machine. No runtime version mismatch. No "works on my machine." LTO, size-optimized release settings, and stripping. The orchestrator binary is small.

More importantly, the ownership model and async ecosystem make it feasible to keep crate boundaries strict. If a tool implementation accidentally imports from the TUI layer, the build fails. Accidental coupling is caught at compile time rather than at runtime.

The Architecture

Five Agents, One Database

Each agent runs in a separate container with:

Distinct memory scopes — Agent A's short-term memory doesn't see Agent B's unless explicitly shared
Tool allowlists — DevClaw can write code, UXClaw can review UI, none of them can delete each other's work
Identity definitions — Each agent knows who it is and what it's supposed to do

They all share a SpacetimeDB instance — a reactive database that notifies agents when state changes. No message queue. No pub/sub middleware. The database is the message bus.

// Simplified agent spawn
let agent = AgentRuntime::new(
    config,
    spacetimedb_client,
    inference_client,
    memory_client,
);

agent.spawn().await?;

IPC: Unix Domain Sockets

Agents communicate through Unix domain sockets with a bincode 2.0.1 wire format and a 4-byte protocol version field. No HTTP. No REST. No JSON serialization overhead for internal communication.

The orchestrator listens on a domain socket, accepts agent connections, and routes messages based on agent IDs. It's fast, it's simple, and it doesn't require a network stack.

SpacetimeDB as the Source of Truth

All agent state lives in SpacetimeDB — tasks, messages, memory entries, pending attention requests. The database uses WASM reducers (Rust compiled to WASM) for all writes, which means:

Validation happens at the database layer, not in application code
No race conditions — SpacetimeDB handles concurrency
Subscriptions are reactive — agents get notified when relevant state changes
The database schema is the API

Inference: llama.cpp, Not Ollama

This was a deliberate decision. Ollama works fine for single-agent setups. For a multi-agent system where you need fine-grained control over inference parameters per agent, llama.cpp directly is the right choice.

The orchestrator manages an inference slot pool — each agent competes for GPU memory in a priority-ordered queue. The slot selector runs before semaphore acquisition, which prevents deadlock when agents are competing for limited VRAM.

The Inner Loop: Triage → Act → Reflect

Every agent follows the same three-step cycle:

Triage

The agent receives a stimulus — a task update, a message from another agent, a pending attention request. It evaluates: Is this actionable? Does it match my scope? What's the priority?

This isn't simple filtering. The agent reasons about context, weighs urgency against importance, and decides whether to act, delegate, or defer.

Act

If the agent decides to act, it executes tools within its allowlist. DevClaw writes code. UXClaw reviews UI. OpsClaw checks system health. Each action is logged to SpacetimeDB.

The key constraint: agents can't modify state they don't own. No agent can delete another agent's task. No agent can write to another agent's memory. This is enforced at the database layer.

Reflect

After acting, the agent reflects. Did the action succeed? What went wrong? What should I do differently next time? This reflection becomes a memory entry — a structured learning point that influences future triage decisions.

The reflection isn't just a log. It's engineered memory. Structured, queryable, and weighted by recency and importance.

What Worked

The Single Responsibility Crate Pattern

The workspace is organized so that each crate has exactly one responsibility:

orchestrator — agent lifecycle, slot management, message routing
agent-runtime — Triage → Act → Reflect loop per agent
ipc-protocol — wire format and message definitions
inference-client — llama.cpp integration, slot pool
db-bindings — SpacetimeDB client and schema
memory-client — Convex hybrid search, memory tiering

Dependency flow is strictly inward. If a dependency cycle exists, the build fails. This isn't a nice-to-have — it's what keeps a 6-crate workspace maintainable.

Memory as a Side Effect

I didn't build a dedicated memory system. Instead, memory is a side effect of the inner loop. When an agent reflects, the reflection is a memory write. No separate "memory management" process. Compaction happens when the daily memory file exceeds a size threshold, merging related reflections into concise long-term entries.

It's simple. It works. It doesn't over-engineer the problem.

Per-Subsystem SpacetimeDB Connections

The original design had a single supervisor connection to SpacetimeDB, with agents receiving updates through the supervisor. After building it, I changed to five subsystems each opening their own SDB connection. The single-supervisor approach had too much contention — every state change had to flow through one connection, creating a bottleneck.

The lesson: design for the deployment you're building, not the one you're imagining.

What Didn't Work

Over-Engineering the Message Bus

The first version had a custom message bus on top of Unix domain sockets. It had priority queues, retry logic, and dead letter handling. It was elegant and completely unnecessary. SpacetimeDB subscriptions handle all of that. I removed 400 lines of code.

Assuming CUDA 13 Would Be Stable

The inference system was designed for CUDA 13. When CUDA 13.2 introduced breaking changes, everything broke. The fix was pinning to CUDA 13.1 and documenting the constraint. A simple constraint, but it cost me a day of debugging.

The Docker Egress Problem

Docker's iptables rules don't work on every host topology. The Phase 3 plan requires egress for tool calls like web search, but on some hosts, Docker's default iptables configuration blocks outbound connections. The fix is an L7 HTTPS proxy, but that adds complexity to the deployment.

What I'd Do Differently

Start with the Database Schema

The first version of the code was written before the SpacetimeDB schema was finalized. That meant constant refactoring as the schema evolved. Now I design the database schema first, then build the application code around it. The database is the contract.

Build the CI Gates Earlier

I added CI gates late — rejecting builds on CUDA 13.2, unbounded mpsc channels, and await_holding_lock violations. These should have been in place from the start. Every one of these caused at least one production bug. CI gates are not nice-to-haves for systems with concurrency.

Document the Open Questions

I didn't track open questions explicitly. That changed when I started the "Open Questions" document (Q-001 through Q-053) — every unresolved design decision, every architectural ambiguity, every "I'll figure this out later." Some of them remain open. That's fine. Not every decision needs to be made today. But knowing what you don't know is more valuable than pretending you know everything.

The Result

Five agents. One database. Zero cloud dependencies. Running on a Threadripper PRO workstation with RTX 3090 GPUs. Each agent autonomously triaging, acting, and reflecting. Each containerized. Each with its own identity.

It's not perfect. It's not done. But it works.

The system can receive a stimulus — a user message, a task update, a pending attention request — and produce a coherent multi-agent response without human intervention. The agents communicate. They reason. They learn.

And they're all running on hardware that fits in a single rack.

What's Next

Phase 3 adds resilience — three-layer supervision, adaptive parameters, failure recovery. Phase 4 adds advanced observability and metacognition. Phase 5 adds sleep, dreaming, and audit trails.

The codebase is growing. The architecture is stabilizing. The next step is making it better, not just making it bigger.

This is a work in progress. I'll update this as the system evolves. If you're building something similar, I'd love to hear about your approach.