Building a Distributed Agent Fabric in Rust: Lessons from Cord’s Architecture

#ai #mcp #programming #buildinpublic

Building a distributed agent system that talks to multiple MCP servers without imploding under latency or memory chaos is hard. I learned that the hard way while building Cord, an agent fabric that coordinates dozens of tool providers across a mesh of concurrent workers—and Rust’s ownership model and zero-cost concurrency turned what could have been a debugging nightmare into a surprisingly smooth ride.

The core challenge: each agent runs as a lightweight async task, communicating with MCP servers over JSON-RPC. You need to share state (agent identity, tool registry, pending requests) without locks or data races, and you need to do it fast—agent handoffs happen in microseconds. Rust’s Arc<tokio::sync::RwLock> is the obvious choice, but the real win is the compiler telling you when you’ve accidentally cloned a reference that should be a borrow, or when a &mut conflicts across an .await boundary.

Here’s a simplified snippet of how Cord spawns an MCP agent worker that forwards tool calls to the right server without blocking:

use tokio::sync::RwLock;
use std::sync::Arc;
use mcp_client::{McpClient, ToolCall};

struct AgentWorker {
    tool_registry: Arc<RwLock<HashMap<String, McpClient>>>,
}

impl AgentWorker {
    async fn handle_tool_call(&self, tool: &str, input: serde_json::Value) -> Result<String> {
        let client = {
            let registry = self.tool_registry.read().await;
            registry.get(tool).cloned()? // clone the Arc<McpClient>
        };
        client.call(tool, input).await // no lock held during I/O
    }
}

The key: we release the read lock before the async call. Rust’s scoping rules make this explicit and the borrow checker enforces it. No accidental lock contention, no “dangling reference” across an async boundary—just a clean, predictable ownership flow.

What surprised me most was the performance. With Go or Node.js, the runtime’s GC would cause occasional spikes when the agent fabric was under load (e.g., streaming logs from 50 MCP servers simultaneously). Rust’s lack of GC means zero pauses, and because we control allocation, we can pre-allocate connection buffers per worker. In production, p99 latency dropped from 12ms to 3ms after porting the core fabric from Go to Rust—and memory usage halved.

If you’re building your own agent–MCP mesh, I’d also recommend embracing tower layers for middleware (retry, timeout, rate limiting) and tokio’s select! for graceful drop. But the biggest lesson: don’t fight the borrow checker. It’s not a gatekeeper—it’s the world’s cheapest correctness proof for your distributed state.

The full Cord architecture is open-source and live on Smithery. I’ve packaged it so you can run your own agent fabric with zero Rust experience if you just want to try the network topology.

See the catalog entry here: https://smithery.ai/servers/vishar-rumbling

Would love to hear how you handle shared mutable state in your own agent systems—comment or PR if you’ve got a smarter async pattern.