Mininglamp

Posted on Jun 10

Why IM Is the Natural Infrastructure Layer for AI Agent Collaboration

#ai #agents #collaboration #opensource

When you're building a multi-agent system, the first real question isn't which model to use or how to structure your prompts. It's simpler and harder than that: how do these agents actually talk to each other?

Most teams reach for the familiar toolkit. Kafka for event streaming. RabbitMQ for task queues. gRPC or REST for synchronous calls. Custom WebSocket servers when latency matters. These are all reasonable. We tried several of them. But after spending significant time on this problem at Mininglamp, we kept running into the same friction: we were building coordination infrastructure that already existed, just under a different name.

The protocol we needed had been running in production, at scale, for decades. We'd been using it to coordinate humans.

What Multi-Agent Coordination Actually Needs

Strip away the specifics of any given agent framework and you get the same core requirements:

Asynchronous message delivery with some ordering guarantee
Routing to specific recipients or groups without point-to-point coupling
Scoped context so agents have what's relevant, not everything
Access control that determines which agents can participate in which workflows
Enough structure to be machine-parseable, enough flexibility to handle novel situations

That list is not a description of a message queue or an RPC framework. It's a description of an instant messaging system.

This isn't a metaphor. IM was designed, from first principles, to coordinate loosely-coupled agents that operate asynchronously, hold different roles and permissions, work across multiple parallel contexts, and communicate primarily in natural language. The design constraints that shaped IM are almost identical to the design constraints of multi-agent systems. The primitives line up because the problem is the same problem.

Async Is Not Optional

One of the first things you internalize when building agent pipelines is that synchronous coupling kills you at scale. An orchestrator that blocks on a sub-agent response doesn't parallelize. Under load, it queues up and falls over.

Traditional message queues solve this, but they introduce a separate operational surface. You have application logic in one place, coordination infrastructure in another. Separate deployments, separate monitoring, separate schemas, separate debugging. That's fine for systems where the coordination patterns are stable and well-understood. It's friction for systems that are evolving quickly.

IM channels are async by design. A message sent to a channel is delivered when the recipient is ready to receive it. The sender doesn't wait. This is the correct semantic for agent coordination, and you get it for free because that's how IM was built.

What you also get, which message queues typically don't provide well, is threaded context. A thread in IM has a natural boundary. It has a topic, a beginning, and a coherent exchange that took place inside it. When an agent joins a thread, it reads the history and understands the context. The scope is bounded by the thread itself.

For LLM-based agents this matters enormously. Context windows are finite and expensive to fill. You want to give an agent the relevant slice of history, not the entire channel going back months. Thread semantics handle this naturally. The thread is already a curated context window.

There's another angle here too: backpressure. When an agent is busy or rate-limited, messages sit in the channel. The sender gets a natural indication that something isn't being processed. The conversation just pauses. You don't need to implement retry logic or circuit breakers at the coordination layer because the persistence of IM messages handles the buffering implicitly. This isn't always enough for high-throughput pipelines, but for the conversational, task-oriented workflows where most agent collaboration happens, it covers the common cases well.

Channel Isolation Maps to Permission Structure

Real multi-agent systems need access control. An agent with database write permissions shouldn't operate in the same coordination space as a customer-facing agent. Sensitive financial workflows need to be isolated from general-purpose agents. Access needs to be auditable.

IM systems model this with a hierarchy that most organizations find intuitive:

Organization
  └── Spaces
        └── Categories
              └── Channels
                    └── Threads

Permissions flow down through this structure. An agent placed in a channel inherits the access boundaries of that channel. It can't see channels it isn't a member of. It can't act outside its scope. You don't configure this per-agent; you configure the channel and the membership.

Contrast this with a custom message broker where access control is your responsibility to design, implement, and maintain. IM gives you a working model backed by years of production use and security scrutiny. You're not reinventing this wheel.

Channel isolation also provides something less obvious: organizational legibility. When you look at your channel structure, you can understand at a glance which agents are participating in which workflows. The access model is visible. That's not always true with queue-based architectures where routing logic lives in application code.

Persistent Context Without a Separate State Store

This took us a while to fully appreciate, so it's worth making explicit.

Custom agent coordination systems almost always have two components: the coordination protocol (messages moving between agents) and a state store (a database or cache where you persist what matters). You design the schema, you write the glue code, you maintain both. When something goes wrong, you debug across both.

IM collapses these. The message history is the state. The conversation is the audit log. Persistence isn't a separate concern you manage; it's the default behavior of the system.

For agentic workflows this enables something practically useful: agents can join ongoing conversations and immediately understand context. You don't need to serialize state into a separate store and inject it. The agent reads the thread. It sees what was discussed, what was decided, what's still open. This is exactly how human onboarding works. Someone joins a project channel, reads back through the conversation, understands what's happening. Agents do the same thing using the same interface.

There's an audit story here too. When something in your agent system produces an unexpected output, the conversation history is your trace. Every message, every decision point, every agent response is logged in the order it happened. You don't need a separate tracing system for the coordination layer because the coordination layer is already a log.

Natural Language as the Default Protocol

Most agent coordination protocols require structured data interchange. JSON schemas. Function call specifications. Defined message envelopes. These work well when all the agents in the system were designed by the same team with the same interfaces in mind.

They work poorly for heterogeneous agent systems. If you're integrating an agent from one provider with infrastructure from another, you need a shared interface specification. Usually this means an integration layer, custom serialization, schema translation. The more agents you add, the more integration surface you're managing.

Natural language sidesteps most of this. If both agents can read and write text, they can coordinate without a shared schema. The protocol is the language. This isn't always the right choice. Some interactions genuinely need structured data, especially when precision and machine parseability matter. But having natural language as the default, with structure as an option you add when needed, is the right starting point for a heterogeneous system.

IM is natural language first. The base message type is text. Structured elements like mentions, attachments, and reactions exist and are useful, but the default is prose. This matches the right default for systems that need to be legible to both humans and machines simultaneously.

There's a practical consequence for debugging too. When an agent-to-agent coordination system uses binary protocols or custom envelopes, debugging requires tooling that understands those formats. When coordination happens in natural language in an IM channel, you can read it. A human can open the channel, read the conversation, and understand immediately what the agents were doing and where something went wrong. This sounds trivial but it's not. Debuggability is a real operational cost, and natural language makes the entire coordination layer human-auditable by default.

Organizational Distribution, Not Just Individual Use

There's a quieter argument here that matters for real deployments.

Most AI tool adoption is individual. You have an agent that helps you personally: it answers questions, summarizes documents, writes code. Its capability is scoped to you. When you're not using it, it produces nothing. When you leave the organization, that capability leaves with you.

When agents participate in IM channels alongside teams, the distribution model changes fundamentally. The channel is shared. Every team member interacts with the same agent, sees what it produces, learns how to prompt it effectively, and builds shared intuitions about how to work with it. A capable agent in a well-run channel becomes organizational infrastructure, not a personal productivity tool.

This matters for adoption in ways that are easy to underestimate. Individual AI adoption is relatively frictionless. One person decides to use a tool and starts using it. Organizational AI adoption is hard. How does a team develop shared working patterns? How does knowledge about what the agent can and can't do propagate across people? How does the team maintain shared context about what agents are doing and why?

IM already solves these problems for human coordination. Agents that live in channels inherit the solutions. The communication infrastructure, the notification patterns, the norms around how people stay in sync. Agents in IM don't need a separate adoption process because they're already part of the process that exists.

The Adapter Problem

Accepting that IM is the right coordination layer creates a practical engineering challenge. Most organizations already have an IM platform. You're integrating into existing infrastructure, not building from scratch.

Making AI agents genuine first-class participants in IM requires more than a webhook that posts messages. True participation means understanding conversation context across multiple messages, respecting inherited permissions from channel membership, handling the full vocabulary of message types the platform supports, maintaining coherent state across sessions, and responding appropriately to the full range of signals in a channel: direct messages, thread replies, mentions, reactions.

Building this as a reusable layer means abstracting over the specific behaviors of different IM platforms while preserving the semantics that actually matter for agent coordination. The primitives are similar across platforms but the details diverge in ways that will bite you if you don't abstract carefully.

What We Built

This line of thinking led us to build Octo, an open-source AI-native team collaboration platform released under Apache 2.0. The central architectural choice was to make AI agents first-class participants in the organizational communication layer rather than an add-on to it.

Agents in Octo join channels and work alongside human teammates through the same interface. There is no separate AI dashboard, no special mode. The same conversation history, the same permission model, the same threading semantics apply to humans and agents both. From a channel member's perspective, an agent is another participant.

A few specific pieces worth describing:

octo-adapters bridges third-party AI agents and IM platforms. Rather than requiring every agent to natively understand IM semantics, the adapter layer handles the translation. Agents expose their capabilities; the adapter manages their participation in channels. This means existing agents can operate inside IM without being redesigned for it. The bridge is the integration point, not each individual agent.

group.md is a structured document that agents help facilitate for group alignment. Teams often accumulate shared context, decisions, and working agreements across hundreds of messages that nobody maintains explicitly. An agent with channel access can help keep a structured summary coherent and current based on what it observes in the conversation. The document lives in the channel, visible to all members, maintained with agent assistance.

Voice input with context-aware correction addresses a specific problem that becomes visible in team settings. General-purpose transcription models make systematic errors on domain-specific vocabulary. If your team discusses a particular codebase, product, or technical domain, transcription accuracy on that vocabulary is poor out of the box. Context-aware correction uses the channel history to improve accuracy on the terms that actually matter for your team. Beyond transcription, it also considers what was said in recent conversation to resolve ambiguous phrases correctly.

The organizational structure (Spaces, Categories, Channels, Threads) maps directly to the permission hierarchy we described. Agents inherit from the structure they're placed in. Access control stays manageable without per-agent configuration.

Octo is at github.com/Mininglamp-OSS under Apache 2.0. The org has 20 repos and 217+ total stars across the project, with the core repos at octo-web, octo-server, octo-deployment, and octo-adapters. If you're building multi-agent systems and thinking about the coordination layer, the Discord community at discord.gg/vj9Vsj9hSB is where we discuss architecture decisions. Come find us there.