We Replaced Message Buses with Telemetry for AI Agent Coordination

#ai #opentelemetry #devops #architecture

We Replaced Message Buses with Telemetry for AI Agent Coordination

By Fae McLachlan

After 2.85 years working with AI agents and building multi-agent systems in production, we discovered something surprising: the best way to coordinate AI agents isn't through traditional message buses—it's through shared observability.

This post explains the BossCat Protocol: a framework we developed to use OpenTelemetry as a primary coordination mechanism. This architecture has allowed us to achieve 96% quality gate pass rates and 6-8x workflow speedups while dramatically simplifying our infrastructure.

The Origin: A 3-Year Production Journey

This pattern wasn't invented in a weekend hackathon. It emerged from thousands of hours of production engineering alongside AI.

We first deployed this protocol in early 2023, long before the current wave of agent frameworks. While the industry was focused on "chain-of-thought" prompting, we were solving the infrastructure reality: how do you debug an agent that has been running for 4 hours?

By treating Observability as Memory, we found that agents could self-correct without human intervention. We are sharing this now because we believe this "Evidence-First" architecture is the missing link for reliable autonomous systems.

The Problem: Traditional Agent Coordination is Complex

When building multi-agent systems, the typical approach is:
Agent A → Message Bus → Agent B ← Message Bus ←

This creates several challenges:

Coordination Overhead: Explicit message passing requires careful protocol design.
Debugging Nightmares: When something goes wrong, you're piecing together messages across agents.
Scaling Issues: More agents = exponentially more message routing logic.
State Management: Keeping agents synchronized requires complex state machines.

We experienced this firsthand when building our AI-powered development workflows. The message passing overhead was becoming a bottleneck.

The Insight: Telemetry IS Coordination

Here's the key insight: if all agents emit structured telemetry to a shared observability backend, they can coordinate through that shared context.

Instead of:

Agent A: "I finished task X, here's the result" Agent B: [receives message, processes, responds]

We have:

Agent A: [emits telemetry span: "task_x_complete" with attributes] Agent B: [queries telemetry, sees task_x_complete, proceeds]

This fundamentally changes the architecture from a "Push" model to a "Stigmergic" model (where agents react to traces left in the environment).

How It Works: The BossCat Approach

We built otel-ops-pack, a framework for OpenTelemetry-based agent coordination. Here is the core loop:

1. Agents Emit Structured Telemetry

Every agent operation becomes a telemetry span with rich attributes:


python
with tracer.start_as_current_span("agent_task") as span:
    span.set_attribute("agent.id", "cursor-agent-1")
    span.set_attribute("task.type", "code_generation")
    span.set_attribute("task.status", "complete")
    span.set_attribute("quality.score", 0.95)
    # Do work...

2. Agents Query Their Own Telemetry

This is the magic part—agents can query the telemetry backend (SigNoz/ClickHouse) to understand what they and other agents have done:
def check_prerequisites(task_id):
    """Check if prerequisite tasks are complete by querying telemetry"""
    query = f"""
    SELECT status, quality_score
    FROM spans
    WHERE task_parent_id = '{task_id}'
    AND status = 'complete'
    """
    results = telemetry_client.query(query)
    return len(results) > 0

3. Emergent Coordination

Agents naturally coordinate because they are all working from the same source of truth. There are no explicit messages needed, and the audit trail is built-in automatically.
Evidence-Based Governance

On top of telemetry-based coordination, we built BossCat, an evidence-first governance framework.

Instead of hoping agents do good work, we use gates—checkpoints that require evidence before proceeding. A gate can only pass if the agent provides a specific telemetry query result.

The "Evidence Rule": An agent cannot simply report "Security Check Passed." It must produce the span ID where the security tool wrote its output. This prevents "hallucinated compliance."

gate_requirements:
  - name: "security_scan"
    evidence_type: "telemetry_span"
    span_name: "security_scan_complete"
    required_attributes:
      - "vulnerabilities.critical: 0"
Results

With BossCat governance:

    96% of gates pass on the first attempt.

    Debug time for complex workflows dropped from hours to seconds.

    Code reduction: We removed ~85% of our coordination logic.

Why This Matters

We believe this is the future of AI infrastructure. As we move toward autonomous swarms, we cannot rely on fragile message buses. We need systems that are self-documenting, self-auditing, and self-correcting by design.

We are open-sourcing the otel-ops-pack to help others build this future.

About the Author I am an infrastructure engineer focused on High-Reliability Agent Systems. I have been building the BossCat Protocol since 2023 to solve the "black box" problem of AI coordination.

I am currently standardizing this protocol. If you are building agent infrastructure at scale, let's talk.