ameyb88

Posted on Apr 2

Introducing AOP: A Common Protocol for Multi-Agent AI Systems

#typescript #programming #opensource #ai

Every multi-agent AI framework invents its own way for agents to talk to each other. LangGraph has one format. CrewAI has another. AutoGen has a third. An agent built for one framework can't work with another without a complete rewrite.

We've been here before. Before HTTP, every networked app invented its own protocol. Before the Language Server Protocol, every editor built its own language intelligence. Before OpenAPI, every API documented itself differently.

Today I'm publishing the Agent Orchestration Protocol (AOP) — an open specification that defines how AI agents communicate in multi-agent systems.

GitHub: github.com/ameyb88/agent-orchestration-protocol

The Problem

If you've built multi-agent systems, you've hit these walls:

Agents aren't portable. A security analysis agent built for LangChain needs a full rewrite to work with AutoGen.
Orchestrators are framework-locked. Switching frameworks means rewriting every agent adapter.
Confidence scores are meaningless across systems. When Agent A says "0.8 confidence" and Agent B says "0.8 confidence," those numbers might mean completely different things.
No shared vocabulary. There's no standard way for an agent to declare what it can do, report what it found, or ask a human for help.

What AOP Defines

AOP specifies five message types, all with formal JSON schemas:

1. Capability Declaration

An agent tells the orchestrator what it can do:

{
  "agent_id": "security-analyzer-v2",
  "agent_version": "2.1.0",
  "protocol_version": "aop/1.0",
  "display_name": "Security Vulnerability Analyzer",
  "supported_task_types": ["code_review", "security_scan"],
  "confidence_range": { 
    "min": 0.0, "max": 1.0, "typical": [0.6, 0.95] 
  },
  "supported_languages": ["typescript", "python", "go"],
  "typical_latency_ms": { "p50": 8000, "p95": 25000 }
}

Think of it like a service declaring its API. The orchestrator uses this to decide which agents to dispatch for a given task.

2. Task Request

The orchestrator sends work to an agent:

{
  "task_id": "tr-20260402-a7f3b2c1",
  "task_type": "code_review",
  "priority": "high",
  "timeout_ms": 30000,
  "context": {
    "repository": "acme/web-app",
    "pull_request": { "number": 142 },
    "diff": "--- a/src/auth.ts\n+++ b/src/auth.ts\n..."
  },
  "constraints": {
    "max_findings": 20,
    "min_confidence": 0.5,
    "focus_areas": ["security", "null_handling"]
  }
}

3. Agent Response

The agent returns structured findings with calibrated confidence scores:

{
  "task_id": "tr-20260402-a7f3b2c1",
  "agent_id": "security-analyzer-v2",
  "status": "completed",
  "findings": [{
    "finding_id": "f-001",
    "category": "security",
    "severity": "critical",
    "title": "SQL injection via string interpolation",
    "location": { "file": "src/auth.ts", "start_line": 18 },
    "confidence": {
      "score": 0.92,
      "evidence_strength": 0.95,
      "model_certainty": 0.90,
      "context_completeness": 0.88
    },
    "evidence": [{
      "type": "code_pattern",
      "content": "db.query(`SELECT * FROM users WHERE id = ${userId}`)"
    }],
    "suggested_actions": [{
      "action": "replace",
      "description": "Use parameterized query",
      "priority": "required"
    }]
  }],
  "metadata": {
    "execution_time_ms": 12400,
    "tokens_used": { "input": 15200, "output": 3100 }
  }
}

4. Handoff Context

When Agent A's output needs to go to Agent B:

{
  "handoff_id": "ho-20260402-x9k2m1",
  "source_agent_id": "security-analyzer-v2",
  "target_agent_id": "logic-analyzer-v1",
  "handoff_reason": "Security findings require logic validation",
  "accumulated_findings": [
    { "finding_id": "f-001", "confidence": { "score": 0.92 } }
  ],
  "confidence_chain": [
    { "agent_id": "security-analyzer-v2", "aggregate_confidence": 0.88 }
  ]
}

5. Escalation Request

When an agent or orchestrator needs a human:

{
  "escalation_id": "esc-20260402-p3q8",
  "level": "review_requested",
  "reason": "Conflicting recommendations between security and performance agents",
  "suggested_options": [
    { "option_id": "a", "description": "Prioritize security", "recommended": true },
    { "option_id": "b", "description": "Prioritize performance" }
  ],
  "timeout_ms": 86400000,
  "auto_resolve": { "option_id": "a", "after_ms": 86400000 }
}

Why Confidence Scoring Needs a Standard

This is the core insight behind AOP: confidence scores are useless without shared semantics.

If your security agent says "0.8 confidence" and your performance agent says "0.8 confidence," you can't compare them, aggregate them, or filter by them unless both numbers mean the same thing.

AOP defines a concrete scale:

Range	Label	Meaning
0.0–0.3	Low	Speculative. Suppress unless configured for max sensitivity.
0.3–0.6	Moderate	Partial evidence. Needs human verification.
0.6–0.8	High	Strong evidence. Surface for review.
0.8–0.95	Very High	Almost certainly correct. May auto-resolve in low-risk contexts.
0.95–1.0	Near Certain	Deterministic checks only (regex-matched secrets, schema violations).

And a calibration requirement: an agent producing findings at 0.8 confidence should be correct approximately 80% of the time. This makes confidence scores comparable across agents, frameworks, and domains.

Each score also decomposes into three factors:

Evidence strength — How concrete is the proof?
Model certainty — How sure is the LLM?
Context completeness — Did the agent have enough information?

This lets orchestrators make nuanced decisions. A finding with high evidence strength but low context completeness might still be worth surfacing. One with high model certainty but low evidence strength is probably a hallucination.

How It Works in Practice

Imagine an orchestrator reviewing PR #142. It dispatches three agents in parallel:

Security Agent finds a SQL injection (confidence: 0.92)
Logic Agent finds a null handling issue (confidence: 0.78)
Performance Agent flags an unbounded query (confidence: 0.71)

The orchestrator aggregates:

Deduplication: Security and Logic agents both flagged overlapping lines. Different categories, so both are preserved and cross-referenced.
Threshold filtering: All findings exceed the 0.7 threshold. Included.
Conflict resolution: No contradictions in this case. If there were, the orchestrator uses configurable strategies: highest confidence wins, consensus required, or human escalation.
Output: One consolidated PR review with three findings, sorted by severity.

The entire exchange uses standardized JSON — the agents don't need to know about each other, and the orchestrator doesn't need agent-specific adapters.

Reference Implementation

agent-orchestra is the reference implementation of AOP in TypeScript/Node.js. It implements all five message types and includes confidence-gated orchestration out of the box.

I've built two production systems on top of AOP:

Code Sentinel — Autonomous AI codebase maintenance with multi-agent orchestration
ReviewStack (coming soon) — Multi-agent PR review as a GitHub App

The Spec

The full specification is at github.com/ameyb88/agent-orchestration-protocol. It includes:

12 sections covering every message type, aggregation rules, security considerations, and transport bindings
JSON Schemas for all message types
Complete worked examples
MIT licensed

It follows RFC conventions (MUST/SHOULD/MAY from RFC 2119) and is transport-agnostic — works over HTTP, message queues, gRPC, or in-process function calls.

What's Next

AOP is a draft. I'm looking for feedback from anyone building multi-agent systems:

Does this match the problems you're hitting?
What's missing from the spec?
Would you implement AOP in your framework?

Star the repo, file issues, or open a PR. Let's make AI agents interoperable.

Links:

Spec: github.com/ameyb88/agent-orchestration-protocol
Reference implementation: npmjs.com/package/agent-orchestra
Code Sentinel: github.com/ameyb88/code-sentinel
Author: github.com/ameyb88

DEV Community