Tim Wood

Posted on Jun 16 • Edited on Jul 27

GoAgentX Architecture Deep Dive (Part 2): AHP — The Communication Backbone for Multi-Agents

#ai #agents #go

This is the second article in the GoAgentX series. Read Part 1 here

A common question when talking about multi-agent systems is: How do agents talk to each other? HTTP? WebSocket? Message queues?

My blunt answer was:
They’re running in the same process. Why the hell would I go through the network just to chat?
That’s how AHP (Agent Hierarchical Protocol) was born — a lightweight, network-free communication protocol built purely with Go channels.

Why Build My Own Wheel?

In GoAgentX, we have two main roles: Leader Agent (the boss that delegates tasks) and Sub Agents (the workers). The headaches they need to solve include:

Asynchronous task dispatching
Progress reporting
Heartbeat & liveness detection
Fault tolerance and dead letter handling

I tried Redis, looked at RabbitMQ… they were either too heavy, too slow, or just not feeling right for an in-process system. So I decided to build something simple, fast, and Go-native.
Core Advantages:

Blazing fast (channel vs network RTT)
No serialization overhead in single-process mode
Easy to evolve — swap the backend to gRPC later without changing business logic

Core Components of AHP

Protocol: Facade that ties everything together
MessageQueue: Buffered channel + backup + atomic flags
HeartbeatMonitor: Detects dead agents with smart timeout logic
DLQ (Dead Letter Queue): Handles failed messages with retry support
QueueRegistry: Lazy loading with double-checked locking

Message Model

AHP defines 5 message types: TASK, RESULT, PROGRESS, ACK, HEARTBEAT.

const (
    AHPMethodTask      AHPMethod = "TASK"      // Task allocation
    AHPMethodResult    AHPMethod = "RESULT"     // Task Result 
    AHPMethodProgress  AHPMethod = "PROGRESS"   // Progress feedback
    AHPMethodACK       AHPMethod = "ACK"        // Confirm Reply
    AHPMethodHeartbeat AHPMethod = "HEARTBEAT"  // Heartbeat signal
)

message struct

type AHPMessage struct {
    MessageID   string         `json:"message_id"`
    Method      AHPMethod      `json:"method"`
    AgentID     string         `json:"agent_id"`
    TargetAgent string         `json:"target_agent"`
    TaskID      string         `json:"task_id"`
    SessionID   string         `json:"session_id"`
    Payload     map[string]any `json:"payload"`
    Timestamp   time.Time      `json:"timestamp"`
}

MessageID Generation

MessageID is designed as a three-part ID:

func generateMessageID() string {
    id := atomic.AddUint64(&messageIDCounter, 1)
    randSuffix := getRandomSuffix()
    return fmt.Sprintf("%s.%d.%s",
        time.Now().Format("20060102150405.000000"), id, randSuffix)
}

Timestamp prefix: Highly readable, facilitates troubleshooting.
Atomic counter: Sequence numbers of multiple messages within the same nanosecond increment.
Random suffix: Avoids conflicts in multi-process scenarios.

This scheme does not rely on a global coordinator and guarantees uniqueness within the process.

Don't worry if you don't quite understand; I've simply borrowed some design elements from blockchains. I previously participated in the design of a public chain, and the messages were referenced. There's no other intention; its main purpose is state synchronization.

HeartbeatMonitor

Core Process:

Each Agent sends a heartbeat at a fixed interval (default 5 seconds).
HeartbeatMonitor records the time of the most recent heartbeat.
If the timeout period (default 30 seconds) is exceeded and the number of consecutive missed heartbeats reaches the threshold (default 3 times), the agent is marked as offline.

Timeout detection algorithm

func (m *HeartbeatMonitor) CheckTimeouts() []string {
    timedOut := m.checkAndMarkOffline()  // 写锁下检测
    for _, agentID := range timedOut {
        m.notifyCallbacks(agentID)        // 锁外执行回调
    }
    return timedOut
}

Key Boundary Condition Handling:

Progressive Timeout: Offline is only determined after 3 missed heartbeats, avoiding false positives due to occasional network latency. 2.Avoid Duplicate Callbacks: Offline agents will not trigger callbacks again. 3.Callbacks Executed Outside Lock: The notifyCallbacks list is copied, the lock is released, and then execution begins; this is crucial for preventing deadlocks.

There are two types of HeartbeatSenders:

ahp.HeartbeatSender: Sends AHPMethodHeartbeat messages to the target's MessageQueue, which includes an internal heartbeat.
heartbeatSender(in internal/agents/sub/): Directly calls HeartbeatMonitor.RecordHeartbeat, which includes an external heartbeat.

Currently, the Sub Agent uses the second method, which is more efficient in monolithic deployments.

Dead-Letter Queues: DLQ

When the Enqueue returns an error, Protocol.SendMessage routes the failure message to the DLQ:

func classifyEnqueueError(err error) string {
    switch {
    case errors.Is(err, apperrors.ErrQueueClosed):  return "queue_closed"
    case errors.Is(err, apperrors.ErrQueueFull):    return "queue_full"
    case errors.Is(err, context.Canceled):          return "context_canceled"
    case errors.Is(err, context.DeadlineExceeded):  return "context_deadline"
    default:                                        return "unknown"
    }
}

DLQProcessor supports registering custom processors based on error type and supports automatic retries:

MaxRetries = 0: Infinite retries
MaxRetries > 0: Marked as exhausted after reaching the specified number of retries.
Currently, there is no exponential backoff, which is an area for improvement.

Key Design Decisions

Why Non-Blocking Enqueues?

The Agent operates in a multi-threaded environment; blocking can lead to cascading waits.
DLQ provides better fault-tolerance semantics; failed messages can be retried.
The caller has greater control: immediate retry, later retry, or discard.

Avoiding TOCTOUs

SendMessage has a key design flaw: it doesn't check IsFull before enqueuing. If it checks IsFull before enqueuing, the queue might go from not full to full between the check and enqueue (TOCTOU race condition), leading to message loss. Directly executing operations and handling errors is more robust.

Serialization Reserved for Extensions

Currently, AHP is purely intra-process communication, and JSON is sufficient. However, the Codec interface reserves extensions in two directions:

Inter-process communication: protobuf/msgpack can provide a smaller payload.

Persistence: If DLQ messages are written to disk, binary format is more advantageous.

Now for the much-anticipated exposé:

To be honest, AHP isn't perfect. I've encountered some pitfalls myself:

Purely in-process: Can't cross processes; for distributed systems, you'd need to switch to a MessageQueue implementation.
No broadcast: Need to send messages to multiple Subs? You have to send them one by one.
Insufficient retry strategy: DLQ retry intervals are fixed, without exponential backoff—continuous failures could cause a retry storm.
Inflexible routing: Doesn't support content-based dynamic routing or Topic subscriptions.

However, these limitations are trade-offs made as needed—in the monolithic phase, the channel solution saves 90% of the distributed complexity. If you really want to implement microservices in the future, you only need to replace the underlying implementation; the business logic code above doesn't need to be changed. That's the benefit of an abstraction layer. When used on a single machine, it is indeed quite smooth. However, if you are going to use this project for an interview, I suggest you prepare how to handle it in a distributed environment. The purpose of writing this is to serve as a starting point. In the AI era, in addition to basic programming skills, I personally suggest you hone your software engineering skills. After all, being an LLM API caller is not our goal.

In summary, AHP is the communication framework I developed for GoAgentX. Channel message passing, DLQ fallback, and HeartbeatMonitor monitoring – these three components together provide the complete infrastructure for multi-agent communication.

The interfaces left in the code (Codec, DLQ handler, MessageSender) are essentially backup plans: if you need to switch to gRPC or RabbitMQ later, simply change the implementation layer; the underlying business logic remains unchanged. This design is especially important in startup projects – you never know what the architecture will look like tomorrow.

Next, let's talk about memory distillation – how agents extract useful experience from hundreds of conversations in their history. This is also a key technology I think is frequently asked in agent interviews: how to ensure the model's information is not distorted in long-context scenarios.

GitHub: https://github.com/Timwood0x10/GoAgentX
What do you think? Star, try it, or drop your feedback — highly appreciated!

DEV Community