This is the second article in the GoAgentX series. Read Part 1 here
A common question when talking about multi-agent systems is: How do agents talk to each other? HTTP? WebSocket? Message queues?
My blunt answer was:
They’re running in the same process. Why the hell would I go through the network just to chat?
That’s how AHP (Agent Hierarchical Protocol) was born — a lightweight, network-free communication protocol built purely with Go channels.
Why Build My Own Wheel?
In GoAgentX, we have two main roles: Leader Agent (the boss that delegates tasks) and Sub Agents (the workers). The headaches they need to solve include:
- Asynchronous task dispatching
- Progress reporting
- Heartbeat & liveness detection
- Fault tolerance and dead letter handling
I tried Redis, looked at RabbitMQ… they were either too heavy, too slow, or just not feeling right for an in-process system. So I decided to build something simple, fast, and Go-native.
Core Advantages:
- Blazing fast (channel vs network RTT)
- No serialization overhead in single-process mode
- Easy to evolve — swap the backend to gRPC later without changing business logic
Core Components of AHP
- Protocol: Facade that ties everything together
- MessageQueue: Buffered channel + backup + atomic flags
- HeartbeatMonitor: Detects dead agents with smart timeout logic
- DLQ (Dead Letter Queue): Handles failed messages with retry support
- QueueRegistry: Lazy loading with double-checked locking
Message Model
AHP defines 5 message types: TASK, RESULT, PROGRESS, ACK, HEARTBEAT.
const (
AHPMethodTask AHPMethod = "TASK" // Task allocation
AHPMethodResult AHPMethod = "RESULT" // Task Result
AHPMethodProgress AHPMethod = "PROGRESS" // Progress feedback
AHPMethodACK AHPMethod = "ACK" // Confirm Reply
AHPMethodHeartbeat AHPMethod = "HEARTBEAT" // Heartbeat signal
)
message struct
type AHPMessage struct {
MessageID string `json:"message_id"`
Method AHPMethod `json:"method"`
AgentID string `json:"agent_id"`
TargetAgent string `json:"target_agent"`
TaskID string `json:"task_id"`
SessionID string `json:"session_id"`
Payload map[string]any `json:"payload"`
Timestamp time.Time `json:"timestamp"`
}
MessageID Generation
MessageID is designed as a three-part ID:
func generateMessageID() string {
id := atomic.AddUint64(&messageIDCounter, 1)
randSuffix := getRandomSuffix()
return fmt.Sprintf("%s.%d.%s",
time.Now().Format("20060102150405.000000"), id, randSuffix)
}
- Timestamp prefix: Highly readable, facilitates troubleshooting.
- Atomic counter: Sequence numbers of multiple messages within the same nanosecond increment.
- Random suffix: Avoids conflicts in multi-process scenarios.
This scheme does not rely on a global coordinator and guarantees uniqueness within the process.
Don't worry if you don't quite understand; I've simply borrowed some design elements from blockchains. I previously participated in the design of a public chain, and the messages were referenced. There's no other intention; its main purpose is state synchronization.
HeartbeatMonitor
Core Process:
- Each Agent sends a heartbeat at a fixed interval (default 5 seconds).
- HeartbeatMonitor records the time of the most recent heartbeat.
- If the timeout period (default 30 seconds) is exceeded and the number of consecutive missed heartbeats reaches the threshold (default 3 times), the agent is marked as offline.
Timeout detection algorithm
func (m *HeartbeatMonitor) CheckTimeouts() []string {
timedOut := m.checkAndMarkOffline() // 写锁下检测
for _, agentID := range timedOut {
m.notifyCallbacks(agentID) // 锁外执行回调
}
return timedOut
}
Key Boundary Condition Handling:
- Progressive Timeout: Offline is only determined after 3 missed heartbeats, avoiding false positives due to occasional network latency. 2.Avoid Duplicate Callbacks: Offline agents will not trigger callbacks again. 3.Callbacks Executed Outside Lock: The notifyCallbacks list is copied, the lock is released, and then execution begins; this is crucial for preventing deadlocks.
There are two types of HeartbeatSenders:
-
ahp.HeartbeatSender: Sends AHPMethodHeartbeat messages to the target's MessageQueue, which includes an internal heartbeat. -
heartbeatSender(in internal/agents/sub/): Directly calls HeartbeatMonitor.RecordHeartbeat, which includes an external heartbeat.
Currently, the Sub Agent uses the second method, which is more efficient in monolithic deployments.
Dead-Letter Queues: DLQ
When the Enqueue returns an error, Protocol.SendMessage routes the failure message to the DLQ:
func classifyEnqueueError(err error) string {
switch {
case errors.Is(err, apperrors.ErrQueueClosed): return "queue_closed"
case errors.Is(err, apperrors.ErrQueueFull): return "queue_full"
case errors.Is(err, context.Canceled): return "context_canceled"
case errors.Is(err, context.DeadlineExceeded): return "context_deadline"
default: return "unknown"
}
}
DLQProcessor supports registering custom processors based on error type and supports automatic retries:
- MaxRetries = 0: Infinite retries
- MaxRetries > 0: Marked as exhausted after reaching the specified number of retries.
- Currently, there is no exponential backoff, which is an area for improvement.
Key Design Decisions
Why Non-Blocking Enqueues?
- The Agent operates in a multi-threaded environment; blocking can lead to cascading waits.
- DLQ provides better fault-tolerance semantics; failed messages can be retried.
- The caller has greater control: immediate retry, later retry, or discard.
Avoiding TOCTOUs
SendMessage has a key design flaw: it doesn't check IsFull before enqueuing. If it checks IsFull before enqueuing, the queue might go from not full to full between the check and enqueue (TOCTOU race condition), leading to message loss. Directly executing operations and handling errors is more robust.
Serialization Reserved for Extensions
Currently, AHP is purely intra-process communication, and JSON is sufficient. However, the Codec interface reserves extensions in two directions:
Inter-process communication: protobuf/msgpack can provide a smaller payload.
Persistence: If DLQ messages are written to disk, binary format is more advantageous.
Now for the much-anticipated exposé:
To be honest, AHP isn't perfect. I've encountered some pitfalls myself:
Purely in-process: Can't cross processes; for distributed systems, you'd need to switch to a MessageQueue implementation.
No broadcast: Need to send messages to multiple Subs? You have to send them one by one.
Insufficient retry strategy: DLQ retry intervals are fixed, without exponential backoff—continuous failures could cause a retry storm.
Inflexible routing: Doesn't support content-based dynamic routing or Topic subscriptions.
However, these limitations are trade-offs made as needed—in the monolithic phase, the channel solution saves 90% of the distributed complexity. If you really want to implement microservices in the future, you only need to replace the underlying implementation; the business logic code above doesn't need to be changed. That's the benefit of an abstraction layer. When used on a single machine, it is indeed quite smooth. However, if you are going to use this project for an interview, I suggest you prepare how to handle it in a distributed environment. The purpose of writing this is to serve as a starting point. In the AI era, in addition to basic programming skills, I personally suggest you hone your software engineering skills. After all, being an LLM API caller is not our goal.
In summary, AHP is the communication framework I developed for GoAgentX. Channel message passing, DLQ fallback, and HeartbeatMonitor monitoring – these three components together provide the complete infrastructure for multi-agent communication.
The interfaces left in the code (Codec, DLQ handler, MessageSender) are essentially backup plans: if you need to switch to gRPC or RabbitMQ later, simply change the implementation layer; the underlying business logic remains unchanged. This design is especially important in startup projects – you never know what the architecture will look like tomorrow.
Next, let's talk about memory distillation – how agents extract useful experience from hundreds of conversations in their history. This is also a key technology I think is frequently asked in agent interviews: how to ensure the model's information is not distorted in long-context scenarios.
GitHub: https://github.com/Timwood0x10/GoAgentX
What do you think? Star, try it, or drop your feedback — highly appreciated!

Top comments (0)