DEV Community

mgbec for AWS Community Builders

Posted on • Originally published at Medium on

I can’t see without my Session ID!

I’ve been building out a multi-agent system and wanted to document some of the issues I have been running into with observability. Since we are in an evolving field, I am sure much of this will change soon, like probably next week. At this point in time, however, this is what is happening for me.

This pattern I am building on deploys a multi-agent system with four coordinating agents. The Orchestrator routes requests to the Specialist (detailed analysis), the Fact Checker (claim verification), and the Critic (quality evaluation with feedback loops)

Key Features:

-Four-agent architecture with multi-hop interagent communication and quality feedback loops

-ADOT (AWS Distro for OpenTelemetry) instrumentation for full distributed tracing

-Web search capability via Tavily API (Specialist and Fact Checker)

-Critic agent providing live LLM-as-a-judge quality scoring

-Automated Docker image building via CodeBuild

-S3-based source code management with change detection

-IAM-based security with least-privilege access

-Windows (PowerShell) and Linux/macOS compatible deployment

My Multi-Agent Problems — Tribbles vs Hive Mind

In my previous variant (https://github.com/mgbec/multi-agent-runtime-with-evals), I added a guardrail and an evaluator. This time around (https://github.com/mgbec/multi-agent-eval-guardrail-optimize), I wanted to experiment with more evaluation and a newer feature called optimization. This is where my troubles began. My evaluation scripts kept failing with “ERROR: Session span data is incomplete. Span with ID: aeaa2ae01c96f4db and name: invoke_agent SpecialistAge”, “Error: Session span data is incomplete. Span with ID: 7b5c234474478”, and the like. Oh so many errors!

Quite a while later, and with the help of some Kiro troubleshooting, the issue was found. I was having a problem with coherence. That is not the first time I have heard that, but on this occasion it was session coherence. I built this multi-agent, looping workflow using invoke_agent_runtime via boto3 in the “agent as tool” pattern mentioned here — https://builder.aws.com/content/3DCax04M9o7gBpAMttstsLjmPUD/multi-agent-architecture-patterns-with-amazon-bedrock-agentcore-runtime.

There are multiple ways to set this pattern up.

Our first option is having Agents spawn their own runtime session, analogous to tribbles:

It is referred to in this document as “Multiple Runtimes, one or few agents each ”, where each specialized agent is deployed as its own AgentCore Runtime, invoked over the network (InvokeAgentRuntime API, MCP, or A2A). Since the agents have their own runtimes, they also have their own session id’s. The parent agent, the orchestrator in this case, passes the parent session id down to the child in its payload, but not in the span data.

The second option is keeping our agents together in one runtime — Hive Mind style. This is referred to as “Single Runtime, multiple agents — The main agent and its subagents run within the same runtime process. We probably wouldn’t link billions of agents into a unified telepathic whole but you get the picture.

Subagents share the main agent’s resources: RAM, CPU, storage, and OS namespace so collaboration is inside the microVM with no network overhead. All agents in the same runtime would keep everything in the same session and that session would be easily visible in CloudWatch spans.

Architecture Trade-offs

Both the “Single Runtime, multiple agents” and “Multiple Runtimes, one or few agents each” set-ups are valid and have use cases. There are tradeoffs, of course:

For this project (not a production system), the single-runtime approach would have been easier as far as getting evals and optimizations. The multi-runtime approach is better for production-grade independent scaling, separate security boundaries, or teams owning different agents. We could also build our projects as a hybrid - using new runtimes for some agents but not others. In any case, I’m not sorry I chose the harder path, since it led to some good learning opportunities.

MCP and A2A protocols follow remote agent patterns which you can read about in the above mentioned article. Using a genuine A2A protocol instead of the quasi-A2A implementation I used would have given us a different situation: “A2A is inherently a multi-Runtime pattern. Each agent is deployed as its own AgentCore Runtime (or on another platform) and exposes an A2A server. Orchestrators and peer agents discover each other via Agent Cards and communicate using the A2A protocol. “

Observability

So, in my project, set up as tribble mode- each agent was getting its own runtime and its own session id. The parent session ids propagate to the child sessions and we have great observability using traces and spans.

Great data, but unfortunately not getting to a place evaluations look at. So the evals were seeing what looked like incomplete sessions.

Evaluation

AgentCore evaluations need to read the session id from span data to work correctly. This is where my architectural decision created problems. My current implementation passes off the parent-session id to the sub-agents, but it happens in the payload. The payload isn’t an OTEL component though, and the parent session id is not making it into the spans of the child sessions. The parent session id gets into the application logs but evaluations are looking at the OTEL span attributes. So, when I call the Evaluations API, it can’t see my entire end to end session.

I have a work-around for this project, but it is not ideal. The eval_goal_attainment.py script collects spans by trace ID (shared across all agents via OTEL), unifies them under one session ID, and includes log events from all agent runtime log groups. This enables evals likeBuiltin.Helpfulness to work on most traces but other evaluators may still report incomplete data.

This work-around will not work for online evaluations or batch evaluations.

Optimization

If you haven’t used Optimization in AgentCore yet, this is the gist of it- https://docs.aws.amazon.com/bedrock-agentcore/latest/devguide/optimization-how-it-works.htmlSo to generate a recommendation for a better system prompt or tool description, you use your agent traces from CloudWatch Logs and specify the evaluator you want to optimize for. The service analyzes failure patterns and returns the optimized recommendations, along with an explanation of what it recommends and why.

Since optimization uses a CloudWatch Logs source span discovery mechanism, it has trouble discovering the session and trace details it needs from my CloudWatch Logs. Again, using Single Runtime for Multiple Agents would help optimization to discover the span data and do what it needs to do.

Final Result

This project ended up functional and provided many learning opportunities but probably not something you want to emulate. In case you do though- the GitHub repo is https://github.com/mgbec/multi-agent-eval-guardrail-optimize. I suspect we will see changes to these capabilities soon. Let me know your thoughts and if you have had a different experience with your observability projects. Thanks for reading!

Top comments (0)