Multi-agent AI systems fail silently.
A 200 OK response doesn’t mean the AI made good decisions.
That was the biggest thing I realized while building a multi-agent system.
My architecture looked like this:
User Query → Multi Agent Call → Final Response
Everything looked normal from an infrastructure perspective.
- APIs were healthy
- Latency looked fine
- Users were getting responses
But I still couldn’t answer important questions:
- Did the Agent route the query to the right specialist?
- Did the agent hallucinate information?
- Did it ignore specialist outputs?
- Did it attribute responses incorrectly?
Traditional monitoring couldn’t help because the system technically wasn’t failing.
The failures were happening at the decision layer.
Full Trace Visibility
I used Langfuse to trace every agent execution.
That includes:
- Tool calls
- Input/output payloads
- Token usage
- Latency per step
If an agent touched something, I wanted visibility into it.
No black boxes.
Deterministic Checks
Some validations didn’t need another LLM.
I added rule-based checks for things like:
- Did the agent call tools from the correct domain?
- Did the agent call tools it wasn’t supposed to?
- Was the expected workflow followed?
These checks are binary:
- Pass →
1 - Fail →
0
Fast and cheap.
Faithfulness Checks
This was mainly for hallucination detection.
I compare the final response with outputs from specialist agents.
If the Final layer introduces claims that weren’t exist in source outputs, it gets flagged.
This helped catch cases where the system sounded confident but wasn’t grounded.
LLM Judges
For things deterministic checks can’t measure, I use Azure OpenAI as judges.
They evaluate:
- Routing correctness
- Response quality
- Attribution accuracy
- Conflict handling
This runs for every multi-agent response.
Expensive? Yes.
Useful? Definitely.
100% Traffic Monitoring
I didn’t want sampling.
Every production request goes through the evaluation pipeline.
Because edge cases are usually the exact things sampling misses.
Cost + Latency Tracking
Multi-agent systems get expensive very fast.
I track:
- Tokens per agent
- Latency per step
- Expensive execution paths
This made optimization much easier.
What This Actually Caught
This surfaced issues normal monitoring completely missed.
Wrong attribution
Correct insights were assigned to the wrong specialist.
Ignored outputs
Sometimes Agent completely ignored specialist responses.
Routing mistakes
The call occasionally sent queries to the wrong agent.
None of these showed up in normal monitoring dashboards.
Everything looked healthy.
Stack
Observability: Langfuse
LLM Evaluation: Azure OpenAI
Deterministic Checks: TypeScript
Final Takeaway
For multi-agent systems, uptime monitoring is not enough.
You also need decision monitoring.
Because a successful response can still be completely wrong.
Top comments (0)