Haripriya Veluchamy

Posted on May 13

How I Built Production AI Agent Monitoring with Langfuse

#ai #cloud #devops #machinelearning

Multi-agent AI systems fail silently.

A 200 OK response doesn’t mean the AI made good decisions.

That was the biggest thing I realized while building a multi-agent system.

My architecture looked like this:

User Query → Multi Agent Call → Final Response

Everything looked normal from an infrastructure perspective.

APIs were healthy
Latency looked fine
Users were getting responses

But I still couldn’t answer important questions:

Did the Agent route the query to the right specialist?
Did the agent hallucinate information?
Did it ignore specialist outputs?
Did it attribute responses incorrectly?

Traditional monitoring couldn’t help because the system technically wasn’t failing.

The failures were happening at the decision layer.

Full Trace Visibility

I used Langfuse to trace every agent execution.

That includes:

Tool calls
Input/output payloads
Token usage
Latency per step

If an agent touched something, I wanted visibility into it.

No black boxes.

Deterministic Checks

Some validations didn’t need another LLM.

I added rule-based checks for things like:

Did the agent call tools from the correct domain?
Did the agent call tools it wasn’t supposed to?
Was the expected workflow followed?

These checks are binary:

Pass → 1
Fail → 0

Fast and cheap.

Faithfulness Checks

This was mainly for hallucination detection.

I compare the final response with outputs from specialist agents.

If the Final layer introduces claims that weren’t exist in source outputs, it gets flagged.

This helped catch cases where the system sounded confident but wasn’t grounded.

LLM Judges

For things deterministic checks can’t measure, I use Azure OpenAI as judges.

They evaluate:

Routing correctness
Response quality
Attribution accuracy
Conflict handling

This runs for every multi-agent response.

Expensive? Yes.
Useful? Definitely.

100% Traffic Monitoring

I didn’t want sampling.

Every production request goes through the evaluation pipeline.

Because edge cases are usually the exact things sampling misses.

Cost + Latency Tracking

Multi-agent systems get expensive very fast.

I track:

Tokens per agent
Latency per step
Expensive execution paths

This made optimization much easier.

What This Actually Caught

This surfaced issues normal monitoring completely missed.

Wrong attribution

Correct insights were assigned to the wrong specialist.

Ignored outputs

Sometimes Agent completely ignored specialist responses.

Routing mistakes

The call occasionally sent queries to the wrong agent.

None of these showed up in normal monitoring dashboards.

Everything looked healthy.

Stack

Observability: Langfuse
LLM Evaluation: Azure OpenAI
Deterministic Checks: TypeScript

Final Takeaway

For multi-agent systems, uptime monitoring is not enough.

You also need decision monitoring.

Because a successful response can still be completely wrong.

Top comments (1)

Harjot Singh • Jun 1

i really resonate with your point about silent failures in multi-agent systems. it's so important to have visibility into the decision-making process. at moonshift, we help you get a full next.js + postgres + auth app deployed in about 7 minutes, and you own the code. if you're curious, i can set you up with a free run to see how it works.