DEV Community

Haripriya Veluchamy
Haripriya Veluchamy

Posted on

How I Built Production AI Agent Monitoring with Langfuse

Multi-agent AI systems fail silently.

A 200 OK response doesn’t mean the AI made good decisions.

That was the biggest thing I realized while building a multi-agent system.

My architecture looked like this:

User Query → Multi Agent Call → Final Response

Everything looked normal from an infrastructure perspective.

  • APIs were healthy
  • Latency looked fine
  • Users were getting responses

But I still couldn’t answer important questions:

  • Did the Agent route the query to the right specialist?
  • Did the agent hallucinate information?
  • Did it ignore specialist outputs?
  • Did it attribute responses incorrectly?

Traditional monitoring couldn’t help because the system technically wasn’t failing.

The failures were happening at the decision layer.


Full Trace Visibility

I used Langfuse to trace every agent execution.

That includes:

  • Tool calls
  • Input/output payloads
  • Token usage
  • Latency per step

If an agent touched something, I wanted visibility into it.

No black boxes.


Deterministic Checks

Some validations didn’t need another LLM.

I added rule-based checks for things like:

  • Did the agent call tools from the correct domain?
  • Did the agent call tools it wasn’t supposed to?
  • Was the expected workflow followed?

These checks are binary:

  • Pass → 1
  • Fail → 0

Fast and cheap.


Faithfulness Checks

This was mainly for hallucination detection.

I compare the final response with outputs from specialist agents.

If the Final layer introduces claims that weren’t exist in source outputs, it gets flagged.

This helped catch cases where the system sounded confident but wasn’t grounded.


LLM Judges

For things deterministic checks can’t measure, I use Azure OpenAI as judges.

They evaluate:

  • Routing correctness
  • Response quality
  • Attribution accuracy
  • Conflict handling

This runs for every multi-agent response.

Expensive? Yes.
Useful? Definitely.


100% Traffic Monitoring

I didn’t want sampling.

Every production request goes through the evaluation pipeline.

Because edge cases are usually the exact things sampling misses.


Cost + Latency Tracking

Multi-agent systems get expensive very fast.

I track:

  • Tokens per agent
  • Latency per step
  • Expensive execution paths

This made optimization much easier.


What This Actually Caught

This surfaced issues normal monitoring completely missed.

Wrong attribution

Correct insights were assigned to the wrong specialist.

Ignored outputs

Sometimes Agent completely ignored specialist responses.

Routing mistakes

The call occasionally sent queries to the wrong agent.


None of these showed up in normal monitoring dashboards.

Everything looked healthy.


Stack

Observability: Langfuse
LLM Evaluation: Azure OpenAI
Deterministic Checks: TypeScript


Final Takeaway

For multi-agent systems, uptime monitoring is not enough.

You also need decision monitoring.

Because a successful response can still be completely wrong.

Top comments (0)