Daniel Malek

Posted on Jun 16 • Edited on Jun 18

LLM Agent Observability with Langfuse

#ai #agents #observability #langfuse

repository link https://github.com/Dan1618/langfuse-observability

Introduction

In this post I will show how to locally install Langfuse to support observability of agent application. Thanks to that we will be able to watch the agents' internal properties like prompt used, agents' answer or token usage.
I will use existing application from this article https://dev.to/dan1618/building-ai-agent-with-langgraph-and-nestjs-4d86

Why agent application observability is important?

Observability means having deep visibility into the "how" and "why" of your software's behavior— a requirement that is rapidly evolving with the rise of AI. Traditional Application Performance Monitoring (APM) tools were built for deterministic systems: a user clicks a button, a predefined API is called, the database is queried, and a response is returned. If you try to use a traditional APM tool on an AI agent, you will only see that an HTTP request was made to an LLM provider and it took 4 seconds. It won't tell you what the model was thinking. Using LLMs requires us to measure different things like Cost and Token Management, Reasoning Loops (we need to consider agents that spawn different agents and other agent-specific actions), and Tracking Response Quality (Evals).

Why Langfuse?

Langfuse has emerged as a go-to open-source AI engineering platform because it was built specifically for the probabilistic nature of LLMs, rather than trying to shoehorn AI metrics into legacy APM dashboards.

Purpose-Built Tracing: It natively understands LLM concepts. It traces the full execution tree—sessions, individual exchanges, LLM calls, and external tool usage—so you can visually inspect an agent's exact decision path.
OpenTelemetry (OTel) Native: Modern agent frameworks (like AutoGen, Pydantic AI, or Amazon Bedrock AgentCore) emit traces via OpenTelemetry. Langfuse seamlessly ingests these standard OTel traces and maps them to its generative AI data model.
Prompt Management: Langfuse allows you to decouple your prompts from your codebase. You can test, version, and roll back prompts directly in the platform without needing to redeploy your application.
Built-in Evaluations: It allows you to run "evals" on production data continuously. You can track quality metrics (faithfulness, accuracy) alongside latency and cost, triggering alerts if the agent's performance degrades.

Langfuse features

here is the visual overview on yt: https://www.youtube.com/watch?v=2E8iTvGo9Hs

The core features and metrics you can track and interact with inside the Langfuse dashboard are organized across several key areas:

1. Core Observability & In-Depth Tracing

Tracing is the foundation of the platform. The dashboard allows you to visually map out exactly what happens under the hood during an LLM execution:

Hierarchical Trace Views: See the entire execution journey of a request. It breaks down multi-step agentic workflows into a timeline or a visual graph, showing every tool invocation, retrieval step (RAG), and LLM call.
Session Tracking: Group individual traces into "Sessions" to track and debug multi-turn user conversations over time.
Deep Filtering: Quickly slice your execution data by user ID, specific models, releases/versions, or custom metadata tags.

2. Live Performance & Health Metrics

The home dashboard natively aggregates technical metrics to monitor system stability:

Latency Monitoring: Inspect overall response times and Time-to-First-Token (TTFT) distributions to find bottlenecks or slow steps in your agents.
Streaming Speed: Tracks tokens per second to optimize real-time user experiences.
Error Rates: Automatically flags failed LLM calls, timeout errors, or API hiccups.

3. Cost & Token Tracking

LLM costs can get out of hand quickly. The dashboard includes a robust cost-management suite:

Token Breakdowns: Track exact token consumption (input vs. output tokens) aggregated across different model types (e.g., GPT-4, Claude, Llama).
Cost Attribution: Compute spending trends and look at cost breakdowns by model provider or down to individual user IDs so you know who or what is driving your API bills.

4. Evaluation & Quality Analytics

The dashboard helps you understand if your app is actually outputting good, safe answers:

Multi-Method Score Tracking: Visualizes quality scores compiled from user feedback (thumbs up/down), manual human annotations, or automated "LLM-as-a-judge" evaluators (checking for hallucinations, correctness, toxicity, etc.).
Annotation Queues: Provides a space for teams to systematically review production traces, apply labels, and build "golden datasets" for future testing.

5. Custom Dashboards & Advanced Reporting

Langfuse includes a fully customizable dashboard builder so you don’t get overwhelmed by irrelevant charts:

Widget Builder: Create, move, and resize different charts (like histograms or line graphs) to build tailored views.
Pivot Tables: Built-in pivot table widgets allow for multi-dimensional data analysis—for instance, evaluating how quality scores stack up against model costs across different application features.
Curated Layouts: Pre-built, single-click templates focused heavily on Cost, Latency, Prompts, or Evals to get you started immediately. ##### 6. Prompt Management Content System The dashboard doubles as a low-code prompt registry:
Version Control: Create, test, and track variations of prompts over time without updating your core codebase.
Playground Integration: Open any real production input trace directly inside an interactive LLM Playground to test prompt tweaks and compare model behaviors side by side.

💡 Data Portability Note: If you prefer visualizing this data outside of Langfuse, the platform includes a Query API endpoint. This allows you to fetch any aggregated trace metric directly from the dashboard database and pipe it straight into your own internal business intelligence tools (like Tableau or Looker).

Instalation

Instalation example will be based on the article and repository I have created earlier - an agent created with NestJS and LangGraph.

https://dev.to/dan1618/building-ai-agent-with-langgraph-and-nestjs-4d86
https://github.com/Dan1618/nest-langgraph

Langfuse’s solution to LLM monitoring and observability consists of two parts:

Langfuse SDKs
Langfuse Server

The Langfuse SDKs are the coding side of Langfuse, available for various platforms, which allow you to enable instrumentation in your application’s code. They are nothing more than a few lines of code which can be used appropriately in your application’s codebase.

The Langfuse server, on the other hand, is the UI based dashboard, along with other underlying services, which can be used to log, view and persist all the traces and metrics. The Langfuse’s dashboard is usually accessible through any modern web browser.

1. Setup Langfuse server with docker
I suggest to install this part by following the docs, since they will be the most actual.
https://langfuse.com/self-hosting/deployment/docker-compose
after proper instalation you can create new project and LANGFUSE_SECRET_KEY and LANGFUSE_PUBLIC_KEY that will be needed to pass to .env file, so langfuse recognizes your app.

2. Installing SDK
https://langfuse.com/docs/observability/sdk/overview
In example of the nestJS app I provided:

a) update your .env file

LANGFUSE_SECRET_KEY=""
LANGFUSE_PUBLIC_KEY=""
LANGFUSE_BASE_URL="http://localhost:port"
OPENAI_API_KEY=""

b) create file which runs open telemetry and initialize it at the very top of your application start

import * as dotenv from 'dotenv';
dotenv.config(); // ← must run before any Langfuse / OTel import reads env vars

import { NodeSDK } from '@opentelemetry/sdk-node';
import { LangfuseSpanProcessor } from '@langfuse/otel';

const sdk = new NodeSDK({
  spanProcessors: [new LangfuseSpanProcessor()],
});

sdk.start();

c) for each appGraph invokation add CallbackHandler

    // Each invocation gets its own handler so the trace is correctly scoped.
    const langfuseHandler = new CallbackHandler({
      tags: ['langgraph', 'portfolio-review'],
    });

    // Invoke the graph — it will pause at the first interrupt in humanReview
    const result = await this.appGraph.invoke(initialState, {
      ...this.threadConfig,
      callbacks: [langfuseHandler],
    });

Observations in practice

Here are examples of some observations that we can make thanks to Langfuse dashboard.

In the screenshot below there is a call to the llm visible with most important properties like the prompt, user input, llm output, time it took or tokens usage (117):

Here you can see a verbose table of logs of any action taken by the agent. The table is interactive, firterable and costomizable.

Summary

In this post we have covered the basics of observability of agentic systems. There are mentioned the main features of Langfuse, instalation guide for NestJS apps with example LangGraph integration and finally - tracing examples of the working agent app.

DEV Community