DEV Community: clemra

[Boost]

clemra — Wed, 01 Oct 2025 16:00:07 +0000

LLM Observability: Debugging My Journaling Agent

Margarita Sliachina ・ Sep 20

#ai #llm #opensource #langfuse

Excited about this!

clemra — Fri, 06 Jun 2025 13:05:01 +0000

clemra

Jun 6 '25

All Langfuse Product Features now Free Open-Source

#opensource #llm #devtools #langfuse

Comments 3

2 min read

All Langfuse Product Features now Free Open-Source

clemra — Fri, 06 Jun 2025 13:02:00 +0000

Today, we’re excited to share a big milestone: all Langfuse product features are now open source under the MIT license.

The LLM landscape is evolving rapidly - and so are the tools and workflows that developers use to build and improve their LLM applications. To empower our community and invite deeper collaboration, we’ve decided to open source all previously commercial features of Langfuse.

That means you can now self-host features like:

All fully open under the MIT license.

If you're already self-hosting Langfuse, upgrade to the latest version to unlock these new features.

Why we’re doing this

Langfuse is building the open source LLM engineering platform - a platform to observe, evaluate, and improve LLM applications.

To truly serve the developer community, we believe the full dev cycle needs to be covered by open source. Features like evals and prompt experiments are table stakes in today’s LLMOps world. These shouldn’t be behind a paywall.

By removing commercial restrictions from key product features, we’re doubling down on transparency, adoption, and community collaboration. It lets us move faster together.

Our Open Source Journey

Langfuse started as an open source project, grounded in a few core principles:

Developers should fully own and access their data
Langfuse must integrate with any stack, framework, or model
Teams should have the freedom to customize the platform for their workflows

That philosophy still guides us today - now with an even wider open-source surface.

While we continue to support enterprise needs (like SCIM, Audit Logs, and custom data retention policies), the core of Langfuse is now entirely OSS.

Langfuse in the Wild

Langfuse is growing fast:

7,000,000 SDK installs/month
5,500,000 Docker pulls
8,000 monthly active self-hosted instances

We’re constantly amazed by the scale and creativity of our community. We hope this move makes Langfuse even more accessible and extensible for all developers building with LLMs.

📌 Check out the self-hosting docs

🚀 Deploy with Terraform

Let us know what you're building - and how we can help make Langfuse even better.

✍️ By Clemens, Marc & Max

RAG observability in 2 lines of code with Llama Index & Langfuse

clemra — Mon, 18 Mar 2024 19:34:03 +0000

Why you need observability for RAG

There are so many different ways to make RAG work for a use case. What vector store to use? What retrieval strategy to use? LlamaIndex makes it easy to try many of them without having to deal with the complexity of integrations, prompts and memory all at once.

Initially, we at Langfuse worked on complex RAG/agent applications and quickly realized that there is a new need for observability and experimentation to tweak and iterate on the details. In the end, these details matter to get from something cool to an actually reliable RAG application that is safe for users and customers. Think of this: if there is a user session of interest in your production RAG application, how can you quickly see whether the retrieved context for that session was actually relevant or the LLM response was on point?

What is Langfuse?

Thus, we started working on Langfuse.com (GitHub) to establish an open source LLM engineering platform with tightly integrated features for tracing, prompt management, and evaluation. In the beginning we just solved our own and our friends’ problems. Today we are at over 1000 projects which rely on Langfuse, and 2.3k stars on GitHub. You can either self-host Langfuse or use the cloud instance maintained by us.

We are thrilled to announce our new integration with LlamaIndex today. This feature was highly requested by our community and aligns with our project's focus on native integration with major application frameworks. Thank you to everyone who contributed and tested it during the beta phase!

The challenge

We love LlamaIndex, since the clean and standardized interface abstracts a lot of complexity away. Let’s take this simple example of a VectorStoreIndex and a ChatEngine.

from llama_index.core import SimpleDirectoryReader
from llama_index.core import VectorStoreIndex

documents = SimpleDirectoryReader("./data").load_data()

index = VectorStoreIndex.from_documents(documents)

chat_engine = index.as_chat_engine()

print(chat_engine.chat("What problems can I solve with RAG?"))
print(chat_engine.chat("How do I optimize my RAG application?"))

In just 3 lines we loaded our local documents, added them to an index and initialized a ChatEngine with memory. Subsequently we had a stateful conversation with the chat_engine.

This is awesome to get started, but we quickly run into questions like:

“What context is actually retrieved from the index to answer the questions?”
“How is chat memory managed?”
“Which steps add the most latency to the overall execution? How to optimize it?”

One-click OSS observability to the rescue

We integrated Langfuse to be a one-click integration with LlamaIndex using the global callback manager.

Preparation

Install the community package (pip install llama-index-callbacks-langfuse)
Copy/paste the environment variables from the Langfuse project settings to your Python project: 'LANGFUSE_SECRET_KEY', 'LANGFUSE_PUBLIC_KEY' and 'LANGFUSE_HOST'

Now, you only need to set the global langfuse handler:

from llama_index.core import set_global_handler

set_global_handler("langfuse")

And voilá, with just two lines of code you get detailed traces for all aspects of your RAG application in Langfuse. They automatically include latency and usage/cost breakdowns.

Group multiple chat threads into a session

Working with lots of teams building GenAI/LLM/RAG applications, we’ve continuously added more features that are useful to debug and improve these applications. One example is session tracking for conversational applications to see the traces in context of a full message thread.

To activate it, just add an id that identifies the session as a trace param before calling the chat_engine.

from llama_index.core import global_handler

global_handler.set_trace_params(
    session_id="your-session-id"
)

chat_engine.chat("What did he do growing up?")
chat_engine.chat("What did he do at USC?")
chat_engine.chat("How old is he?")

Thereby you can see all these chat invocations grouped into a session view in Langfuse Tracing:

Next to sessions, you can also track individual users or add tags and metadata to your Langfuse traces.

Trace more complex applications and use other Langfuse features for prompt management and evaluation

This integration makes it easy to get started with Tracing. If your application ends up growing into using custom logic or other frameworks/packages, all Langfuse integrations are fully interoperable.

We have also built additional features to version control and collaborate on prompts (langfuse prompt management), track experiments, and evaluate production traces. For RAG specifically, we collaborated with the RAGAS team and it’s easy to run their popular eval suite on traces captured with Langfuse (see cookbook).

Get started

The easiest way to get started is to follow the cookbook and check out the docs.

Feedback? Ping us

We’d love to hear any feedback. Come join us on our community discord or add your thoughts to this GitHub thread.

LLM Analytics 101 - How to Improve your LLM app

clemra — Thu, 14 Sep 2023 20:10:17 +0000

This guide gives builders on the LLM application layer an understanding of the why, what and how of tracing & analytics to improve their LLM applications

LLMs Have Changed Software Delivery

Generative AI outputs are not deterministic. That is, they cannot be reliably forecasted. This changes how software is delivered as compared to more 'traditional' software engineering. If it is not clear what an output will look like and what a 'good' output is, it is harder to to assure quality and build robust tests before shipping code.

Learning from production data has taken the place of extensive software design and testing on the LLM application layer. But to learn from production, you have to trace your LLMs and analyze what works and what does not.

Tracing LLM apps - What's Different?

Building LLM-based apps means integrating multiple complex elements and interactions to your code. This can mean chains, agents, different base models, tools, embedding retrieval and routing. Traditional logging and analytics tools are not well equipped to ingest, display and analyze these new ways of interacting with LLMs.
The new logging stack needs to think LLM-native from the ground up. That means grouping calls and visualizing them in a way that enables teams to understand and debug them.

Let's Dive in: What to Measure?

// Example generation creation
const generation = trace.generation({
  name: "chat-completion",
  model: "gpt-3.5-turbo",
  modelParameters: {
    temperature: 0.9,
    maxTokens: 2000,
  },
  prompt: messages,
});

The baseline requirement to improve an LLM-based app is to trace its activity. But what does that mean and what do I want to record? From working with our users at the bleeding edge of LLMs, we've see five metrics emerge to keep track of:

Volume: The foundation for all other metrics - track all LLM calls and their content and attach relevant metadata for both prompts and completions.
Costs: Record token counts and pricing to compute the cost of each call. Track GPU seconds and pricing for self-hosted models.
Latency: Measure latency for every call. Use this data to analyze which steps add latency and start improving your users' experience.
Quality: Proactively solicit user feedback, conduct manual evaluations and score outputs using model-based evaluations.
Errors/Exceptions: Monitor for timeouts and HTTP errors, such as rate limits, that are indicative of systemic issues.

Implementing Effective Analytics through KPIs

We've seen successful teams implement the following best practice KPIs by slicing the above five metrics (volume, cost, latency, quality, errors) by:

Use case: Cluster prompts and completions by use case to understand how your users are interacting with your LLM
Model and configuration: How do different models and model configurations affect quality, latency or errors?
Chain and step: Drill down into chains to understand what drives performance
User data: Group users by specific characteristics to gain insight into personas and specific constituencies in your product
Chain and step: Drill down into chains to understand what drives performance
Model and configuration: Track how different models and model configurations affect quality, latency or errors
Use case: Cluster prompts and completions by use case to understand how your users are interacting with your LLM
Time: Inspect your KPIs over time and detect trends
Version: Track prompts, chains and software releases by their version and understand performance changes
Geography: Especially important for latency
Language: Understand how well your app works by user language

Step-by-Step: Implementing Tracing & Analytics in LLM Applications

Define goals: What do you want to achieve and how do your goals align with your users' requirements. Take the above metrics as a starting point to define KPIs unique to your application.
Incorporate tracking: This means backend execution and scores (e.g. capturing user feedback in the frontend).
Inspect and debug: Understand your users by inspecting runtime traces through a visual UI
Analyze: Start by measuring cost by model/user and time, cost by product feature, latency by step of a chain and start scattering quality/latency/cost grouped by experiments or production versions.

Give Langfuse a Spin

Langfuse makes tracing and analyzing LLM applications accessible. It is an open-source project under MIT license.

It offers data integration with async SDKs (JS/TS, Python), via API, and Langchain integrations. It provides a UI for debugging complex traces & includes pre-built dashboards to analyze quality, latency and cost. It allows for recording user feedback and using LLM models to grade and score your outputs. To get going, refer to the quickstart guide in the docs.

Visit us on Discord and Github to engage with our project.

Interested? Sign up to try the demo at langfuse.com. Self-hosting instructions can be found in our docs.