<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Ayub Shah</title>
    <description>The latest articles on DEV Community by Ayub Shah (@ayubshah014sys).</description>
    <link>https://dev.to/ayubshah014sys</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3906545%2F9b00e6e0-15d4-41a5-8a69-a61d354056ec.jpg</url>
      <title>DEV Community: Ayub Shah</title>
      <link>https://dev.to/ayubshah014sys</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/ayubshah014sys"/>
    <language>en</language>
    <item>
      <title>MLflow Tutorial: How to Track ML Experiments Like a Pro (2026)</title>
      <dc:creator>Ayub Shah</dc:creator>
      <pubDate>Fri, 01 May 2026 19:04:47 +0000</pubDate>
      <link>https://dev.to/ayubshah014sys/mlflow-tutorial-how-to-track-ml-experiments-like-a-pro-2026-362f</link>
      <guid>https://dev.to/ayubshah014sys/mlflow-tutorial-how-to-track-ml-experiments-like-a-pro-2026-362f</guid>
      <description>&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://mlopslab.org/mlflow-tutorial/" rel="noopener noreferrer"&gt;mlopslab.org/mlflow-tutorial&lt;/a&gt; — updated weekly. 0 sponsors, 0 affiliate links.&lt;/em&gt;&lt;/p&gt;




&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;⚡ Quick answer:&lt;/strong&gt; MLflow is an open-source platform that tracks everything about your ML experiments — parameters, metrics, model artifacts, and code versions — so you can reproduce any result and never lose a winning configuration again. You'll have your first experiment tracked in under 20 minutes.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Table of Contents
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;What is MLflow?&lt;/li&gt;
&lt;li&gt;Before you start&lt;/li&gt;
&lt;li&gt;Step 1 — Install MLflow&lt;/li&gt;
&lt;li&gt;Step 2 — Start the tracking server&lt;/li&gt;
&lt;li&gt;Step 3 — Write your first tracking script&lt;/li&gt;
&lt;li&gt;Step 4 — View results in the UI&lt;/li&gt;
&lt;li&gt;Step 5 — Compare multiple runs&lt;/li&gt;
&lt;li&gt;What to learn next&lt;/li&gt;
&lt;li&gt;FAQ&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  1. What is MLflow?
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;MLflow is an open-source platform that tracks everything about your ML experiments&lt;/strong&gt; — parameters, metrics, model artifacts, and code versions — so you can reproduce any result and never lose a winning configuration again.&lt;/p&gt;

&lt;p&gt;Without experiment tracking, most ML engineers waste hours rerunning experiments they've already done — or ship models they can't reproduce. MLflow eliminates both problems permanently.&lt;/p&gt;

&lt;p&gt;At its core, MLflow gives you four things:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Tracking&lt;/strong&gt; — log parameters, metrics, and artifacts for every run&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Projects&lt;/strong&gt; — package code so it's reproducible on any machine&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Models&lt;/strong&gt; — a standard format to package models for deployment&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Registry&lt;/strong&gt; — a central hub to manage model lifecycle (staging → production)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This tutorial covers the Tracking component, which is where 90% of the day-to-day value lives.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;💡 &lt;strong&gt;Note:&lt;/strong&gt; MLflow is model-framework agnostic. It works with scikit-learn, PyTorch, TensorFlow, XGBoost, Keras, LightGBM — anything you're already using.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  2. Before you start
&lt;/h2&gt;

&lt;p&gt;You need three things:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Python 3.8+&lt;/strong&gt; — run &lt;code&gt;python --version&lt;/code&gt; to check&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;pip installed&lt;/strong&gt; — comes with Python 3.4+&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Basic ML knowledge&lt;/strong&gt; — you should know what "training a model" and "accuracy" mean&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That's it. No Docker, no AWS account, no paid tier.&lt;/p&gt;




&lt;h2&gt;
  
  
  3. Step 1 — Install MLflow
&lt;/h2&gt;

&lt;p&gt;⏱ &lt;em&gt;2 minutes&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;MLflow is a single pip install. It includes the tracking server, the UI, and the full Python API.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;mlflow scikit-learn
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Verify the install:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;mlflow &lt;span class="nt"&gt;--version&lt;/span&gt;
&lt;span class="c"&gt;# mlflow, version 2.x.x&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;✅ &lt;strong&gt;Using a virtual environment?&lt;/strong&gt; Run &lt;code&gt;python -m venv .venv &amp;amp;&amp;amp; source .venv/bin/activate&lt;/code&gt; before installing. Recommended to keep your environment clean.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  4. Step 2 — Start the tracking server
&lt;/h2&gt;

&lt;p&gt;⏱ &lt;em&gt;1 minute&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;In a terminal, run:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;mlflow ui
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You'll see:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[2026-04-15 10:23:01 +0000] [INFO] Starting gunicorn 21.2.0
[2026-04-15 10:23:01 +0000] [INFO] Listening at: http://127.0.0.1:5000
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Open &lt;strong&gt;&lt;a href="http://localhost:5000" rel="noopener noreferrer"&gt;http://localhost:5000&lt;/a&gt;&lt;/strong&gt; in your browser — you'll see an empty MLflow dashboard. Leave this terminal running.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;⚠️ &lt;strong&gt;Port conflict?&lt;/strong&gt; If port 5000 is taken (common on macOS), run &lt;code&gt;mlflow ui --port 5001&lt;/code&gt; and visit &lt;code&gt;http://localhost:5001&lt;/code&gt; instead.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  5. Step 3 — Write your first tracking script
&lt;/h2&gt;

&lt;p&gt;⏱ &lt;em&gt;10 minutes&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Create a file called &lt;code&gt;train.py&lt;/code&gt; and paste this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;mlflow&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;mlflow.sklearn&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;sklearn.datasets&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;load_iris&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;sklearn.ensemble&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;RandomForestClassifier&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;sklearn.model_selection&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;train_test_split&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;sklearn.metrics&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;accuracy_score&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;f1_score&lt;/span&gt;

&lt;span class="c1"&gt;# Configuration — change these to experiment
&lt;/span&gt;&lt;span class="n"&gt;N_ESTIMATORS&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;
&lt;span class="n"&gt;MAX_DEPTH&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;
&lt;span class="n"&gt;RANDOM_STATE&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;42&lt;/span&gt;

&lt;span class="c1"&gt;# Load data
&lt;/span&gt;&lt;span class="n"&gt;iris&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;load_iris&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;X_train&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;X_test&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_train&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_test&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;train_test_split&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;iris&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;iris&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;target&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;test_size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;random_state&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;RANDOM_STATE&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Name your experiment (MLflow creates it if it doesn't exist)
&lt;/span&gt;&lt;span class="n"&gt;mlflow&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_experiment&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;iris-classifier&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;mlflow&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;start_run&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="c1"&gt;# Train model
&lt;/span&gt;    &lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;RandomForestClassifier&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;n_estimators&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;N_ESTIMATORS&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;max_depth&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;MAX_DEPTH&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;random_state&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;RANDOM_STATE&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;fit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X_train&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_train&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Evaluate
&lt;/span&gt;    &lt;span class="n"&gt;predictions&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;predict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X_test&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;accuracy&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;accuracy_score&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;y_test&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;predictions&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;f1&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;f1_score&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;y_test&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;predictions&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;average&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;weighted&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Log everything to MLflow
&lt;/span&gt;    &lt;span class="n"&gt;mlflow&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log_param&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;n_estimators&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;N_ESTIMATORS&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;mlflow&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log_param&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;max_depth&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;MAX_DEPTH&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;mlflow&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log_metric&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;accuracy&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;accuracy&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;mlflow&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log_metric&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;f1_score&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;f1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;mlflow&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sklearn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log_model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;random-forest-model&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Accuracy: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;accuracy&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; | F1: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;f1&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Run ID: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;mlflow&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;active_run&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="n"&gt;info&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;run_id&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Run it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;python train.py
&lt;span class="c"&gt;# Accuracy: 0.9667 | F1: 0.9667&lt;/span&gt;
&lt;span class="c"&gt;# Run ID: a1b2c3d4e5f6...&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;MLflow created an &lt;code&gt;mlruns/&lt;/code&gt; folder in your working directory. That's where everything is stored locally.&lt;/p&gt;

&lt;h3&gt;
  
  
  What each MLflow call does
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Call&lt;/th&gt;
&lt;th&gt;What it logs&lt;/th&gt;
&lt;th&gt;Example&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;mlflow.set_experiment()&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Groups runs under a named experiment&lt;/td&gt;
&lt;td&gt;&lt;code&gt;"iris-classifier"&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;mlflow.log_param()&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;A single key-value config value&lt;/td&gt;
&lt;td&gt;&lt;code&gt;n_estimators=100&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;mlflow.log_metric()&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;A numeric result (can be stepped over time)&lt;/td&gt;
&lt;td&gt;&lt;code&gt;accuracy=0.967&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;mlflow.sklearn.log_model()&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;The trained model artifact + signature&lt;/td&gt;
&lt;td&gt;Serialized RandomForest&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;blockquote&gt;
&lt;p&gt;✅ &lt;strong&gt;It worked!&lt;/strong&gt; Every run gets a unique run ID, timestamp, and its own folder under &lt;code&gt;mlruns/&lt;/code&gt;. Nothing overwrites anything.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  6. Step 4 — View results in the MLflow UI
&lt;/h2&gt;

&lt;p&gt;⏱ &lt;em&gt;2 minutes&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Go back to &lt;strong&gt;&lt;a href="http://localhost:5000" rel="noopener noreferrer"&gt;http://localhost:5000&lt;/a&gt;&lt;/strong&gt;. You'll now see your &lt;code&gt;iris-classifier&lt;/code&gt; experiment with one run logged.&lt;/p&gt;

&lt;p&gt;Click the run to see:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Parameters tab&lt;/strong&gt; — &lt;code&gt;n_estimators&lt;/code&gt;, &lt;code&gt;max_depth&lt;/code&gt;, &lt;code&gt;random_state&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Metrics tab&lt;/strong&gt; — &lt;code&gt;accuracy&lt;/code&gt;, &lt;code&gt;f1_score&lt;/code&gt; with a time-series chart&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Artifacts tab&lt;/strong&gt; — the serialized model, ready to load&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fn2pf6nnrqvn9hl1scms8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fn2pf6nnrqvn9hl1scms8.png" alt="MLflow UI showing metric tracking dashboard" width="800" height="438"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Figure 1: MLflow tracking UI — parameters and metrics are visualized automatically per run&lt;/em&gt;&lt;/p&gt;


&lt;h2&gt;
  
  
  7. Step 5 — Compare multiple runs
&lt;/h2&gt;

&lt;p&gt;⏱ &lt;em&gt;5 minutes&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;This is where MLflow pays off. Run &lt;code&gt;train.py&lt;/code&gt; a few more times with different parameters:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Edit N_ESTIMATORS and MAX_DEPTH in train.py between runs, then:&lt;/span&gt;
python train.py  &lt;span class="c"&gt;# run 2: n_estimators=50, max_depth=3&lt;/span&gt;
python train.py  &lt;span class="c"&gt;# run 3: n_estimators=200, max_depth=10&lt;/span&gt;
python train.py  &lt;span class="c"&gt;# run 4: n_estimators=10, max_depth=2&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In the MLflow UI, check the checkboxes next to multiple runs and click &lt;strong&gt;"Compare"&lt;/strong&gt;. You'll get a side-by-side table of every parameter and metric across all runs.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhv0j8hom7z8kbuxoa8lu.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhv0j8hom7z8kbuxoa8lu.png" alt="MLflow run comparison table" width="800" height="608"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Figure 2: Compare runs side-by-side — MLflow shows exactly which parameters produced the best results&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;You can now answer: &lt;em&gt;"Which configuration gave us the best result, and can we reproduce it?"&lt;/em&gt; — with a single click, using the run ID.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;🏆 &lt;strong&gt;Pro tip:&lt;/strong&gt; In the UI, click any metric column header to sort runs by that metric. The best run floats to the top instantly.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  8. What to learn next
&lt;/h2&gt;

&lt;p&gt;Once you have basic tracking working, these are the natural next steps in order of complexity:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Model Registry&lt;/strong&gt; — promote your best run from "Experiment" to "Staging" to "Production" with one click. Gives you a version-controlled model store with transition history.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Log more metrics&lt;/strong&gt; — use &lt;code&gt;mlflow.log_metric("loss", loss, step=epoch)&lt;/code&gt; inside your training loop to track metrics over time, not just at the end. The UI plots them automatically.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Serve your model&lt;/strong&gt; — run &lt;code&gt;mlflow models serve -m runs:/&amp;lt;RUN_ID&amp;gt;/random-forest-model --port 8080&lt;/code&gt; to expose your logged model as a REST API endpoint. No extra code needed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Remote tracking server&lt;/strong&gt; — instead of &lt;code&gt;mlflow ui&lt;/code&gt; on localhost, point your team at one shared PostgreSQL-backed server: &lt;code&gt;mlflow server --backend-store-uri postgresql://...&lt;/code&gt;. Every engineer's runs go to the same place.&lt;/p&gt;




&lt;h2&gt;
  
  
  9. FAQ
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;What's the difference between MLflow and Weights &amp;amp; Biases?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;MLflow is fully open-source and self-hostable — your data never leaves your infrastructure. W&amp;amp;B is cloud-first with a better UI and more advanced features (sweeps, reports), but costs money at scale. For teams that need data sovereignty or are cost-sensitive, MLflow wins. See the &lt;a href="https://mlopslab.org/mlflow-vs-weights-biases-which-actually-saves-engineering-time/" rel="noopener noreferrer"&gt;full MLflow vs W&amp;amp;B comparison&lt;/a&gt; for a detailed breakdown.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Can MLflow track deep learning training loops?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Yes. Use &lt;code&gt;mlflow.log_metric("loss", loss, step=epoch)&lt;/code&gt; inside your epoch loop and MLflow plots the full training curve. It also has autologging support for PyTorch Lightning, Keras, and Hugging Face — one line enables automatic logging of all metrics, params, and the final model.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What happens to my runs if I delete &lt;code&gt;mlruns/&lt;/code&gt;?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;They're gone. For anything beyond local experimentation, set up a proper backend store (SQLite at minimum, PostgreSQL for teams) and an artifact store (S3, GCS, or Azure Blob). Then your runs survive machine restarts and are shareable.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Does MLflow work with open-source models like Llama or Mistral?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Yes — MLflow has a &lt;code&gt;mlflow.transformers&lt;/code&gt; flavor for Hugging Face models and supports custom Python function flavors for anything else. You can log any model as long as you can serialize it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How does MLflow compare to ClearML?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Both are strong open-source options. ClearML has a richer built-in UI and experiment orchestration features out of the box. MLflow has a larger ecosystem and better framework integrations. See the &lt;a href="https://mlopslab.org/mlflow-vs-clearml-which-open-source-mlops-tool-actually-wins-2026/" rel="noopener noreferrer"&gt;MLflow vs ClearML breakdown&lt;/a&gt; for a production-focused comparison.&lt;/p&gt;




&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;MLflow experiment tracking isn't optional once you're running more than a handful of experiments. The "I'll remember which config worked best" approach breaks fast.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The minimum viable setup:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;pip install mlflow&lt;/code&gt; → &lt;code&gt;mlflow ui&lt;/code&gt; → &lt;code&gt;mlflow.log_param()&lt;/code&gt; + &lt;code&gt;mlflow.log_metric()&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That combination gives you full reproducibility with maybe 30 minutes of implementation work.&lt;/p&gt;

&lt;p&gt;Don't set up the perfect MLflow infrastructure before you ship. Start local, log everything, move to a shared server when you have a team. The habit of logging compounds.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;🔗 &lt;strong&gt;Next step:&lt;/strong&gt; Run the &lt;code&gt;train.py&lt;/code&gt; above → check your first trace in the UI at &lt;code&gt;localhost:5000&lt;/code&gt;. That's the first 15 minutes. Everything else follows from having that first run visible.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Related articles on MLOpsLab
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://mlopslab.org/mlflow-vs-weights-biases-which-actually-saves-engineering-time/" rel="noopener noreferrer"&gt;MLflow vs Weights &amp;amp; Biases: Which Actually Saves Engineering Time?&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://mlopslab.org/mlflow-vs-clearml-which-open-source-mlops-tool-actually-wins-2026/" rel="noopener noreferrer"&gt;MLflow vs ClearML: Which Open Source MLOps Tool Actually Wins (2026)?&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://mlopslab.org/how-to-deploy-a-machine-learning-model-with-docker-and-mlflow-2026-tutorial/" rel="noopener noreferrer"&gt;How to Deploy a Machine Learning Model with Docker &amp;amp; MLflow (2026)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://mlopslab.org/llm-observability/" rel="noopener noreferrer"&gt;LLM Observability: The ML Engineer's Practical Guide (2026)&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;MLflow Documentation. &lt;a href="https://mlflow.org/docs/latest/index.html" rel="noopener noreferrer"&gt;https://mlflow.org/docs/latest/index.html&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Chen, A., et al. (2020). Developments in MLflow: A System to Accelerate the Machine Learning Lifecycle. DEEM Workshop, ACM SIGMOD. &lt;a href="https://doi.org/10.1145/3399579.3399867" rel="noopener noreferrer"&gt;https://doi.org/10.1145/3399579.3399867&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;scikit-learn Documentation. &lt;a href="https://scikit-learn.org/stable/" rel="noopener noreferrer"&gt;https://scikit-learn.org/stable/&lt;/a&gt;
&lt;/li&gt;
&lt;/ol&gt;




&lt;p&gt;&lt;em&gt;Written by Ayub Shah — ML Engineering student, MLOps enthusiast. Testing every tool so you don't have to. No sponsors, no affiliate links.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;→ More at &lt;a href="https://mlopslab.org" rel="noopener noreferrer"&gt;mlopslab.org&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>machinelearning</category>
      <category>mlops</category>
      <category>python</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>What is LLM Observability? The ML Engineer's Practical Guide (2026)</title>
      <dc:creator>Ayub Shah</dc:creator>
      <pubDate>Fri, 01 May 2026 17:49:00 +0000</pubDate>
      <link>https://dev.to/ayubshah014sys/what-is-llm-observability-the-ml-engineers-practical-guide-2026-1l4h</link>
      <guid>https://dev.to/ayubshah014sys/what-is-llm-observability-the-ml-engineers-practical-guide-2026-1l4h</guid>
      <description>&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://mlopslab.org/llm-observability/" rel="noopener noreferrer"&gt;mlopslab.org/llm-observability&lt;/a&gt; — updated weekly. 0 sponsors, 0 affiliate links.&lt;/em&gt;&lt;/p&gt;




&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;⚡ Quick answer:&lt;/strong&gt; LLM observability is the practice of collecting metrics, traces, and logs from large language model applications to monitor behavior, catch failures, control costs, and improve output quality — in real time. Unlike traditional APM, it handles non-deterministic outputs, prompt/response pairs, token costs, hallucination rates, and multi-step agent chains that standard monitoring tools were never built for.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Table of Contents
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;LLM observability: the actual definition&lt;/li&gt;
&lt;li&gt;Why traditional APM fails for LLMs&lt;/li&gt;
&lt;li&gt;Why it matters in 2026&lt;/li&gt;
&lt;li&gt;The three pillars: metrics, traces, logs&lt;/li&gt;
&lt;li&gt;Key LLM observability metrics&lt;/li&gt;
&lt;li&gt;Best LLM observability tools (2026)&lt;/li&gt;
&lt;li&gt;How to implement it in Python — step by step&lt;/li&gt;
&lt;li&gt;RAG observability: what's different&lt;/li&gt;
&lt;li&gt;Common mistakes to avoid&lt;/li&gt;
&lt;li&gt;FAQ&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  1. LLM observability: the actual definition
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;LLM observability&lt;/strong&gt; is the ability to understand what your large language model is doing, why it's doing it, and whether it's doing it well — while it's running in production.&lt;/p&gt;

&lt;p&gt;The formal definition: it's the process of instrumenting LLM applications to collect structured data (metrics, traces, logs) about inputs, outputs, latency, token usage, and downstream behavior — then making that data queryable and actionable.&lt;/p&gt;

&lt;p&gt;But here's the part most definitions skip: &lt;strong&gt;LLMs are non-deterministic&lt;/strong&gt;. The same prompt can produce different outputs. That single fact breaks every assumption traditional application monitoring was built on.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;💡 &lt;strong&gt;Note:&lt;/strong&gt; "Observability" comes from control theory — a system is observable if you can infer its internal state from its outputs. For LLMs, the "internal state" is opaque by design. Observability is how you compensate for that opacity.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;A complete LLM observability setup lets you answer questions like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Why did this prompt return garbage output on Tuesday at 3pm?&lt;/li&gt;
&lt;li&gt;How many tokens did we burn last week, and on which features?&lt;/li&gt;
&lt;li&gt;Is our retrieval step actually finding relevant context, or just noise?&lt;/li&gt;
&lt;li&gt;Which user flows are generating the most hallucinations?&lt;/li&gt;
&lt;li&gt;Did our prompt change last Wednesday improve or hurt response quality?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Without observability, you're guessing at all of the above.&lt;/p&gt;




&lt;h2&gt;
  
  
  2. Why traditional APM fails for LLMs
&lt;/h2&gt;

&lt;p&gt;You might already have Datadog, New Relic, or Prometheus running. They're great tools. They will &lt;strong&gt;not&lt;/strong&gt; help you monitor an LLM application properly. Here's why:&lt;/p&gt;

&lt;h3&gt;
  
  
  Traditional APM vs LLM Observability
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Dimension&lt;/th&gt;
&lt;th&gt;Traditional APM&lt;/th&gt;
&lt;th&gt;LLM Observability&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Output nature&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Deterministic — same input → same output&lt;/td&gt;
&lt;td&gt;Non-deterministic — same prompt → different outputs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Failure mode&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Binary (HTTP 200 vs 500)&lt;/td&gt;
&lt;td&gt;Output can be grammatically correct but factually wrong&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Performance definition&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Speed + uptime&lt;/td&gt;
&lt;td&gt;Relevance, factual accuracy, coherence&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Quality&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Not applicable&lt;/td&gt;
&lt;td&gt;First-class concern with dedicated metrics&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Tracing&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Fixed execution paths&lt;/td&gt;
&lt;td&gt;Spans across prompt → retrieval → generation → re-ranking&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Cost tracking&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Not needed&lt;/td&gt;
&lt;td&gt;Token cost per request is critical (it's your AWS bill)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Errors&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Clear: stack traces, exceptions&lt;/td&gt;
&lt;td&gt;"Silent failures" — plausible-sounding wrong answers&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The most dangerous failure mode in LLM production is the &lt;strong&gt;silent failure&lt;/strong&gt;: the model returns a 200 OK with a confident, fluent, completely wrong answer. Your APM sees green. Your users are getting misinformation. You have no idea.&lt;/p&gt;

&lt;p&gt;That's the problem LLM observability is built to solve.&lt;/p&gt;




&lt;h2&gt;
  
  
  3. Why it matters in 2026
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. You're paying per token — and it adds up fast
&lt;/h3&gt;

&lt;p&gt;GPT-4o charges ~$5 per million input tokens. Claude Opus is $15. If you're running a RAG pipeline that sends 3,000-token prompts for every user query, and you have 10,000 daily active users, you're burning through tokens fast.&lt;/p&gt;

&lt;p&gt;Without observability, you have zero visibility into:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Which features are expensive&lt;/li&gt;
&lt;li&gt;Which prompts are bloated&lt;/li&gt;
&lt;li&gt;Which retrieval chunks are redundant&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;A 40% cost reduction is realistic&lt;/strong&gt; just from instrumenting your token usage and trimming waste.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Hallucinations don't throw exceptions
&lt;/h3&gt;

&lt;p&gt;When a SQL query fails, you get an error. When an LLM confidently fabricates a legal clause, a medical dosage, or a product spec — you get a 200 OK.&lt;/p&gt;

&lt;p&gt;The only way to catch this is output evaluation: either automated (LLM-as-judge, assertion checks) or via user feedback signals — both of which require an observability layer to collect and route.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. LLM apps are increasingly multi-step
&lt;/h3&gt;

&lt;p&gt;A modern RAG agent might do:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;query rewriting → vector search → reranking → generation → post-processing → tool calls
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Any step can fail silently. Without distributed tracing across all those steps, you have no way to know which node in the chain is degrading your quality.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;✅ &lt;strong&gt;Tip:&lt;/strong&gt; If you're already logging prompts and responses to a database, you have the raw material for LLM observability. The difference is structure, aggregation, and making that data queryable — which is what proper tooling does.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  4. The three pillars: metrics, traces, logs
&lt;/h2&gt;

&lt;p&gt;LLM observability, like traditional observability, rests on three data types. But each has LLM-specific meaning:&lt;/p&gt;

&lt;h3&gt;
  
  
  📊 Metrics — aggregated numbers over time
&lt;/h3&gt;

&lt;p&gt;Latency percentiles, token consumption per day, error rates, hallucination rate, TTFT (time to first token), user thumbs-up/down ratio.&lt;/p&gt;

&lt;p&gt;These are your dashboards — the signals that tell you whether the system is healthy at a glance.&lt;/p&gt;

&lt;h3&gt;
  
  
  🔍 Traces — the execution path of a single request
&lt;/h3&gt;

&lt;p&gt;A trace for an LLM request spans every step:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;input received → prompt constructed → retrieval triggered → chunks fetched → LLM called → response parsed → returned
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Traces tell you &lt;em&gt;where&lt;/em&gt; time and tokens were spent on a specific request and let you drill into failures.&lt;/p&gt;

&lt;h3&gt;
  
  
  📋 Logs — raw structured records of events
&lt;/h3&gt;

&lt;p&gt;Every prompt sent, every response received, every retrieved chunk, every tool call. Logs are the ground truth — unsampled, timestamped, filterable.&lt;/p&gt;

&lt;p&gt;They're what you reach for during incident investigation when metrics tell you &lt;em&gt;something is wrong&lt;/em&gt; but not exactly what.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A mature LLM observability setup collects all three and links them:&lt;/strong&gt; a metric spike points you to a trace, a trace links to the logs of that specific exchange.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;⚠️ &lt;strong&gt;Warning:&lt;/strong&gt; Logging raw prompts and responses raises data privacy and compliance considerations. If users send PII, it ends up in your logs. Make sure you have a redaction or anonymization strategy before you log at full fidelity in production.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  5. Key LLM observability metrics
&lt;/h2&gt;

&lt;p&gt;These are the metrics that actually matter — not the generic list you'll find everywhere, but the ones that show up when something goes wrong.&lt;/p&gt;

&lt;h3&gt;
  
  
  ⏱️ Latency metrics
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;What it measures&lt;/th&gt;
&lt;th&gt;Why it matters&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;TTFT&lt;/strong&gt; (Time To First Token)&lt;/td&gt;
&lt;td&gt;Latency before streaming starts&lt;/td&gt;
&lt;td&gt;User-perceived speed — low TTFT feels fast even if total is high&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;TPS&lt;/strong&gt; (Tokens Per Second)&lt;/td&gt;
&lt;td&gt;Generation speed&lt;/td&gt;
&lt;td&gt;Degrades under load — track p50, p95, p99&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;End-to-end latency&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Total request time including retrieval + generation&lt;/td&gt;
&lt;td&gt;What SLAs are measured against&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  💸 Cost metrics
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;What it measures&lt;/th&gt;
&lt;th&gt;Why it matters&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Input tokens/request&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Prompt tokens per call&lt;/td&gt;
&lt;td&gt;Where cost bloat hides — long system prompts, noisy chunks&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Cost per request&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Input+output tokens × model price&lt;/td&gt;
&lt;td&gt;Unit economics for your feature&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Daily token burn rate&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Total tokens across all requests&lt;/td&gt;
&lt;td&gt;Set alerts here — a loop bug shows up here before your bill does&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  🎯 Quality metrics
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;What it measures&lt;/th&gt;
&lt;th&gt;Why it matters&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Faithfulness&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Does answer stay grounded in retrieved context?&lt;/td&gt;
&lt;td&gt;Unfaithful answers are hallucinations&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Relevance score&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Is the answer relevant to what was asked?&lt;/td&gt;
&lt;td&gt;Factually correct but wrong-topic answers still fail&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;User feedback rate&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Thumbs up/down, ratings, correction events&lt;/td&gt;
&lt;td&gt;Highest-signal quality metric — direct from users&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;blockquote&gt;
&lt;p&gt;💡 &lt;strong&gt;Note:&lt;/strong&gt; Quality metrics are the hardest to collect automatically. Start with user feedback signals (explicit) and retry/abandon rate (implicit). Then layer in automated evaluation once you have a baseline.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  6. Best LLM observability tools (2026)
&lt;/h2&gt;

&lt;p&gt;Honest breakdown. I've tested all of these. No affiliate links, no vendor bias.&lt;/p&gt;

&lt;h3&gt;
  
  
  🦜 Langfuse — &lt;em&gt;Best open-source default&lt;/em&gt;
&lt;/h3&gt;

&lt;p&gt;Self-hostable, developer-first LLM tracing. Best OSS option if you want full data control and a clean SDK.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pros:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Self-hostable via Docker (free)&lt;/li&gt;
&lt;li&gt;SDKs for Python, JS, LangChain, LlamaIndex&lt;/li&gt;
&lt;li&gt;Prompt management + version tracking&lt;/li&gt;
&lt;li&gt;Dataset + evaluation workflows&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; Most teams. Start here.&lt;/p&gt;




&lt;h3&gt;
  
  
  🔥 Arize Phoenix — &lt;em&gt;Best for embedding analysis&lt;/em&gt;
&lt;/h3&gt;

&lt;p&gt;ML observability platform with strong LLM support.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pros:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;OpenInference tracing standard&lt;/li&gt;
&lt;li&gt;Embedding drift &amp;amp; cluster visualization&lt;/li&gt;
&lt;li&gt;Built-in evals (hallucination, toxicity)&lt;/li&gt;
&lt;li&gt;Works fully offline / local&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; Teams already using Arize for traditional ML monitoring.&lt;/p&gt;




&lt;h3&gt;
  
  
  ⚡ Helicone — &lt;em&gt;Fastest to set up&lt;/em&gt;
&lt;/h3&gt;

&lt;p&gt;Proxy-based approach — zero SDK changes. One header = instant logging.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pros:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;One-line integration (proxy URL swap)&lt;/li&gt;
&lt;li&gt;Real-time cost dashboard&lt;/li&gt;
&lt;li&gt;Request caching (reduces cost)&lt;/li&gt;
&lt;li&gt;10k req/month free&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; Cost tracking, teams that want zero implementation overhead.&lt;/p&gt;




&lt;h3&gt;
  
  
  🌊 W&amp;amp;B Weave — &lt;em&gt;Best if you're already on W&amp;amp;B&lt;/em&gt;
&lt;/h3&gt;

&lt;p&gt;Weights &amp;amp; Biases' LLM observability layer.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pros:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Native W&amp;amp;B integration&lt;/li&gt;
&lt;li&gt;Automatic function tracing via decorator&lt;/li&gt;
&lt;li&gt;Evaluation pipelines built-in&lt;/li&gt;
&lt;li&gt;Free for individual use&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; Teams using W&amp;amp;B for experiment tracking.&lt;/p&gt;




&lt;h3&gt;
  
  
  📡 OpenTelemetry — &lt;em&gt;Most flexible, most work&lt;/em&gt;
&lt;/h3&gt;

&lt;p&gt;Vendor-neutral observability standard. Build your own pipeline.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pros:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Vendor-neutral (ship to any backend)&lt;/li&gt;
&lt;li&gt;OpenLLMetry SDK for LLM spans&lt;/li&gt;
&lt;li&gt;Works with Jaeger, Tempo, Datadog&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; Enterprise, multi-backend infrastructure.&lt;/p&gt;




&lt;h3&gt;
  
  
  🐕 Datadog LLM Observability — &lt;em&gt;Enterprise grade, enterprise price&lt;/em&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Pros:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Unified with existing Datadog APM&lt;/li&gt;
&lt;li&gt;Auto-instrumentation for OpenAI/Anthropic&lt;/li&gt;
&lt;li&gt;Cluster analysis for prompt patterns&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; Existing Datadog shops with budget.&lt;/p&gt;




&lt;h3&gt;
  
  
  Quick comparison table
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Tool&lt;/th&gt;
&lt;th&gt;Open Source&lt;/th&gt;
&lt;th&gt;Self-hostable&lt;/th&gt;
&lt;th&gt;RAG support&lt;/th&gt;
&lt;th&gt;Evals built-in&lt;/th&gt;
&lt;th&gt;Best for&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Langfuse&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;Most teams — best OSS default&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Arize Phoenix&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;Embedding analysis, ML teams&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Helicone&lt;/td&gt;
&lt;td&gt;⚠️&lt;/td&gt;
&lt;td&gt;⚠️&lt;/td&gt;
&lt;td&gt;⚠️&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;Cost tracking, fastest setup&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;W&amp;amp;B Weave&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;W&amp;amp;B users, experiment correlation&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;OpenTelemetry&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;Enterprise, multi-backend&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Datadog LLM Obs&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;Existing Datadog shops&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;blockquote&gt;
&lt;p&gt;✅ &lt;strong&gt;Recommendation:&lt;/strong&gt; Start with Langfuse. Open source, self-hostable with Docker in 5 minutes, clean Python SDK, covers 90% of what you need. Graduate to OpenTelemetry when you need unified tracing across complex multi-service infra.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  7. How to implement it in Python — step by step
&lt;/h2&gt;

&lt;p&gt;Enough theory. Here's how you actually do it. We'll use &lt;strong&gt;Langfuse&lt;/strong&gt; — the best open-source option — for the full flow from a simple LLM call to a RAG pipeline with spans, scores, and cost tracking.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 1: Set up Langfuse (self-hosted via Docker)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Clone and start Langfuse locally&lt;/span&gt;
git clone https://github.com/langfuse/langfuse.git
&lt;span class="nb"&gt;cd &lt;/span&gt;langfuse
docker compose up &lt;span class="nt"&gt;-d&lt;/span&gt;

&lt;span class="c"&gt;# Langfuse UI will be at http://localhost:3000&lt;/span&gt;
&lt;span class="c"&gt;# Create a project and grab your API keys&lt;/span&gt;

&lt;span class="c"&gt;# Install the Python SDK&lt;/span&gt;
pip &lt;span class="nb"&gt;install &lt;/span&gt;langfuse openai
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 2: Basic LLM call with full tracing
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langfuse&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Langfuse&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langfuse.openai&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;openai&lt;/span&gt;  &lt;span class="c1"&gt;# drop-in replacement
&lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;

&lt;span class="c1"&gt;# Init — reads LANGFUSE_PUBLIC_KEY, LANGFUSE_SECRET_KEY, LANGFUSE_HOST from env
&lt;/span&gt;&lt;span class="n"&gt;langfuse&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Langfuse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;public_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;pk-lf-...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;secret_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sk-lf-...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;host&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http://localhost:3000&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;  &lt;span class="c1"&gt;# or https://cloud.langfuse.com
&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# This single import swap gives you automatic tracing
# of every OpenAI call: prompt, response, tokens, latency, cost
&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;openai&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-4o&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;system&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;You are a helpful MLOps assistant.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;What is MLflow used for?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="c1"&gt;# Optional: tag this trace for filtering in the UI
&lt;/span&gt;    &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;mlops-qa&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;metadata&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;feature&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;chat&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;u_123&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;choices&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# All trace data is now visible in Langfuse UI — zero extra code needed
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The import swap is the key. &lt;code&gt;from langfuse.openai import openai&lt;/code&gt; patches the OpenAI client and captures everything automatically: token counts, cost, latency, the full prompt and response.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 3: Custom spans for multi-step pipelines
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langfuse&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Langfuse&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langfuse.openai&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;openai&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langfuse.decorators&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;langfuse_context&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;observe&lt;/span&gt;

&lt;span class="n"&gt;langfuse&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Langfuse&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;# @observe creates a span for this function automatically
&lt;/span&gt;&lt;span class="nd"&gt;@observe&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;retrieve_chunks&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;top_k&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Simulated vector store retrieval&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="c1"&gt;# In production: call your Chroma / Pinecone / Weaviate here
&lt;/span&gt;    &lt;span class="n"&gt;chunks&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;MLflow is an open source platform for ML lifecycle management...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;score&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.92&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;MLflow Tracking logs parameters, metrics, and artifacts...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;score&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.87&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="c1"&gt;# Log retrieval metadata to the span
&lt;/span&gt;    &lt;span class="n"&gt;langfuse_context&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;update_current_observation&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="nb"&gt;input&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;output&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;chunks&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;metadata&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;top_k&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;top_k&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;chunk_count&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;chunks&lt;/span&gt;&lt;span class="p"&gt;)}&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;chunks&lt;/span&gt;

&lt;span class="nd"&gt;@observe&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;build_prompt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;chunks&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Assemble the final prompt from query + retrieved context&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;context&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;chunks&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Answer using only the context below.

Context:
&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;

Question: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;
Answer:&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

&lt;span class="nd"&gt;@observe&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  &lt;span class="c1"&gt;# The root trace — wraps the whole pipeline
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;rag_answer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="c1"&gt;# Step 1: retrieve — traced as a child span
&lt;/span&gt;    &lt;span class="n"&gt;chunks&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;retrieve_chunks&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Step 2: build prompt — traced as a child span
&lt;/span&gt;    &lt;span class="n"&gt;prompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;build_prompt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;chunks&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Step 3: generate — traced via patched OpenAI client
&lt;/span&gt;    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;openai&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-4o-mini&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;}]&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;answer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;choices&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;

    &lt;span class="c1"&gt;# Step 4: score the output quality (0-1 scale)
&lt;/span&gt;    &lt;span class="n"&gt;langfuse_context&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;score_current_trace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;answer_quality&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;1.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# replace with your eval logic
&lt;/span&gt;        &lt;span class="n"&gt;comment&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Auto-scored: retrieval found relevant chunks&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;answer&lt;/span&gt;

&lt;span class="c1"&gt;# Run it
&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;rag_answer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;What is MLflow used for?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Flush traces before script exits
&lt;/span&gt;&lt;span class="n"&gt;langfuse&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;flush&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 4: Automated quality scoring (LLM-as-judge)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;openai&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;OpenAI&lt;/span&gt;

&lt;span class="n"&gt;raw_client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;OpenAI&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  &lt;span class="c1"&gt;# unpatched — don't trace the judge calls
&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;evaluate_faithfulness&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;question&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;answer&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;tuple&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;float&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    LLM-as-judge: score whether the answer is faithful to the retrieved context.
    Returns a score from 0.0 (hallucination) to 1.0 (fully grounded).
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;judge_prompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;You are evaluating an AI assistant&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s answer for faithfulness.

RETRIEVED CONTEXT:
&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;

QUESTION: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;question&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;

ANSWER: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;answer&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;

Task: Score whether the answer is ONLY based on the retrieved context (not hallucinated).
Respond with JSON only: {{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;score&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;: 0.0-1.0, &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;reason&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;: &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;brief explanation&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;}}
0.0 = completely hallucinated | 1.0 = fully grounded in context&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;raw_client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-4o-mini&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;judge_prompt&lt;/span&gt;&lt;span class="p"&gt;}],&lt;/span&gt;
        &lt;span class="n"&gt;response_format&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;json_object&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;
    &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;loads&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;choices&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;score&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;reason&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="c1"&gt;# Post the score back to Langfuse for any trace
&lt;/span&gt;&lt;span class="n"&gt;score&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;reason&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;evaluate_faithfulness&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;question&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;answer&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;langfuse&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;score&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;trace_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;your-trace-id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# from langfuse_context.get_current_trace_id()
&lt;/span&gt;    &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;faithfulness&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;score&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;comment&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;reason&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 5: Capture user feedback signals
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langfuse&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Langfuse&lt;/span&gt;

&lt;span class="n"&gt;langfuse&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Langfuse&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;handle_user_feedback&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;trace_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;thumbs_up&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;bool&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;comment&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Record user feedback against the trace that generated the response&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;langfuse&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;score&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;trace_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;trace_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user_feedback&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;thumbs_up&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;comment&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;comment&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Example: in your FastAPI endpoint
# @app.post("/feedback")
# async def feedback(trace_id: str, positive: bool, comment: str = None):
#     handle_user_feedback(trace_id, positive, comment)
#     return {"status": "recorded"}
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;After this implementation, your Langfuse dashboard shows: every trace, constituent spans (retrieval → prompt build → generation), token counts, latency by step, faithfulness scores, and user feedback — all correlated.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;✅ &lt;strong&gt;Pro tip:&lt;/strong&gt; Get the current &lt;code&gt;trace_id&lt;/code&gt; inside any &lt;code&gt;@observe&lt;/code&gt;-decorated function with &lt;code&gt;langfuse_context.get_current_trace_id()&lt;/code&gt;. Store this in your response payload so you can link user feedback back to the exact trace.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  8. RAG observability: what's different
&lt;/h2&gt;

&lt;p&gt;RAG pipelines have unique failure modes that generic LLM observability doesn't capture.&lt;/p&gt;

&lt;h3&gt;
  
  
  RAG-specific metrics to track
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;What it measures&lt;/th&gt;
&lt;th&gt;Good range&lt;/th&gt;
&lt;th&gt;Bad signal&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Context precision&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Are retrieved chunks actually relevant?&lt;/td&gt;
&lt;td&gt;&amp;gt; 0.8&lt;/td&gt;
&lt;td&gt;Low → noisy retrieval, poor embedding&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Context recall&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Did retrieval find all needed chunks?&lt;/td&gt;
&lt;td&gt;&amp;gt; 0.75&lt;/td&gt;
&lt;td&gt;Low → answer is incomplete&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Faithfulness&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Is the answer grounded in context?&lt;/td&gt;
&lt;td&gt;&amp;gt; 0.85&lt;/td&gt;
&lt;td&gt;Low → hallucination&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Answer relevance&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Does the answer address what was asked?&lt;/td&gt;
&lt;td&gt;&amp;gt; 0.8&lt;/td&gt;
&lt;td&gt;Low → model answering wrong question&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Retrieval latency&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Time spent in vector search&lt;/td&gt;
&lt;td&gt;&amp;lt; 200ms&lt;/td&gt;
&lt;td&gt;High → index needs optimization&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Chunk token count&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Avg tokens per retrieved chunk&lt;/td&gt;
&lt;td&gt;200–600&lt;/td&gt;
&lt;td&gt;Too high → inflated cost, diluted signal&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  The RAG failure nobody talks about: context stuffing
&lt;/h3&gt;

&lt;p&gt;The most common undetected RAG failure: retrieval returns chunks that look semantically similar to the query but &lt;strong&gt;don't contain the actual answer&lt;/strong&gt;. The model then either hallucinates or returns a plausible-sounding non-answer.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Context precision catches this.&lt;/strong&gt; Track it per query, and set an alert if it drops below 0.6 for more than 5% of requests.&lt;/p&gt;

&lt;h3&gt;
  
  
  Measuring RAG quality with RAGAS
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;ragas&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;evaluate&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;ragas.metrics&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;faithfulness&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;answer_relevancy&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;context_precision&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;context_recall&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;datasets&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Dataset&lt;/span&gt;

&lt;span class="c1"&gt;# Collect your RAG pipeline outputs
&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;question&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;What is MLflow used for?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;answer&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;    &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;MLflow is used for experiment tracking...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;contexts&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  &lt;span class="p"&gt;[[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;MLflow is an open source platform...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;MLflow Tracking logs...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]],&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ground_truth&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;MLflow manages the ML lifecycle including tracking...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="n"&gt;dataset&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Dataset&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;from_dict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Run RAGAS evaluation — gives you all 4 RAG metrics at once
&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;evaluate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;dataset&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;dataset&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;metrics&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;faithfulness&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;answer_relevancy&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;context_precision&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;context_recall&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# {'faithfulness': 0.92, 'answer_relevancy': 0.88,
#  'context_precision': 0.94, 'context_recall': 0.81}
&lt;/span&gt;
&lt;span class="c1"&gt;# Then post these scores to Langfuse for the corresponding trace
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  9. Common mistakes to avoid
&lt;/h2&gt;

&lt;h3&gt;
  
  
  ❌ Logging everything with no retention policy
&lt;/h3&gt;

&lt;p&gt;Storing every raw prompt and response forever will balloon your storage costs. Set a 30–90 day retention window. Sample high-volume low-value traces (e.g., 1 in 10 for healthy routine calls), and keep 100% of error traces and scored traces.&lt;/p&gt;

&lt;h3&gt;
  
  
  ❌ Treating latency as the only quality signal
&lt;/h3&gt;

&lt;p&gt;Fast bad answers are worse than slow good ones. Build quality metrics from day one — even if it's just a user thumbs-up/down. Don't let "it's fast" become your proxy for "it's working."&lt;/p&gt;

&lt;h3&gt;
  
  
  ❌ Adding observability as an afterthought
&lt;/h3&gt;

&lt;p&gt;If you retrofit tracing into a production system with no span structure, you'll get a flat blob of logs with no actionable signal. Instrument at the architecture level — define your spans (retrieval, generation, eval) from the first prototype.&lt;/p&gt;

&lt;h3&gt;
  
  
  ❌ Not separating judge calls from production traces
&lt;/h3&gt;

&lt;p&gt;If you're using an LLM to evaluate your LLM's outputs, those evaluation calls &lt;strong&gt;must&lt;/strong&gt; use an unpatched client. Otherwise: recursive tracing, inflated token counts, meaningless cost data.&lt;/p&gt;

&lt;h3&gt;
  
  
  ❌ Ignoring PII in logs
&lt;/h3&gt;

&lt;p&gt;Users will send email addresses, names, medical info into your LLM app. In production, run a PII redaction pass before writing traces to storage. This is not optional if you're handling EU users (GDPR).&lt;/p&gt;




&lt;h2&gt;
  
  
  10. FAQ
&lt;/h2&gt;

&lt;p&gt;
  "What's the difference between LLM monitoring and LLM observability?"
  &lt;p&gt;Monitoring tracks predefined metrics (latency, error rate) and alerts when they cross thresholds.&lt;/p&gt;

&lt;p&gt;Observability is broader — it's the ability to ask arbitrary questions about your system's behavior from its outputs, including things you didn't anticipate when you set up the system.&lt;/p&gt;

&lt;p&gt;In practice: &lt;strong&gt;monitoring tells you &lt;em&gt;something is wrong&lt;/em&gt;, observability helps you figure out &lt;em&gt;why and what&lt;/em&gt;.&lt;/strong&gt;&lt;/p&gt;



&lt;/p&gt;

&lt;p&gt;
  "Can I use Prometheus and Grafana for LLM observability?"
  &lt;p&gt;Yes, for system-level metrics (latency, throughput, error rate, token counts). Expose these via a &lt;code&gt;/metrics&lt;/code&gt; endpoint and scrape with Prometheus.&lt;/p&gt;

&lt;p&gt;But you'll still need a purpose-built tool like Langfuse or Phoenix for prompt/response tracing, RAG-specific metrics, and quality evaluation. Prometheus doesn't understand the semantic content of LLM outputs.&lt;/p&gt;



&lt;/p&gt;

&lt;p&gt;
  "How do you detect hallucinations automatically?"
  &lt;p&gt;Three main approaches:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Faithfulness scoring&lt;/strong&gt; — use an LLM judge to check if the answer is grounded in retrieved context&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Assertion checks&lt;/strong&gt; — programmatic rules for your domain (e.g., "answer must not contain dates before 2020")&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Semantic similarity&lt;/strong&gt; — compare answer embedding to context embedding; low similarity suggests "off-context" generation&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;None of these are perfect. Start with LLM-as-judge faithfulness scoring combined with user feedback signals.&lt;/p&gt;



&lt;/p&gt;

&lt;p&gt;
  "Is LLM observability the same as MLOps?"
  &lt;p&gt;MLOps is the broader practice of operationalizing machine learning — including training pipelines, experiment tracking, model deployment, and monitoring.&lt;/p&gt;

&lt;p&gt;LLM observability is a specific subset focused on monitoring LLM-powered applications in production. It overlaps with MLOps but has different tooling: token costs, prompt management, output quality evaluation vs. model drift, retraining pipelines.&lt;/p&gt;



&lt;/p&gt;

&lt;p&gt;
  "What's the cheapest way to start?"
  &lt;p&gt;Self-host Langfuse via Docker (free). Use the Python SDK with the OpenAI import swap (5 lines of code). You'll have full tracing, token tracking, and a queryable UI for $0.&lt;/p&gt;

&lt;p&gt;Your only cost is the server running Langfuse — a $5/month DigitalOcean droplet is enough for early-stage projects.&lt;/p&gt;



&lt;/p&gt;

&lt;p&gt;
  "Does LLM observability work with open-source models (Llama, Mistral)?"
  &lt;p&gt;Yes. Langfuse and Phoenix work with any model via their generic SDK (you manually log inputs/outputs). For models served via vLLM or Ollama with an OpenAI-compatible API, the OpenAI import swap works directly.&lt;/p&gt;

&lt;p&gt;Token cost tracking requires manual calculation since open-source model servers don't report costs.&lt;/p&gt;



&lt;/p&gt;




&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;LLM observability isn't optional at production scale. The "it works in testing" mindset breaks fast when real users send unexpected inputs, when retrieval quality degrades silently, when a token-hungry prompt pattern starts inflating your inference bill.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The stack to start with:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Langfuse&lt;/strong&gt; for tracing&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;RAGAS&lt;/strong&gt; for RAG quality metrics&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;User feedback signals&lt;/strong&gt; for ground truth&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That combination gives you 80% of what you need with maybe a day of implementation work.&lt;/p&gt;

&lt;p&gt;Don't build the perfect observability system before shipping. Instrument as you build. Add quality metrics when you have baseline data to compare against. The value compounds.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;🔗 &lt;strong&gt;Next step:&lt;/strong&gt; Set up Langfuse locally → instrument one LLM call → check the trace in the UI. That's the first 20 minutes. Everything else follows from having that first trace visible.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Related articles on MLOpsLab
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://mlopslab.org/ml-pipeline-tutorial-build-your-first-production-ml-pipeline-2026/" rel="noopener noreferrer"&gt;ML Pipeline Tutorial: Build Your First Production ML Pipeline (2026)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://mlopslab.org/model-drift-detection-tutorial-how-to-monitor-ml-models-in-production-2026/" rel="noopener noreferrer"&gt;Model Drift Detection: Monitor ML Models in Production (2026)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://mlopslab.org/mlops-roadmap-2026-how-to-become-an-ml-engineer-step-by-step/" rel="noopener noreferrer"&gt;MLOps Roadmap 2026: How to Become an ML Engineer&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;Dong, L., Lu, Q., &amp;amp; Zhu, L. (2024). AgentOps: Enabling Observability of LLM Agents. arXiv. &lt;a href="https://arxiv.org/abs/2411.05285" rel="noopener noreferrer"&gt;https://arxiv.org/abs/2411.05285&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Es, S., et al. (2023). RAGAS: Automated Evaluation of Retrieval Augmented Generation. arXiv. &lt;a href="https://arxiv.org/abs/2309.15217" rel="noopener noreferrer"&gt;https://arxiv.org/abs/2309.15217&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Langfuse Documentation. &lt;a href="https://langfuse.com/docs" rel="noopener noreferrer"&gt;https://langfuse.com/docs&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;OpenTelemetry Semantic Conventions for LLM systems. &lt;a href="https://opentelemetry.io/docs/specs/semconv/gen-ai/" rel="noopener noreferrer"&gt;https://opentelemetry.io/docs/specs/semconv/gen-ai/&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Vesely, K., &amp;amp; Lewis, M. (2024). Real-Time Monitoring and Diagnostics of ML Pipelines. Journal of Systems and Software, 185, 111136.&lt;/li&gt;
&lt;/ol&gt;




&lt;p&gt;&lt;em&gt;Written by Ayub Shah — ML Engineering student, MLOps enthusiast. Testing every tool so you don't have to. No sponsors, no affiliate links.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;→ More at &lt;a href="https://mlopslab.org" rel="noopener noreferrer"&gt;mlopslab.org&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>machinelearning</category>
      <category>mlops</category>
      <category>llm</category>
      <category>python</category>
    </item>
  </channel>
</rss>
