<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: OpenObserve</title>
    <description>The latest articles on DEV Community by OpenObserve (@openobserve).</description>
    <link>https://dev.to/openobserve</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Forganization%2Fprofile_image%2F13191%2Fa8a5832e-60af-4ee0-a047-091276bd76be.png</url>
      <title>DEV Community: OpenObserve</title>
      <link>https://dev.to/openobserve</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/openobserve"/>
    <language>en</language>
    <item>
      <title>What's New: Terraform Support, Kubernetes and AWS Automation, Bring Your Own Bucket, and UX Improvements</title>
      <dc:creator>Sara</dc:creator>
      <pubDate>Tue, 19 May 2026 15:23:19 +0000</pubDate>
      <link>https://dev.to/openobserve/whats-new-terraform-support-kubernetes-and-aws-automation-bring-your-own-bucket-and-ux-341m</link>
      <guid>https://dev.to/openobserve/whats-new-terraform-support-kubernetes-and-aws-automation-bring-your-own-bucket-and-ux-341m</guid>
      <description>&lt;h1&gt;
  
  
  What's New in OpenObserve: Terraform Support, Kubernetes and AWS Automation, Bring Your Own Bucket, and UX Improvements
&lt;/h1&gt;

&lt;p&gt;OpenObserve has shipped three major updates that help engineering teams automate observability, keep full control over telemetry data, and troubleshoot incidents faster.&lt;/p&gt;

&lt;p&gt;In this release:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Terraform support for managing OpenObserve deployments and resources as code&lt;/li&gt;
&lt;li&gt;Bring Your Own Bucket (BYOB) for Amazon S3 and Azure Blob Storage&lt;/li&gt;
&lt;li&gt;UX and UI improvements for logs, distributed tracing, and root cause analysis&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you run observability on Kubernetes, AWS, Azure, or other cloud environments, these updates simplify deployment, improve governance, and streamline day-to-day troubleshooting.&lt;/p&gt;

&lt;h2&gt;
  
  
  Terraform Support for Observability as Code
&lt;/h2&gt;

&lt;p&gt;OpenObserve now includes a Terraform provider that lets you manage observability resources using infrastructure as code.&lt;/p&gt;

&lt;p&gt;Supported resources include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Streams&lt;/li&gt;
&lt;li&gt;Dashboards&lt;/li&gt;
&lt;li&gt;Users and organizations&lt;/li&gt;
&lt;li&gt;Retention policies&lt;/li&gt;
&lt;li&gt;Indexed fields&lt;/li&gt;
&lt;li&gt;Full-text search settings&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;OpenObserve also provides a Kubernetes Terraform module that deploys the platform using the official Helm chart. The module supports both single-node environments and production high-availability deployments with PostgreSQL, NATS, S3, and Ingress.&lt;/p&gt;

&lt;p&gt;For AWS users, the module can optionally provision:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Amazon VPC&lt;/li&gt;
&lt;li&gt;Amazon EKS&lt;/li&gt;
&lt;li&gt;Amazon S3&lt;/li&gt;
&lt;li&gt;IAM roles&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This makes it possible to manage both the observability platform and its configuration through &lt;a href="https://na2.hubs.ly/H05zTH90" rel="noopener noreferrer"&gt;Terraform or OpenTofu&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Bring Your Own Bucket (BYOB) for Amazon S3 and Azure Blob Storage
&lt;/h2&gt;

&lt;p&gt;Commercial OpenObserve Cloud customers can now connect their own Amazon S3 bucket or Azure Blob Storage container.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://na2.hubs.ly/H05zVJ40" rel="noopener noreferrer"&gt;Telemetry data remains in your cloud account&lt;/a&gt;, region, and security boundary, while OpenObserve continues to handle ingestion, compaction, and querying.&lt;/p&gt;

&lt;p&gt;Key benefits include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Full ownership of logs, metrics, and traces&lt;/li&gt;
&lt;li&gt;Data residency and compliance control&lt;/li&gt;
&lt;li&gt;Better use of existing cloud storage commitments&lt;/li&gt;
&lt;li&gt;No storage lock-in&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  UX and UI Improvements for Logs and Distributed Tracing
&lt;/h2&gt;

&lt;p&gt;This release also includes several improvements to help engineers move from alert to root cause more quickly.&lt;/p&gt;

&lt;p&gt;Highlights include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Service Catalog&lt;/li&gt;
&lt;li&gt;Span details directly in the flame graph&lt;/li&gt;
&lt;li&gt;Better default log columns&lt;/li&gt;
&lt;li&gt;Multi-stream log correlation&lt;/li&gt;
&lt;li&gt;Smarter View Logs filters&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These changes reduce the number of clicks required to investigate incidents and correlate logs and traces. &lt;a href="https://na2.hubs.ly/H05zWzP0" rel="noopener noreferrer"&gt;Try it!&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Get all the details, features, and how-tos:
&lt;/h2&gt;

&lt;p&gt;This article is a summary of the latest OpenObserve release.&lt;/p&gt;

&lt;p&gt;For screenshots, implementation details, and links to the Terraform provider and Kubernetes module, &lt;a href="https://na2.hubs.ly/H05yVY80" rel="noopener noreferrer"&gt;read the full announcement&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>kubernetes</category>
      <category>terraform</category>
      <category>devops</category>
      <category>automation</category>
    </item>
    <item>
      <title>How to Monitor OpenAI API Costs and Token Usage with OpenTelemetry</title>
      <dc:creator>Manas Sharma</dc:creator>
      <pubDate>Fri, 15 May 2026 13:02:47 +0000</pubDate>
      <link>https://dev.to/openobserve/how-to-monitor-openai-api-costs-and-token-usage-with-opentelemetry-4c5o</link>
      <guid>https://dev.to/openobserve/how-to-monitor-openai-api-costs-and-token-usage-with-opentelemetry-4c5o</guid>
      <description>&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Capture &lt;code&gt;gen_ai.*&lt;/code&gt; semantic convention attributes on every OpenAI call: request model, input tokens, output tokens. Add &lt;code&gt;feature&lt;/code&gt;, &lt;code&gt;user_id&lt;/code&gt;, and &lt;code&gt;team&lt;/code&gt; on every span so you can break down cost by who and what is spending.&lt;/li&gt;
&lt;li&gt;Compute &lt;code&gt;gen_ai.usage.cost_usd&lt;/code&gt; from a pricing table you control and emit it as both a span attribute (for per-request drill-down) and a histogram metric (for aggregation and alerting).&lt;/li&gt;
&lt;li&gt;Alert on cost anomalies relative to your historical baseline, not just static budget thresholds. Retry loops and runaway agents show up as deviations before they ever cross a daily spend limit.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Why OpenAI bills are impossible to predict without instrumentation
&lt;/h2&gt;

&lt;p&gt;Running an LLM app in production without instrumentation is a slow way to find out your margins are negative. Token consumption is non-obvious: a single user with a verbose system prompt and long chat history can cost 20x more per interaction than an average user. A bug in a retry loop can 10x your daily spend in an hour. A single new feature that adds RAG context to every call can double your input token count overnight.&lt;/p&gt;

&lt;p&gt;The OpenAI dashboard tells you what you spent yesterday. It does not tell you which feature, which user, which prompt template, or which model variant drove the spend. By the time you notice a cost spike in your billing dashboard, you have already paid for it.&lt;/p&gt;

&lt;p&gt;The fix is the same fix you use for any production system: emit structured telemetry at the point of the API call and make it queryable. OpenTelemetry gives you a vendor-neutral way to do this, and a growing set of GenAI-specific conventions means the fields you emit today will still be meaningful in two years.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Quick start:&lt;/strong&gt; Jump to the Python setup or Node.js setup if you just need the code.&lt;/p&gt;

&lt;h2&gt;
  
  
  The three signals you actually need to track
&lt;/h2&gt;

&lt;p&gt;For LLM cost monitoring, three signals carry almost all the value:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Token usage&lt;/strong&gt; tells you how much capacity you consumed. Input tokens and output tokens, always separately, because they price differently.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cost&lt;/strong&gt; is the dollar-denominated derivative of token usage. You compute it at emit time using a pricing table you control.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Latency&lt;/strong&gt; tells you how long users waited. For streaming endpoints, split this into time to first token and total duration.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Everything else (error rate, finish reason, response model) is useful context for these three. Start with the three and add context as you need it.&lt;/p&gt;

&lt;h2&gt;
  
  
  What OpenTelemetry's GenAI semantic conventions give you
&lt;/h2&gt;

&lt;p&gt;OpenTelemetry has a dedicated set of semantic conventions for generative AI workloads, living under the &lt;code&gt;gen_ai.*&lt;/code&gt; namespace. The point of conventions is that the same attribute names work across providers and observability backends, so your queries do not break when you swap from OpenAI to Anthropic or from one backend to another.&lt;/p&gt;

&lt;p&gt;The attributes you will use most:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Attribute&lt;/th&gt;
&lt;th&gt;What it holds&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;gen_ai.provider.name&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Provider name: &lt;code&gt;openai&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;gen_ai.request.model&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Model requested by your code: &lt;code&gt;gpt-4o&lt;/code&gt;, &lt;code&gt;gpt-4o-mini&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;gen_ai.response.model&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Model the provider actually used (can differ if provider routes)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;gen_ai.operation.name&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;chat&lt;/code&gt;, &lt;code&gt;text_completion&lt;/code&gt;, &lt;code&gt;embeddings&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;gen_ai.usage.input_tokens&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Prompt tokens consumed&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;gen_ai.usage.output_tokens&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Completion tokens generated&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;gen_ai.request.temperature&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Temperature parameter (useful when debugging determinism)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;gen_ai.request.max_tokens&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Max tokens parameter&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;gen_ai.response.finish_reasons&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Why the model stopped: &lt;code&gt;stop&lt;/code&gt;, &lt;code&gt;length&lt;/code&gt;, &lt;code&gt;content_filter&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;One attribute worth noting: &lt;code&gt;gen_ai.system&lt;/code&gt; has been renamed to &lt;code&gt;gen_ai.provider.name&lt;/code&gt; in the current OTel GenAI spec. Most instrumentation libraries still emit &lt;code&gt;gen_ai.system&lt;/code&gt; today. Your backend should accept both until library adoption catches up.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fz42cw9al0h7kffykwgu7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fz42cw9al0h7kffykwgu7.png" alt="OpenTelemetry GenAI semantic convention attributes" width="800" height="363"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Instrumenting a Python app with the official OTel OpenAI SDK
&lt;/h2&gt;

&lt;p&gt;This guide uses &lt;code&gt;opentelemetry-instrumentation-openai-v2&lt;/code&gt;, the official OTel package maintained in &lt;code&gt;opentelemetry-python-contrib&lt;/code&gt;. It follows the GenAI semantic conventions closely and is the right choice for OpenAI instrumentation.&lt;/p&gt;

&lt;h3&gt;
  
  
  Install the three packages
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;opentelemetry-distro
pip &lt;span class="nb"&gt;install &lt;/span&gt;opentelemetry-exporter-otlp
pip &lt;span class="nb"&gt;install &lt;/span&gt;opentelemetry-instrumentation-openai-v2
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then run the bootstrap command once to install auto-instrumentation for any other libraries in your app (Flask, FastAPI, &lt;code&gt;requests&lt;/code&gt;, and so on):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;opentelemetry-bootstrap &lt;span class="nt"&gt;--action&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;install&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Set the OTLP endpoint for OpenObserve
&lt;/h3&gt;

&lt;p&gt;Grab your OTLP HTTP endpoint and Authorization header from the OpenObserve UI under &lt;strong&gt;Data Sources -&amp;gt; Traces (OpenTelemetry) -&amp;gt; OTLP HTTP&lt;/strong&gt;. Set these environment variables:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;OTEL_SERVICE_NAME&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;my-llm-app
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;OTEL_EXPORTER_OTLP_ENDPOINT&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"https://api.openobserve.ai/api/&amp;lt;your-org&amp;gt;"&lt;/span&gt;
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;OTEL_EXPORTER_OTLP_HEADERS&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"Authorization=Basic &amp;lt;your-auth-token&amp;gt;"&lt;/span&gt;
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;OTEL_EXPORTER_OTLP_PROTOCOL&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;http/protobuf
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;OTEL_PYTHON_LOGGING_AUTO_INSTRUMENTATION_ENABLED&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;true&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you are self-hosting OpenObserve, the endpoint is typically &lt;code&gt;http://localhost:5080/api/&amp;lt;your-org&amp;gt;&lt;/code&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Run with &lt;code&gt;opentelemetry-instrument&lt;/code&gt;
&lt;/h3&gt;

&lt;p&gt;Wrap your existing run command:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;opentelemetry-instrument python app.py
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;No code changes to &lt;code&gt;app.py&lt;/code&gt;. The OpenAI SDK is wrapped at import time, and every &lt;code&gt;chat.completions.create&lt;/code&gt; call emits a span with the &lt;code&gt;gen_ai.*&lt;/code&gt; attributes populated.&lt;/p&gt;

&lt;h3&gt;
  
  
  A minimal example app
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# app.py
&lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;openai&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;OpenAI&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;OpenAI&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="n"&gt;resp&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-4o-mini&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Summarize observability in one sentence.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}],&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;choices&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Input tokens:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;usage&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;prompt_tokens&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Output tokens:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;usage&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;completion_tokens&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Run it with &lt;code&gt;opentelemetry-instrument python app.py&lt;/code&gt; and check the Traces tab in OpenObserve. You should see a span named &lt;code&gt;chat gpt-4o-mini&lt;/code&gt; with the token counts attached.&lt;/p&gt;

&lt;h3&gt;
  
  
  Capturing message content (and the privacy tradeoff)
&lt;/h3&gt;

&lt;p&gt;The instrumentation does not capture the prompt or completion text by default. To enable it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;OTEL_INSTRUMENTATION_GENAI_CAPTURE_MESSAGE_CONTENT&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;true&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This ships the full prompt and completion as log events. It is useful for debugging but has real privacy implications: you are now logging whatever your users typed, including anything they pasted in. If your app handles regulated data (health, finance, anything under GDPR or HIPAA), do not enable this globally. Enable it per-environment or per-feature flag, and scrub sensitive fields before the exporter sees them.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyr73dgxg0j9a2df25cwf.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyr73dgxg0j9a2df25cwf.png" alt="OpenObserve Traces view showing LLM spans" width="800" height="388"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Instrumenting a Node.js app
&lt;/h2&gt;

&lt;p&gt;For Node.js, the pattern is the same. Install the packages:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npm &lt;span class="nb"&gt;install&lt;/span&gt; @opentelemetry/api &lt;span class="se"&gt;\&lt;/span&gt;
  @opentelemetry/sdk-node &lt;span class="se"&gt;\&lt;/span&gt;
  @opentelemetry/exporter-trace-otlp-http &lt;span class="se"&gt;\&lt;/span&gt;
  @opentelemetry/instrumentation-openai
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Create a &lt;code&gt;tracing.js&lt;/code&gt; bootstrap file:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// tracing.js&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;NodeSDK&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;require&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;@opentelemetry/sdk-node&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;OTLPTraceExporter&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;require&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;@opentelemetry/exporter-trace-otlp-http&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;OpenAIInstrumentation&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;require&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;@opentelemetry/instrumentation-openai&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;Resource&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;require&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;@opentelemetry/resources&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;sdk&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;NodeSDK&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="na"&gt;resource&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Resource&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;service.name&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;my-llm-app-node&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;deployment.environment&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;process&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;env&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;NODE_ENV&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;development&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="p"&gt;}),&lt;/span&gt;
  &lt;span class="na"&gt;traceExporter&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;OTLPTraceExporter&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="na"&gt;url&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;process&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;env&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;OTEL_EXPORTER_OTLP_ENDPOINT&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;/v1/traces`&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;headers&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="na"&gt;Authorization&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;process&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;env&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;OTEL_EXPORTER_OTLP_HEADERS&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
  &lt;span class="p"&gt;}),&lt;/span&gt;
  &lt;span class="na"&gt;instrumentations&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;OpenAIInstrumentation&lt;/span&gt;&lt;span class="p"&gt;()],&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;

&lt;span class="nx"&gt;sdk&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;start&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then preload it when you run your app:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;node &lt;span class="nt"&gt;--require&lt;/span&gt; ./tracing.js app.js
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Same result: every OpenAI call produces a span in OpenObserve with the GenAI attributes populated.&lt;/p&gt;

&lt;h2&gt;
  
  
  Building a cost calculation layer
&lt;/h2&gt;

&lt;p&gt;OpenAI's SDK gives you token counts. It does not give you dollars. You have to multiply tokens by a price, and that price changes. Build this as a small, updatable module.&lt;/p&gt;

&lt;h3&gt;
  
  
  Pricing table as code
&lt;/h3&gt;

&lt;p&gt;Keep this in source control. Review it every quarter, or every time a provider announces a price change.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# pricing.py
# Prices in USD per 1 million tokens, as of April 2026.
# Verify against provider pricing pages before each release.
&lt;/span&gt;
&lt;span class="n"&gt;MODEL_PRICING&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-4o&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;      &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;input&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;2.50&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;output&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;10.00&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-4o-mini&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;input&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.15&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;output&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.60&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;o1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;          &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;input&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;15.00&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;output&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;60.00&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;o1-mini&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;     &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;input&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;3.00&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;output&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;12.00&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;calculate_cost&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;input_tokens&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;output_tokens&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Return the estimated cost in USD for a single LLM call.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;pricing&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;MODEL_PRICING&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;pricing&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="c1"&gt;# Unknown model. Emit 0 and alert separately so you can add pricing.
&lt;/span&gt;        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="mf"&gt;0.0&lt;/span&gt;
    &lt;span class="n"&gt;input_cost&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;input_tokens&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;1_000_000&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;pricing&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;input&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;output_cost&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;output_tokens&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;1_000_000&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;pricing&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;output&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;round&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;input_cost&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;output_cost&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;6&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Emitting cost as a custom metric
&lt;/h3&gt;

&lt;p&gt;The official &lt;code&gt;-v2&lt;/code&gt; package does not emit cost, only tokens. Add cost yourself with a thin wrapper that runs after each call:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# tracked_llm.py
&lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;opentelemetry&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;trace&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;metrics&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;openai&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;OpenAI&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pricing&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;calculate_cost&lt;/span&gt;

&lt;span class="n"&gt;tracer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;trace&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_tracer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;llm-cost&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;meter&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;metrics&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_meter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;llm-cost&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;cost_histogram&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;meter&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create_histogram&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gen_ai.usage.cost_usd&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Estimated cost of a single LLM call in USD&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;unit&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;USD&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;OpenAI&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;tracked_chat&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-4o-mini&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;feature&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;unknown&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;anon&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;tracer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;start_as_current_span&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gen_ai.chat&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_attribute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gen_ai.provider.name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;openai&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_attribute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gen_ai.request.model&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_attribute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;feature&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;feature&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_attribute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="n"&gt;start&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;perf_counter&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;elapsed_ms&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;perf_counter&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;start&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;1000&lt;/span&gt;

        &lt;span class="n"&gt;input_tokens&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;usage&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;prompt_tokens&lt;/span&gt;
        &lt;span class="n"&gt;output_tokens&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;usage&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;completion_tokens&lt;/span&gt;
        &lt;span class="n"&gt;cost&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;calculate_cost&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;input_tokens&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;output_tokens&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="c1"&gt;# Span attributes for per-request investigation
&lt;/span&gt;        &lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_attribute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gen_ai.usage.input_tokens&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;input_tokens&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_attribute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gen_ai.usage.output_tokens&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;output_tokens&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_attribute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gen_ai.usage.cost_usd&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;cost&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_attribute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gen_ai.latency.duration_ms&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;elapsed_ms&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_attribute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gen_ai.response.model&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="c1"&gt;# Metric for aggregation
&lt;/span&gt;        &lt;span class="n"&gt;cost_histogram&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;record&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cost&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gen_ai.provider.name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;openai&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gen_ai.request.model&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;feature&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;feature&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;})&lt;/span&gt;

        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You now have cost on the span (for drill-down) and cost as a metric (for aggregation, alerting, and dashboards). Both are labeled with &lt;code&gt;feature&lt;/code&gt; so you can break them down later.&lt;/p&gt;

&lt;h2&gt;
  
  
  Attributing cost to users, features, and teams
&lt;/h2&gt;

&lt;p&gt;This is the section most readers came for. Raw token counts do not answer "who is spending our money." Attribution does.&lt;/p&gt;

&lt;h3&gt;
  
  
  Adding attributes on every span
&lt;/h3&gt;

&lt;p&gt;Every LLM call should carry four attribution dimensions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;feature&lt;/code&gt;: which product path triggered the call (&lt;code&gt;document_summary&lt;/code&gt;, &lt;code&gt;chat_reply&lt;/code&gt;, &lt;code&gt;rag_answer&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;user_id&lt;/code&gt;: hashed user identifier for per-user rollups&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;team&lt;/code&gt;: which internal team or product area owns the feature&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;environment&lt;/code&gt;: &lt;code&gt;prod&lt;/code&gt;, &lt;code&gt;staging&lt;/code&gt;, &lt;code&gt;dev&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Wire them through as keyword arguments on your wrapper:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;tracked_chat&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;}],&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-4o&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;feature&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;document_summary&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;hashed_user_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Building the cost attribution dashboard
&lt;/h2&gt;

&lt;p&gt;A complete LLM cost dashboard covers two concerns: spend attribution and token efficiency. Organize it across two tabs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tab 1: LLM Cost Overview&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Four single-stat tiles at the top give you the headline numbers at a glance: &lt;strong&gt;Total LLM Cost ($)&lt;/strong&gt;, &lt;strong&gt;Total Input Tokens&lt;/strong&gt;, &lt;strong&gt;Total Output Tokens&lt;/strong&gt;, and &lt;strong&gt;Total LLM Calls&lt;/strong&gt;. These are the first things you check when something looks off.&lt;/p&gt;

&lt;p&gt;Below the tiles:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;LLM Cost Over Time ($)&lt;/strong&gt;: bar chart over the selected time range. Reveals bursty spend patterns and days that are trending above baseline.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cost by Model&lt;/strong&gt;: pie chart, one slice per &lt;code&gt;gen_ai.request.model&lt;/code&gt;. Shows your model mix and whether a cheaper model is handling the bulk of traffic.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Input vs Output Cost Over Time ($)&lt;/strong&gt;: grouped bar chart with two series, &lt;code&gt;input_cost&lt;/code&gt; and &lt;code&gt;output_cost&lt;/code&gt;. Output tokens cost 3-4x more than input tokens on most models; this panel tells you which side is driving cost growth.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Token Usage by Model&lt;/strong&gt;: grouped bar chart of &lt;code&gt;input_tokens&lt;/code&gt; and &lt;code&gt;output_tokens&lt;/code&gt; per model. Cross-reference this with Cost by Model to spot models that are expensive relative to their token volume.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Token Usage Over Time&lt;/strong&gt;: time series of token counts. Useful for capacity planning and catching prompt inflation.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fn78rhwv8i09shrgsoo27.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fn78rhwv8i09shrgsoo27.png" alt="LLM Cost Monitoring dashboard" width="800" height="437"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Alerting on cost anomalies and rate-limit errors
&lt;/h2&gt;

&lt;p&gt;Static budget thresholds are table stakes. The interesting failures are the ones that do not cross a static threshold until it is too late.&lt;/p&gt;

&lt;h3&gt;
  
  
  Threshold alerts vs anomaly detection
&lt;/h3&gt;

&lt;p&gt;A threshold alert fires when daily spend exceeds $500. It works for the blunt cases. It misses three common failure modes:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;A retry loop that 3x's a specific feature's token usage in an hour. The daily threshold may still be fine by end of day, but you paid 3x for that hour.&lt;/li&gt;
&lt;li&gt;A prompt injection that triggers a long runaway completion on a single request, burning 100k output tokens in one call.&lt;/li&gt;
&lt;li&gt;Seasonal growth that quietly pushes baseline from $300/day to $600/day over a month, outpacing capacity plans.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Anomaly detection catches all three by comparing current behavior to historical baseline rather than to a fixed number.&lt;/p&gt;

&lt;h3&gt;
  
  
  A daily budget threshold
&lt;/h3&gt;

&lt;p&gt;Set this first. In OpenObserve, create an alert on the &lt;code&gt;gen_ai.usage.cost_usd&lt;/code&gt; metric:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Trigger:&lt;/strong&gt; &lt;code&gt;SUM(gen_ai_usage_cost_usd)&lt;/code&gt; over &lt;code&gt;24h&lt;/code&gt; is greater than &lt;code&gt;500&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Evaluation frequency:&lt;/strong&gt; every 5 minutes&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Action:&lt;/strong&gt; Slack or PagerDuty, routed to the LLM-platform team&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  An anomaly-based alert for cost spikes
&lt;/h3&gt;

&lt;p&gt;This is more valuable. Create an anomaly alert on &lt;code&gt;gen_ai.usage.cost_usd&lt;/code&gt; grouped by &lt;code&gt;feature&lt;/code&gt;, with a training window of the last 14 days and a sensitivity tuned to catch 3x deviations. A retry loop in the &lt;code&gt;document_summary&lt;/code&gt; feature shows up in minutes, before it hits your daily threshold.&lt;/p&gt;

&lt;h3&gt;
  
  
  Alert on rate-limit errors (HTTP 429)
&lt;/h3&gt;

&lt;p&gt;When OpenAI rate-limits you, downstream calls fail and retries pile up. Fire an alert when &lt;code&gt;gen_ai.response.error.type = rate_limit_exceeded&lt;/code&gt; exceeds a low threshold (say, 5 in 5 minutes). This usually surfaces a runaway loop before a cost anomaly does.&lt;/p&gt;

&lt;h2&gt;
  
  
  Reconciling estimated cost with the OpenAI billing API
&lt;/h2&gt;

&lt;p&gt;Your OTel-derived cost is an estimate. It is usually within a couple of percent, but it drifts from the real bill for three reasons:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Cached input tokens.&lt;/strong&gt; Repeat prompts are billed at a discount. Your naive pricing math assumes full price.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Reasoning tokens.&lt;/strong&gt; &lt;code&gt;o1&lt;/code&gt; and similar models emit internal reasoning tokens that count toward billing but may not appear in the standard &lt;code&gt;usage&lt;/code&gt; object.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Batch API discounts.&lt;/strong&gt; If you use the async batch endpoint, those requests are priced lower.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Reconcile monthly. Pull the OpenAI usage endpoint and compare total cost for the window against your OTel sum. If the drift is more than 5 percent, dig in and adjust your pricing table. This is the pattern production teams use: OTel for real-time signal, billing API for ground truth.&lt;/p&gt;

&lt;h2&gt;
  
  
  Measuring time to first token for streaming
&lt;/h2&gt;

&lt;p&gt;For chat UIs, users feel time to first token (TTFT), not total duration. If you use streaming responses, capture it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;stream_with_ttft&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-4o&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;tracer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;start_as_current_span&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gen_ai.chat&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_attribute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gen_ai.provider.name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;openai&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_attribute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gen_ai.request.model&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_attribute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gen_ai.response.streaming&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="n"&gt;start&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;perf_counter&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="n"&gt;ttft_ms&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;

        &lt;span class="n"&gt;stream&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;stream&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="n"&gt;chunks&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;chunk&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;stream&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;ttft_ms&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;chunk&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;choices&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;delta&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="n"&gt;ttft_ms&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;perf_counter&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;start&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;1000&lt;/span&gt;
                &lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_attribute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gen_ai.latency.ttft_ms&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ttft_ms&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;chunks&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;chunk&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="n"&gt;total_ms&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;perf_counter&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;start&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;1000&lt;/span&gt;
        &lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_attribute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gen_ai.latency.duration_ms&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;total_ms&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;chunks&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now you can alert on TTFT regressions separately from total-duration regressions.&lt;/p&gt;

&lt;h2&gt;
  
  
  Production checklist
&lt;/h2&gt;

&lt;p&gt;Before shipping this to prod:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;✅ Retention policy set on your LLM telemetry stream&lt;/li&gt;
&lt;li&gt;✅ PII scrubbing pipeline in place if capturing message content&lt;/li&gt;
&lt;li&gt;✅ Sampling strategy decided (100% for LLM spans is usually fine)&lt;/li&gt;
&lt;li&gt;✅ Pricing table in source control with quarterly review reminder&lt;/li&gt;
&lt;li&gt;✅ Budget threshold alert and anomaly-based alert configured&lt;/li&gt;
&lt;li&gt;✅ Monthly reconciliation against OpenAI billing API scheduled&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Send your LLM telemetry to OpenObserve
&lt;/h2&gt;

&lt;p&gt;OpenObserve is an open-source observability platform that accepts standard OTLP over HTTP and gRPC. There is no proprietary SDK to adopt and no special instrumentation to learn. Point your OTLP exporter at OpenObserve Cloud or a self-hosted instance, and your LLM spans, logs, and metrics land in the same place as your infrastructure telemetry.&lt;/p&gt;

&lt;p&gt;If you want to see this working end to end, spin up a free account at &lt;a href="https://cloud.openobserve.ai/" rel="noopener noreferrer"&gt;OpenObserve Cloud&lt;/a&gt; or check out the &lt;a href="https://openobserve.ai/llm-observability/" rel="noopener noreferrer"&gt;LLM Observability overview&lt;/a&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  Further Reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://openobserve.ai/blog/llm-monitoring-best-practices/" rel="noopener noreferrer"&gt;LLM monitoring best practices&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://openobserve.ai/blog/ai-anomaly-detection-guide/" rel="noopener noreferrer"&gt;AI anomaly detection guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://openobserve.ai/blog/llm-observability-tools/" rel="noopener noreferrer"&gt;Top open source LLM observability tools&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://openobserve.ai/blog/distributed-tracing-basics-to-beyond-guide/" rel="noopener noreferrer"&gt;Distributed tracing guide&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>opentelemetry</category>
      <category>llm</category>
      <category>observability</category>
      <category>openai</category>
    </item>
    <item>
      <title>I Built a Dashboard in 30 Seconds with AI</title>
      <dc:creator>Manas Sharma</dc:creator>
      <pubDate>Thu, 14 May 2026 10:17:29 +0000</pubDate>
      <link>https://dev.to/openobserve/i-built-a-dashboard-in-30-seconds-with-ai-500p</link>
      <guid>https://dev.to/openobserve/i-built-a-dashboard-in-30-seconds-with-ai-500p</guid>
      <description>&lt;h2&gt;
  
  
  The Problem
&lt;/h2&gt;

&lt;p&gt;It's 2 AM. An alert fires. Cart service is throwing errors. You've got five minutes before someone escalates.&lt;/p&gt;

&lt;p&gt;The runbook says: "Check the dashboard. Look at the logs." But which dashboard? What query? You're half-asleep, the alert description tells you nothing useful, and now you're supposed to write SQL from scratch while someone in Slack asks "any update?"&lt;/p&gt;

&lt;p&gt;Most of us have been there. And most runbooks were written by someone who never had to use them under pressure.&lt;/p&gt;

&lt;p&gt;What if you could just type: &lt;strong&gt;"cart is throwing errors. find the root cause."&lt;/strong&gt; and get a real answer?&lt;/p&gt;

&lt;p&gt;That's what I tested with the new AI Assistant in OpenObserve. Here's what happened.&lt;/p&gt;




&lt;h2&gt;
  
  
  It's Not Anomaly Detection. It's Something Simpler.
&lt;/h2&gt;

&lt;p&gt;Most AI + observability discussions jump straight to anomaly detection or ML-powered forecasting. Those are interesting. But the thing that's actually changing how I work right now is simpler: an assistant embedded in the platform that lets me ask questions in plain English and get answers from my own production data.&lt;/p&gt;

&lt;p&gt;No SQL. No PromQL. Just describe what you want.&lt;/p&gt;

&lt;p&gt;I ran four real scenarios against live data from an otel-demo microservices app and a Kubernetes cluster. Here's how each one went.&lt;/p&gt;




&lt;h3&gt;
  
  
  1. The Dashboard Request That Normally Kills Your Afternoon
&lt;/h3&gt;

&lt;p&gt;Someone from the business team asks for a dashboard. They don't know SQL. They don't know PromQL. They just want to see what's happening with nginx — request rate, how fast it's responding, how many errors.&lt;/p&gt;

&lt;p&gt;Normally this kills thirty minutes: finding the right log stream, writing queries, dragging panels, tweaking units.&lt;/p&gt;

&lt;p&gt;Instead, I typed:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;create a dashboard for my nginx logs showing request rate, latency percentiles, and 4xx vs 5xx errors.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Thirty seconds later I had a production-ready dashboard. It picked the right log stream. It listed the relevant fields. It wrote the SQL queries. It chose appropriate visualizations — line chart for request rate, heatmap for latency distribution, stacked bar for status codes. These were real queries against actual data. Not a template.&lt;/p&gt;

&lt;p&gt;Here's what stuck with me: &lt;strong&gt;the person who asked for this could have done it themselves.&lt;/strong&gt; They don't need to know what a PromQL query looks like. They just describe what they want to see.&lt;/p&gt;




&lt;h3&gt;
  
  
  2. Same Thing, Different Domain: Infrastructure
&lt;/h3&gt;

&lt;p&gt;Application logs worked. But what about infrastructure?&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;build a K8s host metrics dashboard showing CPU, memory, disk per node.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Completely different data source — Kubernetes metrics, not nginx logs. Same experience. The assistant figured out where the data lived, what metrics to pull, and how to visualize them.&lt;/p&gt;

&lt;p&gt;What impressed me was the panel design. Usage per node and cumulative across the cluster. Separate tabs for CPU, memory, and disk. It understood that "CPU per node" implies a time series grouped by host, not a single aggregate gauge. That's the kind of design decision a human SRE makes after looking at the data — and the assistant just did it.&lt;/p&gt;

&lt;p&gt;The assistant had enough context about the infrastructure to know what clusters were running and what hosts were connected. I didn't explain my setup. It already knew.&lt;/p&gt;




&lt;h3&gt;
  
  
  3. Proactive: Don't Wait Until Something Breaks
&lt;/h3&gt;

&lt;p&gt;Dashboards are great, but nobody wants to stare at them all day. I wanted to see if I could use the assistant proactively — scan everything, find problems before they escalate.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;what's the health of the otel-demo right now? if anything is red, create an alert.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This isn't asking for one dashboard or one service. It's saying: scan all services, tell me how we're doing, and if something looks off, lock in an alert so I'm covered.&lt;/p&gt;

&lt;p&gt;It checked error rates and latencies across every service. Found the ones running green, identified the ones that weren't. And for anything red — it created an alert. Right there. No configuration. No navigating to the alerts page.&lt;/p&gt;

&lt;p&gt;This is the kind of thing most teams only set up &lt;em&gt;after&lt;/em&gt; an incident, during the postmortem, when someone says "we should have caught this earlier." One sentence and you're covered before the page goes off.&lt;/p&gt;




&lt;h3&gt;
  
  
  4. Something's Actually Broken: Root Cause Analysis
&lt;/h3&gt;

&lt;p&gt;Now the real test. The cart service in the otel-demo app is throwing errors. Not a synthetic scenario — a real incident.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;otel-demo app cart is throwing errors. find the root cause.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;What happened next is worth breaking down step by step:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;It searched across &lt;strong&gt;both logs and traces&lt;/strong&gt; — not one or the other, both at once&lt;/li&gt;
&lt;li&gt;It looked for errors in the last six hours and found none&lt;/li&gt;
&lt;li&gt;It &lt;strong&gt;automatically widened the search window&lt;/strong&gt; — I didn't tell it to do that&lt;/li&gt;
&lt;li&gt;It identified the pattern: cart service failing on database writes under load&lt;/li&gt;
&lt;li&gt;It showed me the exact traces, the error distribution over time, and the specific downstream call that was failing&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Every step was visible. I could expand any tool call, see the exact query it ran, and verify the result. It's not a black box. It shows its work — and if I disagreed with where it was going, I could redirect it.&lt;/p&gt;

&lt;p&gt;Once I had the root cause, I stayed in the same conversation:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;alert me if cart error rate crosses 10 errors in 5 minutes.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Same context. Same conversation. Investigation to prevention in two sentences.&lt;/p&gt;

&lt;p&gt;That last part is what I keep coming back to. The assistant doesn't just help you find problems — it helps you lock in the fix so you don't get paged for the same thing at 3 AM next week.&lt;/p&gt;




&lt;h2&gt;
  
  
  Beyond the UI: Take It to Your IDE
&lt;/h2&gt;

&lt;p&gt;Here's the part that changes the workflow entirely. You don't have to be inside the OpenObserve UI to get this.&lt;/p&gt;

&lt;p&gt;OpenObserve exposes all of this through an MCP server. Connect your AI coding assistant (Claude Code, Cursor, whatever you use) directly to your production observability data. One command:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;claude mcp add o2 https://api.openobserve.ai/api/default/mcp &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-t&lt;/span&gt; http &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--header&lt;/span&gt; &lt;span class="s2"&gt;"Authorization: Basic &amp;lt;YOUR_TOKEN&amp;gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's it. Under five minutes. Now your IDE can query production logs, metrics, and traces. Debug a deploy from your terminal. Pull up a trace without leaving your editor. Check error rates during a code review.&lt;/p&gt;

&lt;p&gt;The assistant follows you wherever you work — not just inside the observability platform.&lt;/p&gt;




&lt;h2&gt;
  
  
  What This Actually Changes
&lt;/h2&gt;

&lt;p&gt;There's been a lot of noise about AI in observability. Most of it falls into two camps:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Anomaly detection&lt;/strong&gt; — useful in theory, unpredictable in practice, hard to trust&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;AI replaces on-call&lt;/strong&gt; — not happening, and most engineers don't want it to&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The thing that's working right now is neither of those. It's reducing the friction between &lt;em&gt;"something is wrong"&lt;/em&gt; and &lt;em&gt;"here's what I know."&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Not replacing your judgment. Not replacing your experience. Just removing the parts of incident response that feel like operating a query builder with one eye open at 2 AM.&lt;/p&gt;

&lt;p&gt;From &lt;em&gt;"I need to see what's happening"&lt;/em&gt; to &lt;em&gt;"I know what happened and we're covered next time"&lt;/em&gt; — in one conversation.&lt;/p&gt;




&lt;h2&gt;
  
  
  Resources
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://openobserve.ai/docs/integration/ai/mcp/claude/" rel="noopener noreferrer"&gt;OpenObserve MCP Setup Guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://openobserve.ai/webinars-videos/integration-with-ai-tools-a-step-by-step-guide-using-mcp/" rel="noopener noreferrer"&gt;Integration with AI Tools Using MCP — Workshop&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/openobserve/openobserve" rel="noopener noreferrer"&gt;OpenObserve on GitHub&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;em&gt;Have you tried connecting AI assistants to your observability stack? What's working? What's still painful? Drop a comment — I'm genuinely curious what others are seeing.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>observability</category>
      <category>ai</category>
      <category>devops</category>
      <category>sre</category>
    </item>
    <item>
      <title>OpenObserve Just Raised $10M and Launched Observability 3.0 with New AI Capabilities</title>
      <dc:creator>Sara</dc:creator>
      <pubDate>Wed, 29 Apr 2026 14:16:01 +0000</pubDate>
      <link>https://dev.to/openobserve/openobserve-just-raised-10m-and-launched-observability-30-with-new-ai-capabilities-3ibl</link>
      <guid>https://dev.to/openobserve/openobserve-just-raised-10m-and-launched-observability-30-with-new-ai-capabilities-3ibl</guid>
      <description>&lt;p&gt;Today we’re announcing two things:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A $10M Series A&lt;/li&gt;
&lt;li&gt;The launch of Observability 3.0&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This funding accelerates a shift we’ve been building toward: Observability 3.0.&lt;/p&gt;

&lt;p&gt;Observability is breaking under AI-scale systems.&lt;br&gt;
More data. More tools. More noise.&lt;/p&gt;

&lt;p&gt;Most teams are still:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;1. Stitching together 6 – 15 tools&lt;/li&gt;
&lt;li&gt;2. Sampling away critical data&lt;/li&gt;
&lt;li&gt;3. Debugging incidents manually&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That model doesn’t scale.&lt;/p&gt;

&lt;p&gt;So we built something different.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://na2.hubs.ly/H059Cr60" rel="noopener noreferrer"&gt;Observability 3.0&lt;/a&gt;&lt;/strong&gt; is a shift from dashboards and alerts to systems that:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Correlate data automatically&lt;/li&gt;
&lt;li&gt;Detect issues early&lt;/li&gt;
&lt;li&gt;Help resolve incidents without manual digging&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This includes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;AI SRE (autonomous incident analysis)&lt;/li&gt;
&lt;li&gt;Anomaly detection (early warning signals)&lt;/li&gt;
&lt;li&gt;LLM observability (visibility into AI systems)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;All in a single platform. No fragmentation. No forced tradeoffs.&lt;/p&gt;

&lt;p&gt;This is what the Series A is fueling.&lt;/p&gt;

&lt;p&gt;👉 Full story, vision, and what we’re building next: &lt;a href="https://na2.hubs.ly/H059Cq20" rel="noopener noreferrer"&gt;https://na2.hubs.ly/H059Cq20&lt;/a&gt;&lt;/p&gt;

</description>
      <category>devops</category>
      <category>news</category>
      <category>cloud</category>
      <category>observability</category>
    </item>
    <item>
      <title>AI Agent Monitoring: How to Observe Autonomous AI Agents in Production</title>
      <dc:creator>Simran Kumari</dc:creator>
      <pubDate>Fri, 10 Apr 2026 16:09:26 +0000</pubDate>
      <link>https://dev.to/openobserve/ai-agent-monitoring-how-to-observe-autonomous-ai-agents-in-production-39a8</link>
      <guid>https://dev.to/openobserve/ai-agent-monitoring-how-to-observe-autonomous-ai-agents-in-production-39a8</guid>
      <description>&lt;p&gt;&lt;strong&gt;AI agent monitoring&lt;/strong&gt; — also called &lt;strong&gt;LLM observability&lt;/strong&gt; — is the practice of collecting, analysing, and acting on telemetry data generated by LLM calls and the autonomous agents built on top of them. Think of it as traditional APM, but purpose-built for AI workloads.&lt;/p&gt;

&lt;p&gt;A modern AI agent is not a static API call. It's a dynamic, multi-step reasoning system that may:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Plan and decompose subtasks autonomously&lt;/li&gt;
&lt;li&gt;Call external tools (web search, code execution, APIs)&lt;/li&gt;
&lt;li&gt;Retrieve documents via Retrieval-Augmented Generation (RAG)&lt;/li&gt;
&lt;li&gt;Spawn sub-agents for parallel task execution&lt;/li&gt;
&lt;li&gt;Loop and self-correct until a goal is satisfied&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Every one of those steps is a potential point of failure, latency spike, or cost explosion. Just as DevOps engineers would never deploy a microservice without metrics, traces, and logs, MLOps and AI engineers need the same rigour for LLM-powered systems.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why It Matters in Production
&lt;/h2&gt;

&lt;p&gt;The jump from a prototype that "works on my machine" to a reliable production AI agent is enormous. Here's what routinely breaks without proper monitoring:&lt;/p&gt;

&lt;h3&gt;
  
  
  🔴 Runaway Token Costs
&lt;/h3&gt;

&lt;p&gt;An unchecked agentic loop can consume millions of tokens before you notice. A single misbehaving agent session — stuck in a reasoning loop — can exhaust your entire daily token budget in minutes. Token-level telemetry gives you per-request cost visibility and the ability to set budget-based circuit breakers.&lt;/p&gt;

&lt;h3&gt;
  
  
  🔴 Silent Latency Regressions
&lt;/h3&gt;

&lt;p&gt;A new model version, a longer system prompt, or a change in retrieval strategy can quietly double your agent's response time. Without distributed latency traces, you discover this from frustrated users — not from a proactive alert.&lt;/p&gt;

&lt;h3&gt;
  
  
  🔴 Rate-Limit Cascade Failures
&lt;/h3&gt;

&lt;p&gt;LLM API rate limits hit unpredictably under production load. A single rate-limit event can trigger aggressive retries across multiple parallel agent sessions, cascading into a full outage.&lt;/p&gt;

&lt;h3&gt;
  
  
  🔴 Degraded Output Quality
&lt;/h3&gt;

&lt;p&gt;Hallucinations, refusals, and incoherent responses increase as context windows grow or prompts drift. Span-level metadata correlating prompt structure with output quality lets you catch these regressions systematically.&lt;/p&gt;

&lt;h3&gt;
  
  
  🔴 Multi-Step Reasoning Failures
&lt;/h3&gt;

&lt;p&gt;In agentic pipelines, a failure deep in a reasoning chain is nearly impossible to attribute without distributed tracing. Did the agent fail because the web search tool returned bad data, because the LLM misinterpreted the tool output, or because the context window overflowed? Traces answer this.&lt;/p&gt;

&lt;h3&gt;
  
  
  🔴 Compliance &amp;amp; Audit Requirements
&lt;/h3&gt;

&lt;p&gt;Enterprise deployments increasingly require complete audit logs of what the agent decided, why, what data it accessed, and what actions it took.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Four Pillars of LLM Observability
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Distributed Tracing
&lt;/h3&gt;

&lt;p&gt;Every agent action — from receiving a user prompt to returning a final answer — is instrumented as a &lt;strong&gt;trace&lt;/strong&gt; composed of &lt;strong&gt;spans&lt;/strong&gt;. Each span captures a discrete unit of work: an LLM call, a tool invocation, a database retrieval, or a sub-agent call.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Tracing answers: &lt;em&gt;"What happened, in what order, and how long did each step take?"&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  2. Metrics
&lt;/h3&gt;

&lt;p&gt;Aggregated numerical data over time — token counts, latency percentiles (p50/p95/p99), error rates, throughput, and cost per request. Metrics are cheap to store and fast to query, making them ideal for real-time dashboards and threshold-based alerting.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Structured Logs
&lt;/h3&gt;

&lt;p&gt;Rich, machine-readable event records attached to each agent action — prompt text, model parameters, completion content, tool call arguments, and exception stack traces. Unlike metrics, logs retain the full context needed for post-incident debugging.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Evaluations (Evals)
&lt;/h3&gt;

&lt;p&gt;A layer unique to AI observability: automated or human-assisted scoring of agent outputs for correctness, safety, relevance, and faithfulness. Evals close the loop between operational telemetry and output quality.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;💡 &lt;strong&gt;Pro Tip:&lt;/strong&gt; For most teams starting out, &lt;strong&gt;distributed tracing&lt;/strong&gt; delivers the highest immediate value. It reveals exactly where latency and failures originate across multi-step agent pipelines — something neither metrics nor logs alone can show.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Key Metrics to Track
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;What It Tells You&lt;/th&gt;
&lt;th&gt;Typical Alert Threshold&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;llm.usage.prompt_tokens&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Input token consumption per request&lt;/td&gt;
&lt;td&gt;&amp;gt; 80% of model context window&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;llm.usage.completion_tokens&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Output token consumption per request&lt;/td&gt;
&lt;td&gt;Sudden spike &amp;gt; 2× baseline&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;llm.usage.total_tokens&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Combined cost proxy per call&lt;/td&gt;
&lt;td&gt;Daily cost budget exceeded&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;duration&lt;/code&gt; (end-to-end)&lt;/td&gt;
&lt;td&gt;User-perceived latency&lt;/td&gt;
&lt;td&gt;p95 &amp;gt; 10s for interactive agents&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;error.rate&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;% of requests that fail or timeout&lt;/td&gt;
&lt;td&gt;&amp;gt; 1% over a 5-minute window&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;tool_call.count&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Tool invocations per session&lt;/td&gt;
&lt;td&gt;&amp;gt; 20 per session (loop indicator)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;agent.steps&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Depth of reasoning chain&lt;/td&gt;
&lt;td&gt;&amp;gt; configured max steps&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;llm.request.model&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Which model was invoked&lt;/td&gt;
&lt;td&gt;Unexpected model fallback detected&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  OpenTelemetry: The Standard for AI Observability
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;OpenTelemetry (OTel)&lt;/strong&gt; is the open-source observability framework that has become the industry standard for instrumenting distributed systems. For AI agents, it provides a vendor-neutral way to emit traces, metrics, and logs from any LLM call to any compatible backend — OpenObserve, Prometheus, Jaeger, Grafana, Datadog, and more.&lt;/p&gt;

&lt;p&gt;The ecosystem includes dedicated auto-instrumentation libraries for all major LLM providers:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;opentelemetry-instrumentation-openai&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;opentelemetry-instrumentation-anthropic&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;opentelemetry-instrumentation-langchain&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;opentelemetry-instrumentation-llama-index&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;opentelemetry-instrumentation-cohere&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These libraries wrap LLM client calls and automatically attach semantic attributes — token counts, model name, temperature, max tokens, error details — as span attributes, &lt;strong&gt;with no manual instrumentation required&lt;/strong&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  How OTel Spans Map to Agent Steps
&lt;/h3&gt;

&lt;p&gt;In an agentic pipeline, the OTel trace tree mirrors the agent's reasoning hierarchy:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[root trace] user-request
  └── [span] planner-llm-call
        └── [span] tool: web_search
        └── [span] tool: code_executor
              └── [span] sub-agent: summariser-llm-call
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This lets you instantly see which step was the bottleneck or failure point in any given agent run.&lt;/p&gt;




&lt;h2&gt;
  
  
  Setting Up LLM Monitoring with OpenObserve
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://openobserve.ai/" rel="noopener noreferrer"&gt;OpenObserve&lt;/a&gt; is an open-source observability platform with a native OTLP endpoint — purpose-built for high-volume telemetry at significantly lower cost and resource footprint than alternatives like the Elastic Stack.&lt;/p&gt;

&lt;h3&gt;
  
  
  Prerequisites
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Python 3.8+&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;uv&lt;/code&gt; package manager (or &lt;code&gt;pip&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;An OpenObserve account — &lt;a href="https://cloud.openobserve.ai/" rel="noopener noreferrer"&gt;cloud&lt;/a&gt; or &lt;a href="https://openobserve.ai/downloads/" rel="noopener noreferrer"&gt;self-hosted&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Your OpenObserve organisation ID and Base64-encoded auth token&lt;/li&gt;
&lt;li&gt;API key for your LLM provider (OpenAI, Anthropic, etc.)&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  Step 1: Configure Your Environment
&lt;/h3&gt;

&lt;p&gt;Create a &lt;code&gt;.env&lt;/code&gt; file in your project root:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight ini"&gt;&lt;code&gt;&lt;span class="c"&gt;# OpenObserve instance URL
&lt;/span&gt;&lt;span class="py"&gt;OPENOBSERVE_URL&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;https://api.openobserve.ai/&lt;/span&gt;

&lt;span class="c"&gt;# Your OpenObserve organisation slug or ID
&lt;/span&gt;&lt;span class="py"&gt;OPENOBSERVE_ORG&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;your_org_id&lt;/span&gt;

&lt;span class="c"&gt;# Basic auth token — Base64-encoded "email:password"
&lt;/span&gt;&lt;span class="py"&gt;OPENOBSERVE_AUTH_TOKEN&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;"Basic &amp;lt;your_base64_token&amp;gt;"&lt;/span&gt;

&lt;span class="c"&gt;# Enable or disable tracing (default: true)
&lt;/span&gt;&lt;span class="py"&gt;OPENOBSERVE_ENABLED&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;true&lt;/span&gt;

&lt;span class="c"&gt;# LLM provider keys
&lt;/span&gt;&lt;span class="py"&gt;OPENAI_API_KEY&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;"your-openai-key"&lt;/span&gt;
&lt;span class="py"&gt;ANTHROPIC_API_KEY&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;"your-anthropic-key"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h3&gt;
  
  
  Step 2: Install Dependencies
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Using uv (recommended)&lt;/span&gt;
uv pip &lt;span class="nb"&gt;install &lt;/span&gt;openobserve-telemetry-sdk &lt;span class="se"&gt;\&lt;/span&gt;
               opentelemetry-instrumentation-openai &lt;span class="se"&gt;\&lt;/span&gt;
               opentelemetry-instrumentation-anthropic &lt;span class="se"&gt;\&lt;/span&gt;
               python-dotenv

&lt;span class="c"&gt;# Or with pip&lt;/span&gt;
pip &lt;span class="nb"&gt;install &lt;/span&gt;openobserve-telemetry-sdk opentelemetry-instrumentation-openai python-dotenv
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h3&gt;
  
  
  Step 3: Instrument Your Application
&lt;/h3&gt;

&lt;h4&gt;
  
  
  OpenAI
&lt;/h4&gt;

&lt;p&gt;Add two lines &lt;strong&gt;before any LLM calls are made&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;opentelemetry.instrumentation.openai&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;OpenAIInstrumentor&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;openobserve&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;openobserve_init&lt;/span&gt;

&lt;span class="c1"&gt;# Instrument OpenAI and initialise the OpenObserve exporter
&lt;/span&gt;&lt;span class="nc"&gt;OpenAIInstrumentor&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;instrument&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="nf"&gt;openobserve_init&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;openai&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;OpenAI&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;OpenAI&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;# Use the client exactly as normal — traces are captured automatically
&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-4o&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Summarise this document...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}]&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;choices&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Anthropic (Claude)
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;opentelemetry.instrumentation.anthropic&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;AnthropicInstrumentor&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;openobserve&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;openobserve_init&lt;/span&gt;

&lt;span class="nc"&gt;AnthropicInstrumentor&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;instrument&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="nf"&gt;openobserve_init&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;anthropic&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Anthropic&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Anthropic&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude-opus-4-5&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;max_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1024&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Analyse this data...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}]&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Every call is now captured as a trace span and exported to OpenObserve automatically.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Note:&lt;/strong&gt; The &lt;code&gt;openobserve-telemetry-sdk&lt;/code&gt; is an optional thin wrapper around the standard OTel Python SDK. If you already use OpenTelemetry, you can send telemetry directly to OpenObserve's OTLP endpoint without it.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h3&gt;
  
  
  Step 4: View Traces in OpenObserve
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;Log in to your OpenObserve instance&lt;/li&gt;
&lt;li&gt;Navigate to &lt;strong&gt;Traces&lt;/strong&gt; in the left sidebar&lt;/li&gt;
&lt;li&gt;Filter by service name, model name, or time range&lt;/li&gt;
&lt;li&gt;Click any span to inspect token counts, latency, parameters, and full request metadata&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  What Gets Captured in Each Trace Span
&lt;/h2&gt;

&lt;p&gt;The OTel instrumentation libraries automatically attach the following attributes — no manual coding needed:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;OTel Attribute&lt;/th&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;th&gt;Example Value&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;llm.request.model&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Model identifier&lt;/td&gt;
&lt;td&gt;&lt;code&gt;gpt-4o&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;llm.usage.prompt_tokens&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Tokens in the prompt&lt;/td&gt;
&lt;td&gt;&lt;code&gt;1,247&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;llm.usage.completion_tokens&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Tokens in the response&lt;/td&gt;
&lt;td&gt;&lt;code&gt;312&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;llm.usage.total_tokens&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Combined token usage&lt;/td&gt;
&lt;td&gt;&lt;code&gt;1,559&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;llm.request.temperature&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Sampling temperature&lt;/td&gt;
&lt;td&gt;&lt;code&gt;0.7&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;llm.request.max_tokens&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Max response length&lt;/td&gt;
&lt;td&gt;&lt;code&gt;2048&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;duration&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;End-to-end request latency&lt;/td&gt;
&lt;td&gt;&lt;code&gt;2,340ms&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;error&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Exception details on failure&lt;/td&gt;
&lt;td&gt;&lt;code&gt;RateLimitError: 429&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Adding Custom Span Attributes
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;opentelemetry&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;trace&lt;/span&gt;

&lt;span class="n"&gt;tracer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;trace&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_tracer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;__name__&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;tracer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;start_as_current_span&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;agent-task&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_attribute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user.id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;usr_abc123&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_attribute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;session.id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sess_xyz789&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_attribute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;agent.name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;research-agent&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_attribute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;task.type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;document-summarisation&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_attribute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;prompt.version&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;v2.3.1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Your LLM call here — child spans are created automatically
&lt;/span&gt;    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(...)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Unique Challenges in Agentic Systems
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Non-Determinism
&lt;/h3&gt;

&lt;p&gt;Unlike traditional software, the same input to an agent may produce different execution paths on different runs. Your monitoring must capture the full trace of &lt;strong&gt;each individual run&lt;/strong&gt;, not just aggregated statistics.&lt;/p&gt;

&lt;h3&gt;
  
  
  Long-Horizon Context Windows
&lt;/h3&gt;

&lt;p&gt;As agents maintain conversation history across multiple turns, context windows grow substantially. A single agent session can consume tens of thousands of tokens. Per-turn token tracking is essential.&lt;/p&gt;

&lt;h3&gt;
  
  
  Nested and Parallel Tool Calls
&lt;/h3&gt;

&lt;p&gt;Modern agents call multiple tools — often in parallel. Distributed tracing with proper parent-child span relationships is the only reliable way to reconstruct the true execution timeline.&lt;/p&gt;

&lt;h3&gt;
  
  
  Infinite Loop Detection
&lt;/h3&gt;

&lt;p&gt;Agents can get stuck in reasoning loops, repeatedly calling the same tool without making progress. Monitor &lt;code&gt;agent.steps&lt;/code&gt; and &lt;code&gt;tool_call.count&lt;/code&gt; per session, combined with a max-step circuit breaker.&lt;/p&gt;

&lt;h3&gt;
  
  
  Multi-Agent Coordination
&lt;/h3&gt;

&lt;p&gt;Orchestrator-worker architectures require trace context propagation across agent boundaries. OpenTelemetry's W3C TraceContext standard enables this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;opentelemetry.propagate&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;inject&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;extract&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;

&lt;span class="c1"&gt;# Orchestrator: inject trace context into outgoing request headers
&lt;/span&gt;&lt;span class="n"&gt;headers&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{}&lt;/span&gt;
&lt;span class="nf"&gt;inject&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# adds traceparent, tracestate headers
&lt;/span&gt;
&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;post&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http://worker-agent/execute&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;task&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;task_payload&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;headers&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Worker agent: extract and continue the trace
&lt;/span&gt;&lt;span class="n"&gt;context&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;extract&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;incoming_request&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;tracer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;start_as_current_span&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;worker-task&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="c1"&gt;# Appears as child span in orchestrator's trace
&lt;/span&gt;    &lt;span class="bp"&gt;...&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;⚠️ &lt;strong&gt;Critical:&lt;/strong&gt; Always propagate the W3C &lt;code&gt;traceparent&lt;/code&gt; header when your orchestrator calls a worker agent. Without this, each agent's activity appears as a disconnected root trace — making end-to-end debugging nearly impossible.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Best Practices for AI Agent Monitoring
&lt;/h2&gt;

&lt;h3&gt;
  
  
  ✅ Instrument Early, Not After the Fact
&lt;/h3&gt;

&lt;p&gt;Add observability during development, not after incidents. Retrofitting into a complex agentic system leaves blind spots in the most critical execution paths.&lt;/p&gt;

&lt;h3&gt;
  
  
  ✅ Separate Evaluation Metrics from Operational Metrics
&lt;/h3&gt;

&lt;p&gt;Don't conflate system health (latency, error rate, tokens) with output quality (correctness, relevance, safety). Keep them in separate pipelines with separate alert policies.&lt;/p&gt;

&lt;h3&gt;
  
  
  ✅ Sample Intelligently, Not Uniformly
&lt;/h3&gt;

&lt;p&gt;Use &lt;strong&gt;head-based sampling&lt;/strong&gt; for normal traffic (e.g., 10%), but configure &lt;strong&gt;tail-based sampling&lt;/strong&gt; to capture 100% of failed or slow requests. Full fidelity where it matters most, without prohibitive storage costs.&lt;/p&gt;

&lt;h3&gt;
  
  
  ✅ Mask Sensitive Data Before Export
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;opentelemetry.sdk.trace&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;SpanProcessor&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;SensitiveDataRedactor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;SpanProcessor&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;SENSITIVE_ATTRS&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;llm.prompts&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;llm.completions&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user.email&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;on_end&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;attr&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;SENSITIVE_ATTRS&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;attr&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;attributes&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_attribute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;attr&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;[REDACTED]&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  ✅ Version Your Prompts
&lt;/h3&gt;

&lt;p&gt;Treat prompt templates as software artefacts with version identifiers. Attach &lt;code&gt;prompt.version: v2.3.1&lt;/code&gt; as a span attribute to compare performance across prompt versions — just like canary deployments.&lt;/p&gt;

&lt;h3&gt;
  
  
  ✅ Tag Every Trace with Business Context
&lt;/h3&gt;

&lt;p&gt;Add &lt;code&gt;user.id&lt;/code&gt;, &lt;code&gt;session.id&lt;/code&gt;, &lt;code&gt;agent.name&lt;/code&gt;, &lt;code&gt;task.type&lt;/code&gt;, and &lt;code&gt;feature.flag&lt;/code&gt; to every trace. These transform your observability data from an engineering artefact into a &lt;strong&gt;product intelligence asset&lt;/strong&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  ✅ Build a Feedback Loop from Evals to Prompts
&lt;/h3&gt;

&lt;p&gt;Connect your evaluation pipeline back to your prompt management system. When evaluations detect a quality regression, it should automatically trigger a prompt review workflow — the AI equivalent of failing a CI/CD pipeline on test failures.&lt;/p&gt;




&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;As autonomous AI agents take on consequential tasks — writing and executing code, managing business workflows, interacting with customers at scale — the organisations that invest in proper observability will have a decisive operational advantage: faster debugging cycles, lower costs, better output quality, and the confidence to scale reliably.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;OpenTelemetry + OpenObserve&lt;/strong&gt; gives you a vendor-neutral, open-source foundation that scales from a solo developer's project to an enterprise deployment, without lock-in or prohibitive cost at scale.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;You cannot improve what you cannot measure. For AI agents, observability is the measurement layer that makes continuous improvement possible.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Further Reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://openobserve.ai/docs/integration/llm-applications/" rel="noopener noreferrer"&gt;OpenObserve LLM Observability Documentation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://openobserve.ai/docs/opentelemetry/openobserve-python-sdk/" rel="noopener noreferrer"&gt;OpenObserve Python SDK&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://opentelemetry.io/docs/specs/semconv/gen-ai/" rel="noopener noreferrer"&gt;OpenTelemetry Semantic Conventions for LLMs&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://opentelemetry.io/docs/languages/python/automatic/" rel="noopener noreferrer"&gt;OpenTelemetry Python Auto-Instrumentation&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;em&gt;Originally published on the &lt;a href="https://openobserve.ai/blog/ai-agent-monitoring/" rel="noopener noreferrer"&gt;OpenObserve blog&lt;/a&gt; by Simran Kumari.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>aiagents</category>
      <category>observability</category>
      <category>opentelemetry</category>
      <category>llm</category>
    </item>
    <item>
      <title>Monitoring Java Microservices with OpenTelemetry and OpenObserve</title>
      <dc:creator>Manas Sharma</dc:creator>
      <pubDate>Fri, 10 Apr 2026 12:14:39 +0000</pubDate>
      <link>https://dev.to/openobserve/monitoring-java-microservices-with-opentelemetry-and-openobserve-2d1</link>
      <guid>https://dev.to/openobserve/monitoring-java-microservices-with-opentelemetry-and-openobserve-2d1</guid>
      <description>&lt;p&gt;Monitoring microservices is hard.&lt;/p&gt;

&lt;p&gt;When a user request fans out across multiple services, each with its own database, logs, and failure modes, traditional monitoring tools often give you a fragmented picture. You can tell &lt;em&gt;something&lt;/em&gt; is slow, but not exactly &lt;em&gt;where&lt;/em&gt; or &lt;em&gt;why&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;Distributed tracing solves this.&lt;/p&gt;

&lt;p&gt;In this tutorial, we'll implement distributed tracing for a Java Spring Boot microservices application using two open-source tools: &lt;strong&gt;OpenTelemetry&lt;/strong&gt; and &lt;strong&gt;OpenObserve&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;If your stack includes other languages, check out these guides too:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://openobserve.ai/blog/distributed-tracing-in-dotnet-application/" rel="noopener noreferrer"&gt;.NET&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://openobserve.ai/blog/monitoring-go-with-opentelemetry/" rel="noopener noreferrer"&gt;Go&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://openobserve.ai/blog/distributed-tracing-in-nodejs-with-opentelemetry/" rel="noopener noreferrer"&gt;Node.js&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  What you'll build
&lt;/h2&gt;

&lt;p&gt;By the end of this guide, you'll have:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A working Spring Boot microservices setup with cross-service HTTP calls&lt;/li&gt;
&lt;li&gt;Zero-code instrumentation using the OpenTelemetry Java Agent&lt;/li&gt;
&lt;li&gt;End-to-end traces in OpenObserve with flamegraph and Gantt chart views&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  What is distributed tracing?
&lt;/h2&gt;

&lt;p&gt;In microservices, one user action can trigger a chain of calls across many services. If a request takes 3 seconds, tracing helps answer:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Which service caused the delay?&lt;/li&gt;
&lt;li&gt;Which operation failed?&lt;/li&gt;
&lt;li&gt;Where exactly time was spent?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Distributed tracing works by attaching context (&lt;code&gt;trace_id&lt;/code&gt;, &lt;code&gt;span_id&lt;/code&gt;) at request entry and propagating it across service boundaries (usually with &lt;code&gt;traceparent&lt;/code&gt; headers). This gives you one complete request journey.&lt;/p&gt;

&lt;p&gt;A trace is made up of spans. Each span records:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Service + operation&lt;/li&gt;
&lt;li&gt;Start time + duration&lt;/li&gt;
&lt;li&gt;HTTP details (method, URL, status)&lt;/li&gt;
&lt;li&gt;DB query metadata&lt;/li&gt;
&lt;li&gt;Errors/exceptions&lt;/li&gt;
&lt;li&gt;Parent-child relationships&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F77fyvps3sm2xpi4kqtj5.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F77fyvps3sm2xpi4kqtj5.png" alt="Key components of distributed tracing in an e-commerce example" width="788" height="413"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;For deeper fundamentals: &lt;a href="https://openobserve.ai/blog/distributed-tracing-basics-to-beyond-guide/" rel="noopener noreferrer"&gt;Distributed Tracing Basics to Beyond&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Why OpenTelemetry + OpenObserve?
&lt;/h2&gt;

&lt;h3&gt;
  
  
  OpenTelemetry
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://opentelemetry.io/" rel="noopener noreferrer"&gt;OpenTelemetry&lt;/a&gt; is a CNCF standard for traces, metrics, and logs.&lt;br&gt;&lt;br&gt;
For Java, the &lt;strong&gt;OpenTelemetry Java Agent&lt;/strong&gt; can auto-instrument Spring Boot, JDBC, and HTTP clients with no code changes.&lt;/p&gt;
&lt;h3&gt;
  
  
  OpenObserve
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://openobserve.ai/" rel="noopener noreferrer"&gt;OpenObserve&lt;/a&gt; is an open-source backend for logs, metrics, and traces.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;OTLP-native ingest&lt;/li&gt;
&lt;li&gt;SQL-powered analytics&lt;/li&gt;
&lt;li&gt;Unified observability in one interface&lt;/li&gt;
&lt;li&gt;Lightweight and storage-efficient&lt;/li&gt;
&lt;/ul&gt;


&lt;h2&gt;
  
  
  Architecture used in this tutorial
&lt;/h2&gt;

&lt;p&gt;We'll run four services:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Service&lt;/th&gt;
&lt;th&gt;Port&lt;/th&gt;
&lt;th&gt;Responsibility&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;discovery-service&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;8761&lt;/td&gt;
&lt;td&gt;Eureka registry&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;user-service&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;8081&lt;/td&gt;
&lt;td&gt;User CRUD (MySQL)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;order-service&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;8082&lt;/td&gt;
&lt;td&gt;Order management; calls &lt;code&gt;user-service&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;payment-service&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;8083&lt;/td&gt;
&lt;td&gt;Payment processing; calls &lt;code&gt;order-service&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The key trace path is:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;payment-service -&amp;gt; order-service -&amp;gt; user-service -&amp;gt; MySQL&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fl10230zmt03dpjjnge17.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fl10230zmt03dpjjnge17.png" alt="Payment Processing Workflow" width="800" height="505"&gt;&lt;/a&gt;&lt;/p&gt;


&lt;h2&gt;
  
  
  Prerequisites
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Java 17+&lt;/li&gt;
&lt;li&gt;Maven 3.8+&lt;/li&gt;
&lt;li&gt;Docker + Docker Compose&lt;/li&gt;
&lt;li&gt;MySQL 8 (or use Dockerized MySQL from compose)&lt;/li&gt;
&lt;/ul&gt;


&lt;h2&gt;
  
  
  Step 1: Clone the project
&lt;/h2&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/openobserve/java-distributed-tracing.git
&lt;span class="nb"&gt;cd &lt;/span&gt;java-distributed-tracing
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Step 2: Start OpenObserve and MySQL
&lt;/h2&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker-compose up &lt;span class="nt"&gt;-d&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fc1n201ddvbbv2p4u6oos.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fc1n201ddvbbv2p4u6oos.png" alt="Docker Compose startup" width="800" height="95"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This starts:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;OpenObserve: &lt;code&gt;http://localhost:5080&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;MySQL: &lt;code&gt;localhost:3306&lt;/code&gt; (&lt;code&gt;tracingdb&lt;/code&gt;)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Login to OpenObserve with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Email: &lt;code&gt;admin@example.com&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Password: &lt;code&gt;Admin123!&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjktcbc95nhgy2yby0osz.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjktcbc95nhgy2yby0osz.png" alt="OpenObserve dashboard" width="800" height="436"&gt;&lt;/a&gt;&lt;/p&gt;


&lt;h2&gt;
  
  
  Step 3: Download OpenTelemetry Java Agent
&lt;/h2&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;mkdir &lt;/span&gt;agents
curl &lt;span class="nt"&gt;-L&lt;/span&gt; https://github.com/open-telemetry/opentelemetry-java-instrumentation/releases/latest/download/opentelemetry-javaagent.jar &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-o&lt;/span&gt; agents/opentelemetry-javaagent.jar
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0fx87t0a6oqn4pg7ofnb.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0fx87t0a6oqn4pg7ofnb.png" alt="Download OpenTelemetry Java Agent" width="800" height="170"&gt;&lt;/a&gt;&lt;/p&gt;


&lt;h2&gt;
  
  
  Step 4: Configure agent export to OpenObserve
&lt;/h2&gt;

&lt;p&gt;Example from &lt;code&gt;user-service/scripts/start.sh&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;OTEL_SERVICE_NAME&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;user-service
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;OTEL_RESOURCE_ATTRIBUTES&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;service.name&lt;span class="o"&gt;=&lt;/span&gt;user-service,deployment.environment&lt;span class="o"&gt;=&lt;/span&gt;dev
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;OTEL_TRACES_EXPORTER&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;otlp
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;OTEL_METRICS_EXPORTER&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;none
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;OTEL_LOGS_EXPORTER&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;none
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;OTEL_EXPORTER_OTLP_TRACES_ENDPOINT&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;http://localhost:5080/api/default/traces
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;OTEL_EXPORTER_OTLP_TRACES_PROTOCOL&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;http/protobuf
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;OTEL_EXPORTER_OTLP_TRACES_HEADERS&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"Authorization=Basic {token}"&lt;/span&gt;

java &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-Xms256m&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-Xmx512m&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-javaagent&lt;/span&gt;:../agents/opentelemetry-javaagent.jar &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-jar&lt;/span&gt; target/user-service-0.0.1-SNAPSHOT.jar
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Get &lt;code&gt;{token}&lt;/code&gt; from OpenObserve UI:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flavcpola4g6acknmci64.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flavcpola4g6acknmci64.png" alt="OpenTelemetry token location" width="800" height="280"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 5: Start discovery-service
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;cd &lt;/span&gt;discovery-service
mvn clean &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-Dmaven&lt;/span&gt;.test.skip
sh scripts/start.sh
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Open: &lt;code&gt;http://localhost:8761&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Faw8dxbyyfusxrafqvduk.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Faw8dxbyyfusxrafqvduk.png" alt="Discovery service" width="800" height="295"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 6: Start user/order/payment services
&lt;/h2&gt;

&lt;p&gt;Run each in a separate terminal.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;cd &lt;/span&gt;user-service
mvn clean &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-Dmaven&lt;/span&gt;.test.skip
sh scripts/start.sh
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;cd &lt;/span&gt;order-service
mvn clean &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-Dmaven&lt;/span&gt;.test.skip
sh scripts/start.sh
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;cd &lt;/span&gt;payment-service
mvn clean &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-Dmaven&lt;/span&gt;.test.skip
sh scripts/start.sh
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Verify registration in Eureka:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3btnshbhbcoc5wx25ozb.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3btnshbhbcoc5wx25ozb.png" alt="Eureka dashboard" width="800" height="338"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 7: Generate traces
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1) Create user
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-X&lt;/span&gt; POST http://localhost:8081/api/users &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{
    "name": "Priya Sharma",
    "email": "priya@example.com",
    "phone": "+91-9876543210"
  }'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Feq5neqphvupz3bidqdmu.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Feq5neqphvupz3bidqdmu.png" alt="Create user API" width="800" height="239"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  2) Create order
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-X&lt;/span&gt; POST http://localhost:8082/api/orders &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{
    "userId": 1,
    "productName": "Mechanical Keyboard",
    "quantity": 1,
    "totalAmount": 4999.00
  }'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbl4100lzdbalcaux0hag.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbl4100lzdbalcaux0hag.png" alt="Create order API" width="800" height="221"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  3) Process payment (full distributed trace)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-X&lt;/span&gt; POST http://localhost:8083/api/payments/process &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{
    "userId": 1,
    "orderId": 1,
    "amount": 4999.00,
    "currency": "INR",
    "paymentMethod": "UPI"
  }'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1cit2w42p9mohjrjz0uz.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1cit2w42p9mohjrjz0uz.png" alt="Process payment API" width="800" height="198"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  4) Trigger an error trace
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-X&lt;/span&gt; POST http://localhost:8082/api/orders &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{
    "userId": 9999,
    "productName": "Test Product",
    "quantity": 1,
    "totalAmount": 100.00
  }'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Expected: &lt;code&gt;400 Bad Request&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcm2rbdfrvh1qlf4x05x6.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcm2rbdfrvh1qlf4x05x6.png" alt="Error test API" width="800" height="156"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Visualize in OpenObserve
&lt;/h2&gt;

&lt;p&gt;Go to &lt;code&gt;http://localhost:5080&lt;/code&gt; -&amp;gt; &lt;strong&gt;Traces&lt;/strong&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Trace Explorer
&lt;/h3&gt;

&lt;p&gt;You'll see:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Trace ID&lt;/li&gt;
&lt;li&gt;Root span&lt;/li&gt;
&lt;li&gt;Service&lt;/li&gt;
&lt;li&gt;Duration&lt;/li&gt;
&lt;li&gt;Span count&lt;/li&gt;
&lt;li&gt;Status&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8ahns6faayter6s4tg1n.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8ahns6faayter6s4tg1n.png" alt="Trace explorer" width="800" height="435"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Filter examples
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;service_name = payment-service&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;status = ERROR&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;Duration range&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;operation_name&lt;/code&gt; for specific endpoints&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2th52axaqj1yl6pcav0n.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2th52axaqj1yl6pcav0n.png" alt="Filter by service" width="800" height="436"&gt;&lt;/a&gt;&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1bxpyygudbc8715k29n7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1bxpyygudbc8715k29n7.png" alt="Filter by error" width="800" height="438"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Flamegraph + Gantt chart
&lt;/h3&gt;

&lt;p&gt;Click a &lt;code&gt;POST /api/payments/process&lt;/code&gt; trace.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Flamegraph&lt;/strong&gt;: nested span timing hierarchy&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Gantt&lt;/strong&gt;: timeline-aligned span bars&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9qjfrrtg9p48v1bunh43.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9qjfrrtg9p48v1bunh43.png" alt="Flamegraph view" width="800" height="225"&gt;&lt;/a&gt;&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2ertp2jg6wiq45g4qx4n.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2ertp2jg6wiq45g4qx4n.png" alt="Gantt chart view" width="800" height="331"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Query traces with SQL
&lt;/h2&gt;

&lt;p&gt;OpenObserve supports SQL over trace data.&lt;/p&gt;

&lt;h3&gt;
  
  
  Slowest payment traces
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;trace_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;duration&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;service_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;operation_name&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="nv"&gt;"default"&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;service_name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'payment-service'&lt;/span&gt;
  &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;operation_name&lt;/span&gt; &lt;span class="k"&gt;LIKE&lt;/span&gt; &lt;span class="s1"&gt;'%payments/process%'&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;duration&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;
&lt;span class="k"&gt;LIMIT&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffoqi5yexqdywd2x58953.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffoqi5yexqdywd2x58953.png" alt="Slow traces SQL" width="800" height="436"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Error count by service
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;service_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;error_count&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="nv"&gt;"default"&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;span_status&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'ERROR'&lt;/span&gt;
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;service_name&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;error_count&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F422qumf0xylbneza7n9m.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F422qumf0xylbneza7n9m.png" alt="Error count SQL" width="800" height="436"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Avg/max latency by service
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;service_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="k"&gt;AVG&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;duration&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;avg_duration_us&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="k"&gt;MAX&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;duration&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;max_duration_us&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;request_count&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="nv"&gt;"default"&lt;/span&gt;
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;service_name&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8ay426qextfca93vmhhs.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8ay426qextfca93vmhhs.png" alt="Latency SQL" width="800" height="435"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  What the Java agent captured automatically
&lt;/h2&gt;

&lt;p&gt;Without adding tracing code, the OpenTelemetry Java Agent instrumented:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Spring Web incoming HTTP requests&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;RestTemplate&lt;/code&gt; outbound calls (&lt;code&gt;traceparent&lt;/code&gt; injected)&lt;/li&gt;
&lt;li&gt;JDBC/MySQL queries&lt;/li&gt;
&lt;li&gt;Context propagation across service boundaries&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;See supported libraries: &lt;a href="https://github.com/open-telemetry/opentelemetry-java-instrumentation/blob/main/docs/supported-libraries.md" rel="noopener noreferrer"&gt;OpenTelemetry Java Instrumentation&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Final takeaway
&lt;/h2&gt;

&lt;p&gt;You now have end-to-end distributed tracing for a Java microservices app with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Zero-code instrumentation&lt;/li&gt;
&lt;li&gt;Full request path visibility&lt;/li&gt;
&lt;li&gt;Visual root-cause analysis (flamegraph/Gantt)&lt;/li&gt;
&lt;li&gt;SQL-based troubleshooting in OpenObserve&lt;/li&gt;
&lt;li&gt;A path to production scaling without vendor lock-in&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Further reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://openobserve.ai/blog/distributed-tracing-in-dotnet-application/" rel="noopener noreferrer"&gt;Distributed Tracing in .NET Applications using OpenTelemetry&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://openobserve.ai/blog/monitoring-go-with-opentelemetry/" rel="noopener noreferrer"&gt;Distributed Tracing in Go Applications with OpenTelemetry and OpenObserve&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://openobserve.ai/blog/distributed-tracing-in-nodejs-with-opentelemetry/" rel="noopener noreferrer"&gt;Distributed Tracing in Node.js Applications with OpenTelemetry&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>java</category>
      <category>springboot</category>
      <category>opentelemetry</category>
      <category>observability</category>
    </item>
    <item>
      <title>Best Open Source LLM Observability Tools in 2026: Complete Guide</title>
      <dc:creator>Simran Kumari</dc:creator>
      <pubDate>Wed, 25 Mar 2026 15:27:12 +0000</pubDate>
      <link>https://dev.to/openobserve/best-open-source-llm-observability-tools-in-2026-complete-guide-kn5</link>
      <guid>https://dev.to/openobserve/best-open-source-llm-observability-tools-in-2026-complete-guide-kn5</guid>
      <description>&lt;h2&gt;
  
  
  What Is LLM Observability?
&lt;/h2&gt;

&lt;p&gt;LLM observability is the practice of monitoring, tracing, and analyzing every layer of an AI application — from the prompt you send to the final response your model returns. As AI systems grow more complex, with multi-step agent workflows, retrieval-augmented generation (RAG) pipelines, and tool calls chained together, traditional logging falls short.&lt;/p&gt;

&lt;p&gt;The four core components of LLM observability are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Tracing&lt;/strong&gt; — tracking the full lifecycle of a user interaction, including intermediate steps, model API calls, and tool invocations&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Evaluation&lt;/strong&gt; — measuring output quality through automated metrics (relevance, faithfulness, toxicity) or human annotation&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cost &amp;amp; Usage Monitoring&lt;/strong&gt; — tracking token consumption, latency, and spend per model, user, or session&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Prompt Management&lt;/strong&gt; — versioning, testing, and iterating on prompts without losing reproducibility&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Without these, teams are blind to quality regressions, prompt drift, hallucinations, and runaway API costs in production.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why LLM Observability Is Different from Traditional Monitoring
&lt;/h2&gt;

&lt;p&gt;Traditional observability tools like Grafana and Prometheus are excellent for infrastructure-level signals — CPU, memory, request rates, latency percentiles. But LLMs introduce an entirely new class of failure that metrics alone cannot detect:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Traditional Monitoring&lt;/th&gt;
&lt;th&gt;LLM Observability&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Tracks uptime, latency, error rates&lt;/td&gt;
&lt;td&gt;Tracks hallucinations, prompt quality, output relevance&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Alerts on crashes or timeouts&lt;/td&gt;
&lt;td&gt;Alerts on silent quality regressions&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Measures infrastructure health&lt;/td&gt;
&lt;td&gt;Measures model behavior and output correctness&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Query languages: PromQL, SQL&lt;/td&gt;
&lt;td&gt;Evaluation frameworks: LLM-as-judge, semantic similarity&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Dashboards for SREs&lt;/td&gt;
&lt;td&gt;Dashboards for ML engineers and product teams&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  What to Look for in an Open Source LLM Observability Tool
&lt;/h2&gt;

&lt;p&gt;A &lt;a href="https://dl.acm.org/doi/10.1145/3706598.3713913" rel="noopener noreferrer"&gt;CHI 2025 study with 30 developers&lt;/a&gt; identified four core design principles every solid LLM observability tool should satisfy:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Principle&lt;/th&gt;
&lt;th&gt;What It Means&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Awareness&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Makes model behavior visible — you understand what is happening inside the system&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Monitoring&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Real-time feedback during training and evaluation to catch issues early&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Intervention&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Enables you to act on problems as they surface, not after users report them&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Operability&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Supports long-term maintainability as models and requirements evolve&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Beyond those principles, evaluate tools on:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Self-hosting support&lt;/strong&gt; — critical for data residency and compliance&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Framework integrations&lt;/strong&gt; — LangChain, LlamaIndex, OpenAI SDK, LiteLLM, Vercel AI SDK, Haystack&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;OpenTelemetry compatibility&lt;/strong&gt; — avoids vendor lock-in and lets you route traces to any OTEL-compatible backend&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Evaluation capabilities&lt;/strong&gt; — LLM-as-judge, human annotation, hallucination detection&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Prompt management&lt;/strong&gt; — versioning and collaboration features for iterating on prompts&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cost tracking&lt;/strong&gt; — per-user, per-model, per-session breakdowns&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Unified observability&lt;/strong&gt; — whether the tool also covers infrastructure so you don't need a second platform&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;License&lt;/strong&gt; — MIT, Apache 2.0, and Elastic License 2.0 carry very different implications for commercial use&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Top Open Source LLM Observability Tools
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. OpenObserve
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;License:&lt;/strong&gt; AGPL-3.0 | &lt;strong&gt;Website:&lt;/strong&gt; &lt;a href="https://openobserve.ai/" rel="noopener noreferrer"&gt;openobserve.ai&lt;/a&gt; | &lt;strong&gt;Cloud:&lt;/strong&gt; &lt;a href="https://cloud.openobserve.ai/" rel="noopener noreferrer"&gt;cloud.openobserve.ai&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;OpenObserve is our top pick for 2026.&lt;/strong&gt; While most tools on this list specialize in LLM-specific concerns, OpenObserve unifies LLM observability with full infrastructure monitoring — logs, metrics, traces, and frontend (RUM) monitoring — in a single deployment. For teams tired of managing a separate DevOps telemetry stack alongside a dedicated LLM tool, OpenObserve eliminates that overhead entirely.&lt;/p&gt;

&lt;p&gt;Built on OpenTelemetry standards and using a Parquet/Vertex columnar format with aggressive compression, OpenObserve delivers &lt;strong&gt;140x lower storage costs&lt;/strong&gt; compared to traditional stacks like Prometheus + Loki + Tempo. Its SQL-based query interface means teams can correlate LLM trace data with infrastructure metrics without learning multiple proprietary query languages. With single binary deployment, you can be up and running in under 2 minutes.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpxgs69jmt69ojwxt3ynx.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpxgs69jmt69ojwxt3ynx.png" alt="LLM Observability in OpenObserve" width="800" height="435"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key Features:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Unified platform&lt;/strong&gt; — logs, metrics, traces, LLM traces, and RUM monitoring in one tool&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;OpenTelemetry-native&lt;/strong&gt; — drop-in instrumentation for LLM applications using any OTEL SDK&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;SQL-based queries&lt;/strong&gt; — correlate LLM trace data with infrastructure signals using familiar syntax&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;140x lower storage costs&lt;/strong&gt; — Parquet columnar format with aggressive compression&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;High-cardinality support&lt;/strong&gt; — handles per-user, per-session, and per-request LLM telemetry without performance degradation&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Single binary deployment&lt;/strong&gt; — self-hosted in under 2 minutes; no Kubernetes expertise required&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Real-time alerting&lt;/strong&gt; — set alerts on token usage, latency spikes, error rates, and custom LLM metrics&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Rich dashboards&lt;/strong&gt; — visualization for both infrastructure health and LLM operational metrics side by side&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Self-hosted or Cloud&lt;/strong&gt; — full data residency control with flexible deployment options&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Pros:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Only open source platform covering infrastructure observability AND LLM tracing in a single tool&lt;/li&gt;
&lt;li&gt;140x storage cost reduction makes it dramatically cheaper to retain long-term LLM trace history&lt;/li&gt;
&lt;li&gt;SQL querying lowers the learning curve — one language for both infrastructure and LLM queries&lt;/li&gt;
&lt;li&gt;Fully OpenTelemetry-native — no vendor lock-in&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Cons:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;LLM-specific features like LLM-as-judge evaluation and prompt management are handled through integrations rather than built-in modules&lt;/li&gt;
&lt;li&gt;Advanced LLM dashboard templates require manual configuration&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Pricing:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Open source (self-hosted): Free&lt;/li&gt;
&lt;li&gt;Cloud: Free tier available; usage-based pricing beyond that&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; Teams that want a single open source platform covering both LLM observability and infrastructure monitoring, or organizations with strict self-hosting/data residency requirements.&lt;/p&gt;




&lt;h3&gt;
  
  
  2. Langfuse
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;GitHub Stars:&lt;/strong&gt; 21,000+ | &lt;strong&gt;License:&lt;/strong&gt; MIT (core) | &lt;strong&gt;Website:&lt;/strong&gt; &lt;a href="https://langfuse.com/" rel="noopener noreferrer"&gt;langfuse.com&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Langfuse is the most widely adopted open source LLM-specific observability platform. Originally from YCombinator W23, it was recently acquired by ClickHouse, signalling strong long-term investment in its data infrastructure. Its MIT-licensed core covers end-to-end tracing, prompt management, evaluation, and datasets — everything a production LLM team needs on the application layer.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0xwdq6nso10y0z0xj9v8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0xwdq6nso10y0z0xj9v8.png" alt="Langfuse" width="800" height="422"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key Features:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;End-to-end tracing across LLM calls, retrieval steps, and agent actions with waterfall views&lt;/li&gt;
&lt;li&gt;Session replay to reconstruct complete conversation histories for debugging&lt;/li&gt;
&lt;li&gt;Prompt management with version control and live iteration without redeployment&lt;/li&gt;
&lt;li&gt;LLM-as-a-judge evaluation workflows for hallucination, toxicity, and relevance&lt;/li&gt;
&lt;li&gt;LLM Playground for testing prompts directly from a failed trace&lt;/li&gt;
&lt;li&gt;Native integrations: LangChain, LlamaIndex, OpenAI SDK, LiteLLM, Vercel AI SDK, Haystack, Mastra&lt;/li&gt;
&lt;li&gt;Self-host via Docker Compose in under 5 minutes&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Pros:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Strongest LLM-specific community adoption in the open source space&lt;/li&gt;
&lt;li&gt;Covers the full LLM development lifecycle — tracing, evals, datasets, prompt management&lt;/li&gt;
&lt;li&gt;Generous free tier on Langfuse Cloud (50k events/month, 2 users)&lt;/li&gt;
&lt;li&gt;True MIT license on core features&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Cons:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;No built-in infrastructure monitoring — needs a separate platform for full-stack visibility&lt;/li&gt;
&lt;li&gt;Enterprise features (SSO, RBAC, advanced security) are separately licensed&lt;/li&gt;
&lt;li&gt;Cloud pricing can grow quickly at high event volumes&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Pricing:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Self-hosted: Free&lt;/li&gt;
&lt;li&gt;Cloud: Free up to 50k events/month, then $29/month for 100k events&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; Engineering teams that want the deepest open source LLM-specific observability with prompt management and evaluation built in.&lt;/p&gt;




&lt;h3&gt;
  
  
  3. Arize Phoenix
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;License:&lt;/strong&gt; Elastic License 2.0 (source-available) | &lt;strong&gt;Website:&lt;/strong&gt; &lt;a href="https://phoenix.arize.com/" rel="noopener noreferrer"&gt;phoenix.arize.com&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Arize Phoenix is a source-available observability platform built specifically for LLM applications, RAG pipelines, and agent workflows. Built on OpenTelemetry standards, it includes built-in hallucination detection and embedding drift visualization, making it particularly powerful for teams iterating on retrieval pipelines.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2mcq92bni4kt3j42zvqn.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2mcq92bni4kt3j42zvqn.png" alt="Arize Phoenix" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key Features:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;End-to-end tracing for prompts, responses, and agent workflows&lt;/li&gt;
&lt;li&gt;RAG observability — inspect retrieval results, chunk quality, and grounding&lt;/li&gt;
&lt;li&gt;Hallucination detection built in&lt;/li&gt;
&lt;li&gt;Embedding drift detection for monitoring distribution shifts over time&lt;/li&gt;
&lt;li&gt;OpenTelemetry-native export to OpenObserve, Datadog, Grafana, or any OTEL backend&lt;/li&gt;
&lt;li&gt;Supports Python and JavaScript&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Pros:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Purpose-built for RAG and agent debugging — best-in-class for retrieval pipeline visibility&lt;/li&gt;
&lt;li&gt;OTEL-native design eliminates vendor lock-in&lt;/li&gt;
&lt;li&gt;Rich visualizations for understanding embedding spaces and cluster drift&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Cons:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Elastic License 2.0 restricts certain commercial uses (not true open source)&lt;/li&gt;
&lt;li&gt;Less mature prompt management than Langfuse&lt;/li&gt;
&lt;li&gt;No infrastructure monitoring — requires a separate backend&lt;/li&gt;
&lt;li&gt;Enterprise features require moving to Arize AI platform ($50/month+)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Pricing:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Phoenix (open source): Free&lt;/li&gt;
&lt;li&gt;Arize AX Pro: $50/month; Enterprise: custom&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; AI engineering teams building RAG-based systems and agent workflows where deep retrieval pipeline visibility is critical.&lt;/p&gt;




&lt;h3&gt;
  
  
  4. OpenLLMetry
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;License:&lt;/strong&gt; Apache 2.0 | &lt;strong&gt;Website:&lt;/strong&gt; &lt;a href="https://www.traceloop.com/docs/openllmetry" rel="noopener noreferrer"&gt;openllmetry.com&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;OpenLLMetry is the most vendor-neutral option on this list. An open source observability framework built purely on OpenTelemetry standards, it provides LLM instrumentation for Python and TypeScript with a single line of setup code. It then ships traces to any OTEL-compatible backend.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkk9kddqochvx4dkxur1e.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkk9kddqochvx4dkxur1e.png" alt="OpenLLMetry" width="800" height="524"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key Features:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Single-line setup for automatic instrumentation&lt;/li&gt;
&lt;li&gt;Supports OpenAI, Anthropic, Cohere, Azure OpenAI, Bedrock, Vertex AI, and more&lt;/li&gt;
&lt;li&gt;Framework support: LangChain, LlamaIndex, Haystack, CrewAI, and others&lt;/li&gt;
&lt;li&gt;Privacy controls for redacting sensitive prompts from traces&lt;/li&gt;
&lt;li&gt;Custom attributes for A/B testing and feature flag tracking&lt;/li&gt;
&lt;li&gt;Completely free — no licensing costs&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Pros:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;True vendor neutrality — switch backends without changing instrumentation code&lt;/li&gt;
&lt;li&gt;Widest framework and provider coverage on the list&lt;/li&gt;
&lt;li&gt;Fully Apache 2.0 licensed — safe for any commercial use&lt;/li&gt;
&lt;li&gt;Zero cost, zero lock-in&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Cons:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Instrumentation library only — requires a separate backend for storage, dashboards, and alerting&lt;/li&gt;
&lt;li&gt;No built-in evaluation, prompt management, or dashboards&lt;/li&gt;
&lt;li&gt;Requires more setup work to build a complete observability stack&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Pricing:&lt;/strong&gt; Completely free&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; Teams that want vendor-neutral LLM instrumentation and already have an observability backend, or teams building a custom OpenTelemetry-native stack.&lt;/p&gt;




&lt;h3&gt;
  
  
  5. Comet Opik
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;License:&lt;/strong&gt; Apache 2.0 | &lt;strong&gt;Website:&lt;/strong&gt; &lt;a href="https://www.comet.com/site/products/opik/" rel="noopener noreferrer"&gt;comet.com/site/products/opik&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Opik is an open source LLM observability and evaluation platform from Comet ML, focused on systematic testing, optimization, and production monitoring. It stands out for its automated prompt optimization — six algorithms including Few-shot Bayesian, evolutionary, and LLM-powered MetaPrompt approaches — which is rare in open source tooling.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F574n68s79c95yvhpp370.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F574n68s79c95yvhpp370.png" alt="Comet Opik" width="800" height="454"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key Features:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Full tracing for LLM calls, agent steps, and RAG pipelines&lt;/li&gt;
&lt;li&gt;Automated prompt optimization (six algorithms built in)&lt;/li&gt;
&lt;li&gt;Built-in guardrails for PII filtering, off-topic detection, and competitor mention blocking&lt;/li&gt;
&lt;li&gt;Works with any LLM provider; native integrations for LangChain, LlamaIndex, OpenAI, Anthropic, Vertex AI&lt;/li&gt;
&lt;li&gt;60-day data retention on free hosted plan with unlimited team members&lt;/li&gt;
&lt;li&gt;Self-hostable with full features available in the codebase&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Pros:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Automated prompt optimization is a major differentiator&lt;/li&gt;
&lt;li&gt;Guardrails are built in, not bolted on&lt;/li&gt;
&lt;li&gt;Truly open source (Apache 2.0) with full feature access&lt;/li&gt;
&lt;li&gt;Unlimited team members on free tier&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Cons:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Smaller community than Langfuse&lt;/li&gt;
&lt;li&gt;No infrastructure monitoring&lt;/li&gt;
&lt;li&gt;Some advanced analytics features are cloud-only&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Pricing:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Free hosted: 25k spans/month, unlimited team members, 60-day retention&lt;/li&gt;
&lt;li&gt;Pro: $39/month for 100k spans&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; Teams that want comprehensive observability with automated prompt optimization and guardrails built in.&lt;/p&gt;




&lt;h3&gt;
  
  
  6. Helicone
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;License:&lt;/strong&gt; MIT | &lt;strong&gt;Website:&lt;/strong&gt; &lt;a href="https://helicone.ai/" rel="noopener noreferrer"&gt;helicone.ai&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Helicone takes a fundamentally different approach: it is a proxy-first observability platform. Rather than adding an SDK, you simply change your base URL to route traffic through Helicone — and it immediately logs every request, response, token count, cost, and error with zero code changes.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbimfxtlw3f4y4rmn4c8m.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbimfxtlw3f4y4rmn4c8m.png" alt="Helicone" width="800" height="389"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key Features:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Proxy-based setup — change one line of code (base URL), nothing else&lt;/li&gt;
&lt;li&gt;Works with 100+ models and any OpenAI-compatible endpoint&lt;/li&gt;
&lt;li&gt;Request caching to reduce latency and cost on repeated calls&lt;/li&gt;
&lt;li&gt;Intelligent request routing and automatic provider failover&lt;/li&gt;
&lt;li&gt;Rate limiting and usage controls to prevent runaway spend&lt;/li&gt;
&lt;li&gt;Cost tracking by model, user, and session&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Pros:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Fastest time-to-value — production observability in under 5 minutes&lt;/li&gt;
&lt;li&gt;No SDK to install or manage&lt;/li&gt;
&lt;li&gt;Caching and routing features go beyond pure observability&lt;/li&gt;
&lt;li&gt;MIT licensed and self-hostable&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Cons:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Proxy architecture introduces a network hop&lt;/li&gt;
&lt;li&gt;Less suited for deep agent workflow tracing than Langfuse or Arize Phoenix&lt;/li&gt;
&lt;li&gt;No infrastructure monitoring&lt;/li&gt;
&lt;li&gt;Evaluation features are limited compared to dedicated eval platforms&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Pricing:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Hobby (free): 50k monthly logs&lt;/li&gt;
&lt;li&gt;Pro: $79/month&lt;/li&gt;
&lt;li&gt;Team: $799/month&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; Teams that need lightweight model-level observability and cost control with the absolute minimum setup friction.&lt;/p&gt;




&lt;h3&gt;
  
  
  7. Lunary
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;License:&lt;/strong&gt; Apache 2.0 | &lt;strong&gt;Website:&lt;/strong&gt; &lt;a href="https://lunary.ai/" rel="noopener noreferrer"&gt;lunary.ai&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Lunary is a lightweight open source observability platform optimized for RAG pipelines and chatbot applications. It offers SDKs for JavaScript (Node.js, Deno, Vercel Edge, Cloudflare Workers) and Python, with a setup time of roughly two minutes. Its Radar feature automatically categorizes LLM responses based on pre-defined criteria, making it easy to audit outputs at scale.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fg7v5b5e5czc3twnkwktf.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fg7v5b5e5czc3twnkwktf.png" alt="Lunary" width="800" height="453"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key Features:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Specialized RAG tracing with embedding metrics and latency visualization&lt;/li&gt;
&lt;li&gt;Radar: rule-based categorization of LLM responses for downstream auditing&lt;/li&gt;
&lt;li&gt;SDKs for JavaScript environments including Vercel Edge and Cloudflare Workers&lt;/li&gt;
&lt;li&gt;Session-level tracing for chatbot conversations&lt;/li&gt;
&lt;li&gt;10k events/month free with 30-day retention&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Pros:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Best JavaScript/TypeScript support of any tool on this list&lt;/li&gt;
&lt;li&gt;Lightweight and fast to set up — under 2 minutes&lt;/li&gt;
&lt;li&gt;Purpose-built for RAG and chatbot use cases&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Cons:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Narrower feature set than Langfuse or OpenObserve&lt;/li&gt;
&lt;li&gt;Some advanced features require Enterprise licensing&lt;/li&gt;
&lt;li&gt;Smaller community and ecosystem&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Pricing:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Free tier: 10k events/month, 30-day retention&lt;/li&gt;
&lt;li&gt;Enterprise: Custom (includes self-hosting)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; JavaScript-first teams building RAG pipelines or chatbot applications who need quick observability setup.&lt;/p&gt;




&lt;h3&gt;
  
  
  8. TruLens
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;License:&lt;/strong&gt; MIT | &lt;strong&gt;Website:&lt;/strong&gt; &lt;a href="https://www.trulens.org/" rel="noopener noreferrer"&gt;trulens.org&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;TruLens takes a qualitative-first approach to LLM observability, built around structured feedback functions that evaluate LLM responses after each call. It is particularly strong for teams using LlamaIndex and LangChain who want systematic evaluation pipelines rather than traditional tracing.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxix8ggqa72fngjv4b0pl.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxix8ggqa72fngjv4b0pl.png" alt="TruLens" width="800" height="478"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key Features:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Feedback functions that run automatically after each LLM call&lt;/li&gt;
&lt;li&gt;Pre-built evaluators for relevance, groundedness, and coherence&lt;/li&gt;
&lt;li&gt;RAG triad evaluation: answer relevance, context relevance, groundedness&lt;/li&gt;
&lt;li&gt;Deep integration with LlamaIndex and LangChain&lt;/li&gt;
&lt;li&gt;LLM-agnostic — supports any model as an evaluator&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Pros:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Best-in-class for structured, systematic evaluation pipelines&lt;/li&gt;
&lt;li&gt;RAG triad evaluation is a well-regarded methodology for RAG quality assessment&lt;/li&gt;
&lt;li&gt;MIT licensed with no restrictions&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Cons:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Python only — no JavaScript/TypeScript support&lt;/li&gt;
&lt;li&gt;Less focus on tracing and production monitoring&lt;/li&gt;
&lt;li&gt;Smaller community than Langfuse&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Pricing:&lt;/strong&gt; Free (MIT licensed)&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; Research teams and ML engineers who need rigorous, automated evaluation pipelines for RAG systems with Python-native tooling.&lt;/p&gt;




&lt;h3&gt;
  
  
  9. PostHog LLM Analytics
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;GitHub Stars:&lt;/strong&gt; 32,100+ | &lt;strong&gt;License:&lt;/strong&gt; MIT | &lt;strong&gt;Website:&lt;/strong&gt; &lt;a href="https://posthog.com/docs/ai-engineering" rel="noopener noreferrer"&gt;posthog.com&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;PostHog bundles LLM observability alongside product analytics, session replay, feature flags, A/B testing, and error tracking. For teams who want to understand not just how their LLM performs technically but how users actually interact with it, PostHog is uniquely positioned.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftmb41flykagr9xwec52y.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftmb41flykagr9xwec52y.png" alt="PostHog LLM Analytics" width="800" height="419"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key Features:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;LLM generation capture with cost, latency, and usage metrics&lt;/li&gt;
&lt;li&gt;Combines LLM data with product analytics — funnels, retention, and user behaviour&lt;/li&gt;
&lt;li&gt;Session replay for AI interactions — watch exactly what users experienced&lt;/li&gt;
&lt;li&gt;A/B testing for prompts using the same experiment framework as product features&lt;/li&gt;
&lt;li&gt;Prompt management (beta) with version control&lt;/li&gt;
&lt;li&gt;100k LLM observability events/month on free tier&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Pros:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Only tool on this list that combines LLM observability with full product analytics&lt;/li&gt;
&lt;li&gt;Session replay for AI interactions is a uniquely powerful debugging tool&lt;/li&gt;
&lt;li&gt;Massive community (32k+ GitHub stars)&lt;/li&gt;
&lt;li&gt;Transparent, usage-based pricing&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Cons:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;LLM-specific features (evaluation, RAG tracing) are less mature than dedicated tools&lt;/li&gt;
&lt;li&gt;No infrastructure monitoring&lt;/li&gt;
&lt;li&gt;Prompt management is still in beta&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Pricing:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Free: 100k LLM events/month, 30-day retention&lt;/li&gt;
&lt;li&gt;Usage-based beyond that&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; Product-led teams who want to combine LLM monitoring with user behaviour and product analytics in one platform.&lt;/p&gt;




&lt;h3&gt;
  
  
  10. Weave by Weights &amp;amp; Biases
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;License:&lt;/strong&gt; Apache 2.0 | &lt;strong&gt;Website:&lt;/strong&gt; &lt;a href="https://wandb.ai/site/weave" rel="noopener noreferrer"&gt;wandb.ai/site/weave&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Weave is the LLM observability product from Weights &amp;amp; Biases (W&amp;amp;B), extending W&amp;amp;B's ML experiment tracking into LLM application observability — covering tracing, evaluation, and dataset management in a unified interface.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxw99ue1opxd9et0bey7j.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxw99ue1opxd9et0bey7j.png" alt="Weave by Weights &amp;amp; Biases" width="800" height="470"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key Features:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;End-to-end tracing for LLM calls, chains, and agent workflows&lt;/li&gt;
&lt;li&gt;Dataset management with versioning for evaluation benchmarks&lt;/li&gt;
&lt;li&gt;Integration with W&amp;amp;B experiment tracking for model-level and application-level comparison&lt;/li&gt;
&lt;li&gt;Human annotation tools for labelling and review workflows&lt;/li&gt;
&lt;li&gt;Supports Python and JavaScript&lt;/li&gt;
&lt;li&gt;Model-agnostic — works with OpenAI, Anthropic, open source models, and custom endpoints&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Pros:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Natural fit for teams already using W&amp;amp;B for model training and experiment tracking&lt;/li&gt;
&lt;li&gt;Strong dataset and evaluation management inherited from W&amp;amp;B's research-grade tooling&lt;/li&gt;
&lt;li&gt;Apache 2.0 license — commercially safe&lt;/li&gt;
&lt;li&gt;Bridges model development and production deployment in one workspace&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Cons:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Less specialized for production LLM monitoring than Langfuse or OpenObserve&lt;/li&gt;
&lt;li&gt;Tightly coupled to the W&amp;amp;B ecosystem — less useful if you're not already a W&amp;amp;B user&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Pricing:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Free tier available via W&amp;amp;B&lt;/li&gt;
&lt;li&gt;Team and Enterprise plans: custom pricing&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; ML research teams already invested in the W&amp;amp;B ecosystem who want to extend experiment tracking into production LLM observability.&lt;/p&gt;




&lt;h2&gt;
  
  
  Comparison Table
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Tool&lt;/th&gt;
&lt;th&gt;License&lt;/th&gt;
&lt;th&gt;Self-Hosted&lt;/th&gt;
&lt;th&gt;Tracing&lt;/th&gt;
&lt;th&gt;Evaluation&lt;/th&gt;
&lt;th&gt;Prompt Mgmt&lt;/th&gt;
&lt;th&gt;Infra Monitoring&lt;/th&gt;
&lt;th&gt;RAG Support&lt;/th&gt;
&lt;th&gt;Best For&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;OpenObserve&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;AGPL-3.0&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;⚠️&lt;/td&gt;
&lt;td&gt;⚠️&lt;/td&gt;
&lt;td&gt;✅✅&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;Unified infra + LLM observability&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Langfuse&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;MIT (core)&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;Full-lifecycle LLM observability&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Arize Phoenix&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;ELv2&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;⚠️&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;✅✅&lt;/td&gt;
&lt;td&gt;RAG and agent debugging&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;OpenLLMetry&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Apache 2.0&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;Vendor-neutral instrumentation&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Comet Opik&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Apache 2.0&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;Prompt optimization + observability&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Helicone&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;MIT&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;⚠️&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;⚠️&lt;/td&gt;
&lt;td&gt;Lightweight proxy-based monitoring&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Lunary&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Apache 2.0&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;⚠️&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;JavaScript RAG &amp;amp; chatbots&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;TruLens&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;MIT&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;⚠️&lt;/td&gt;
&lt;td&gt;✅✅&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;Structured evaluation pipelines&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;PostHog&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;MIT&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;⚠️&lt;/td&gt;
&lt;td&gt;⚠️&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;⚠️&lt;/td&gt;
&lt;td&gt;LLM + product analytics combined&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Weave (W&amp;amp;B)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Apache 2.0&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;⚠️&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;ML research teams on W&amp;amp;B&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;em&gt;✅ = strong support, ⚠️ = partial or in beta, ❌ = not available&lt;/em&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  How to Choose the Right Tool
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Start with your deployment requirement
&lt;/h3&gt;

&lt;p&gt;If your organization requires data residency or strict compliance, every tool on this list supports self-hosting. For the simplest self-hosted path, &lt;strong&gt;OpenObserve&lt;/strong&gt; stands out — single binary deployment in under 2 minutes, covering both infrastructure and LLM telemetry. For pure LLM-specific self-hosting, &lt;strong&gt;Langfuse&lt;/strong&gt; via Docker Compose takes about 5 minutes.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Match the tool to your primary bottleneck
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;If your main problem is...&lt;/th&gt;
&lt;th&gt;Best tool(s)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Unified infra + LLM observability in one place&lt;/td&gt;
&lt;td&gt;OpenObserve&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Debugging agent and chain failures&lt;/td&gt;
&lt;td&gt;OpenObserve, Langfuse, Arize Phoenix&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;RAG pipeline quality&lt;/td&gt;
&lt;td&gt;Arize Phoenix, TruLens, Lunary&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Prompt quality and optimization&lt;/td&gt;
&lt;td&gt;Comet Opik, Langfuse&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cost and token tracking&lt;/td&gt;
&lt;td&gt;Helicone, Langfuse, OpenObserve&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Storage cost at scale&lt;/td&gt;
&lt;td&gt;OpenObserve (140x compression)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Vendor-neutral instrumentation&lt;/td&gt;
&lt;td&gt;OpenLLMetry → OpenObserve as backend&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;JavaScript/Node.js first&lt;/td&gt;
&lt;td&gt;Lunary, PostHog&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Product analytics + LLM&lt;/td&gt;
&lt;td&gt;PostHog&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  3. Consider your framework dependencies
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;LangChain / LangGraph users:&lt;/strong&gt; Langfuse has the deepest native LLM-specific integration&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;LlamaIndex users:&lt;/strong&gt; TruLens and Arize Phoenix have strong LlamaIndex support&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;OpenAI SDK / Anthropic SDK users:&lt;/strong&gt; All tools support this; Helicone is fastest to set up&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Custom stacks / framework agnostic:&lt;/strong&gt; OpenLLMetry → OpenObserve is the safest, most future-proof combination&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  4. Think about the evaluation maturity you need
&lt;/h3&gt;

&lt;p&gt;In early development, basic tracing and cost monitoring (Helicone, Lunary) may be enough. As you move to production, evaluation becomes critical. &lt;strong&gt;Langfuse&lt;/strong&gt; and &lt;strong&gt;Arize Phoenix&lt;/strong&gt; lead for comprehensive evaluation workflows; &lt;strong&gt;TruLens&lt;/strong&gt; leads for structured RAG evaluation methodology.&lt;/p&gt;

&lt;h3&gt;
  
  
  5. Factor in long-term lock-in risk
&lt;/h3&gt;

&lt;p&gt;Tools built on OpenTelemetry standards — particularly &lt;strong&gt;OpenLLMetry&lt;/strong&gt;, &lt;strong&gt;Arize Phoenix&lt;/strong&gt;, and &lt;strong&gt;OpenObserve&lt;/strong&gt; — give you the most flexibility to change components without re-instrumenting your application.&lt;/p&gt;




&lt;h2&gt;
  
  
  FAQs
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;What is the best open source LLM observability tool in 2026?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;OpenObserve is our top pick for 2026 — the only open source platform covering both LLM observability and infrastructure monitoring in a single deployment. For LLM-specific evaluation and prompt management on top, &lt;strong&gt;Langfuse&lt;/strong&gt; is the strongest companion. For RAG-specific debugging, &lt;strong&gt;Arize Phoenix&lt;/strong&gt; leads.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Can I use these tools with any LLM provider?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Yes. All tools on this list support major providers including OpenAI, Anthropic, Cohere, Azure OpenAI, AWS Bedrock, Vertex AI, and most open source model endpoints. &lt;strong&gt;OpenLLMetry&lt;/strong&gt; and &lt;strong&gt;Helicone&lt;/strong&gt; have the broadest provider coverage (100+ models).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What is the difference between LLM tracing and LLM evaluation?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Tracing records &lt;em&gt;what happened&lt;/em&gt; — prompts sent, responses received, latencies, token counts, tool calls. Evaluation assesses &lt;em&gt;whether what happened was good&lt;/em&gt; — was the response accurate, relevant, grounded in retrieved context, free of hallucinations?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Do I need a separate observability stack for infrastructure if I adopt one of these tools?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Not if you choose &lt;strong&gt;OpenObserve&lt;/strong&gt;. It handles metrics, logs, distributed traces, and LLM telemetry in a single platform — replacing the need for separate tools like Prometheus, Loki, and Tempo. For all other tools on this list, you will need a separate infrastructure monitoring stack.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What is the easiest tool to set up?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Helicone&lt;/strong&gt; wins on LLM-specific setup speed — one line of code (change your base URL) and you have immediate production observability. &lt;strong&gt;OpenObserve&lt;/strong&gt; wins on full-stack setup speed — single binary deployment in under 2 minutes covering both LLM and infrastructure telemetry.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How much does LLM observability cost at scale?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This is where &lt;strong&gt;OpenObserve&lt;/strong&gt; stands out most clearly. Its Parquet-based 140x compression technology dramatically reduces the cost of storing LLM traces, prompt histories, and operational metrics at scale — critical as LLM application volumes grow.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published on &lt;a href="https://openobserve.ai/blog/llm-observability-tools/" rel="noopener noreferrer"&gt;openobserve.ai&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>observability</category>
      <category>opentelemetry</category>
    </item>
    <item>
      <title>Top Log Visualization Tools in 2026: Dashboards, Search &amp; AI-Assisted Analysis</title>
      <dc:creator>Manas Sharma</dc:creator>
      <pubDate>Tue, 17 Mar 2026 08:44:41 +0000</pubDate>
      <link>https://dev.to/openobserve/top-log-visualization-tools-in-2026-dashboards-search-ai-assisted-analysis-2g9</link>
      <guid>https://dev.to/openobserve/top-log-visualization-tools-in-2026-dashboards-search-ai-assisted-analysis-2g9</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Quick answer:&lt;/strong&gt; The best log visualization tools in 2026 are &lt;strong&gt;OpenObserve&lt;/strong&gt;, Kibana (Elastic Stack), Grafana + Loki, Datadog Logs, and Splunk. OpenObserve stands out by combining traditional dashboards with a built-in AI assistant (&lt;strong&gt;O2 Assistant&lt;/strong&gt;) that lets you query, correlate, and visualize logs in plain English.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  What Separates Great Log Visualization from Basic Log Search?
&lt;/h2&gt;

&lt;p&gt;Most log tools can search. The best ones let you &lt;em&gt;understand&lt;/em&gt;. &lt;/p&gt;

&lt;p&gt;In 2026, the gap has widened between tools that simply dump raw text and those that provide a fast path from &lt;strong&gt;alert → root cause → fix&lt;/strong&gt;. The features that define the leaders today include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Saved Views &amp;amp; Search Templates&lt;/strong&gt; – Reuse complex filters without starting from scratch.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Dashboard Templating&lt;/strong&gt; – Parameterized views that scale across services and environments.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Anomaly Detection&lt;/strong&gt; – Surfacing "unknown unknowns" without manual thresholds.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Deep Drill-Down&lt;/strong&gt; – Moving from a high-level spike to specific log lines in one click.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;AI-Assisted Analysis&lt;/strong&gt; – Using natural language to generate complex queries.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  The Best Log Visualization Tools in 2026
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Tool&lt;/th&gt;
&lt;th&gt;AI-Assisted Analysis&lt;/th&gt;
&lt;th&gt;Open Source&lt;/th&gt;
&lt;th&gt;Deployment&lt;/th&gt;
&lt;th&gt;Best For&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;OpenObserve&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;O2 Assistant + MCP&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;Self-hosted / Cloud&lt;/td&gt;
&lt;td&gt;Full-stack observability with AI&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Kibana (Elastic)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Partial (ML add-on)&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;Self-hosted / Cloud&lt;/td&gt;
&lt;td&gt;Full-text search, complex pipelines&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Grafana + Loki&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Partial (plugin)&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;Self-hosted / Cloud&lt;/td&gt;
&lt;td&gt;Prometheus-native teams&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Datadog Logs&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Watchdog AI&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;SaaS&lt;/td&gt;
&lt;td&gt;Managed, all-in-one observability&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Splunk&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Splunk AI&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;Self-hosted / Cloud&lt;/td&gt;
&lt;td&gt;Enterprise SIEM &amp;amp; security&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  1. OpenObserve — Best for AI-Assisted Log Visualization
&lt;/h2&gt;

&lt;p&gt;OpenObserve is the only tool where AI-assisted analysis is native, not bolted on. Its &lt;strong&gt;O2 Assistant&lt;/strong&gt; is a full observability co-pilot that understands your schema, queries, and infrastructure topology.&lt;/p&gt;

&lt;h3&gt;
  
  
  What makes O2 Assistant different?
&lt;/h3&gt;

&lt;p&gt;Traditional visualization requires you to know what to look for. With O2 Assistant, the workflow inverts: &lt;strong&gt;You describe the problem; the tool finds the evidence.&lt;/strong&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"Show me error rate spikes in the payment service over the last 6 hours, correlated with any upstream database latency."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1gmm9x86afugdgnemr4o.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1gmm9x86afugdgnemr4o.png" alt="NLP mode for SQL queries with AI Assistant" width="800" height="435"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Key Capabilities
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Natural Language to Query:&lt;/strong&gt; Translates English into SQL, PromQL, or VRL scripts.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cross-Telemetry Correlation:&lt;/strong&gt; Query logs, metrics, and traces in the same conversation thread.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;AI-Generated Dashboards:&lt;/strong&gt; Use the MCP (Model Context Protocol) server to build entire dashboards from a single prompt.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Ad-hoc Investigation:&lt;/strong&gt; Perfect for "2 AM incidents" where you don't have a pre-built dashboard ready.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Works with Your Existing Stack
&lt;/h3&gt;

&lt;p&gt;OpenObserve supports &lt;strong&gt;Fluent Bit, Vector, Logstash, Filebeat, and OpenTelemetry&lt;/strong&gt;. You can repoint your existing shippers and be up and running in minutes. It also features a built-in visual pipeline editor with over 100 VRL functions for real-time parsing and redaction.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fo117ol034gyau7m5ecu2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fo117ol034gyau7m5ecu2.png" alt="Agent receivers ingestion flow into OpenObserve" width="800" height="453"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  2. Kibana (Elastic Stack) — Best for Full-Text Search
&lt;/h2&gt;

&lt;p&gt;Kibana remains the gold standard for inverted-index search. Its &lt;strong&gt;Lens&lt;/strong&gt; visualization engine and &lt;strong&gt;Discover&lt;/strong&gt; view are incredibly mature.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Strengths:&lt;/strong&gt; High customizability, mature drag-and-drop editors, and powerful ML-driven anomaly detection.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Weaknesses:&lt;/strong&gt; High resource consumption (RAM-hungry) and a steeper learning curve for KQL (Kibana Query Language) compared to natural language interfaces.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  3. Grafana + Loki — Best for Prometheus-Native Teams
&lt;/h2&gt;

&lt;p&gt;For teams already deep in the Prometheus ecosystem, Grafana + Loki is the natural choice. It uses the same label model and UI you already know.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Strengths:&lt;/strong&gt; Unified dashboards for metrics, logs, and traces; excellent Kubernetes integration.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Weaknesses:&lt;/strong&gt; Loki only indexes labels, making full-text search over unstructured logs slower and more expensive than indexed alternatives.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  4. Datadog Logs — Best Managed Option
&lt;/h2&gt;

&lt;p&gt;Datadog offers the most polished "zero-ops" experience. Its &lt;strong&gt;Watchdog AI&lt;/strong&gt; surfaces anomalies automatically, and the integration between logs and distributed traces is seamless.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Tradeoff:&lt;/strong&gt; Cost. As log volume grows, Datadog’s pricing often forces teams to sample or redact data aggressively to stay within budget.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  5. Splunk — Best for Enterprise Security
&lt;/h2&gt;

&lt;p&gt;Splunk is the powerhouse of the SIEM world. If your log visualization needs are tied to forensic investigation and strict compliance, Splunk’s SPL (Search Processing Language) is unmatched. For standard app observability, however, it is often considered overengineered.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Shift: From Dashboards to Conversations
&lt;/h2&gt;

&lt;p&gt;The old way of observing involved building dashboards for "known" failure modes. But modern, distributed systems fail in "unknown" ways. &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;AI-assisted log analysis&lt;/strong&gt; changes the game by allowing exploratory investigation. When you can generate a correlated view across logs and metrics via a chat interface, the "Time to Resolution" (TTR) drops significantly. This is why OpenObserve’s native AI integration represents a fundamental shift in how we handle incidents in 2026.&lt;/p&gt;




&lt;h2&gt;
  
  
  FAQ
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;What is the lowest-cost log tool?&lt;/strong&gt;&lt;br&gt;
OpenObserve typically offers the lowest storage costs (up to 140x lower than ELK) due to its S3-native architecture.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Does OpenObserve work with OpenTelemetry?&lt;/strong&gt;&lt;br&gt;
Yes, it is OTLP-native and supports logs, metrics, and traces via OpenTelemetry collectors.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Can I create dashboards using AI?&lt;/strong&gt;&lt;br&gt;
Yes. Using OpenObserve's AI assistant, you can generate complete dashboard panels from a simple text prompt.&lt;/p&gt;




&lt;h3&gt;
  
  
  Get Started
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://cloud.openobserve.ai" rel="noopener noreferrer"&gt;OpenObserve Cloud&lt;/a&gt;&lt;/strong&gt; — 14-day free trial, no credit card required.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Self-hosted&lt;/strong&gt; — Run it as a single binary or via Helm charts in under 10 minutes.&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>devops</category>
      <category>observability</category>
      <category>logs</category>
      <category>ai</category>
    </item>
    <item>
      <title>Jaeger for Distributed Tracing: A Complete Guide with OpenObserve Comparison</title>
      <dc:creator>Manas Sharma</dc:creator>
      <pubDate>Fri, 13 Feb 2026 15:12:29 +0000</pubDate>
      <link>https://dev.to/openobserve/jaeger-for-distributed-tracing-a-complete-guide-with-openobserve-comparison-22ac</link>
      <guid>https://dev.to/openobserve/jaeger-for-distributed-tracing-a-complete-guide-with-openobserve-comparison-22ac</guid>
      <description>&lt;p&gt;As software systems evolve, they become increasingly complex, especially with the rise of microservices and distributed architectures. Keeping track of what's happening across different services can quickly become a daunting task. Tracing tools like Jaeger have emerged as essential solutions for debugging and monitoring distributed applications, helping developers understand and optimise their systems.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;In this blog, we will cover:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;The Pillars of Observability&lt;/li&gt;
&lt;li&gt;Background on Distributed Tracing&lt;/li&gt;
&lt;li&gt;What Is Jaeger?&lt;/li&gt;
&lt;li&gt;How Jaeger Works: Key Concepts and Components&lt;/li&gt;
&lt;li&gt;How Jaeger Collects and Visualizes Traces&lt;/li&gt;
&lt;li&gt;Getting Started with Jaeger&lt;/li&gt;
&lt;li&gt;Getting Started with OpenObserve&lt;/li&gt;
&lt;li&gt;Jaeger vs. OpenObserve&lt;/li&gt;
&lt;li&gt;Conclusion&lt;/li&gt;
&lt;li&gt;Real-World Case Study: Jidu's Journey to 100% Tracing Fidelity&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Prerequisites:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;A running Docker instance with admin access.&lt;/li&gt;
&lt;li&gt;An OpenObserve instance or cloud account ready to receive logs.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The Pillars of Observability
&lt;/h2&gt;

&lt;p&gt;To truly understand Jaeger, it's vital to grasp the concept of observability. Observability allows us to infer the internal states of systems through their outputs, and it primarily revolves around three pillars:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Logging:&lt;/strong&gt; Capturing individual events or errors.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Metrics:&lt;/strong&gt; Quantifying system performance and resource usage.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tracing:&lt;/strong&gt; Visualizing request paths and measuring latency across services.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;While logging and metrics provide critical insights, distributed tracing complements them by offering context on how different services interact and depend on one another.&lt;/p&gt;

&lt;h2&gt;
  
  
  Background on Distributed Tracing
&lt;/h2&gt;

&lt;p&gt;Before we dive into Jaeger, it's essential to understand the concept of distributed tracing and why it's crucial in microservices environments.&lt;/p&gt;

&lt;h3&gt;
  
  
  What is Distributed Tracing?
&lt;/h3&gt;

&lt;p&gt;Distributed tracing is a methodology used to track and analyze requests as they traverse through various services in a distributed system. It helps in visualizing the journey of a request, from the initial entry point all the way to the final response.&lt;/p&gt;

&lt;p&gt;E.g. Service A → Service B → Service C → Service D&lt;/p&gt;

&lt;h3&gt;
  
  
  Why is Distributed Tracing Important?
&lt;/h3&gt;

&lt;p&gt;In monolithic applications, tracing and debugging are straightforward. However, modern applications often depend on multiple microservices communicating over networks, complicating the identification of delays or failures.&lt;/p&gt;

&lt;p&gt;Logging alone can't capture complex dependencies or detect bottlenecks. Distributed tracing tools like Jaeger provide end-to-end visibility of requests, capturing metadata at each step, which helps developers:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Trace requests across services&lt;/li&gt;
&lt;li&gt;Visualise service dependencies and interactions&lt;/li&gt;
&lt;li&gt;Identify performance bottlenecks&lt;/li&gt;
&lt;li&gt;Quickly troubleshoot issues by pinpointing problematic services&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  What Is Jaeger?
&lt;/h2&gt;

&lt;p&gt;Jaeger is an open-source, end-to-end distributed tracing tool originally developed by Uber Technologies. Now part of the CNCF (Cloud Native Computing Foundation), Jaeger allows developers to trace requests as they propagate through distributed systems, providing insights into service behavior and performance bottlenecks.&lt;/p&gt;

&lt;p&gt;With Jaeger, you can:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Track request latency and identify services contributing to slow response times&lt;/li&gt;
&lt;li&gt;Monitor errors and investigate the root cause of failures across services&lt;/li&gt;
&lt;li&gt;Visualise dependency graphs for services to understand relationships and interactions&lt;/li&gt;
&lt;li&gt;Optimise performance by identifying and removing bottlenecks&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Jaeger is widely adopted due to its powerful tracing capabilities, ease of use, and integration with other monitoring tools in the observability stack.&lt;/p&gt;

&lt;h2&gt;
  
  
  How Jaeger Works: Key Concepts and Components
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fh58gelacidtvo5vb5aww.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fh58gelacidtvo5vb5aww.png" alt="jaeger_architecture" width="800" height="220"&gt;&lt;/a&gt;&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fm6zjvc6i9g09are7jbfs.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fm6zjvc6i9g09are7jbfs.png" alt="jaeger_architecture" width="800" height="220"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Jaeger traces requests as they travel through various services in a distributed system. It captures information about each service's interaction, which helps in pinpointing issues. Let's break down the primary components of Jaeger to understand its functioning:&lt;/p&gt;
&lt;h3&gt;
  
  
  Spans and Traces:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Span:&lt;/strong&gt; A span represents a single unit of work within a trace, capturing details like start time, duration, and any metadata or tags. Each span represents a single service call or action in the overall trace.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Trace:&lt;/strong&gt; A trace represents the entire journey of a request across multiple spans. For instance, when a user makes a request to an application, a trace records the entire sequence, from the front end to each microservice involved.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvf62f3whriy1l7oyrksa.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvf62f3whriy1l7oyrksa.png" alt="jaeger_trace" width="800" height="414"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This screenshot is from the HOT Commerce project by OpenObserve, which demonstrates tracing across microservices. For more details, visit the project on &lt;a href="https://github.com/openobserve/hotcommerce/" rel="noopener noreferrer"&gt;GitHub here.&lt;/a&gt;&lt;/p&gt;
&lt;h4&gt;
  
  
  Trace Analysis:
&lt;/h4&gt;

&lt;p&gt;In the image above, each line represents a span—a single operation within the overall trace, showing the journey of a request across services:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Trace:&lt;/strong&gt; The set of spans forms the trace, covering services like frontend, shop, product, review, and price.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Longest Span:&lt;/strong&gt; The frontend service takes the longest time at 2.53 seconds.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Shortest Span:&lt;/strong&gt; The request handler completes in just 27.00 microseconds (µs).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Total Spans:&lt;/strong&gt; There are 15 spans, each representing a unit of work, such as middleware processing, database calls, and service interactions.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This breakdown shows how the request interacts with multiple services and highlights areas for potential optimization.&lt;/p&gt;
&lt;h3&gt;
  
  
  Jaeger Client:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Jaeger clients are libraries that you embed in your application code to instrument services and collect tracing data. These clients generate spans and traces, sending them to a collector for storage and analysis.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Alternatively, instead of using the Jaeger-specific client, you can also use OpenTelemetry (OTel) SDKs for instrumentation. OpenTelemetry is a vendor-neutral observability framework that can work with multiple tracing backends, including Jaeger. Using OTel SDKs allows flexibility to switch or integrate with other observability tools.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;
  
  
  Agent:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;The Jaeger agent is a lightweight daemon running alongside the application. It receives traces emitted by the client and batches them for efficient transmission to the collector.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Alternatively, the OpenTelemetry Collector can be used as an alternative to the Jaeger Agent. The OTel Collector is a versatile tool that not only receives, processes, and exports tracing data but can also handle metrics and logs. It can send data to multiple observability backends, making it a flexible choice for distributed tracing setups.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;
  
  
  Collector:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;The Jaeger collector receives traces from agents and stores them in a backend. It also performs any preprocessing or filtering needed for the traces before they are stored.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;In OpenTelemetry-based setups, the OTel Collector can handle this role as well, offering additional features like data transformation and routing, which make it ideal for complex or multi-backend environments.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;
  
  
  Query Service and UI:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Jaeger provides a UI for querying and visualising traces. Through this UI, developers can search for traces, identify latency bottlenecks, and visualise service dependencies and call hierarchies.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;
  
  
  Storage Backend:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Jaeger supports various storage backends like Cassandra, Elasticsearch, or even local files for persistence. This allows you to store traces for later analysis and comparisons.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;
  
  
  How Jaeger Collects and Visualizes Traces
&lt;/h2&gt;

&lt;p&gt;When a user request enters a service, the Jaeger client library starts a trace, generating a unique trace ID for that request. As the request flows through different services, the trace ID propagates along, with each service generating a span representing its part of the work. These spans are sent to the Jaeger agent and ultimately stored in the backend.&lt;/p&gt;

&lt;p&gt;The Jaeger UI allows you to visualise traces in a timeline view, making it easier to observe the sequence of events and locate bottlenecks. The UI also provides a service dependency graph that shows the relationships between services, allowing you to monitor dependencies and the overall health of your system.&lt;/p&gt;
&lt;h2&gt;
  
  
  Getting Started with Jaeger
&lt;/h2&gt;

&lt;p&gt;Here's a quick guide to setting up Jaeger in your environment. We'll use Docker to deploy Jaeger and assume you have Docker installed.&lt;br&gt;
For a complete setup guide, refer to the &lt;a href="https://www.jaegertracing.io/docs/1.62/getting-started/" rel="noopener noreferrer"&gt;Jaeger Getting Started Documentation.&lt;/a&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  Step 1: Deploy Jaeger with Docker
&lt;/h3&gt;

&lt;p&gt;Jaeger offers an all-in-one image for testing and development purposes. To start the Jaeger all-in-one container, run the following command:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker run &lt;span class="nt"&gt;--rm&lt;/span&gt; &lt;span class="nt"&gt;--name&lt;/span&gt; jaeger &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-e&lt;/span&gt; &lt;span class="nv"&gt;COLLECTOR_ZIPKIN_HOST_PORT&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;:9411 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-p&lt;/span&gt; 6831:6831/udp &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-p&lt;/span&gt; 6832:6832/udp &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-p&lt;/span&gt; 5778:5778 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-p&lt;/span&gt; 16686:16686 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-p&lt;/span&gt; 4317:4317 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-p&lt;/span&gt; 4318:4318 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-p&lt;/span&gt; 14250:14250 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-p&lt;/span&gt; 14268:14268 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-p&lt;/span&gt; 14269:14269 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-p&lt;/span&gt; 9411:9411 &lt;span class="se"&gt;\&lt;/span&gt;
  jaegertracing/all-in-one:1.62.0
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The above command runs the Jaeger all-in-one Docker container, which is useful for testing and development. It exposes the following ports:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;6831/udp &amp;amp; 6832/udp:&lt;/strong&gt; Receive trace data from Jaeger agents.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;5778:&lt;/strong&gt; Agent configuration HTTP endpoint.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;16686:&lt;/strong&gt; Jaeger Query UI for viewing and searching traces.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;4317:&lt;/strong&gt; OpenTelemetry gRPC endpoint for tracing data.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;4318:&lt;/strong&gt; OpenTelemetry HTTP endpoint for tracing data.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;14250:&lt;/strong&gt; gRPC endpoint for the Jaeger collector.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;14268:&lt;/strong&gt; HTTP endpoint for the collector to receive traces.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;14269:&lt;/strong&gt; Health check endpoint for the collector.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;9411:&lt;/strong&gt; Zipkin-compatible endpoint for receiving data.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Note:&lt;/strong&gt; This setup uses memory as the default backend storage, which is intended for short-term use and is not recommended for production due to the lack of persistence.&lt;/p&gt;

&lt;p&gt;You can access the Jaeger UI at &lt;strong&gt;&lt;a href="http://localhost:16686" rel="noopener noreferrer"&gt;http://localhost:16686&lt;/a&gt;&lt;/strong&gt;, to visualise and interact with the traces collected.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F16fjasa45n0jtybt3ynu.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F16fjasa45n0jtybt3ynu.jpg" alt="jaeger_UI" width="800" height="419"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 2: Instrument the HotROD Sample Application
&lt;/h3&gt;

&lt;p&gt;Next, we'll instrument the HotROD sample application to work with Jaeger for distributed tracing.&lt;/p&gt;

&lt;h4&gt;
  
  
  What is HotROD?
&lt;/h4&gt;

&lt;p&gt;HotROD is a microservices application simulating a ride-hailing service, similar to Uber or Lyft. It consists of multiple services, such as ride management and driver management, making it an ideal example for demonstrating distributed tracing in a microservices architecture.&lt;/p&gt;

&lt;p&gt;To run the HotROD application alongside Jaeger, use the following Docker command:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker run &lt;span class="nt"&gt;--rm&lt;/span&gt; &lt;span class="nt"&gt;-it&lt;/span&gt; &lt;span class="nt"&gt;--link&lt;/span&gt; jaeger &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-p8080-8083&lt;/span&gt;:8080-8083 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-e&lt;/span&gt; &lt;span class="nv"&gt;OTEL_EXPORTER_OTLP_ENDPOINT&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"http://jaeger:4318"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  jaegertracing/example-hotrod:1.62.0 &lt;span class="se"&gt;\&lt;/span&gt;
  all &lt;span class="nt"&gt;--otel-exporter&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;otlp
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The above command will run the HotROD sample application in a Docker container, linking it to the Jaeger container. It will expose ports 8080 to 8083 on the host for accessing the HotROD services. The application is configured to send tracing data to Jaeger via the OpenTelemetry Protocol (OTLP) at the specified endpoint.&lt;/p&gt;

&lt;p&gt;You can access the HotROD UI at &lt;strong&gt;&lt;a href="http://localhost:8080" rel="noopener noreferrer"&gt;http://localhost:8080&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fx7b4m2pbiu0jslly5ama.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fx7b4m2pbiu0jslly5ama.jpg" alt="hotrod_UI" width="800" height="237"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 3: View Traces in Jaeger UI
&lt;/h3&gt;

&lt;p&gt;Once your application is instrumented, run a few requests to generate some traces.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6sb9fcjwqooi9alix5o1.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6sb9fcjwqooi9alix5o1.gif" alt="hotrod_UI_clicks" width="600" height="337"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Then, navigate to &lt;strong&gt;&lt;a href="http://localhost:16686" rel="noopener noreferrer"&gt;http://localhost:16686&lt;/a&gt;&lt;/strong&gt;, where you can query traces, visualise the flow of requests, and see latency and dependency data.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxgrll4vcgxh6w9s2g7gs.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxgrll4vcgxh6w9s2g7gs.gif" alt="jeager_UI_1" width="" height=""&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Getting Started with OpenObserve
&lt;/h2&gt;

&lt;p&gt;Now, let's guide you through the setup of OpenObserve using Docker for deployment.&lt;br&gt;
For a detailed setup guide, you can refer to the &lt;a href="https://openobserve.ai/docs/quickstart/#openobserve-cloud/" rel="noopener noreferrer"&gt;OpenObserve Quickstart Documentation.&lt;/a&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  Step 1: Deploy OpenObserve with Docker
&lt;/h3&gt;

&lt;p&gt;OpenObserve provides a Docker image for easy deployment. To start using OpenObserve, run the following command:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker run &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--name&lt;/span&gt; openobserve &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;-v&lt;/span&gt; &lt;span class="nv"&gt;$PWD&lt;/span&gt;/data:/data &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;-e&lt;/span&gt; &lt;span class="nv"&gt;ZO_DATA_DIR&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"/data"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;-p&lt;/span&gt; 5080:5080 &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;-e&lt;/span&gt; &lt;span class="nv"&gt;ZO_ROOT_USER_EMAIL&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"root@example.com"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;-e&lt;/span&gt; &lt;span class="nv"&gt;ZO_ROOT_USER_PASSWORD&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"Complexpass#123"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    public.ecr.aws/zinclabs/openobserve:latest
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The command will start an OpenObserve Docker container named openobserve, with the following configurations:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Persistent Storage:&lt;/strong&gt; Maps the local directory $PWD/data to the container's /data directory.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Authentication:&lt;/strong&gt; Sets the root user email and password for the OpenObserve interface.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Port Exposure:&lt;/strong&gt; Exposes port 5080 for external access to the OpenObserve web application.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You can access the OpenObserve UI at &lt;strong&gt;&lt;a href="http://localhost:5080" rel="noopener noreferrer"&gt;http://localhost:5080&lt;/a&gt;&lt;/strong&gt; to visualise and interact with your observability data.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fi6wdf804p6lpkpoc3gzq.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fi6wdf804p6lpkpoc3gzq.jpg" alt="O2_login_page" width="800" height="408"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Log in with the following credentials:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;User email:&lt;/strong&gt; &lt;a href="mailto:root@example.com"&gt;root@example.com&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Password:&lt;/strong&gt; Complexpass#123&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnh99r0u8d3orr1jthkub.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnh99r0u8d3orr1jthkub.gif" alt="O2_login" width="" height=""&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 2: Instrument the HotROD Sample Application
&lt;/h3&gt;

&lt;p&gt;Run the following command to configure the HotROD sample app to send tracing data to OpenObserve (O2). Replace placeholders with the correct values from your OpenObserve setup.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker run &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--rm&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--link&lt;/span&gt; &amp;lt;O2_CONTAINER_NAME&amp;gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--env&lt;/span&gt; &lt;span class="nv"&gt;OTEL_EXPORTER_OTLP_ENDPOINT&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&amp;lt;O2_ENDPOINT&amp;gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--env&lt;/span&gt; &lt;span class="nv"&gt;OTEL_EXPORTER_OTLP_HEADERS&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"&amp;lt;Authorization=Basic &amp;lt;BASE64_ENCODED_CREDENTIALS&amp;gt;&amp;gt;"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-p&lt;/span&gt; 8080-8083:8080-8083 &lt;span class="se"&gt;\&lt;/span&gt;
  jaegertracing/example-hotrod:latest &lt;span class="se"&gt;\&lt;/span&gt;
  all
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This command does the following:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Runs the HotROD application in a Docker container and links it to your OpenObserve container.&lt;/li&gt;
&lt;li&gt;Sets the environment variable for the OpenTelemetry exporter endpoint to send tracing data to OpenObserve.&lt;/li&gt;
&lt;li&gt;Configures the necessary headers for authentication.&lt;/li&gt;
&lt;li&gt;Maps ports 8080 to 8083 for accessing the HotROD services externally.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;By running this command, you'll be able to generate trace data from the HotROD application and send it to OpenObserve for visualisation and analysis.&lt;/p&gt;

&lt;p&gt;You can find the HTTP endpoint and authorization details in the Data Sources section, under Traces (OpenTelemetry).&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkllnh3ucbvv8tspj5pa4.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkllnh3ucbvv8tspj5pa4.gif" alt="O2_endpoint" width="600" height="289"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This is how the command looks after replacing required fields:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker run &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--rm&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--link&lt;/span&gt; openobserve &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--env&lt;/span&gt; &lt;span class="nv"&gt;OTEL_EXPORTER_OTLP_ENDPOINT&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;http://13.232.45.32:5080/api/default &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--env&lt;/span&gt; &lt;span class="nv"&gt;OTEL_EXPORTER_OTLP_HEADERS&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"Authorization=Basic cm9vdEBleGFtcGxlLmNvbTpTMzVHMjhaMEkxVEdxYm9q"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-p&lt;/span&gt; 8080-8083:8080-8083 &lt;span class="se"&gt;\&lt;/span&gt;
  jaegertracing/example-hotrod:latest &lt;span class="se"&gt;\&lt;/span&gt;
  all
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Replace &lt;strong&gt;&lt;/strong&gt; and &lt;strong&gt;&lt;/strong&gt; with your specific values.&lt;/p&gt;

&lt;p&gt;You can access the HotROD UI at &lt;strong&gt;&lt;a href="http://localhost:8080" rel="noopener noreferrer"&gt;http://localhost:8080&lt;/a&gt;&lt;/strong&gt;. Once your application is instrumented, run a few requests to generate some traces.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7irphunqqc4fbt66tajv.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7irphunqqc4fbt66tajv.gif" alt="hotrod_UI_clicks" width="600" height="337"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 3: View Traces in OpenObserve UI
&lt;/h3&gt;

&lt;p&gt;Once your application is instrumented, generate some telemetry data by making requests to your services. You can then explore the data in the OpenObserve UI at &lt;strong&gt;&lt;a href="http://localhost:5080" rel="noopener noreferrer"&gt;http://localhost:5080&lt;/a&gt;&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6iefveaaicfdtwqgsm55.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6iefveaaicfdtwqgsm55.gif" alt="O2_traces" width="" height=""&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fiu34uzvg71clcr9ej5hr.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fiu34uzvg71clcr9ej5hr.jpg" alt="O2_traces" width="800" height="449"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fc1w2yqsy77byomw10172.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fc1w2yqsy77byomw10172.jpg" alt="O2_traces" width="800" height="449"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Jaeger vs. OpenObserve
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;strong&gt;Challenge&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Jaeger&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;OpenObserve (O2)&lt;/strong&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Scalability&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Struggles with high traffic&lt;/td&gt;
&lt;td&gt;Built for high scalability and performance&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Unified Platform&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Separate tools for logs and metrics&lt;/td&gt;
&lt;td&gt;Combines metrics, logs, and traces into one platform&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Querying&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Basic querying options&lt;/td&gt;
&lt;td&gt;Advanced querying capabilities for deeper insights&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Cost Management&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Higher storage and processing costs&lt;/td&gt;
&lt;td&gt;Optimized for lower resource usage&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;User Experience&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Traditional, complex interfaces&lt;/td&gt;
&lt;td&gt;Modern, intuitive interface for easy navigation and analysis&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Jaeger is an excellent tool for getting started with distributed tracing and is widely adopted for microservices observability. However, as systems grow, Jaeger's limitations in data handling and cross-function observability (metrics, logs, and traces) may become restrictive.&lt;/p&gt;

&lt;p&gt;OpenObserve addresses these limitations by unifying metrics, logs, and traces in a single platform, making it a more comprehensive observability solution. With its scalability, enhanced query capabilities, and cost-effectiveness, OpenObserve empowers teams to monitor, troubleshoot, and optimise complex distributed systems more efficiently.&lt;/p&gt;

&lt;h2&gt;
  
  
  Real-World Case Study: Jidu's Journey to 100% Tracing Fidelity
&lt;/h2&gt;

&lt;p&gt;To see OpenObserve's impact in action, read about Jidu's journey to achieving &lt;strong&gt;100% tracing fidelity using OpenObserve&lt;/strong&gt;. Their challenge with Jaeger with Elasticsearch backend limited their ability to ingest traces and they were able to ingest only 10% of traces that their application generated (10 TB per day) and performance was bad for the money that was spent on the resources.&lt;/p&gt;

&lt;p&gt;After moving from Jaeger+Elasticsearch to OpenObserve they were able to increase trace ingestion to 100% (10 TB) offering higher performance on the same hardware and reduced storage cost as well. They eventually started ingesting 100 TB of traces per day in OpenObserve. Their team's work offers valuable insights into overcoming the challenges of tracing at scale and ensuring trace fidelity. You can read the full case study &lt;a href="https://openobserve.ai/blog/jidu-journey-to-100-tracing-fidelity/" rel="noopener noreferrer"&gt;here.&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This case demonstrates how OpenObserve's unified approach to observability enables improved trace fidelity and facilitates better troubleshooting, performance optimization, and insight gathering across distributed systems.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Ready to get started?&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://openobserve.ai/downloads/" rel="noopener noreferrer"&gt;Download OpenObserve&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://cloud.openobserve.ai/" rel="noopener noreferrer"&gt;Try OpenObserve Cloud&lt;/a&gt; with a 14-day free trial&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://short.openobserve.ai/community" rel="noopener noreferrer"&gt;Join our community&lt;/a&gt; for support and discussions&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>jaeger</category>
      <category>observability</category>
      <category>microservices</category>
      <category>tracing</category>
    </item>
    <item>
      <title>Top 10 Lightstep Alternatives for 2026 (OpenTelemetry-Native Options)</title>
      <dc:creator>Manas Sharma</dc:creator>
      <pubDate>Wed, 04 Feb 2026 14:41:04 +0000</pubDate>
      <link>https://dev.to/openobserve/top-10-lightstep-alternatives-for-2026-opentelemetry-native-options-2ol4</link>
      <guid>https://dev.to/openobserve/top-10-lightstep-alternatives-for-2026-opentelemetry-native-options-2ol4</guid>
      <description>&lt;p&gt;ServiceNow announced the sunset of &lt;strong&gt;Lightstep (Cloud Observability)&lt;/strong&gt; effective March 1, 2026. If you're a Lightstep user, you're facing a forced migration with no direct replacement offered by ServiceNow.&lt;/p&gt;

&lt;p&gt;Several factors are driving teams to evaluate &lt;strong&gt;Lightstep alternatives&lt;/strong&gt;:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Forced migration&lt;/strong&gt; - March 2026 EOL deadline approaching with no migration path from ServiceNow&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cost optimization&lt;/strong&gt; - Opportunity to reduce observability spending by 60-90% with modern platforms&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Vendor lock-in concerns&lt;/strong&gt; - Avoid future platform sunsets by choosing OpenTelemetry-native solutions&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;OpenTelemetry standardization&lt;/strong&gt; - Move to vendor-neutral instrumentation that works across platforms&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Data sovereignty&lt;/strong&gt; - Teams need self-hosted or regional deployment options for compliance&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;In this guide, we'll explore ten &lt;strong&gt;OpenTelemetry-native alternatives to Lightstep&lt;/strong&gt; that address these concerns, from open source platforms to specialized SaaS solutions. We'll include real cost comparisons, migration code snippets, and technical analysis to help you choose the right replacement and migrate before the March 2026 deadline.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Lightstep Sunset: What You Need to Know
&lt;/h2&gt;

&lt;p&gt;The clock is ticking. ServiceNow has officially announced the sunset of Lightstep (rebranded as ServiceNow Cloud Observability), with the service reaching End-of-Life (EOL) by March 1, 2026.&lt;/p&gt;

&lt;p&gt;For engineering teams that relied on Lightstep for its pioneering work in distributed tracing and OpenTelemetry (OTel), this is a critical turning point. You need a replacement that respects your existing OTel instrumentation, handles high-cardinality data without breaking the bank, and doesn't trap you in a proprietary agent ecosystem.&lt;/p&gt;

&lt;p&gt;This guide analyzes the &lt;strong&gt;Top 10 Lightstep alternatives for 2026&lt;/strong&gt;, focusing on:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;OpenTelemetry compatibility&lt;/strong&gt; - Native OTel support vs translation layers&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Migration ease&lt;/strong&gt; - How quickly can you switch without rewriting code?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Total cost of ownership&lt;/strong&gt; - Real pricing for production workloads&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;High-cardinality support&lt;/strong&gt; - Can it handle user IDs, request IDs at scale?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Vendor lock-in risk&lt;/strong&gt; - Will you face this problem again in 3 years?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Bottom line&lt;/strong&gt;: OpenObserve emerges as the best drop-in replacement, offering significant cost savings while maintaining OpenTelemetry-native architecture and distributed tracing capabilities.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why This Guide Exists
&lt;/h2&gt;

&lt;p&gt;As observability requirements evolve in 2026, Lightstep users face a forced migration due to ServiceNow's March 1, 2026 end-of-life announcement. With no direct replacement or migration path provided by ServiceNow, teams must evaluate alternatives quickly.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Evidence from Real Migrations:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Cost reduction:&lt;/strong&gt; - Production data shows dramatic savings when moving from Lightstep to modern OpenTelemetry-native alternatives.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Migration timeline: Fast with OTel&lt;/strong&gt; - Teams using OpenTelemetry can migrate quickly by changing collector configuration. This is significantly faster than platforms that need new instrumentation.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;OpenTelemetry-native prevents lock-in&lt;/strong&gt; - Vendor-neutral instrumentation using OpenTelemetry standards enables future flexibility. You're not rewriting code or learning proprietary agents if you need to switch platforms again.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Unified observability simplifies operations&lt;/strong&gt; - Logs, metrics, and traces in one platform reduces tool sprawl, context switching, and correlation complexity that teams experienced with fragmented monitoring stacks.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  What Lightstep Users Need to Replicate
&lt;/h3&gt;

&lt;p&gt;Lightstep was known for several key capabilities that any replacement must match:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;OpenTelemetry pioneer&lt;/strong&gt; - Lightstep was an early contributor to OpenTelemetry and built its platform as OTel-native from day one&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Distributed tracing excellence&lt;/strong&gt; - High-cardinality trace data at scale without performance penalties or cost explosions&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Unified observability&lt;/strong&gt; - Logs, metrics, and traces correlated in a single platform with powerful cross-signal queries&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Change Intelligence&lt;/strong&gt; - Deployment tracking and automatic correlation between changes and performance impacts&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Service dependency mapping&lt;/strong&gt; - Visual representation of service relationships and data flows&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;SQL-based querying&lt;/strong&gt; - Accessible query language for both developers and SREs&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Your replacement platform needs to match these capabilities while avoiding the vendor lock-in risk that led to this forced migration.&lt;/p&gt;




&lt;h2&gt;
  
  
  What to Look for in a Lightstep Alternative
&lt;/h2&gt;

&lt;p&gt;When evaluating observability platforms to replace Lightstep, assess these critical dimensions:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Criterion&lt;/th&gt;
&lt;th&gt;Why It Matters&lt;/th&gt;
&lt;th&gt;What to Evaluate&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;OpenTelemetry Native&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Ensures easy migration without code changes&lt;/td&gt;
&lt;td&gt;Native OTLP support vs translation layers that add complexity&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Migration Timeline&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;March 2026 deadline approaching fast&lt;/td&gt;
&lt;td&gt;Can you complete migration quickly with your team size?&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Cost Structure&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Opportunity to reduce observability spend&lt;/td&gt;
&lt;td&gt;Transparent pricing vs usage-based surprises and hidden fees&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Distributed Tracing&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Core Lightstep capability you can't lose&lt;/td&gt;
&lt;td&gt;High-cardinality support, trace quality, sampling strategies&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Data Ownership&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Avoid future vendor lock-in scenarios&lt;/td&gt;
&lt;td&gt;Self-hosted deployment option available or SaaS-only?&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Unified Observability&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Reduce tool sprawl and context switching&lt;/td&gt;
&lt;td&gt;Logs, metrics, traces in one platform with correlation&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Query Capabilities&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Investigation efficiency during incidents&lt;/td&gt;
&lt;td&gt;SQL/PromQL vs proprietary query languages requiring training&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Service Maps&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Dependency visualization and troubleshooting&lt;/td&gt;
&lt;td&gt;Automatic topology mapping from trace data&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Integration Ecosystem&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Works with your existing infrastructure&lt;/td&gt;
&lt;td&gt;Cloud providers, databases, Kubernetes, CI/CD tools&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Vendor Stability&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Avoid another sudden platform sunset&lt;/td&gt;
&lt;td&gt;Long-term viability, funding, community support, roadmap&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Scalability&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Handle growing data volumes&lt;/td&gt;
&lt;td&gt;Performance at 2x, 5x, 10x current data volumes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;High-Cardinality Support&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Modern app requirements (user IDs, request IDs)&lt;/td&gt;
&lt;td&gt;Cost and performance impact of high-cardinality dimensions&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  Top 10 Lightstep Alternatives
&lt;/h2&gt;

&lt;p&gt;Jump to comparison table&lt;/p&gt;

&lt;h3&gt;
  
  
  1. OpenObserve (The Drop-in Replacement)
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://openobserve.ai" rel="noopener noreferrer"&gt;OpenObserve&lt;/a&gt;&lt;/strong&gt; is the best Lightstep alternative for teams wanting unified observability with OpenTelemetry-native architecture, no vendor lock-in, and 90% cost savings. It delivers the same distributed tracing capabilities Lightstep users rely on, but with transparent pricing and self-hosting options.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9s1kk3jdcee6k2xxxa4u.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9s1kk3jdcee6k2xxxa4u.png" alt="OpenObserve Dashboard" width="800" height="452"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why OpenObserve is the best Lightstep alternative:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;OpenObserve isn't just similar to Lightstep - it's architecturally compatible. Both platforms are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Built for OpenTelemetry from day one&lt;/li&gt;
&lt;li&gt;Designed for high-cardinality distributed tracing at scale&lt;/li&gt;
&lt;li&gt;Focused on unified observability (logs, metrics, traces)&lt;/li&gt;
&lt;li&gt;Using SQL-based query languages (vs proprietary DSLs)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;The difference?&lt;/strong&gt; OpenObserve gives you complete data ownership through self-hosting options.&lt;/p&gt;

&lt;h4&gt;
  
  
  OpenObserve Pros:
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;True Drop-in Replacement&lt;/strong&gt;: Migration from Lightstep requires changing one config file in your OpenTelemetry Collector - no application code changes needed&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;OpenTelemetry-Native&lt;/strong&gt;: Native OTLP support means seamless integration with your existing OTel instrumentation&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;High-Cardinality Friendly&lt;/strong&gt;: Handles user-level dimensions and request IDs without performance degradation or cost explosions&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Unified Observability&lt;/strong&gt;: Logs, metrics, and traces in one platform with powerful correlation capabilities&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;SQL + PromQL Querying&lt;/strong&gt;: Familiar query languages instead of proprietary syntax requiring training&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Self-Hosted or Cloud&lt;/strong&gt;: Deploy on your infrastructure for complete control, or use managed cloud for simplicity&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Transparent Pricing&lt;/strong&gt;: Ingestion-based pricing model with no hidden per-host or per-metric fees&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  OpenObserve Cons:
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;Community maturity: While the core platform is battle-tested, the AI agent community is newer compared to established vendors&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  Migration from Lightstep:
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Easiest migration path of any alternative.&lt;/strong&gt; If you're using OpenTelemetry (which Lightstep users are):&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Sign up for OpenObserve (cloud or self-hosted in 10 minutes)&lt;/li&gt;
&lt;li&gt;Update your OpenTelemetry Collector exporter configuration (change endpoint URL and auth token)&lt;/li&gt;
&lt;li&gt;Restart collector - data immediately flows to OpenObserve&lt;/li&gt;
&lt;li&gt;Rebuild dashboards (OpenObserve provides similar visualization capabilities)&lt;/li&gt;
&lt;li&gt;Set up alerts (SQL-based, often simpler than Lightstep's UI-based approach)&lt;/li&gt;
&lt;/ol&gt;

&lt;h4&gt;
  
  
  Best For:
&lt;/h4&gt;

&lt;p&gt;Teams seeking a &lt;strong&gt;Lightstep replacement&lt;/strong&gt; that maintains OpenTelemetry-native architecture, matches distributed tracing capabilities, and dramatically reduces costs without sacrificing functionality. Ideal for organizations wanting data ownership through self-hosting while avoiding vendor lock-in.&lt;/p&gt;




&lt;h3&gt;
  
  
  2. Grafana Stack (LGTM)
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://grafana.com/" rel="noopener noreferrer"&gt;Grafana Stack&lt;/a&gt;&lt;/strong&gt; (Loki for logs, Grafana for visualization, Tempo for traces, Mimir/Prometheus for metrics) is a popular open-source Lightstep alternative composed of best-in-class tools.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0xxtoswb9wvxwxqpprzg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0xxtoswb9wvxwxqpprzg.png" alt="Grafana Dashboard" width="800" height="435"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Grafana Stack Pros:
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Best Visualization&lt;/strong&gt;: Grafana dashboards are industry-leading with extensive customization options&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Open Source &amp;amp; Vendor-Neutral&lt;/strong&gt;: No proprietary formats or lock-in across the stack&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tempo for Tracing&lt;/strong&gt;: OpenTelemetry-native distributed tracing with excellent performance&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Large Ecosystem&lt;/strong&gt;: Thousands of integrations, plugins, and community dashboards&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Flexible Deployment&lt;/strong&gt;: Self-host components individually or use managed Grafana Cloud&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Prometheus Standard&lt;/strong&gt;: Industry-standard metrics collection and querying (PromQL)&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  Grafana Stack Cons:
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;Not a single unified product like Lightstep - requires managing multiple components&lt;/li&gt;
&lt;li&gt;Operational complexity increases significantly at scale (4 different systems)&lt;/li&gt;
&lt;li&gt;Correlation across logs/metrics/traces requires manual setup&lt;/li&gt;
&lt;li&gt;Steeper learning curve than unified platforms&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  Migration from Lightstep:
&lt;/h4&gt;

&lt;p&gt;Configure OpenTelemetry Collector to export traces to Tempo, metrics to Prometheus/Mimir, and logs to Loki. More complex than single-platform alternatives due to multiple destinations.&lt;/p&gt;

&lt;h4&gt;
  
  
  Best For:
&lt;/h4&gt;

&lt;p&gt;Teams wanting &lt;strong&gt;maximum flexibility&lt;/strong&gt; and best-in-class visualization who are comfortable managing multiple components. Good for organizations with strong infrastructure teams or using Grafana Cloud to reduce operational burden.&lt;/p&gt;




&lt;h3&gt;
  
  
  3. Honeycomb
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://www.honeycomb.io/" rel="noopener noreferrer"&gt;Honeycomb&lt;/a&gt;&lt;/strong&gt; is a modern Lightstep alternative focused on high-cardinality observability and debugging distributed systems.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F02ugeeqs0a19hlmsg3wg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F02ugeeqs0a19hlmsg3wg.png" alt="Honeycomb Traces" width="800" height="560"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Honeycomb Pros:
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Excellent for Distributed Tracing&lt;/strong&gt;: Purpose-built for understanding complex request flows across microservices&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;High-Cardinality Native&lt;/strong&gt;: Handles millions of unique dimension values (user IDs, request IDs) without performance issues&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Fast Exploratory Queries&lt;/strong&gt;: Rapid ad-hoc querying enables real-time investigation during incidents&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;OpenTelemetry Native&lt;/strong&gt;: Built from ground up to ingest and leverage OpenTelemetry data&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;BubbleUp Feature&lt;/strong&gt;: Automatically surfaces anomalies and patterns in high-cardinality data&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Developer-Centric UX&lt;/strong&gt;: Designed around developer and SRE workflows rather than infrastructure-only monitoring&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  Honeycomb Cons:
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;SaaS-only (no self-hosted option)&lt;/li&gt;
&lt;li&gt;Less focus on traditional dashboards (more investigation-oriented)&lt;/li&gt;
&lt;li&gt;Pricing scales with event volume (can grow quickly with high traffic)&lt;/li&gt;
&lt;li&gt;Logs and metrics support still evolving compared to tracing strength&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  Migration from Lightstep:
&lt;/h4&gt;

&lt;p&gt;Straightforward for OpenTelemetry users. Update collector configuration to send traces to Honeycomb. Strong documentation for Lightstep migration scenarios.&lt;/p&gt;

&lt;h4&gt;
  
  
  Best For:
&lt;/h4&gt;

&lt;p&gt;Teams prioritizing &lt;strong&gt;distributed tracing excellence&lt;/strong&gt; and high-cardinality debugging capabilities over traditional dashboard-heavy monitoring. Ideal for microservices architectures where understanding request flows is critical.&lt;/p&gt;




&lt;h3&gt;
  
  
  4. Datadog
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://www.datadoghq.com/" rel="noopener noreferrer"&gt;Datadog&lt;/a&gt;&lt;/strong&gt; is a comprehensive Lightstep alternative offering all-in-one observability with extensive integrations and enterprise features.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffhnct1t2q00nwq20j61w.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffhnct1t2q00nwq20j61w.png" alt="Datadog APM" width="800" height="616"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Datadog Pros:
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Most Comprehensive Platform&lt;/strong&gt;: Covers infrastructure, APM, logs, traces, RUM, synthetics, and security in one platform&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;700+ Integrations&lt;/strong&gt;: Extensive integration marketplace for cloud providers, databases, and frameworks&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Mature APM&lt;/strong&gt;: Deep application performance monitoring with code-level insights&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Enterprise-Grade&lt;/strong&gt;: Strong governance, compliance, and multi-tenancy capabilities&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Excellent UX&lt;/strong&gt;: Polished interface with powerful visualization and alerting&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  Datadog Cons:
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Very Expensive&lt;/strong&gt;: Often more expensive than Lightstep, with complex multi-vector pricing&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Vendor Lock-in&lt;/strong&gt;: Proprietary agents and data formats make switching difficult&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cost Surprises&lt;/strong&gt;: Usage-based pricing can lead to unexpected bills with traffic spikes&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;OpenTelemetry Support Limited&lt;/strong&gt;: Treats OTel metrics as expensive "custom metrics"&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  Migration from Lightstep:
&lt;/h4&gt;

&lt;p&gt;Requires Datadog agents or OpenTelemetry Collector configured for Datadog. More complex than OTel-native alternatives due to Datadog's proprietary ingestion formats.&lt;/p&gt;

&lt;h4&gt;
  
  
  Best For:
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Enterprise teams&lt;/strong&gt; with large budgets prioritizing ecosystem breadth and polished UX over cost optimization. Good if observability budget isn't constrained and you value comprehensive built-in features.&lt;/p&gt;




&lt;h3&gt;
  
  
  5. New Relic
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://newrelic.com/" rel="noopener noreferrer"&gt;New Relic&lt;/a&gt;&lt;/strong&gt; is a SaaS observability platform offering unified logs, metrics, traces, and APM with OpenTelemetry support.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7obcbz3xf34z8136uqi1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7obcbz3xf34z8136uqi1.png" alt="New Relic APM" width="800" height="500"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  New Relic Pros:
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Unified Platform&lt;/strong&gt;: Full-stack observability in single SaaS platform&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Strong APM&lt;/strong&gt;: Deep code-level performance insights and error tracking&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;OpenTelemetry Support&lt;/strong&gt;: Native OTLP ingestion simplifies migration&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Per-GB Pricing&lt;/strong&gt;: More predictable than per-host models (though still usage-based)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Developer-Friendly&lt;/strong&gt;: Good documentation and onboarding experience&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  New Relic Cons:
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Proprietary Translation&lt;/strong&gt;: Translates OpenTelemetry data into New Relic format (vendor lock-in)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Costs Scale Quickly&lt;/strong&gt;: Per-GB pricing grows fast with verbose logging or high trace volumes&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;SaaS-Only&lt;/strong&gt;: No self-hosted option for data sovereignty&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Historical Billing Issues&lt;/strong&gt;: Past controversies around retroactive pricing changes&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  Migration from Lightstep:
&lt;/h4&gt;

&lt;p&gt;OpenTelemetry Collector can send data directly to New Relic via OTLP. Simpler than Datadog but creates some vendor lock-in through data format translation.&lt;/p&gt;

&lt;h4&gt;
  
  
  Best For:
&lt;/h4&gt;

&lt;p&gt;Teams wanting a &lt;strong&gt;familiar SaaS experience&lt;/strong&gt; similar to Lightstep with strong APM capabilities and willing to accept usage-based pricing for operational simplicity.&lt;/p&gt;




&lt;h3&gt;
  
  
  6. Chronosphere
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://chronosphere.io/" rel="noopener noreferrer"&gt;Chronosphere&lt;/a&gt;&lt;/strong&gt; is a cloud-native observability platform built by ex-Uber engineers, focused on controlling costs at scale while supporting OpenTelemetry.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fadtmr9s0xz703x8pmos8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fadtmr9s0xz703x8pmos8.png" alt="Chronosphere Platform" width="800" height="484"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Chronosphere Pros:
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Built for Scale&lt;/strong&gt;: Created by engineers who built M3 at Uber for handling massive metric volumes&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cost Controls&lt;/strong&gt;: Native cost visibility and controls to prevent observability bill explosions&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;OpenTelemetry Compatible&lt;/strong&gt;: Works with OTel Collector and standard instrumentation&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;High-Cardinality Metrics&lt;/strong&gt;: Handles modern application requirements without performance degradation&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Governance Features&lt;/strong&gt;: Strong multi-tenancy and access controls for large organizations&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Query Performance&lt;/strong&gt;: Fast queries even on large datasets&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  Chronosphere Cons:
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;Primarily metrics-focused (traces and logs less mature than competitors)&lt;/li&gt;
&lt;li&gt;Enterprise pricing (not as cost-effective as open source alternatives)&lt;/li&gt;
&lt;li&gt;Smaller ecosystem compared to established players&lt;/li&gt;
&lt;li&gt;SaaS-focused (limited self-hosted options)&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  Migration from Lightstep:
&lt;/h4&gt;

&lt;p&gt;OpenTelemetry Collector can export metrics to Chronosphere. Straightforward for metrics migration, but you'll need additional solutions for comprehensive tracing that Lightstep provided.&lt;/p&gt;

&lt;h4&gt;
  
  
  Best For:
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Large-scale environments&lt;/strong&gt; generating massive metric volumes where cost control and governance are critical. Good for teams migrating from Lightstep who want enterprise support but need better cost predictability.&lt;/p&gt;




&lt;h3&gt;
  
  
  7. Jaeger
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://www.jaegertracing.io/" rel="noopener noreferrer"&gt;Jaeger&lt;/a&gt;&lt;/strong&gt; is an open-source distributed tracing platform and graduated CNCF project, offering core tracing capabilities without logs or metrics.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fo2mfqcj53vzll9rad3q8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fo2mfqcj53vzll9rad3q8.png" alt="Jaeger UI" width="800" height="408"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Jaeger Pros:
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Completely Free&lt;/strong&gt;: Open source with no licensing costs whatsoever&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;CNCF Graduated&lt;/strong&gt;: Proven stability and community support through Cloud Native Computing Foundation&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;OpenTelemetry Native&lt;/strong&gt;: Built as the reference implementation for OpenTelemetry tracing&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Battle-Tested&lt;/strong&gt;: Used in production by thousands of organizations globally&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Flexible Storage&lt;/strong&gt;: Supports Cassandra, Elasticsearch, Kafka, and Badger backends&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Lightweight&lt;/strong&gt;: Focused solely on distributed tracing without feature bloat&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  Jaeger Cons:
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Tracing Only&lt;/strong&gt;: No logs or metrics - requires separate tools for unified observability&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Basic UI&lt;/strong&gt;: Functional but less polished than commercial alternatives&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Self-Hosted Only&lt;/strong&gt;: Requires managing infrastructure (no managed SaaS option)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Limited Advanced Features&lt;/strong&gt;: Missing some of Lightstep's Change Intelligence and correlation features&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  Migration from Lightstep:
&lt;/h4&gt;

&lt;p&gt;Simple for OpenTelemetry users. Point collector traces to Jaeger endpoint. However, you'll need additional tools for logs and metrics that Lightstep provided.&lt;/p&gt;

&lt;h4&gt;
  
  
  Best For:
&lt;/h4&gt;

&lt;p&gt;Teams needing &lt;strong&gt;just distributed tracing&lt;/strong&gt; at zero cost and comfortable with self-hosting. Often paired with Prometheus (metrics) and Grafana Loki (logs) for complete observability.&lt;/p&gt;




&lt;h3&gt;
  
  
  8. Elastic Observability
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://www.elastic.co/observability" rel="noopener noreferrer"&gt;Elastic Observability&lt;/a&gt;&lt;/strong&gt; (part of Elastic Stack/ELK) provides unified logs, metrics, APM, and traces with powerful search capabilities.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fesw1pnbms5l4h924tu8x.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fesw1pnbms5l4h924tu8x.png" alt="Elastic APM" width="800" height="405"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Elastic Observability Pros:
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Powerful Search&lt;/strong&gt;: Elasticsearch excels at full-text and structured log search&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Unified Platform&lt;/strong&gt;: Logs, metrics, APM, and traces in single stack&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Flexible Deployment&lt;/strong&gt;: Self-hosted, managed Elastic Cloud, or hybrid&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Large Ecosystem&lt;/strong&gt;: Extensive integrations with Beats and Logstash&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Security + Observability&lt;/strong&gt;: Strong overlap with SIEM capabilities for security teams&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  Elastic Observability Cons:
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Expensive at Scale&lt;/strong&gt;: Elasticsearch clusters require significant infrastructure investment&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Operational Complexity&lt;/strong&gt;: Managing Elasticsearch at scale requires expertise&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Storage Costs&lt;/strong&gt;: Full-fidelity data retention gets expensive quickly&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;OpenTelemetry Support&lt;/strong&gt;: Works but not as seamless as OTel-native platforms&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  Migration from Lightstep:
&lt;/h4&gt;

&lt;p&gt;OpenTelemetry Collector can export to Elastic APM. Requires more operational setup than simpler alternatives due to Elasticsearch cluster management.&lt;/p&gt;

&lt;h4&gt;
  
  
  Best For:
&lt;/h4&gt;

&lt;p&gt;Teams with &lt;strong&gt;heavy log analytics&lt;/strong&gt; requirements or existing Elasticsearch investments who want to consolidate observability into their ELK stack.&lt;/p&gt;




&lt;h3&gt;
  
  
  9. Dynatrace
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://www.dynatrace.com/" rel="noopener noreferrer"&gt;Dynatrace&lt;/a&gt;&lt;/strong&gt; is an enterprise APM and observability platform with AI-powered automation and root cause analysis.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1zmgnttfrflrun2hoiu1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1zmgnttfrflrun2hoiu1.png" alt="Dynatrace Dashboard" width="800" height="451"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Dynatrace Pros:
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Automatic Instrumentation&lt;/strong&gt;: OneAgent automatically discovers and instruments applications&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Davis AI&lt;/strong&gt;: AI engine reduces alert noise through intelligent root cause analysis&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Enterprise-Grade&lt;/strong&gt;: Handles very large, complex enterprise environments&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Hybrid Support&lt;/strong&gt;: Works across on-premises, cloud, and hybrid infrastructures&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Low Maintenance&lt;/strong&gt;: Highly automated requiring minimal configuration&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  Dynatrace Cons:
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Very Expensive&lt;/strong&gt;: Premium enterprise pricing, often higher than Lightstep&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Proprietary Technology&lt;/strong&gt;: OneAgent and data formats create vendor lock-in&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Complex Licensing&lt;/strong&gt;: Unit-based pricing model can be difficult to predict&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;OpenTelemetry&lt;/strong&gt;: Supports OTel but pushes proprietary OneAgent approach&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  Migration from Lightstep:
&lt;/h4&gt;

&lt;p&gt;Requires deploying OneAgent (Dynatrace's proprietary agent) rather than continuing with OpenTelemetry Collector. More disruptive migration than OTel-native alternatives.&lt;/p&gt;

&lt;h4&gt;
  
  
  Best For:
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Large enterprises&lt;/strong&gt; with complex environments prioritizing automation and willing to pay premium prices for reduced operational overhead.&lt;/p&gt;




&lt;h3&gt;
  
  
  10. Splunk Observability Cloud
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://www.splunk.com/en_us/products/observability.html" rel="noopener noreferrer"&gt;Splunk Observability Cloud&lt;/a&gt;&lt;/strong&gt; (formerly SignalFx) offers real-time metrics, APM, and infrastructure monitoring focused on cloud-native environments.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3czu341ad6wonip8jmvw.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3czu341ad6wonip8jmvw.png" alt="Splunk Observability" width="800" height="490"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Splunk Observability Pros:
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Real-Time Streaming&lt;/strong&gt;: NoSample architecture provides full-fidelity, real-time telemetry&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Strong Metrics&lt;/strong&gt;: Excellent time-series metrics handling and analytics&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Enterprise Features&lt;/strong&gt;: Robust access controls, compliance, and security capabilities&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Splunk Ecosystem&lt;/strong&gt;: Integrates with Splunk platform for unified security and observability&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Mature Platform&lt;/strong&gt;: Proven at scale in large enterprise environments&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  Splunk Observability Cons:
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Expensive&lt;/strong&gt;: Data-volume-based pricing can be prohibitively expensive&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Complexity&lt;/strong&gt;: Splunk's enterprise focus adds complexity for smaller teams&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Storage Costs&lt;/strong&gt;: Full-fidelity streaming requires significant storage investment&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;OpenTelemetry&lt;/strong&gt;: Supports OTel but historically pushed proprietary instrumentation&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Migrating from Lightstep to OpenObserve
&lt;/h2&gt;

&lt;p&gt;OpenObserve has first-class support for OpenTelemetry, which means no vendor lock-in and seamless integration with your existing instrumentation.&lt;/p&gt;

&lt;p&gt;Your applications don't change. Your OpenTelemetry instrumentation doesn't change. Only the collector destination changes.&lt;/p&gt;

&lt;p&gt;O2 supports standardized telemetry collection (i.e., FluentBit, OpenTelemetry, Logstash) ensuring seamless integration. It exposes APIs for ingestion, search, and more, allowing programmatic access to everything. OpenObserve works with any object storage such as S3 or GCS and stores data in open formats, avoiding vendor lock-in on collection and storage.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fo117ol034gyau7m5ecu2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fo117ol034gyau7m5ecu2.png" alt="Agent receivers ingestion flow into OpenObserve" width="800" height="453"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Migration Path
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;1. Point your OTel collectors to OpenObserve&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Already using OpenTelemetry? Just update your exporter endpoint. No re-instrumentation required.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fo53n7wkkly06tqz8o7md.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fo53n7wkkly06tqz8o7md.png" alt="Otel Collector Data Sources Page" width="800" height="456"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;After (OpenObserve Configuration):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;exporters&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;otlphttp/openobserve&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;endpoint&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;https://your-org.openobserve.ai/api/default/&lt;/span&gt;
    &lt;span class="na"&gt;headers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;Authorization&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Basic&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;${OPENOBSERVE_TOKEN}"&lt;/span&gt;
      &lt;span class="na"&gt;stream-name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;default"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;2. Run both platforms in parallel&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Test OpenObserve with your production traffic while Lightstep still runs. Validate data quality and dashboard parity before fully committing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Complete migration&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Once validated, migrate all workloads to OpenObserve.&lt;/p&gt;




&lt;h3&gt;
  
  
  Why Migration is Seamless
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;SQL/PromQL querying&lt;/strong&gt; - Universal languages your team already knows. No proprietary DSL to learn.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;OpenTelemetry-native&lt;/strong&gt; - Your existing instrumentation works as-is. No agent rewrites or application changes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Self-hosted or cloud&lt;/strong&gt; - Deploy however your team prefers. Cloud for simplicity, self-hosted for complete control.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Similar visualization&lt;/strong&gt; - Familiar observability workflows. Dashboards, service maps, trace views work the same way.&lt;/p&gt;




&lt;h3&gt;
  
  
  Need Help?
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Talk to our team for a personalized migration plan.&lt;/strong&gt; We'll help you:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Validate technical feasibility for your specific setup&lt;/li&gt;
&lt;li&gt;Recreate your critical dashboards and alerting rules&lt;/li&gt;
&lt;li&gt;Accelerate the migration process with hands-on support&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://openobserve.ai/contact-us/" rel="noopener noreferrer"&gt;Contact us for migration support&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Comparison Table: Lightstep Alternatives
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Tool&lt;/th&gt;
&lt;th&gt;Deployment&lt;/th&gt;
&lt;th&gt;OTel Native&lt;/th&gt;
&lt;th&gt;Pricing Model&lt;/th&gt;
&lt;th&gt;Migration Ease&lt;/th&gt;
&lt;th&gt;Best For&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;OpenObserve&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Cloud / Self-hosted&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Ingestion-based&lt;/td&gt;
&lt;td&gt;Very Easy (1 config change)&lt;/td&gt;
&lt;td&gt;Drop-in Lightstep replacement with 90% cost savings&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Grafana Stack&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Cloud / Self-hosted&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Modular (LGTM)&lt;/td&gt;
&lt;td&gt;Moderate (Multiple components)&lt;/td&gt;
&lt;td&gt;Maximum flexibility and best visualization&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Honeycomb&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;SaaS only&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Event-based&lt;/td&gt;
&lt;td&gt;Very Easy (OTel-native)&lt;/td&gt;
&lt;td&gt;High-cardinality tracing excellence&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Datadog&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;SaaS only&lt;/td&gt;
&lt;td&gt;Supported&lt;/td&gt;
&lt;td&gt;Host/Usage-based&lt;/td&gt;
&lt;td&gt;Moderate (More complex)&lt;/td&gt;
&lt;td&gt;Enterprise teams with unlimited budget&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;New Relic&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;SaaS only&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Per-GB&lt;/td&gt;
&lt;td&gt;Easy (OTel-native)&lt;/td&gt;
&lt;td&gt;Familiar SaaS with strong APM&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Chronosphere&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;SaaS / Cloud&lt;/td&gt;
&lt;td&gt;Compatible&lt;/td&gt;
&lt;td&gt;Enterprise&lt;/td&gt;
&lt;td&gt;Moderate (Metrics-focused)&lt;/td&gt;
&lt;td&gt;Large-scale metrics with cost controls&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Jaeger&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Self-hosted&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Free (Open source)&lt;/td&gt;
&lt;td&gt;Easy (Traces only)&lt;/td&gt;
&lt;td&gt;Distributed tracing only (no logs/metrics)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Elastic&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Cloud / Self-hosted&lt;/td&gt;
&lt;td&gt;Supported&lt;/td&gt;
&lt;td&gt;Data-volume&lt;/td&gt;
&lt;td&gt;Moderate (Operational complexity)&lt;/td&gt;
&lt;td&gt;Log-heavy workloads with search focus&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Dynatrace&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;SaaS / Hybrid&lt;/td&gt;
&lt;td&gt;Supported&lt;/td&gt;
&lt;td&gt;Unit-based&lt;/td&gt;
&lt;td&gt;Moderate (OneAgent required)&lt;/td&gt;
&lt;td&gt;Large enterprises needing automation&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Splunk&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;SaaS / On-prem&lt;/td&gt;
&lt;td&gt;Supported&lt;/td&gt;
&lt;td&gt;Data-volume&lt;/td&gt;
&lt;td&gt;Moderate (Complex pricing)&lt;/td&gt;
&lt;td&gt;Security + Observability convergence&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;With ServiceNow's March 1, 2026 Lightstep end-of-life deadline approaching, teams have an opportunity to modernize their observability stack while dramatically reducing costs and avoiding future vendor lock-in.&lt;/p&gt;

&lt;h3&gt;
  
  
  Key Takeaways
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;1. OpenObserve is the best drop-in replacement for Lightstep&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;For most teams, OpenObserve offers the optimal combination of:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;OpenTelemetry-native architecture (easy migration - just change collector config)&lt;/li&gt;
&lt;li&gt;Similar distributed tracing capabilities (high-cardinality support, service maps, unified observability)&lt;/li&gt;
&lt;li&gt;Data ownership through self-hosting option&lt;/li&gt;
&lt;li&gt;No vendor lock-in risk&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;2. OpenTelemetry-native platforms prevent future lock-in&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Choose alternatives that support OpenTelemetry natively (OpenObserve, Honeycomb, Jaeger, Grafana) rather than platforms that translate OTel data into proprietary formats (Datadog, Dynatrace). This ensures you can switch platforms again in the future without rewriting application code.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Migration is straightforward with OpenTelemetry&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;If you're already using OpenTelemetry (which Lightstep users are), migration to OTel-native platforms like OpenObserve requires just updating your collector configuration. No application code changes, no re-instrumentation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Start migration now&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;With the EOL deadline approaching, begin your evaluation and pilot testing immediately. Most teams can validate OpenObserve in a test environment within days.&lt;/p&gt;

&lt;h3&gt;
  
  
  Recommended Action Plan
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;This week&lt;/strong&gt;: Sign up for OpenObserve free trial and test with a non-critical service&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Next week&lt;/strong&gt;: Update OpenTelemetry Collector config and validate data flow&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Following weeks&lt;/strong&gt;: Build dashboards and alerts, run parallel with Lightstep&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Complete migration&lt;/strong&gt;: Gradually move production workloads to OpenObserve&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Whether you choose OpenObserve or another alternative, prioritize &lt;strong&gt;OpenTelemetry-native platforms&lt;/strong&gt; to avoid rewriting instrumentation and ensure long-term flexibility.&lt;/p&gt;




&lt;h2&gt;
  
  
  Take the Next Step
&lt;/h2&gt;

&lt;p&gt;Ready to explore the best Lightstep alternative?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Try OpenObserve&lt;/strong&gt;: &lt;a href="https://openobserve.ai/downloads/" rel="noopener noreferrer"&gt;Download&lt;/a&gt; or sign up for &lt;a href="https://cloud.openobserve.ai/" rel="noopener noreferrer"&gt;OpenObserve Cloud&lt;/a&gt; with a 14-day free trial.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Talk to our team&lt;/strong&gt;: &lt;a href="https://openobserve.ai/contact-us/" rel="noopener noreferrer"&gt;Schedule a migration consultation&lt;/a&gt; to get a personalized plan for your Lightstep replacement.&lt;/p&gt;




&lt;h2&gt;
  
  
  FAQ: Lightstep Alternatives
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Why is ServiceNow shutting down Lightstep?
&lt;/h3&gt;

&lt;p&gt;ServiceNow acquired Lightstep but decided to discontinue it without providing a replacement. The official reason wasn't detailed publicly, but it's part of their portfolio rationalization. For you, this means finding an alternative before March 1, 2026.&lt;/p&gt;

&lt;h3&gt;
  
  
  I'm using Lightstep right now - what should I do?
&lt;/h3&gt;

&lt;p&gt;Start testing alternatives immediately. Most migrations take 2-4 weeks, so:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;This month&lt;/strong&gt;: Test OpenObserve or another OTel-native platform with a non-prod service&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Next month&lt;/strong&gt;: Validate data volume handling and build critical dashboards&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Following months&lt;/strong&gt;: Migrate production workloads gradually&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Will I lose all my historical data when Lightstep shuts down?
&lt;/h3&gt;

&lt;p&gt;Yes, unless you export it now. ServiceNow stops accepting data after March 1, 2026. Use Lightstep's export APIs to save critical traces you need for compliance or debugging. Most teams only export essential data since full historical migration is rarely necessary.&lt;/p&gt;

&lt;h3&gt;
  
  
  Do I have to rewrite all my instrumentation code?
&lt;/h3&gt;

&lt;p&gt;No. If you're using OpenTelemetry (most Lightstep users are), just update your OTel Collector config to point to the new platform. Zero application code changes. Only if you're using Lightstep-specific SDKs (rare) would you need to re-instrument.&lt;/p&gt;

&lt;h3&gt;
  
  
  How long does it actually take to migrate from Lightstep?
&lt;/h3&gt;

&lt;p&gt;2-4 weeks realistically:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Week 1: Setup and testing&lt;/li&gt;
&lt;li&gt;Week 2: Build dashboards, run parallel with Lightstep&lt;/li&gt;
&lt;li&gt;Week 3-4: Migrate production services&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Some vendors claim "migrations in an hour" - that's just the config change. Budget a month to do it properly with dashboard recreation and validation.&lt;/p&gt;

&lt;h3&gt;
  
  
  What happens if I miss the March 2026 deadline?
&lt;/h3&gt;

&lt;p&gt;ServiceNow stops accepting telemetry. Your observability goes dark - zero visibility into production. Set up at least a basic OTel-native platform (even free Jaeger) as a fallback to avoid complete blindness.&lt;/p&gt;

&lt;h3&gt;
  
  
  Can I keep using OpenTelemetry after migrating?
&lt;/h3&gt;

&lt;p&gt;Yes - that's the whole point. Your OTel instrumentation continues working unchanged. This is why we recommend OTel-native platforms (OpenObserve, Honeycomb, Jaeger) over proprietary ones (Datadog, Dynatrace) that translate OTel into their formats. Keeps you flexible for future switches.&lt;/p&gt;




</description>
      <category>observability</category>
      <category>opentelemetry</category>
      <category>monitoring</category>
      <category>devops</category>
    </item>
    <item>
      <title>FastAPI + OpenTelemetry: Stop Debugging with grep (Use Distributed Tracing)</title>
      <dc:creator>Manas Sharma</dc:creator>
      <pubDate>Mon, 02 Feb 2026 03:50:55 +0000</pubDate>
      <link>https://dev.to/openobserve/fastapi-opentelemetry-stop-debugging-with-grep-use-distributed-tracing-16m5</link>
      <guid>https://dev.to/openobserve/fastapi-opentelemetry-stop-debugging-with-grep-use-distributed-tracing-16m5</guid>
      <description>&lt;p&gt;How do you debug a FastAPI app that talks to 5 other services?&lt;/p&gt;

&lt;p&gt;Most people grep through logs:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Service A logs: "Request received ✓"&lt;/li&gt;
&lt;li&gt;Service B logs: "Processing ✓"&lt;/li&gt;
&lt;li&gt;Service C logs: "Query executed ✓"&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;User:&lt;/strong&gt; "It failed"&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Classic distributed systems problem: every service &lt;em&gt;thinks&lt;/em&gt; it worked, but the request still broke somewhere.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The issue?&lt;/strong&gt; Logs are isolated. Each service writes independently with no context about where the request came from or where it's going next.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The fix?&lt;/strong&gt; OpenTelemetry distributed tracing. Every request gets a unique trace ID that follows it across all services—like a tracking number for API calls. When something breaks, you follow the trace ID and see exactly where it failed.&lt;/p&gt;

&lt;p&gt;Setup takes 20 minutes. Debugging goes from hours of log archaeology to "oh, there it is" in under a minute.&lt;/p&gt;




&lt;h2&gt;
  
  
  Introduction to OpenTelemetry &amp;amp; OpenObserve
&lt;/h2&gt;

&lt;p&gt;OpenTelemetry represents "an open-source observability framework" that enables developers to gather logs, metrics, and traces in a standardized manner. OpenObserve serves as a complementary platform, providing intuitive interfaces for analyzing telemetry data effectively.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why OpenTelemetry for FastAPI?
&lt;/h2&gt;

&lt;p&gt;The framework streamlines logging by integrating with existing logging libraries. This unified methodology enables consistent metadata capture across logs, traces, and metrics—making it simpler to correlate information throughout your application stack.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Problem with Traditional Logging
&lt;/h3&gt;

&lt;p&gt;When debugging microservices:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Each service logs separately&lt;/li&gt;
&lt;li&gt;No connection between related requests across services&lt;/li&gt;
&lt;li&gt;You're grep-ing through multiple log files trying to piece together what happened&lt;/li&gt;
&lt;li&gt;Time zones, log formats, and missing context make correlation nearly impossible&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  What OpenTelemetry Solves
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Distributed Tracing:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Every request gets a unique trace ID&lt;/li&gt;
&lt;li&gt;Trace ID follows the request across all services&lt;/li&gt;
&lt;li&gt;See the complete request path in one view&lt;/li&gt;
&lt;li&gt;Identify exactly where failures occur&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Unified Observability:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Logs, metrics, and traces in one place&lt;/li&gt;
&lt;li&gt;Correlate log lines to specific traces&lt;/li&gt;
&lt;li&gt;See performance metrics alongside request flows&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  OpenObserve Key Features
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Lightweight &amp;amp; Deployable&lt;/strong&gt;: Operates as a single binary on laptops or containerized environments&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Intuitive Interface&lt;/strong&gt;: More user-friendly than comparable tools&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Query Flexibility&lt;/strong&gt;: Supports both SQL and PromQL syntax&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Integrated Alerting&lt;/strong&gt;: Built-in capabilities eliminate additional configuration&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cost Efficiency&lt;/strong&gt;: Achieves substantially lower storage expenses than competitors (140x less than Elasticsearch)&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  How It Works: Quick Overview
&lt;/h2&gt;

&lt;p&gt;The setup involves five main components:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;OpenTelemetry Collector&lt;/strong&gt; - Receives and processes telemetry data&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;FastAPI Instrumentation&lt;/strong&gt; - Automatically captures traces from your FastAPI app&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;OpenObserve&lt;/strong&gt; - Stores and visualizes logs, metrics, and traces&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Trace IDs&lt;/strong&gt; - Unique identifiers that follow requests across services&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Dashboards&lt;/strong&gt; - See correlated logs and traces in one view&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Example: Debugging with Trace IDs
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Before OpenTelemetry:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;grep&lt;/span&gt; &lt;span class="s2"&gt;"user_id=12345"&lt;/span&gt; service1.log  &lt;span class="c"&gt;# Found request&lt;/span&gt;
&lt;span class="nb"&gt;grep&lt;/span&gt; &lt;span class="s2"&gt;"timestamp=14:23:45"&lt;/span&gt; service2.log  &lt;span class="c"&gt;# Which timezone?&lt;/span&gt;
&lt;span class="nb"&gt;grep&lt;/span&gt; &lt;span class="s2"&gt;"error"&lt;/span&gt; service3.log  &lt;span class="c"&gt;# Too many results&lt;/span&gt;
&lt;span class="c"&gt;# 2 hours later... still searching&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;After OpenTelemetry:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Search by trace ID across all services&lt;/span&gt;
&lt;span class="nb"&gt;grep&lt;/span&gt; &lt;span class="s2"&gt;"trace_id=abc123"&lt;/span&gt; &lt;span class="k"&gt;*&lt;/span&gt;.log
&lt;span class="c"&gt;# Instantly see: Request → Auth → Database → External API timeout&lt;/span&gt;
&lt;span class="c"&gt;# 2 minutes to identify root cause&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  What You'll Get
&lt;/h2&gt;

&lt;p&gt;With FastAPI + OpenTelemetry + OpenObserve:&lt;/p&gt;

&lt;p&gt;✅ &lt;strong&gt;Automatic tracing&lt;/strong&gt; for all FastAPI endpoints&lt;br&gt;
✅ &lt;strong&gt;Trace IDs&lt;/strong&gt; that follow requests across microservices&lt;br&gt;
✅ &lt;strong&gt;Log correlation&lt;/strong&gt; - click a trace to see all related logs&lt;br&gt;
✅ &lt;strong&gt;Performance metrics&lt;/strong&gt; - response times, error rates per endpoint&lt;br&gt;
✅ &lt;strong&gt;Fast debugging&lt;/strong&gt; - find issues in minutes, not hours&lt;/p&gt;




&lt;h2&gt;
  
  
  Ready to Set This Up?
&lt;/h2&gt;

&lt;p&gt;The complete setup guide (with step-by-step instructions, code examples, and configuration files) is available on OpenObserve's blog.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What you'll learn:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Installing OpenTelemetry Collector&lt;/li&gt;
&lt;li&gt;Configuring YAML for log and trace collection&lt;/li&gt;
&lt;li&gt;Setting up OpenObserve locally or in the cloud&lt;/li&gt;
&lt;li&gt;Instrumenting your FastAPI application with automatic tracing&lt;/li&gt;
&lt;li&gt;Testing and analyzing traces in the OpenObserve dashboard&lt;/li&gt;
&lt;li&gt;Common troubleshooting tips&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;👉 &lt;strong&gt;&lt;a href="https://openobserve.ai/blog/monitoring-fastapi-application-using-opentelemetry-and-openobserve/" rel="noopener noreferrer"&gt;Read the full setup guide here&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Looking for an OpenTelemetry-native backend?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;If you need something that works with your existing OTel setup—self-hosted or managed cloud, SQL + PromQL querying, unified logs/metrics/traces, with enterprise features (SSO, RBAC, multi-tenancy) but without the Datadog/Elastic price tag:&lt;/p&gt;

&lt;p&gt;Check out &lt;a href="https://openobserve.ai" rel="noopener noreferrer"&gt;OpenObserve&lt;/a&gt;. Open-source, 140x lower storage costs, built for teams that want control over their observability stack.&lt;/p&gt;

&lt;p&gt;→ &lt;a href="https://cloud.openobserve.ai" rel="noopener noreferrer"&gt;Try the cloud version&lt;/a&gt; (14-day trial)&lt;br&gt;
  → &lt;a href="https://openobserve.ai/downloads/" rel="noopener noreferrer"&gt;Download&lt;/a&gt;&lt;/p&gt;

</description>
      <category>fastapi</category>
      <category>python</category>
      <category>opentelemetry</category>
      <category>observability</category>
    </item>
    <item>
      <title>NVIDIA GPU Monitoring: Catch Thermal Throttling Before It Costs You $50k/Year</title>
      <dc:creator>Manas Sharma</dc:creator>
      <pubDate>Sun, 01 Feb 2026 09:19:19 +0000</pubDate>
      <link>https://dev.to/openobserve/nvidia-gpu-monitoring-with-dcgm-exporter-and-openobserve-complete-setup-guide-34k6</link>
      <guid>https://dev.to/openobserve/nvidia-gpu-monitoring-with-dcgm-exporter-and-openobserve-complete-setup-guide-34k6</guid>
      <description>&lt;p&gt;Thermal throttling at 3 AM because you didn't catch that GPU running hot? Your $240k H200 cluster shouldn't be bleeding $50k+ annually through silent failures and inefficiencies.&lt;/p&gt;

&lt;p&gt;We built this guide because monitoring NVIDIA GPUs with traditional tools was taking 4-8 hours of setup time. Here's how to get DCGM Exporter + OpenObserve running in ~30 minutes and catch issues before they torch your budget.&lt;/p&gt;




&lt;p&gt;AI-driven infrastructure landscape is evolving and GPU clusters represent one of the most significant capital investments for organizations. Whether you're running large language models, training deep learning models, or processing massive datasets, your NVIDIA GPUs (H100s, H200s, A100s, or L40S) are the workhorses powering your most critical workloads.&lt;/p&gt;

&lt;p&gt;But here's the challenge: &lt;strong&gt;how do you know if your GPU infrastructure is performing optimally?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Traditional monitoring approaches fall short when it comes to GPU infrastructure. System metrics like CPU and memory utilization don't tell you if your GPUs are thermal throttling, experiencing memory bottlenecks, or operating at peak efficiency. You need deep visibility into GPU-specific metrics like utilization, temperature, power consumption, memory usage, and PCIe throughput.&lt;/p&gt;

&lt;p&gt;This is where &lt;strong&gt;NVIDIA's Data Center GPU Manager (DCGM) Exporter&lt;/strong&gt; combined with &lt;strong&gt;OpenObserve&lt;/strong&gt; creates a powerful, cost-effective monitoring solution that gives you real-time insights into your GPU infrastructure.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why GPU Monitoring Matters
&lt;/h2&gt;

&lt;h3&gt;
  
  
  The High Cost of GPU Inefficiency
&lt;/h3&gt;

&lt;p&gt;Consider this scenario: You're running an 8x NVIDIA H200 cluster. Each H200 costs approximately $30,000-$40,000, meaning your hardware investment alone is around $240,000-$320,000. Operating costs (power, cooling, infrastructure) can easily add another $50,000-$100,000 annually.&lt;/p&gt;

&lt;p&gt;Now imagine:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Thermal throttling&lt;/strong&gt; reducing performance by 15% due to poor cooling&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;GPU memory leaks&lt;/strong&gt; causing jobs to fail silently&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Underutilization&lt;/strong&gt; with GPUs sitting idle 40% of the time&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Hardware failures&lt;/strong&gt; going undetected until complete outage&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;PCIe bottlenecks&lt;/strong&gt; limiting data transfer rates&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Without proper monitoring, you're flying blind. You might be:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Wasting $50,000+ annually&lt;/strong&gt; on inefficient GPU utilization&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Missing critical performance degradation&lt;/strong&gt; before it impacts production&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Unable to justify ROI&lt;/strong&gt; on GPU infrastructure to stakeholders&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Lacking data&lt;/strong&gt; for capacity planning and optimization decisions&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  What You Need to Monitor
&lt;/h3&gt;

&lt;p&gt;Effective GPU monitoring requires tracking dozens of metrics across multiple dimensions:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Performance Metrics:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;GPU compute utilization (%)&lt;/li&gt;
&lt;li&gt;Memory bandwidth utilization (%)&lt;/li&gt;
&lt;li&gt;Tensor Core utilization&lt;/li&gt;
&lt;li&gt;SM (Streaming Multiprocessor) occupancy&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Thermal &amp;amp; Power:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;GPU temperature (°C)&lt;/li&gt;
&lt;li&gt;Power consumption (W)&lt;/li&gt;
&lt;li&gt;Power limit throttling events&lt;/li&gt;
&lt;li&gt;Thermal throttling events&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Memory:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;GPU memory usage (MB/GB)&lt;/li&gt;
&lt;li&gt;Memory allocation failures&lt;/li&gt;
&lt;li&gt;ECC (Error Correction Code) errors&lt;/li&gt;
&lt;li&gt;Memory clock speeds&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Interconnect:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;PCIe throughput (TX/RX)&lt;/li&gt;
&lt;li&gt;NVLink bandwidth&lt;/li&gt;
&lt;li&gt;NVSwitch fabric health&lt;/li&gt;
&lt;li&gt;Data transfer bottlenecks&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Health &amp;amp; Reliability:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;XID errors (hardware faults)&lt;/li&gt;
&lt;li&gt;Page retirement events&lt;/li&gt;
&lt;li&gt;GPU compute capability&lt;/li&gt;
&lt;li&gt;Driver version compliance&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The Solution: DCGM Exporter + OpenObserve
&lt;/h2&gt;

&lt;h3&gt;
  
  
  What is DCGM Exporter?
&lt;/h3&gt;

&lt;p&gt;NVIDIA's Data Center GPU Manager (DCGM) is a suite of tools for managing and monitoring NVIDIA datacenter GPUs. DCGM Exporter exposes GPU metrics in Prometheus format, making it easy to integrate with modern observability platforms.&lt;/p&gt;

&lt;p&gt;You can find more details about DCGM exporter &lt;a href="https://github.com/NVIDIA/dcgm-exporter" rel="noopener noreferrer"&gt;here&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key capabilities:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Exposes 40+ GPU metrics per device&lt;/li&gt;
&lt;li&gt;Supports all modern NVIDIA datacenter GPUs (A100, H100, H200, L40S)&lt;/li&gt;
&lt;li&gt;Low overhead monitoring (~1% GPU utilization)&lt;/li&gt;
&lt;li&gt;Works with Docker, Kubernetes, and bare metal&lt;/li&gt;
&lt;li&gt;Handles multi-GPU and multi-node deployments&lt;/li&gt;
&lt;li&gt;Provides health diagnostics and error detection&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Complete Setup Guide
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Prerequisites
&lt;/h3&gt;

&lt;p&gt;Before starting, ensure you have:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;GPU-enabled server (cloud or on-premises)&lt;/li&gt;
&lt;li&gt;NVIDIA GPUs installed and recognized by the system&lt;/li&gt;
&lt;li&gt;NVIDIA drivers version 535+ (550+ recommended for H200)&lt;/li&gt;
&lt;li&gt;Docker installed and configured with NVIDIA Container Toolkit&lt;/li&gt;
&lt;li&gt;OpenObserve instance (cloud or self-hosted)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Step 1: Verify GPU Detection
&lt;/h3&gt;

&lt;p&gt;First, confirm your GPUs are properly detected by the system:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Check if GPUs are visible&lt;/span&gt;
nvidia-smi

&lt;span class="c"&gt;# Expected output: List of GPUs with utilization, temperature, and memory&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For NVIDIA H200 or multi-GPU systems with NVSwitch, you'll need the NVIDIA Fabric Manager:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Install fabric manager (version should match your driver)&lt;/span&gt;
&lt;span class="nb"&gt;sudo &lt;/span&gt;apt update
&lt;span class="nb"&gt;sudo &lt;/span&gt;apt &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-y&lt;/span&gt; nvidia-driver-535 nvidia-fabricmanager-535

&lt;span class="c"&gt;# Reboot to load new driver&lt;/span&gt;
&lt;span class="nb"&gt;sudo &lt;/span&gt;reboot

&lt;span class="c"&gt;# After reboot, start the service&lt;/span&gt;
&lt;span class="nb"&gt;sudo &lt;/span&gt;systemctl start nvidia-fabricmanager
&lt;span class="nb"&gt;sudo &lt;/span&gt;systemctl &lt;span class="nb"&gt;enable &lt;/span&gt;nvidia-fabricmanager

&lt;span class="c"&gt;# Verify&lt;/span&gt;
nvidia-smi  &lt;span class="c"&gt;# Should now show all GPUs&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 2: Deploy DCGM Exporter
&lt;/h3&gt;

&lt;p&gt;Deploy DCGM Exporter as a Docker container. This lightweight container exposes GPU metrics on port 9400:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker run &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--gpus&lt;/span&gt; all &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--cap-add&lt;/span&gt; SYS_ADMIN &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--network&lt;/span&gt; host &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--name&lt;/span&gt; dcgm-exporter &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--restart&lt;/span&gt; unless-stopped &lt;span class="se"&gt;\&lt;/span&gt;
  nvcr.io/nvidia/k8s/dcgm-exporter:3.3.5-3.4.0-ubuntu22.04
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Configuration breakdown:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;--gpus all&lt;/code&gt; - Grants access to all GPUs on the host&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;--cap-add SYS_ADMIN&lt;/code&gt; - Required for DCGM to query GPU metrics&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;--network host&lt;/code&gt; - Uses host networking for easier access&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;--restart unless-stopped&lt;/code&gt; - Ensures resilience across reboots&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Verify DCGM is working:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Wait 10 seconds for initialization&lt;/span&gt;
&lt;span class="nb"&gt;sleep &lt;/span&gt;10

&lt;span class="c"&gt;# Access metrics from inside the container&lt;/span&gt;
docker &lt;span class="nb"&gt;exec &lt;/span&gt;dcgm-exporter curl &lt;span class="nt"&gt;-s&lt;/span&gt; http://localhost:9400/metrics | &lt;span class="nb"&gt;head&lt;/span&gt; &lt;span class="nt"&gt;-30&lt;/span&gt;

&lt;span class="c"&gt;# You should see output like:&lt;/span&gt;
&lt;span class="c"&gt;# DCGM_FI_DEV_GPU_UTIL{gpu="0",UUID="GPU-xxxx",...} 45.0&lt;/span&gt;
&lt;span class="c"&gt;# DCGM_FI_DEV_GPU_TEMP{gpu="0",UUID="GPU-xxxx",...} 42.0&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 3: Configure OpenTelemetry Collector
&lt;/h3&gt;

&lt;p&gt;The OpenTelemetry Collector scrapes metrics from DCGM Exporter and forwards them to OpenObserve. Create the configuration:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;receivers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;prometheus&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;config&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;scrape_configs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;job_name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;dcgm-gpu-metrics'&lt;/span&gt;
          &lt;span class="na"&gt;scrape_interval&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;30s&lt;/span&gt;
          &lt;span class="na"&gt;static_configs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;targets&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;localhost:9400'&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
          &lt;span class="na"&gt;metric_relabel_configs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="c1"&gt;# Keep only DCGM metrics&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;source_labels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;__name__&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
              &lt;span class="na"&gt;regex&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;DCGM_.*'&lt;/span&gt;
              &lt;span class="na"&gt;action&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;keep&lt;/span&gt;

&lt;span class="na"&gt;exporters&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;otlphttp/openobserve&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;endpoint&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;https://example.openobserve.ai/api/ORG_NAME/&lt;/span&gt;
    &lt;span class="na"&gt;headers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;Authorization&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Basic&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;YOUR_O2_TOKEN"&lt;/span&gt;

&lt;span class="na"&gt;processors&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;batch&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;timeout&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;10s&lt;/span&gt;
    &lt;span class="na"&gt;send_batch_size&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1024&lt;/span&gt;

&lt;span class="na"&gt;service&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;pipelines&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;metrics&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;receivers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;prometheus&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
      &lt;span class="na"&gt;processors&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;batch&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
      &lt;span class="na"&gt;exporters&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;otlphttp/openobserve&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Get your OpenObserve credentials:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# For Ingestion token authentication (recommended):&lt;/span&gt;
Go to OpenObserve UI → Datasources -&amp;gt; Custom -&amp;gt; Otel Collector
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyd0f0rh47j5c6jy1ii6q.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyd0f0rh47j5c6jy1ii6q.jpeg" alt="openobserve ingestion token" width="800" height="379"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Update the &lt;code&gt;Authorization&lt;/code&gt; header in the config with your base64-encoded credentials.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 4: Deploy OpenTelemetry Collector
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker run &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--network&lt;/span&gt; host &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-v&lt;/span&gt; &lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;pwd&lt;/span&gt;&lt;span class="si"&gt;)&lt;/span&gt;/otel-collector-config.yaml:/etc/otel-collector-config.yaml &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--name&lt;/span&gt; otel-collector &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--restart&lt;/span&gt; unless-stopped &lt;span class="se"&gt;\&lt;/span&gt;
  otel/opentelemetry-collector-contrib:latest &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--config&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;/etc/otel-collector-config.yaml
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Check OpenTelemetry Collector:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# View collector logs&lt;/span&gt;
docker logs otel-collector

&lt;span class="c"&gt;# Look for successful scrapes (no error messages)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Check OpenObserve:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Log into OpenObserve UI&lt;/li&gt;
&lt;li&gt;Navigate to &lt;strong&gt;Metrics&lt;/strong&gt; section&lt;/li&gt;
&lt;li&gt;Search for metrics starting with &lt;code&gt;DCGM_&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Data should appear within 1-2 minutes&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fk82kheukfq0gjptaamg1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fk82kheukfq0gjptaamg1.png" alt="dcgm metrics list" width="800" height="431"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 5: Generate GPU Load (Optional)
&lt;/h3&gt;

&lt;p&gt;To verify monitoring is working, generate some GPU activity:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Install PyTorch&lt;/span&gt;
pip3 &lt;span class="nb"&gt;install &lt;/span&gt;torch

&lt;span class="c"&gt;# Create a load test script&lt;/span&gt;
&lt;span class="nb"&gt;cat&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; gpu_load.py &lt;span class="o"&gt;&amp;lt;&amp;lt;&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="no"&gt;EOF&lt;/span&gt;&lt;span class="sh"&gt;'
import torch
import time

print("Starting GPU load test...")
devices = [torch.device(f'cuda:{i}') for i in range(torch.cuda.device_count())]
tensors = [torch.randn(15000, 15000, device=d) for d in devices]

print(f"Loaded {len(devices)} GPUs")
while True:
    for tensor in tensors:
        _ = torch.mm(tensor, tensor)
    time.sleep(0.5)
&lt;/span&gt;&lt;span class="no"&gt;EOF

&lt;/span&gt;&lt;span class="c"&gt;# Run load test&lt;/span&gt;
python3 gpu_load.py
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Watch your metrics in OpenObserve - you should see GPU utilization spike!&lt;/p&gt;

&lt;h2&gt;
  
  
  Creating Dashboards in OpenObserve
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;Download the Dashboards from our &lt;a href="https://github.com/openobserve/dashboards/tree/main/NVIDIA%20GPU%20Monitoring" rel="noopener noreferrer"&gt;community repository&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;In OpenObserve UI, go to Dashboards → Import -&amp;gt; Drop your files here -&amp;gt; select your json -&amp;gt; Import&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1prhwbahlrq91r5n5ecn.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1prhwbahlrq91r5n5ecn.gif" alt="steps to show how to import dashboards" width="" height=""&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Once the dashboard has been imported, you will see the below metrics that were prebuilt and you can always customize the dashboards as needed.
&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcnqidg9ca68b8y81uq2n.gif" alt="gpu-dash.gif" width="560" height="304"&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Setting Up Alerts
&lt;/h2&gt;

&lt;p&gt;Critical alerts to configure in OpenObserve:&lt;/p&gt;

&lt;h3&gt;
  
  
  1. High GPU Temperature
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;DCGM_FI_DEV_GPU_TEMP &amp;gt; 85
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Severity:&lt;/strong&gt; Warning at 85°C, Critical at 90°C&lt;br&gt;
&lt;strong&gt;Action:&lt;/strong&gt; Check cooling systems, reduce workload&lt;/p&gt;

&lt;h3&gt;
  
  
  2. GPU Memory Near Capacity
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;(DCGM_FI_DEV_FB_USED / (DCGM_FI_DEV_FB_USED + DCGM_FI_DEV_FB_FREE)) &amp;gt; 0.90
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Severity:&lt;/strong&gt; Warning at 90%, Critical at 95%&lt;br&gt;
&lt;strong&gt;Action:&lt;/strong&gt; Optimize memory usage or scale horizontally&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Low GPU Utilization (Waste Detection)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;avg(DCGM_FI_DEV_GPU_UTIL) &amp;lt; 20
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Duration:&lt;/strong&gt; For 30 minutes&lt;br&gt;
&lt;strong&gt;Action:&lt;/strong&gt; Review workload scheduling, consider rightsizing&lt;/p&gt;

&lt;h3&gt;
  
  
  4. GPU Hardware Errors
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;increase(DCGM_FI_DEV_XID_ERRORS[5m]) &amp;gt; 0
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Severity:&lt;/strong&gt; Critical&lt;br&gt;
&lt;strong&gt;Action:&lt;/strong&gt; Immediate investigation, potential RMA&lt;/p&gt;

&lt;h3&gt;
  
  
  5. Thermal Throttling Detected
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;increase(DCGM_FI_DEV_THERMAL_VIOLATION[5m]) &amp;gt; 0
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Severity:&lt;/strong&gt; Warning&lt;br&gt;
&lt;strong&gt;Action:&lt;/strong&gt; Improve cooling or reduce ambient temperature&lt;/p&gt;

&lt;h3&gt;
  
  
  6. GPU Offline
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;absent(DCGM_FI_DEV_GPU_TEMP)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Duration:&lt;/strong&gt; For 2 minutes&lt;br&gt;
&lt;strong&gt;Action:&lt;/strong&gt; Check GPU health, driver status, fabric manager&lt;/p&gt;

&lt;h2&gt;
  
  
  Traditional Monitoring vs. GPU Monitoring with OpenObserve
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Aspect&lt;/th&gt;
&lt;th&gt;Traditional Monitoring (Prometheus/Grafana)&lt;/th&gt;
&lt;th&gt;OpenObserve for GPU Monitoring&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Setup Complexity&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Requires Prometheus, node exporters, Grafana, storage backend, and complex configuration&lt;/td&gt;
&lt;td&gt;Single unified platform with built-in visualization&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Storage Costs&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;High - Prometheus stores all metrics at full resolution, requires expensive SSD storage&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;80% lower&lt;/strong&gt; - Advanced compression and columnar storage&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Multi-tenancy&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Complex setup requiring multiple Prometheus instances or federation&lt;/td&gt;
&lt;td&gt;Built-in with organization isolation and access controls&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Alerting&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Separate alerting system (Alertmanager), complex routing configuration&lt;/td&gt;
&lt;td&gt;Integrated alerting with flexible notification channels&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Long-term Retention&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Expensive - requires additional tools like Thanos or Cortex&lt;/td&gt;
&lt;td&gt;Native long-term storage with automatic data lifecycle management&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;GPU-Specific Features&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Generic time-series database, not optimized for GPU metrics&lt;/td&gt;
&lt;td&gt;Optimized for high-cardinality workloads like GPU monitoring&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Log Correlation&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Separate log management system needed (ELK, Loki)&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;Unified logs, metrics, and traces&lt;/strong&gt; in one platform&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Setup Time&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;4-8 hours (multiple components, configurations, troubleshooting)&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;30 minutes&lt;/strong&gt; (end-to-end)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Maintenance Overhead&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;High - multiple systems to update, monitor, and troubleshoot&lt;/td&gt;
&lt;td&gt;Low - single platform with automatic updates&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  ROI Examples
&lt;/h3&gt;

&lt;p&gt;For an 8-GPU H200 cluster worth $320,000:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Detect thermal throttling early:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;15% performance loss = $48,000 annual waste&lt;/li&gt;
&lt;li&gt;Early detection saves this loss&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;ROI: 990% in first year&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Optimize utilization:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Increase from 40% to 70% = 75% more work&lt;/li&gt;
&lt;li&gt;Defer $240,000 expansion by 1 year&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;ROI: 4,900% in first year&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Prevent downtime:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;1 hour downtime = $2,800 revenue loss&lt;/li&gt;
&lt;li&gt;Preventing 5 hours/year = $14,000 saved&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;ROI: 289% in first year&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;GPU monitoring is no longer optional—it's essential infrastructure for any organization running GPU workloads. The combination of DCGM Exporter and OpenObserve provides:&lt;/p&gt;

&lt;p&gt;✅ &lt;strong&gt;Complete visibility&lt;/strong&gt; into GPU health, performance, and utilization&lt;br&gt;
✅ &lt;strong&gt;Cost optimization&lt;/strong&gt; through identifying waste and inefficiencies&lt;br&gt;
✅ &lt;strong&gt;Proactive alerting&lt;/strong&gt; to prevent outages and degradation&lt;br&gt;
✅ &lt;strong&gt;Data-driven decisions&lt;/strong&gt; for capacity planning and architecture&lt;br&gt;
✅ &lt;strong&gt;89% lower TCO&lt;/strong&gt; compared to traditional monitoring stacks&lt;br&gt;
✅ &lt;strong&gt;30-minute setup&lt;/strong&gt; vs. days with traditional tools&lt;/p&gt;

&lt;p&gt;Whether you're running AI/ML workloads, rendering farms, scientific computing, or GPU-accelerated databases, this monitoring solution delivers immediate ROI while scaling effortlessly as your infrastructure grows.&lt;/p&gt;

&lt;h3&gt;
  
  
  Resources
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;DCGM Exporter:&lt;/strong&gt; &lt;a href="https://github.com/NVIDIA/dcgm-exporter" rel="noopener noreferrer"&gt;github.com/NVIDIA/dcgm-exporter&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;OpenObserve:&lt;/strong&gt; &lt;a href="https://openobserve.ai" rel="noopener noreferrer"&gt;openobserve.ai&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;OpenObserve Docs:&lt;/strong&gt; &lt;a href="https://openobserve.ai/docs" rel="noopener noreferrer"&gt;openobserve.ai/docs&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;OpenTelemetry Collector:&lt;/strong&gt; &lt;a href="https://opentelemetry.io/docs/collector" rel="noopener noreferrer"&gt;opentelemetry.io/docs/collector&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;h4&gt;
  
  
  Get Started with OpenObserve Today!
&lt;/h4&gt;

&lt;p&gt;Sign up for a &lt;a href="https://cloud.openobserve.ai" rel="noopener noreferrer"&gt;14 day trial&lt;/a&gt;&lt;br&gt;
Check out our &lt;a href="https://github.com/openobserve" rel="noopener noreferrer"&gt;GitHub repository&lt;/a&gt; for self-hosting and contribution opportunities&lt;/p&gt;
&lt;/blockquote&gt;




&lt;p&gt;&lt;strong&gt;Debugging GPU infrastructure shouldn't feel like a 2 AM guessing game.&lt;/strong&gt;&lt;br&gt;
Try &lt;a href="//cloud.openobserve.ai"&gt;OpenObserve&lt;/a&gt; for free&lt;/p&gt;

</description>
      <category>devops</category>
      <category>monitoring</category>
      <category>gpu</category>
      <category>observability</category>
    </item>
  </channel>
</rss>
