<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Manas Sharma</title>
    <description>The latest articles on DEV Community by Manas Sharma (@manas_sharma).</description>
    <link>https://dev.to/manas_sharma</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3739096%2F64c63567-d504-47de-b304-1cd488cc2906.jpeg</url>
      <title>DEV Community: Manas Sharma</title>
      <link>https://dev.to/manas_sharma</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/manas_sharma"/>
    <language>en</language>
    <item>
      <title>How to Monitor AI Agents in Production</title>
      <dc:creator>Manas Sharma</dc:creator>
      <pubDate>Thu, 28 May 2026 06:18:40 +0000</pubDate>
      <link>https://dev.to/manas_sharma/how-to-monitor-ai-agents-in-production-1mn2</link>
      <guid>https://dev.to/manas_sharma/how-to-monitor-ai-agents-in-production-1mn2</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;TLDR&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Monitoring AI agents in production requires distributed tracing: a single user request fans out into 10 or more internal operations, and logs alone cannot show you which step is slow, failing, or burning your token budget.&lt;/li&gt;
&lt;li&gt;OpenTelemetry's &lt;code&gt;gen_ai.*&lt;/code&gt; semantic conventions give you standardized span attributes for LLM calls, tool invocations, and agent steps. Some are stable today; others are still experimental.&lt;/li&gt;
&lt;li&gt;Auto-instrumentation libraries (OpenLLMetry, OpenInference, OpenLIT) cover most agent frameworks with two to three lines of initialization code. You do not change your agent code.&lt;/li&gt;
&lt;li&gt;Traces ship to OpenObserve over OTLP. From there you get SQL-queryable trace data, token usage dashboards, cost attribution by agent and model, and alerting on latency and cost anomalies.&lt;/li&gt;
&lt;li&gt;OpenObserve also exposes an MCP server. You can query your live agent traces from a Claude or GPT session without opening a dashboard.&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Why Agents Are Harder to Monitor Than a Single LLM Call
&lt;/h2&gt;

&lt;p&gt;A single LLM call is straightforward to observe. One HTTP request, one response, one latency number. You can log the input and output and call it done.&lt;/p&gt;

&lt;p&gt;An agent is different. When a user sends a message, the agent calls an LLM to decide what to do, invokes a tool, processes the result, calls the LLM again, possibly calls another tool, and eventually returns a response. That one user message becomes ten or more internal operations. Some of those operations call external APIs. Some retry. Some spawn sub-agents.&lt;/p&gt;

&lt;p&gt;Without distributed tracing, you see none of this structure. You know the response took 8 seconds. You do not know whether the LLM took 7 of those seconds or whether a tool made three retries before timing out.&lt;/p&gt;

&lt;p&gt;Four categories of problems appear in production agents that you cannot debug without traces:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Latency.&lt;/strong&gt; Which step is slow? The LLM call? The tool execution? A retry loop the agent entered because the tool returned ambiguous output?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cost.&lt;/strong&gt; Which agent, which task, which model is consuming tokens? A single misconfigured prompt can bloat your monthly bill.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Failures.&lt;/strong&gt; Did the tool fail silently and return an empty result? Did the agent exhaust its step limit and return to a fallback?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Quality.&lt;/strong&gt; Did the agent complete the task, or did it reason its way to a confident-sounding wrong answer?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Distributed tracing gives you a complete record of every operation, in order, with timing and attributes. That record is what makes these questions answerable.&lt;/p&gt;

&lt;h2&gt;
  
  
  The OTel Data Model for AI Agents
&lt;/h2&gt;

&lt;p&gt;OpenTelemetry's GenAI semantic conventions define a standard set of span attributes for AI workloads. The stable attributes you can build on today:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Attribute&lt;/th&gt;
&lt;th&gt;What it captures&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;gen_ai.system&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;LLM provider: openai, anthropic, cohere&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;gen_ai.operation.name&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Operation type: chat, embeddings, text_completion&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;gen_ai.request.model&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Model name: gpt-4o, claude-3-5-sonnet-20241022&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;gen_ai.usage.input_tokens&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Tokens consumed by the prompt&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;gen_ai.usage.output_tokens&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Tokens in the model response&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;gen_ai.response.finish_reasons&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Why the model stopped: stop, tool_calls, length&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;For agent-specific spans, the conventions extend to &lt;code&gt;gen_ai.agent.name&lt;/code&gt;, &lt;code&gt;gen_ai.agent.description&lt;/code&gt;, &lt;code&gt;gen_ai.tool.name&lt;/code&gt;, and &lt;code&gt;gen_ai.tool.description&lt;/code&gt;. These are still marked experimental as of early 2026 but are already implemented by the major instrumentation libraries and are stable enough to use in production.&lt;/p&gt;

&lt;p&gt;For a full breakdown of what OpenTelemetry captures for LLM workloads, including how SRE teams use the three signal types together, see &lt;a href="https://openobserve.ai/blog/opentelemetry-for-llms/" rel="noopener noreferrer"&gt;OpenTelemetry for LLMs: Complete SRE Guide&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Spans: LLM calls, tool invocations, and agent steps
&lt;/h3&gt;

&lt;p&gt;Every significant operation in an agent's lifecycle becomes a span:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;gen_ai.chat&lt;/code&gt;: wraps a single LLM API call. Carries model name, token counts, and finish reason.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;gen_ai.tool&lt;/code&gt;: wraps a single tool invocation. Child of the LLM call span that requested it.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;agent.step&lt;/code&gt;: wraps one full reasoning cycle. Parent of all LLM and tool spans within that cycle.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Events vs. attributes for prompt and response content
&lt;/h3&gt;

&lt;p&gt;Prompt and completion content is large. Storing it as span attributes inflates trace payloads and storage costs. The OTel GenAI convention puts prompt and completion content into span events (typed &lt;code&gt;gen_ai.content.prompt&lt;/code&gt; and &lt;code&gt;gen_ai.content.completion&lt;/code&gt;) rather than attributes. Events attach to the span but are stored separately, keeping the attribute payload small while preserving full content for debugging.&lt;/p&gt;

&lt;p&gt;In practice: leave content capture enabled during development. Before shipping to production, disable it at the application level or route it through the Collector for redaction.&lt;/p&gt;

&lt;h3&gt;
  
  
  Trace context propagation across agent boundaries
&lt;/h3&gt;

&lt;p&gt;When an orchestrator delegates to a worker agent, the worker's spans need to appear under the same root trace. For HTTP-based delegation, include the W3C &lt;code&gt;traceparent&lt;/code&gt; header in the outgoing request and extract it in the worker. For in-process delegation (LangGraph node transitions, OpenAI Agents SDK handoffs), auto-instrumentation handles this automatically.&lt;/p&gt;

&lt;h2&gt;
  
  
  Picking Your Auto-Instrumentation Library
&lt;/h2&gt;

&lt;p&gt;Three libraries sit between your agent code and the OTel SDK. The examples in this blog use LangChain and the OpenAI Agents SDK, both supported by all three libraries. For support across other frameworks (CrewAI, AutoGen, DSPy, and more), check each library's docs.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Library&lt;/th&gt;
&lt;th&gt;Signals&lt;/th&gt;
&lt;th&gt;LangChain&lt;/th&gt;
&lt;th&gt;OpenAI Agents&lt;/th&gt;
&lt;th&gt;Config overhead&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;OpenLLMetry (&lt;code&gt;traceloop-sdk&lt;/code&gt;)&lt;/td&gt;
&lt;td&gt;Traces + Metrics + Logs&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Medium&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;OpenInference&lt;/td&gt;
&lt;td&gt;Traces only&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Low&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;OpenLIT&lt;/td&gt;
&lt;td&gt;Traces + Metrics&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Minimal&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;OpenLLMetry captures the most signals and covers the widest framework catalog. OpenLIT is the easiest entry point: one import, one function call. OpenInference is traces-only but has the closest alignment with OTel GenAI semantic conventions.&lt;/p&gt;

&lt;p&gt;For teams starting out: use OpenLLMetry. For teams already running an OTel SDK setup: use the official &lt;code&gt;opentelemetry-instrumentation-*&lt;/code&gt; packages from &lt;code&gt;opentelemetry-python-contrib&lt;/code&gt;, which include &lt;code&gt;opentelemetry-instrumentation-langchain&lt;/code&gt; and &lt;code&gt;opentelemetry-instrumentation-openai-agents-v2&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;For a full walkthrough of OpenLIT with OpenObserve, including pre-built dashboards for GPU and vector database monitoring, see &lt;a href="https://openobserve.ai/blog/observability-for-ai-applications-using-openobserve-and-openlit/" rel="noopener noreferrer"&gt;LLM Observability for AI Applications with OpenObserve and OpenLIT&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;For a broader comparison of open-source LLM observability tooling, see &lt;a href="https://openobserve.ai/blog/llm-observability-tools/" rel="noopener noreferrer"&gt;Top Open Source LLM Observability Tools&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Example 1: Instrumenting a LangChain Agent
&lt;/h2&gt;

&lt;p&gt;The following examples use LangChain and the OpenAI Agents SDK. The instrumentation pattern is the same for virtually every other agent framework: install a library, initialize before importing framework classes, point the exporter at your backend.&lt;/p&gt;

&lt;p&gt;LangChain's current recommended approach for building agents uses LangGraph as the execution runtime. The &lt;code&gt;opentelemetry-instrumentation-langchain&lt;/code&gt; package instruments both.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Install:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;opentelemetry-sdk &lt;span class="se"&gt;\&lt;/span&gt;
    opentelemetry-exporter-otlp-proto-http &lt;span class="se"&gt;\&lt;/span&gt;
    opentelemetry-instrumentation-openai &lt;span class="se"&gt;\&lt;/span&gt;
    langgraph langchain-openai
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Initialize before any LangChain imports:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;opentelemetry.sdk.trace&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;TracerProvider&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;opentelemetry.sdk.trace.export&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;BatchSpanProcessor&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;opentelemetry.exporter.otlp.proto.http.trace_exporter&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;OTLPSpanExporter&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;opentelemetry.instrumentation.openai&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;OpenAIInstrumentor&lt;/span&gt;

&lt;span class="n"&gt;exporter&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;OTLPSpanExporter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;endpoint&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;&amp;lt;your-openobserve-otlp-endpoint&amp;gt;&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Authorization&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Basic &amp;lt;base64(email:password)&amp;gt;&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;stream-name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;default&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;provider&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;TracerProvider&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;provider&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_span_processor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;BatchSpanProcessor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;exporter&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

&lt;span class="nc"&gt;OpenAIInstrumentor&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;instrument&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tracer_provider&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;provider&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Note:&lt;/strong&gt; &lt;code&gt;opentelemetry-instrumentation-langchain&lt;/code&gt; has a known compatibility issue with current LangGraph versions. &lt;code&gt;OpenAIInstrumentor&lt;/code&gt; covers the spans that matter: LLM calls with token counts, model name, and finish reason. LangChain graph-level spans can be added manually if needed.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;A simple ReAct agent with a tool:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langchain.agents&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;create_react_agent&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langchain_openai&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ChatOpenAI&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langchain_core.tools&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;tool&lt;/span&gt;

&lt;span class="nd"&gt;@tool&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_stock_price&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ticker&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Get the current stock price for a ticker symbol.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="c1"&gt;# Replace with your actual data source
&lt;/span&gt;    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;ticker&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;: $142.50&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

&lt;span class="n"&gt;llm&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;ChatOpenAI&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-4o-mini&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;agent&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;create_react_agent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;llm&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;get_stock_price&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;

&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;agent&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;invoke&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;messages&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;What is the price of AAPL?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}]&lt;/span&gt;
&lt;span class="p"&gt;})&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You did not add a single line to the agent code. The instrumentation wraps LangChain's framework classes at import time and emits spans for every LLM call and tool invocation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What you get in OpenObserve:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Root span for the graph execution&lt;/li&gt;
&lt;li&gt;One child span per LLM call with &lt;code&gt;gen_ai.request.model&lt;/code&gt;, &lt;code&gt;gen_ai.usage.input_tokens&lt;/code&gt;, and &lt;code&gt;gen_ai.usage.output_tokens&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;One child span per tool invocation with the tool name and execution result&lt;/li&gt;
&lt;li&gt;Wall clock timing on every span&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;By default, prompt and completion content is captured. Disable it for production:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;OTEL_INSTRUMENTATION_GENAI_CAPTURE_MESSAGE_CONTENT&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;no_content
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Example 2: Instrumenting an OpenAI Agents SDK App
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Install:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;opentelemetry-sdk &lt;span class="se"&gt;\&lt;/span&gt;
    opentelemetry-exporter-otlp-proto-http &lt;span class="se"&gt;\&lt;/span&gt;
    opentelemetry-instrumentation-openai-agents &lt;span class="se"&gt;\&lt;/span&gt;
    openai-agents
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Initialize:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;opentelemetry.sdk.trace&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;TracerProvider&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;opentelemetry.sdk.trace.export&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;BatchSpanProcessor&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;opentelemetry.exporter.otlp.proto.http.trace_exporter&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;OTLPSpanExporter&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;opentelemetry.instrumentation.openai_agents&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;OpenAIAgentsInstrumentor&lt;/span&gt;

&lt;span class="n"&gt;exporter&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;OTLPSpanExporter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;endpoint&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;&amp;lt;your-openobserve-otlp-endpoint&amp;gt;&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Authorization&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Basic &amp;lt;base64(email:password)&amp;gt;&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;stream-name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;default&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;provider&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;TracerProvider&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;provider&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_span_processor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;BatchSpanProcessor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;exporter&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="nc"&gt;OpenAIAgentsInstrumentor&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;instrument&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tracer_provider&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;provider&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;A two-agent handoff:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;agents&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Agent&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;handoff&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Runner&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;function_tool&lt;/span&gt;

&lt;span class="nd"&gt;@function_tool&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;search_knowledge_base&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Search the internal knowledge base for product information.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Results for &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;: Feature Y has been available since v2.3.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

&lt;span class="n"&gt;support_agent&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Agent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;support_agent&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;instructions&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Answer customer questions using the knowledge base.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;tools&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;search_knowledge_base&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-4o-mini&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;triage_agent&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Agent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;triage_agent&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;instructions&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Route incoming requests to the correct specialist.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;handoffs&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nf"&gt;handoff&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;support_agent&lt;/span&gt;&lt;span class="p"&gt;)],&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-4o-mini&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Runner&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run_sync&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;triage_agent&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;How do I enable feature Y?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The instrumentation generates spans for each agent activation (tagged with &lt;code&gt;gen_ai.agent.name&lt;/code&gt;), each LLM generation (with model and token counts), each tool call (with name and arguments), and each handoff between agents. The handoff span shows up as a child of the triage agent span and a parent of the support agent span, giving you the full call tree.&lt;/p&gt;

&lt;p&gt;Content capture is controlled separately from OpenLLMetry:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;OTEL_INSTRUMENTATION_GENAI_CAPTURE_MESSAGE_CONTENT&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;span_only
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Options: &lt;code&gt;span_only&lt;/code&gt;, &lt;code&gt;event_only&lt;/code&gt;, &lt;code&gt;span_and_event&lt;/code&gt;, &lt;code&gt;no_content&lt;/code&gt;. Use &lt;code&gt;no_content&lt;/code&gt; in production if prompts contain PII.&lt;/p&gt;

&lt;h2&gt;
  
  
  Shipping Traces to OpenObserve
&lt;/h2&gt;

&lt;p&gt;The OTLP exporter configuration shown in the examples above works for both self-hosted and cloud deployments. The only difference is the endpoint URL.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Self-hosted OpenObserve (port 5080):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;OTEL_EXPORTER_OTLP_ENDPOINT&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;http://localhost:5080/api/default/v1/traces
&lt;span class="nv"&gt;OTEL_EXPORTER_OTLP_HEADERS&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nv"&gt;Authorization&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;Basic &amp;lt;base64_token&amp;gt;,stream-name&lt;span class="o"&gt;=&lt;/span&gt;default
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;OpenObserve Cloud:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;OTEL_EXPORTER_OTLP_ENDPOINT&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;https://api.openobserve.ai/api/&amp;lt;your_org&amp;gt;/v1/traces
&lt;span class="nv"&gt;OTEL_EXPORTER_OTLP_HEADERS&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nv"&gt;Authorization&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;Basic &amp;lt;base64_token&amp;gt;,stream-name&lt;span class="o"&gt;=&lt;/span&gt;default
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Generate the base64 token:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="nt"&gt;-n&lt;/span&gt; &lt;span class="s2"&gt;"your_email@example.com:your_password"&lt;/span&gt; | &lt;span class="nb"&gt;base64&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Direct export vs. OTel Collector
&lt;/h3&gt;

&lt;p&gt;Direct export is simpler for development and small deployments. The application sends spans directly to OpenObserve with no intermediate hop.&lt;/p&gt;

&lt;p&gt;The OTel Collector adds a processing layer between your agent and OpenObserve. It is worth adding when you need any of the following:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;PII redaction before spans leave your application network&lt;/li&gt;
&lt;li&gt;Tail-based sampling to reduce trace volume (see the production checklist below)&lt;/li&gt;
&lt;li&gt;Routing the same telemetry to multiple backends simultaneously&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For a complete OTLP exporter configuration guide covering both the direct and Collector paths, see &lt;a href="https://openobserve.ai/blog/langchain-llamaindex-openobserve/" rel="noopener noreferrer"&gt;LangChain and LlamaIndex Tracing with OpenObserve&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Sample Collector configuration pointing at OpenObserve:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;receivers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;otlp&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;protocols&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;grpc&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;endpoint&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;0.0.0.0:4317&lt;/span&gt;
      &lt;span class="na"&gt;http&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;endpoint&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;0.0.0.0:4318&lt;/span&gt;

&lt;span class="na"&gt;processors&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;batch&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;

&lt;span class="na"&gt;exporters&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;otlphttp/openobserve&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;endpoint&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;&amp;lt;your-openobserve-otlp-endpoint&amp;gt;&lt;/span&gt;
    &lt;span class="na"&gt;headers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;Authorization&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Basic&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;lt;base64_token&amp;gt;"&lt;/span&gt;
      &lt;span class="na"&gt;stream-name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;default&lt;/span&gt;

&lt;span class="na"&gt;service&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;pipelines&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;traces&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;receivers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;otlp&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
      &lt;span class="na"&gt;processors&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;batch&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
      &lt;span class="na"&gt;exporters&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;otlphttp/openobserve&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You can find your OTLP endpoint and the matching &lt;code&gt;Authorization&lt;/code&gt; header in the OpenObserve UI under &lt;strong&gt;Data Sources → OpenTelemetry Collector&lt;/strong&gt; — copy the values directly from there into your Collector config.&lt;/p&gt;

&lt;h2&gt;
  
  
  What to Look For in OpenObserve
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Reading a multi-agent trace waterfall
&lt;/h3&gt;

&lt;p&gt;The trace timeline shows every span as a horizontal bar: width is duration, indentation is the parent-child relationship. For a LangChain ReAct agent, you can immediately see which LLM call or tool invocation is driving latency, something that's invisible in logs.&lt;/p&gt;

&lt;h3&gt;
  
  
  SQL queries for token usage and cost
&lt;/h3&gt;

&lt;p&gt;OpenObserve lets you query trace data with SQL directly against the &lt;code&gt;gen_ai.*&lt;/code&gt; attributes. For example, token usage by model over the last hour:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt;
    &lt;span class="n"&gt;gen_ai_request_model&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;SUM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;CAST&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;gen_ai_usage_input_tokens&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="nb"&gt;BIGINT&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;input_tokens&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;SUM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;CAST&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;gen_ai_usage_output_tokens&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="nb"&gt;BIGINT&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;output_tokens&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="k"&gt;default&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;gen_ai_request_model&lt;/span&gt; &lt;span class="k"&gt;IS&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;gen_ai_request_model&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;input_tokens&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Note:&lt;/strong&gt; OpenObserve stores span attributes as top-level flattened fields using underscores (&lt;code&gt;gen_ai_request_model&lt;/code&gt;, not &lt;code&gt;attributes['gen_ai.request.model']&lt;/code&gt;). The time range filter is applied via the dashboard time picker rather than in SQL, since &lt;code&gt;_timestamp&lt;/code&gt; is stored as nanosecond &lt;code&gt;Int64&lt;/code&gt; and is not directly comparable to &lt;code&gt;NOW()&lt;/code&gt;.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;You can extend the same pattern to P99 latency by agent (&lt;code&gt;span_name = 'agent.step'&lt;/code&gt;) or error rate by tool (&lt;code&gt;span_name = 'gen_ai.tool'&lt;/code&gt;). For a full cost attribution setup (per-agent, per-model, with real-time spend alerting), see &lt;a href="https://openobserve.ai/blog/llm-cost-monitoring/" rel="noopener noreferrer"&gt;LLM Cost Monitoring with OpenObserve&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Querying Agent Traces via MCP
&lt;/h2&gt;

&lt;p&gt;OpenObserve exposes an MCP server, so any MCP-compatible LLM client can query your trace store directly, with no dashboard or SQL client required. Connect it to Claude Code:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;claude mcp add o2 https://api.openobserve.ai/api/&amp;lt;your_org&amp;gt;/mcp &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-t&lt;/span&gt; http &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--header&lt;/span&gt; &lt;span class="s2"&gt;"Authorization: Basic &amp;lt;base64_token&amp;gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For self-hosted OpenObserve, replace the URL with &lt;code&gt;http://localhost:5080/api/&amp;lt;your_org&amp;gt;/mcp&lt;/code&gt;. Once connected, ask questions like "which tool had the highest error rate in the last hour?" and get structured results back in your LLM session.&lt;/p&gt;

&lt;p&gt;For a full guide to MCP servers in the observability stack, see &lt;a href="https://openobserve.ai/blog/mcp-servers-observability-guide/" rel="noopener noreferrer"&gt;What OpenObserve MCP server can do?&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Production Checklist
&lt;/h2&gt;

&lt;h3&gt;
  
  
  PII redaction
&lt;/h3&gt;

&lt;p&gt;Disable prompt and completion capture at the application level before traces leave the process:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# OpenLLMetry&lt;/span&gt;
&lt;span class="nv"&gt;TRACELOOP_TRACE_CONTENT&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;false&lt;/span&gt;

&lt;span class="c"&gt;# OpenAI Agents SDK / OTel GenAI instrumentation&lt;/span&gt;
&lt;span class="nv"&gt;OTEL_INSTRUMENTATION_GENAI_CAPTURE_MESSAGE_CONTENT&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;no_content
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For finer-grained redaction (specific patterns, or third-party instrumentation you don't fully control), OpenObserve has a native sensitive data redaction feature with 140+ built-in PII patterns and redact/hash/drop actions applied at ingestion time. See &lt;a href="https://openobserve.ai/blog/sensitive-data-redaction-openobserve/" rel="noopener noreferrer"&gt;Sensitive Data Redaction in OpenObserve&lt;/a&gt; for a full walkthrough, or the &lt;a href="https://openobserve.ai/blog/redact-sensitive-data-in-logs/" rel="noopener noreferrer"&gt;OTel Collector approach for logs&lt;/a&gt; if you prefer to handle it at the pipeline level.&lt;/p&gt;

&lt;h3&gt;
  
  
  Sampling for LLM traffic
&lt;/h3&gt;

&lt;p&gt;LLM spans are large and frequent. Tracing at 100% is expensive. Use tail-based sampling in the Collector: keep 100% of error traces and slow traces (e.g. &amp;gt;5s), and sample the rest probabilistically (e.g. 10%). This preserves the traces you need for debugging while keeping storage costs predictable. For a deeper look at head- vs. tail-based sampling tradeoffs and Collector configuration, see &lt;a href="https://openobserve.ai/blog/head-and-tail-based-sampling/" rel="noopener noreferrer"&gt;Head-Based vs Tail-Based Sampling&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Alerting
&lt;/h3&gt;

&lt;p&gt;Four alerts to configure before your agent goes to production:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Latency spike:&lt;/strong&gt; P99 of &lt;code&gt;agent.step&lt;/code&gt; spans exceeds 10 seconds in a 5-minute window&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cost anomaly:&lt;/strong&gt; total &lt;code&gt;gen_ai.usage.output_tokens&lt;/code&gt; per hour exceeds your 7-day baseline by 3x&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tool failure rate:&lt;/strong&gt; error percentage on any &lt;code&gt;gen_ai.tool&lt;/code&gt; span exceeds 5% in 15 minutes&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Trace volume spike:&lt;/strong&gt; unique trace IDs per minute exceeds 5x the normal rate (retry storm or agent stuck in a loop)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;OpenObserve supports scheduled and real-time alerts with SQL, PromQL, or the query builder. See the &lt;a href="https://openobserve.ai/docs/user-guide/analytics/alerts/" rel="noopener noreferrer"&gt;Alerts docs&lt;/a&gt; to configure these.&lt;/p&gt;

&lt;h2&gt;
  
  
  Try It on OpenObserve Cloud
&lt;/h2&gt;

&lt;p&gt;OpenObserve Cloud gives you an OTLP endpoint ready to accept traces, metrics, and logs with no infrastructure to provision. Point your exporter at &lt;code&gt;https://api.openobserve.ai/api/&amp;lt;your_org&amp;gt;/v1/traces&lt;/code&gt;, set your auth header, and agent traces start appearing in the UI within seconds. The same SQL queries, cost dashboards, and MCP server are available from day one.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://cloud.openobserve.ai" rel="noopener noreferrer"&gt;Start for free on OpenObserve Cloud&lt;/a&gt;&lt;/p&gt;

</description>
      <category>aiagents</category>
      <category>opentelemetry</category>
      <category>observability</category>
      <category>llm</category>
    </item>
    <item>
      <title>How to Monitor OpenAI API Costs and Token Usage with OpenTelemetry</title>
      <dc:creator>Manas Sharma</dc:creator>
      <pubDate>Fri, 15 May 2026 13:02:47 +0000</pubDate>
      <link>https://dev.to/openobserve/how-to-monitor-openai-api-costs-and-token-usage-with-opentelemetry-4c5o</link>
      <guid>https://dev.to/openobserve/how-to-monitor-openai-api-costs-and-token-usage-with-opentelemetry-4c5o</guid>
      <description>&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Capture &lt;code&gt;gen_ai.*&lt;/code&gt; semantic convention attributes on every OpenAI call: request model, input tokens, output tokens. Add &lt;code&gt;feature&lt;/code&gt;, &lt;code&gt;user_id&lt;/code&gt;, and &lt;code&gt;team&lt;/code&gt; on every span so you can break down cost by who and what is spending.&lt;/li&gt;
&lt;li&gt;Compute &lt;code&gt;gen_ai.usage.cost_usd&lt;/code&gt; from a pricing table you control and emit it as both a span attribute (for per-request drill-down) and a histogram metric (for aggregation and alerting).&lt;/li&gt;
&lt;li&gt;Alert on cost anomalies relative to your historical baseline, not just static budget thresholds. Retry loops and runaway agents show up as deviations before they ever cross a daily spend limit.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Why OpenAI bills are impossible to predict without instrumentation
&lt;/h2&gt;

&lt;p&gt;Running an LLM app in production without instrumentation is a slow way to find out your margins are negative. Token consumption is non-obvious: a single user with a verbose system prompt and long chat history can cost 20x more per interaction than an average user. A bug in a retry loop can 10x your daily spend in an hour. A single new feature that adds RAG context to every call can double your input token count overnight.&lt;/p&gt;

&lt;p&gt;The OpenAI dashboard tells you what you spent yesterday. It does not tell you which feature, which user, which prompt template, or which model variant drove the spend. By the time you notice a cost spike in your billing dashboard, you have already paid for it.&lt;/p&gt;

&lt;p&gt;The fix is the same fix you use for any production system: emit structured telemetry at the point of the API call and make it queryable. OpenTelemetry gives you a vendor-neutral way to do this, and a growing set of GenAI-specific conventions means the fields you emit today will still be meaningful in two years.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Quick start:&lt;/strong&gt; Jump to the Python setup or Node.js setup if you just need the code.&lt;/p&gt;

&lt;h2&gt;
  
  
  The three signals you actually need to track
&lt;/h2&gt;

&lt;p&gt;For LLM cost monitoring, three signals carry almost all the value:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Token usage&lt;/strong&gt; tells you how much capacity you consumed. Input tokens and output tokens, always separately, because they price differently.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cost&lt;/strong&gt; is the dollar-denominated derivative of token usage. You compute it at emit time using a pricing table you control.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Latency&lt;/strong&gt; tells you how long users waited. For streaming endpoints, split this into time to first token and total duration.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Everything else (error rate, finish reason, response model) is useful context for these three. Start with the three and add context as you need it.&lt;/p&gt;

&lt;h2&gt;
  
  
  What OpenTelemetry's GenAI semantic conventions give you
&lt;/h2&gt;

&lt;p&gt;OpenTelemetry has a dedicated set of semantic conventions for generative AI workloads, living under the &lt;code&gt;gen_ai.*&lt;/code&gt; namespace. The point of conventions is that the same attribute names work across providers and observability backends, so your queries do not break when you swap from OpenAI to Anthropic or from one backend to another.&lt;/p&gt;

&lt;p&gt;The attributes you will use most:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Attribute&lt;/th&gt;
&lt;th&gt;What it holds&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;gen_ai.provider.name&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Provider name: &lt;code&gt;openai&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;gen_ai.request.model&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Model requested by your code: &lt;code&gt;gpt-4o&lt;/code&gt;, &lt;code&gt;gpt-4o-mini&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;gen_ai.response.model&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Model the provider actually used (can differ if provider routes)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;gen_ai.operation.name&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;chat&lt;/code&gt;, &lt;code&gt;text_completion&lt;/code&gt;, &lt;code&gt;embeddings&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;gen_ai.usage.input_tokens&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Prompt tokens consumed&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;gen_ai.usage.output_tokens&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Completion tokens generated&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;gen_ai.request.temperature&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Temperature parameter (useful when debugging determinism)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;gen_ai.request.max_tokens&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Max tokens parameter&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;gen_ai.response.finish_reasons&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Why the model stopped: &lt;code&gt;stop&lt;/code&gt;, &lt;code&gt;length&lt;/code&gt;, &lt;code&gt;content_filter&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;One attribute worth noting: &lt;code&gt;gen_ai.system&lt;/code&gt; has been renamed to &lt;code&gt;gen_ai.provider.name&lt;/code&gt; in the current OTel GenAI spec. Most instrumentation libraries still emit &lt;code&gt;gen_ai.system&lt;/code&gt; today. Your backend should accept both until library adoption catches up.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fz42cw9al0h7kffykwgu7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fz42cw9al0h7kffykwgu7.png" alt="OpenTelemetry GenAI semantic convention attributes" width="800" height="363"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Instrumenting a Python app with the official OTel OpenAI SDK
&lt;/h2&gt;

&lt;p&gt;This guide uses &lt;code&gt;opentelemetry-instrumentation-openai-v2&lt;/code&gt;, the official OTel package maintained in &lt;code&gt;opentelemetry-python-contrib&lt;/code&gt;. It follows the GenAI semantic conventions closely and is the right choice for OpenAI instrumentation.&lt;/p&gt;

&lt;h3&gt;
  
  
  Install the three packages
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;opentelemetry-distro
pip &lt;span class="nb"&gt;install &lt;/span&gt;opentelemetry-exporter-otlp
pip &lt;span class="nb"&gt;install &lt;/span&gt;opentelemetry-instrumentation-openai-v2
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then run the bootstrap command once to install auto-instrumentation for any other libraries in your app (Flask, FastAPI, &lt;code&gt;requests&lt;/code&gt;, and so on):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;opentelemetry-bootstrap &lt;span class="nt"&gt;--action&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;install&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Set the OTLP endpoint for OpenObserve
&lt;/h3&gt;

&lt;p&gt;Grab your OTLP HTTP endpoint and Authorization header from the OpenObserve UI under &lt;strong&gt;Data Sources -&amp;gt; Traces (OpenTelemetry) -&amp;gt; OTLP HTTP&lt;/strong&gt;. Set these environment variables:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;OTEL_SERVICE_NAME&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;my-llm-app
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;OTEL_EXPORTER_OTLP_ENDPOINT&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"https://api.openobserve.ai/api/&amp;lt;your-org&amp;gt;"&lt;/span&gt;
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;OTEL_EXPORTER_OTLP_HEADERS&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"Authorization=Basic &amp;lt;your-auth-token&amp;gt;"&lt;/span&gt;
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;OTEL_EXPORTER_OTLP_PROTOCOL&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;http/protobuf
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;OTEL_PYTHON_LOGGING_AUTO_INSTRUMENTATION_ENABLED&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;true&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you are self-hosting OpenObserve, the endpoint is typically &lt;code&gt;http://localhost:5080/api/&amp;lt;your-org&amp;gt;&lt;/code&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Run with &lt;code&gt;opentelemetry-instrument&lt;/code&gt;
&lt;/h3&gt;

&lt;p&gt;Wrap your existing run command:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;opentelemetry-instrument python app.py
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;No code changes to &lt;code&gt;app.py&lt;/code&gt;. The OpenAI SDK is wrapped at import time, and every &lt;code&gt;chat.completions.create&lt;/code&gt; call emits a span with the &lt;code&gt;gen_ai.*&lt;/code&gt; attributes populated.&lt;/p&gt;

&lt;h3&gt;
  
  
  A minimal example app
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# app.py
&lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;openai&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;OpenAI&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;OpenAI&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="n"&gt;resp&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-4o-mini&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Summarize observability in one sentence.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}],&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;choices&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Input tokens:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;usage&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;prompt_tokens&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Output tokens:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;usage&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;completion_tokens&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Run it with &lt;code&gt;opentelemetry-instrument python app.py&lt;/code&gt; and check the Traces tab in OpenObserve. You should see a span named &lt;code&gt;chat gpt-4o-mini&lt;/code&gt; with the token counts attached.&lt;/p&gt;

&lt;h3&gt;
  
  
  Capturing message content (and the privacy tradeoff)
&lt;/h3&gt;

&lt;p&gt;The instrumentation does not capture the prompt or completion text by default. To enable it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;OTEL_INSTRUMENTATION_GENAI_CAPTURE_MESSAGE_CONTENT&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;true&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This ships the full prompt and completion as log events. It is useful for debugging but has real privacy implications: you are now logging whatever your users typed, including anything they pasted in. If your app handles regulated data (health, finance, anything under GDPR or HIPAA), do not enable this globally. Enable it per-environment or per-feature flag, and scrub sensitive fields before the exporter sees them.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyr73dgxg0j9a2df25cwf.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyr73dgxg0j9a2df25cwf.png" alt="OpenObserve Traces view showing LLM spans" width="800" height="388"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Instrumenting a Node.js app
&lt;/h2&gt;

&lt;p&gt;For Node.js, the pattern is the same. Install the packages:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npm &lt;span class="nb"&gt;install&lt;/span&gt; @opentelemetry/api &lt;span class="se"&gt;\&lt;/span&gt;
  @opentelemetry/sdk-node &lt;span class="se"&gt;\&lt;/span&gt;
  @opentelemetry/exporter-trace-otlp-http &lt;span class="se"&gt;\&lt;/span&gt;
  @opentelemetry/instrumentation-openai
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Create a &lt;code&gt;tracing.js&lt;/code&gt; bootstrap file:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// tracing.js&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;NodeSDK&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;require&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;@opentelemetry/sdk-node&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;OTLPTraceExporter&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;require&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;@opentelemetry/exporter-trace-otlp-http&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;OpenAIInstrumentation&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;require&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;@opentelemetry/instrumentation-openai&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;Resource&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;require&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;@opentelemetry/resources&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;sdk&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;NodeSDK&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="na"&gt;resource&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Resource&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;service.name&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;my-llm-app-node&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;deployment.environment&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;process&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;env&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;NODE_ENV&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;development&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="p"&gt;}),&lt;/span&gt;
  &lt;span class="na"&gt;traceExporter&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;OTLPTraceExporter&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="na"&gt;url&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;process&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;env&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;OTEL_EXPORTER_OTLP_ENDPOINT&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;/v1/traces`&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;headers&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="na"&gt;Authorization&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;process&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;env&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;OTEL_EXPORTER_OTLP_HEADERS&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
  &lt;span class="p"&gt;}),&lt;/span&gt;
  &lt;span class="na"&gt;instrumentations&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;OpenAIInstrumentation&lt;/span&gt;&lt;span class="p"&gt;()],&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;

&lt;span class="nx"&gt;sdk&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;start&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then preload it when you run your app:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;node &lt;span class="nt"&gt;--require&lt;/span&gt; ./tracing.js app.js
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Same result: every OpenAI call produces a span in OpenObserve with the GenAI attributes populated.&lt;/p&gt;

&lt;h2&gt;
  
  
  Building a cost calculation layer
&lt;/h2&gt;

&lt;p&gt;OpenAI's SDK gives you token counts. It does not give you dollars. You have to multiply tokens by a price, and that price changes. Build this as a small, updatable module.&lt;/p&gt;

&lt;h3&gt;
  
  
  Pricing table as code
&lt;/h3&gt;

&lt;p&gt;Keep this in source control. Review it every quarter, or every time a provider announces a price change.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# pricing.py
# Prices in USD per 1 million tokens, as of April 2026.
# Verify against provider pricing pages before each release.
&lt;/span&gt;
&lt;span class="n"&gt;MODEL_PRICING&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-4o&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;      &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;input&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;2.50&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;output&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;10.00&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-4o-mini&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;input&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.15&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;output&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.60&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;o1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;          &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;input&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;15.00&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;output&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;60.00&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;o1-mini&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;     &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;input&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;3.00&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;output&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;12.00&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;calculate_cost&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;input_tokens&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;output_tokens&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Return the estimated cost in USD for a single LLM call.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;pricing&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;MODEL_PRICING&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;pricing&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="c1"&gt;# Unknown model. Emit 0 and alert separately so you can add pricing.
&lt;/span&gt;        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="mf"&gt;0.0&lt;/span&gt;
    &lt;span class="n"&gt;input_cost&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;input_tokens&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;1_000_000&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;pricing&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;input&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;output_cost&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;output_tokens&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;1_000_000&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;pricing&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;output&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;round&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;input_cost&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;output_cost&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;6&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Emitting cost as a custom metric
&lt;/h3&gt;

&lt;p&gt;The official &lt;code&gt;-v2&lt;/code&gt; package does not emit cost, only tokens. Add cost yourself with a thin wrapper that runs after each call:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# tracked_llm.py
&lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;opentelemetry&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;trace&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;metrics&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;openai&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;OpenAI&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pricing&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;calculate_cost&lt;/span&gt;

&lt;span class="n"&gt;tracer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;trace&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_tracer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;llm-cost&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;meter&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;metrics&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_meter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;llm-cost&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;cost_histogram&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;meter&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create_histogram&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gen_ai.usage.cost_usd&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Estimated cost of a single LLM call in USD&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;unit&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;USD&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;OpenAI&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;tracked_chat&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-4o-mini&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;feature&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;unknown&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;anon&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;tracer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;start_as_current_span&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gen_ai.chat&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_attribute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gen_ai.provider.name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;openai&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_attribute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gen_ai.request.model&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_attribute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;feature&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;feature&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_attribute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="n"&gt;start&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;perf_counter&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;elapsed_ms&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;perf_counter&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;start&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;1000&lt;/span&gt;

        &lt;span class="n"&gt;input_tokens&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;usage&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;prompt_tokens&lt;/span&gt;
        &lt;span class="n"&gt;output_tokens&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;usage&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;completion_tokens&lt;/span&gt;
        &lt;span class="n"&gt;cost&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;calculate_cost&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;input_tokens&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;output_tokens&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="c1"&gt;# Span attributes for per-request investigation
&lt;/span&gt;        &lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_attribute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gen_ai.usage.input_tokens&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;input_tokens&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_attribute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gen_ai.usage.output_tokens&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;output_tokens&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_attribute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gen_ai.usage.cost_usd&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;cost&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_attribute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gen_ai.latency.duration_ms&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;elapsed_ms&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_attribute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gen_ai.response.model&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="c1"&gt;# Metric for aggregation
&lt;/span&gt;        &lt;span class="n"&gt;cost_histogram&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;record&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cost&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gen_ai.provider.name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;openai&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gen_ai.request.model&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;feature&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;feature&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;})&lt;/span&gt;

        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You now have cost on the span (for drill-down) and cost as a metric (for aggregation, alerting, and dashboards). Both are labeled with &lt;code&gt;feature&lt;/code&gt; so you can break them down later.&lt;/p&gt;

&lt;h2&gt;
  
  
  Attributing cost to users, features, and teams
&lt;/h2&gt;

&lt;p&gt;This is the section most readers came for. Raw token counts do not answer "who is spending our money." Attribution does.&lt;/p&gt;

&lt;h3&gt;
  
  
  Adding attributes on every span
&lt;/h3&gt;

&lt;p&gt;Every LLM call should carry four attribution dimensions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;feature&lt;/code&gt;: which product path triggered the call (&lt;code&gt;document_summary&lt;/code&gt;, &lt;code&gt;chat_reply&lt;/code&gt;, &lt;code&gt;rag_answer&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;user_id&lt;/code&gt;: hashed user identifier for per-user rollups&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;team&lt;/code&gt;: which internal team or product area owns the feature&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;environment&lt;/code&gt;: &lt;code&gt;prod&lt;/code&gt;, &lt;code&gt;staging&lt;/code&gt;, &lt;code&gt;dev&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Wire them through as keyword arguments on your wrapper:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;tracked_chat&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;}],&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-4o&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;feature&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;document_summary&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;hashed_user_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Building the cost attribution dashboard
&lt;/h2&gt;

&lt;p&gt;A complete LLM cost dashboard covers two concerns: spend attribution and token efficiency. Organize it across two tabs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tab 1: LLM Cost Overview&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Four single-stat tiles at the top give you the headline numbers at a glance: &lt;strong&gt;Total LLM Cost ($)&lt;/strong&gt;, &lt;strong&gt;Total Input Tokens&lt;/strong&gt;, &lt;strong&gt;Total Output Tokens&lt;/strong&gt;, and &lt;strong&gt;Total LLM Calls&lt;/strong&gt;. These are the first things you check when something looks off.&lt;/p&gt;

&lt;p&gt;Below the tiles:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;LLM Cost Over Time ($)&lt;/strong&gt;: bar chart over the selected time range. Reveals bursty spend patterns and days that are trending above baseline.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cost by Model&lt;/strong&gt;: pie chart, one slice per &lt;code&gt;gen_ai.request.model&lt;/code&gt;. Shows your model mix and whether a cheaper model is handling the bulk of traffic.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Input vs Output Cost Over Time ($)&lt;/strong&gt;: grouped bar chart with two series, &lt;code&gt;input_cost&lt;/code&gt; and &lt;code&gt;output_cost&lt;/code&gt;. Output tokens cost 3-4x more than input tokens on most models; this panel tells you which side is driving cost growth.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Token Usage by Model&lt;/strong&gt;: grouped bar chart of &lt;code&gt;input_tokens&lt;/code&gt; and &lt;code&gt;output_tokens&lt;/code&gt; per model. Cross-reference this with Cost by Model to spot models that are expensive relative to their token volume.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Token Usage Over Time&lt;/strong&gt;: time series of token counts. Useful for capacity planning and catching prompt inflation.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fn78rhwv8i09shrgsoo27.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fn78rhwv8i09shrgsoo27.png" alt="LLM Cost Monitoring dashboard" width="800" height="437"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Alerting on cost anomalies and rate-limit errors
&lt;/h2&gt;

&lt;p&gt;Static budget thresholds are table stakes. The interesting failures are the ones that do not cross a static threshold until it is too late.&lt;/p&gt;

&lt;h3&gt;
  
  
  Threshold alerts vs anomaly detection
&lt;/h3&gt;

&lt;p&gt;A threshold alert fires when daily spend exceeds $500. It works for the blunt cases. It misses three common failure modes:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;A retry loop that 3x's a specific feature's token usage in an hour. The daily threshold may still be fine by end of day, but you paid 3x for that hour.&lt;/li&gt;
&lt;li&gt;A prompt injection that triggers a long runaway completion on a single request, burning 100k output tokens in one call.&lt;/li&gt;
&lt;li&gt;Seasonal growth that quietly pushes baseline from $300/day to $600/day over a month, outpacing capacity plans.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Anomaly detection catches all three by comparing current behavior to historical baseline rather than to a fixed number.&lt;/p&gt;

&lt;h3&gt;
  
  
  A daily budget threshold
&lt;/h3&gt;

&lt;p&gt;Set this first. In OpenObserve, create an alert on the &lt;code&gt;gen_ai.usage.cost_usd&lt;/code&gt; metric:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Trigger:&lt;/strong&gt; &lt;code&gt;SUM(gen_ai_usage_cost_usd)&lt;/code&gt; over &lt;code&gt;24h&lt;/code&gt; is greater than &lt;code&gt;500&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Evaluation frequency:&lt;/strong&gt; every 5 minutes&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Action:&lt;/strong&gt; Slack or PagerDuty, routed to the LLM-platform team&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  An anomaly-based alert for cost spikes
&lt;/h3&gt;

&lt;p&gt;This is more valuable. Create an anomaly alert on &lt;code&gt;gen_ai.usage.cost_usd&lt;/code&gt; grouped by &lt;code&gt;feature&lt;/code&gt;, with a training window of the last 14 days and a sensitivity tuned to catch 3x deviations. A retry loop in the &lt;code&gt;document_summary&lt;/code&gt; feature shows up in minutes, before it hits your daily threshold.&lt;/p&gt;

&lt;h3&gt;
  
  
  Alert on rate-limit errors (HTTP 429)
&lt;/h3&gt;

&lt;p&gt;When OpenAI rate-limits you, downstream calls fail and retries pile up. Fire an alert when &lt;code&gt;gen_ai.response.error.type = rate_limit_exceeded&lt;/code&gt; exceeds a low threshold (say, 5 in 5 minutes). This usually surfaces a runaway loop before a cost anomaly does.&lt;/p&gt;

&lt;h2&gt;
  
  
  Reconciling estimated cost with the OpenAI billing API
&lt;/h2&gt;

&lt;p&gt;Your OTel-derived cost is an estimate. It is usually within a couple of percent, but it drifts from the real bill for three reasons:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Cached input tokens.&lt;/strong&gt; Repeat prompts are billed at a discount. Your naive pricing math assumes full price.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Reasoning tokens.&lt;/strong&gt; &lt;code&gt;o1&lt;/code&gt; and similar models emit internal reasoning tokens that count toward billing but may not appear in the standard &lt;code&gt;usage&lt;/code&gt; object.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Batch API discounts.&lt;/strong&gt; If you use the async batch endpoint, those requests are priced lower.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Reconcile monthly. Pull the OpenAI usage endpoint and compare total cost for the window against your OTel sum. If the drift is more than 5 percent, dig in and adjust your pricing table. This is the pattern production teams use: OTel for real-time signal, billing API for ground truth.&lt;/p&gt;

&lt;h2&gt;
  
  
  Measuring time to first token for streaming
&lt;/h2&gt;

&lt;p&gt;For chat UIs, users feel time to first token (TTFT), not total duration. If you use streaming responses, capture it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;stream_with_ttft&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-4o&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;tracer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;start_as_current_span&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gen_ai.chat&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_attribute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gen_ai.provider.name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;openai&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_attribute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gen_ai.request.model&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_attribute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gen_ai.response.streaming&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="n"&gt;start&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;perf_counter&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="n"&gt;ttft_ms&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;

        &lt;span class="n"&gt;stream&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;stream&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="n"&gt;chunks&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;chunk&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;stream&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;ttft_ms&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;chunk&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;choices&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;delta&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="n"&gt;ttft_ms&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;perf_counter&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;start&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;1000&lt;/span&gt;
                &lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_attribute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gen_ai.latency.ttft_ms&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ttft_ms&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;chunks&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;chunk&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="n"&gt;total_ms&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;perf_counter&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;start&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;1000&lt;/span&gt;
        &lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_attribute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gen_ai.latency.duration_ms&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;total_ms&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;chunks&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now you can alert on TTFT regressions separately from total-duration regressions.&lt;/p&gt;

&lt;h2&gt;
  
  
  Production checklist
&lt;/h2&gt;

&lt;p&gt;Before shipping this to prod:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;✅ Retention policy set on your LLM telemetry stream&lt;/li&gt;
&lt;li&gt;✅ PII scrubbing pipeline in place if capturing message content&lt;/li&gt;
&lt;li&gt;✅ Sampling strategy decided (100% for LLM spans is usually fine)&lt;/li&gt;
&lt;li&gt;✅ Pricing table in source control with quarterly review reminder&lt;/li&gt;
&lt;li&gt;✅ Budget threshold alert and anomaly-based alert configured&lt;/li&gt;
&lt;li&gt;✅ Monthly reconciliation against OpenAI billing API scheduled&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Send your LLM telemetry to OpenObserve
&lt;/h2&gt;

&lt;p&gt;OpenObserve is an open-source observability platform that accepts standard OTLP over HTTP and gRPC. There is no proprietary SDK to adopt and no special instrumentation to learn. Point your OTLP exporter at OpenObserve Cloud or a self-hosted instance, and your LLM spans, logs, and metrics land in the same place as your infrastructure telemetry.&lt;/p&gt;

&lt;p&gt;If you want to see this working end to end, spin up a free account at &lt;a href="https://cloud.openobserve.ai/" rel="noopener noreferrer"&gt;OpenObserve Cloud&lt;/a&gt; or check out the &lt;a href="https://openobserve.ai/llm-observability/" rel="noopener noreferrer"&gt;LLM Observability overview&lt;/a&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  Further Reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://openobserve.ai/blog/llm-monitoring-best-practices/" rel="noopener noreferrer"&gt;LLM monitoring best practices&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://openobserve.ai/blog/ai-anomaly-detection-guide/" rel="noopener noreferrer"&gt;AI anomaly detection guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://openobserve.ai/blog/llm-observability-tools/" rel="noopener noreferrer"&gt;Top open source LLM observability tools&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://openobserve.ai/blog/distributed-tracing-basics-to-beyond-guide/" rel="noopener noreferrer"&gt;Distributed tracing guide&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>opentelemetry</category>
      <category>llm</category>
      <category>observability</category>
      <category>openai</category>
    </item>
    <item>
      <title>I Built a Dashboard in 30 Seconds with AI</title>
      <dc:creator>Manas Sharma</dc:creator>
      <pubDate>Thu, 14 May 2026 10:17:29 +0000</pubDate>
      <link>https://dev.to/openobserve/i-built-a-dashboard-in-30-seconds-with-ai-500p</link>
      <guid>https://dev.to/openobserve/i-built-a-dashboard-in-30-seconds-with-ai-500p</guid>
      <description>&lt;h2&gt;
  
  
  The Problem
&lt;/h2&gt;

&lt;p&gt;It's 2 AM. An alert fires. Cart service is throwing errors. You've got five minutes before someone escalates.&lt;/p&gt;

&lt;p&gt;The runbook says: "Check the dashboard. Look at the logs." But which dashboard? What query? You're half-asleep, the alert description tells you nothing useful, and now you're supposed to write SQL from scratch while someone in Slack asks "any update?"&lt;/p&gt;

&lt;p&gt;Most of us have been there. And most runbooks were written by someone who never had to use them under pressure.&lt;/p&gt;

&lt;p&gt;What if you could just type: &lt;strong&gt;"cart is throwing errors. find the root cause."&lt;/strong&gt; and get a real answer?&lt;/p&gt;

&lt;p&gt;That's what I tested with the new AI Assistant in OpenObserve. Here's what happened.&lt;/p&gt;




&lt;h2&gt;
  
  
  It's Not Anomaly Detection. It's Something Simpler.
&lt;/h2&gt;

&lt;p&gt;Most AI + observability discussions jump straight to anomaly detection or ML-powered forecasting. Those are interesting. But the thing that's actually changing how I work right now is simpler: an assistant embedded in the platform that lets me ask questions in plain English and get answers from my own production data.&lt;/p&gt;

&lt;p&gt;No SQL. No PromQL. Just describe what you want.&lt;/p&gt;

&lt;p&gt;I ran four real scenarios against live data from an otel-demo microservices app and a Kubernetes cluster. Here's how each one went.&lt;/p&gt;




&lt;h3&gt;
  
  
  1. The Dashboard Request That Normally Kills Your Afternoon
&lt;/h3&gt;

&lt;p&gt;Someone from the business team asks for a dashboard. They don't know SQL. They don't know PromQL. They just want to see what's happening with nginx — request rate, how fast it's responding, how many errors.&lt;/p&gt;

&lt;p&gt;Normally this kills thirty minutes: finding the right log stream, writing queries, dragging panels, tweaking units.&lt;/p&gt;

&lt;p&gt;Instead, I typed:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;create a dashboard for my nginx logs showing request rate, latency percentiles, and 4xx vs 5xx errors.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Thirty seconds later I had a production-ready dashboard. It picked the right log stream. It listed the relevant fields. It wrote the SQL queries. It chose appropriate visualizations — line chart for request rate, heatmap for latency distribution, stacked bar for status codes. These were real queries against actual data. Not a template.&lt;/p&gt;

&lt;p&gt;Here's what stuck with me: &lt;strong&gt;the person who asked for this could have done it themselves.&lt;/strong&gt; They don't need to know what a PromQL query looks like. They just describe what they want to see.&lt;/p&gt;




&lt;h3&gt;
  
  
  2. Same Thing, Different Domain: Infrastructure
&lt;/h3&gt;

&lt;p&gt;Application logs worked. But what about infrastructure?&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;build a K8s host metrics dashboard showing CPU, memory, disk per node.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Completely different data source — Kubernetes metrics, not nginx logs. Same experience. The assistant figured out where the data lived, what metrics to pull, and how to visualize them.&lt;/p&gt;

&lt;p&gt;What impressed me was the panel design. Usage per node and cumulative across the cluster. Separate tabs for CPU, memory, and disk. It understood that "CPU per node" implies a time series grouped by host, not a single aggregate gauge. That's the kind of design decision a human SRE makes after looking at the data — and the assistant just did it.&lt;/p&gt;

&lt;p&gt;The assistant had enough context about the infrastructure to know what clusters were running and what hosts were connected. I didn't explain my setup. It already knew.&lt;/p&gt;




&lt;h3&gt;
  
  
  3. Proactive: Don't Wait Until Something Breaks
&lt;/h3&gt;

&lt;p&gt;Dashboards are great, but nobody wants to stare at them all day. I wanted to see if I could use the assistant proactively — scan everything, find problems before they escalate.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;what's the health of the otel-demo right now? if anything is red, create an alert.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This isn't asking for one dashboard or one service. It's saying: scan all services, tell me how we're doing, and if something looks off, lock in an alert so I'm covered.&lt;/p&gt;

&lt;p&gt;It checked error rates and latencies across every service. Found the ones running green, identified the ones that weren't. And for anything red — it created an alert. Right there. No configuration. No navigating to the alerts page.&lt;/p&gt;

&lt;p&gt;This is the kind of thing most teams only set up &lt;em&gt;after&lt;/em&gt; an incident, during the postmortem, when someone says "we should have caught this earlier." One sentence and you're covered before the page goes off.&lt;/p&gt;




&lt;h3&gt;
  
  
  4. Something's Actually Broken: Root Cause Analysis
&lt;/h3&gt;

&lt;p&gt;Now the real test. The cart service in the otel-demo app is throwing errors. Not a synthetic scenario — a real incident.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;otel-demo app cart is throwing errors. find the root cause.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;What happened next is worth breaking down step by step:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;It searched across &lt;strong&gt;both logs and traces&lt;/strong&gt; — not one or the other, both at once&lt;/li&gt;
&lt;li&gt;It looked for errors in the last six hours and found none&lt;/li&gt;
&lt;li&gt;It &lt;strong&gt;automatically widened the search window&lt;/strong&gt; — I didn't tell it to do that&lt;/li&gt;
&lt;li&gt;It identified the pattern: cart service failing on database writes under load&lt;/li&gt;
&lt;li&gt;It showed me the exact traces, the error distribution over time, and the specific downstream call that was failing&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Every step was visible. I could expand any tool call, see the exact query it ran, and verify the result. It's not a black box. It shows its work — and if I disagreed with where it was going, I could redirect it.&lt;/p&gt;

&lt;p&gt;Once I had the root cause, I stayed in the same conversation:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;alert me if cart error rate crosses 10 errors in 5 minutes.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Same context. Same conversation. Investigation to prevention in two sentences.&lt;/p&gt;

&lt;p&gt;That last part is what I keep coming back to. The assistant doesn't just help you find problems — it helps you lock in the fix so you don't get paged for the same thing at 3 AM next week.&lt;/p&gt;




&lt;h2&gt;
  
  
  Beyond the UI: Take It to Your IDE
&lt;/h2&gt;

&lt;p&gt;Here's the part that changes the workflow entirely. You don't have to be inside the OpenObserve UI to get this.&lt;/p&gt;

&lt;p&gt;OpenObserve exposes all of this through an MCP server. Connect your AI coding assistant (Claude Code, Cursor, whatever you use) directly to your production observability data. One command:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;claude mcp add o2 https://api.openobserve.ai/api/default/mcp &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-t&lt;/span&gt; http &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--header&lt;/span&gt; &lt;span class="s2"&gt;"Authorization: Basic &amp;lt;YOUR_TOKEN&amp;gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's it. Under five minutes. Now your IDE can query production logs, metrics, and traces. Debug a deploy from your terminal. Pull up a trace without leaving your editor. Check error rates during a code review.&lt;/p&gt;

&lt;p&gt;The assistant follows you wherever you work — not just inside the observability platform.&lt;/p&gt;




&lt;h2&gt;
  
  
  What This Actually Changes
&lt;/h2&gt;

&lt;p&gt;There's been a lot of noise about AI in observability. Most of it falls into two camps:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Anomaly detection&lt;/strong&gt; — useful in theory, unpredictable in practice, hard to trust&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;AI replaces on-call&lt;/strong&gt; — not happening, and most engineers don't want it to&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The thing that's working right now is neither of those. It's reducing the friction between &lt;em&gt;"something is wrong"&lt;/em&gt; and &lt;em&gt;"here's what I know."&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Not replacing your judgment. Not replacing your experience. Just removing the parts of incident response that feel like operating a query builder with one eye open at 2 AM.&lt;/p&gt;

&lt;p&gt;From &lt;em&gt;"I need to see what's happening"&lt;/em&gt; to &lt;em&gt;"I know what happened and we're covered next time"&lt;/em&gt; — in one conversation.&lt;/p&gt;




&lt;h2&gt;
  
  
  Resources
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://openobserve.ai/docs/integration/ai/mcp/claude/" rel="noopener noreferrer"&gt;OpenObserve MCP Setup Guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://openobserve.ai/webinars-videos/integration-with-ai-tools-a-step-by-step-guide-using-mcp/" rel="noopener noreferrer"&gt;Integration with AI Tools Using MCP — Workshop&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/openobserve/openobserve" rel="noopener noreferrer"&gt;OpenObserve on GitHub&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;em&gt;Have you tried connecting AI assistants to your observability stack? What's working? What's still painful? Drop a comment — I'm genuinely curious what others are seeing.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>observability</category>
      <category>ai</category>
      <category>devops</category>
      <category>sre</category>
    </item>
    <item>
      <title>Monitoring Java Microservices with OpenTelemetry and OpenObserve</title>
      <dc:creator>Manas Sharma</dc:creator>
      <pubDate>Fri, 10 Apr 2026 12:14:39 +0000</pubDate>
      <link>https://dev.to/openobserve/monitoring-java-microservices-with-opentelemetry-and-openobserve-2d1</link>
      <guid>https://dev.to/openobserve/monitoring-java-microservices-with-opentelemetry-and-openobserve-2d1</guid>
      <description>&lt;p&gt;Monitoring microservices is hard.&lt;/p&gt;

&lt;p&gt;When a user request fans out across multiple services, each with its own database, logs, and failure modes, traditional monitoring tools often give you a fragmented picture. You can tell &lt;em&gt;something&lt;/em&gt; is slow, but not exactly &lt;em&gt;where&lt;/em&gt; or &lt;em&gt;why&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;Distributed tracing solves this.&lt;/p&gt;

&lt;p&gt;In this tutorial, we'll implement distributed tracing for a Java Spring Boot microservices application using two open-source tools: &lt;strong&gt;OpenTelemetry&lt;/strong&gt; and &lt;strong&gt;OpenObserve&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;If your stack includes other languages, check out these guides too:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://openobserve.ai/blog/distributed-tracing-in-dotnet-application/" rel="noopener noreferrer"&gt;.NET&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://openobserve.ai/blog/monitoring-go-with-opentelemetry/" rel="noopener noreferrer"&gt;Go&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://openobserve.ai/blog/distributed-tracing-in-nodejs-with-opentelemetry/" rel="noopener noreferrer"&gt;Node.js&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  What you'll build
&lt;/h2&gt;

&lt;p&gt;By the end of this guide, you'll have:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A working Spring Boot microservices setup with cross-service HTTP calls&lt;/li&gt;
&lt;li&gt;Zero-code instrumentation using the OpenTelemetry Java Agent&lt;/li&gt;
&lt;li&gt;End-to-end traces in OpenObserve with flamegraph and Gantt chart views&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  What is distributed tracing?
&lt;/h2&gt;

&lt;p&gt;In microservices, one user action can trigger a chain of calls across many services. If a request takes 3 seconds, tracing helps answer:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Which service caused the delay?&lt;/li&gt;
&lt;li&gt;Which operation failed?&lt;/li&gt;
&lt;li&gt;Where exactly time was spent?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Distributed tracing works by attaching context (&lt;code&gt;trace_id&lt;/code&gt;, &lt;code&gt;span_id&lt;/code&gt;) at request entry and propagating it across service boundaries (usually with &lt;code&gt;traceparent&lt;/code&gt; headers). This gives you one complete request journey.&lt;/p&gt;

&lt;p&gt;A trace is made up of spans. Each span records:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Service + operation&lt;/li&gt;
&lt;li&gt;Start time + duration&lt;/li&gt;
&lt;li&gt;HTTP details (method, URL, status)&lt;/li&gt;
&lt;li&gt;DB query metadata&lt;/li&gt;
&lt;li&gt;Errors/exceptions&lt;/li&gt;
&lt;li&gt;Parent-child relationships&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F77fyvps3sm2xpi4kqtj5.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F77fyvps3sm2xpi4kqtj5.png" alt="Key components of distributed tracing in an e-commerce example" width="788" height="413"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;For deeper fundamentals: &lt;a href="https://openobserve.ai/blog/distributed-tracing-basics-to-beyond-guide/" rel="noopener noreferrer"&gt;Distributed Tracing Basics to Beyond&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Why OpenTelemetry + OpenObserve?
&lt;/h2&gt;

&lt;h3&gt;
  
  
  OpenTelemetry
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://opentelemetry.io/" rel="noopener noreferrer"&gt;OpenTelemetry&lt;/a&gt; is a CNCF standard for traces, metrics, and logs.&lt;br&gt;&lt;br&gt;
For Java, the &lt;strong&gt;OpenTelemetry Java Agent&lt;/strong&gt; can auto-instrument Spring Boot, JDBC, and HTTP clients with no code changes.&lt;/p&gt;
&lt;h3&gt;
  
  
  OpenObserve
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://openobserve.ai/" rel="noopener noreferrer"&gt;OpenObserve&lt;/a&gt; is an open-source backend for logs, metrics, and traces.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;OTLP-native ingest&lt;/li&gt;
&lt;li&gt;SQL-powered analytics&lt;/li&gt;
&lt;li&gt;Unified observability in one interface&lt;/li&gt;
&lt;li&gt;Lightweight and storage-efficient&lt;/li&gt;
&lt;/ul&gt;


&lt;h2&gt;
  
  
  Architecture used in this tutorial
&lt;/h2&gt;

&lt;p&gt;We'll run four services:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Service&lt;/th&gt;
&lt;th&gt;Port&lt;/th&gt;
&lt;th&gt;Responsibility&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;discovery-service&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;8761&lt;/td&gt;
&lt;td&gt;Eureka registry&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;user-service&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;8081&lt;/td&gt;
&lt;td&gt;User CRUD (MySQL)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;order-service&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;8082&lt;/td&gt;
&lt;td&gt;Order management; calls &lt;code&gt;user-service&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;payment-service&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;8083&lt;/td&gt;
&lt;td&gt;Payment processing; calls &lt;code&gt;order-service&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The key trace path is:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;payment-service -&amp;gt; order-service -&amp;gt; user-service -&amp;gt; MySQL&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fl10230zmt03dpjjnge17.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fl10230zmt03dpjjnge17.png" alt="Payment Processing Workflow" width="800" height="505"&gt;&lt;/a&gt;&lt;/p&gt;


&lt;h2&gt;
  
  
  Prerequisites
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Java 17+&lt;/li&gt;
&lt;li&gt;Maven 3.8+&lt;/li&gt;
&lt;li&gt;Docker + Docker Compose&lt;/li&gt;
&lt;li&gt;MySQL 8 (or use Dockerized MySQL from compose)&lt;/li&gt;
&lt;/ul&gt;


&lt;h2&gt;
  
  
  Step 1: Clone the project
&lt;/h2&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/openobserve/java-distributed-tracing.git
&lt;span class="nb"&gt;cd &lt;/span&gt;java-distributed-tracing
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Step 2: Start OpenObserve and MySQL
&lt;/h2&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker-compose up &lt;span class="nt"&gt;-d&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fc1n201ddvbbv2p4u6oos.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fc1n201ddvbbv2p4u6oos.png" alt="Docker Compose startup" width="800" height="95"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This starts:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;OpenObserve: &lt;code&gt;http://localhost:5080&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;MySQL: &lt;code&gt;localhost:3306&lt;/code&gt; (&lt;code&gt;tracingdb&lt;/code&gt;)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Login to OpenObserve with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Email: &lt;code&gt;admin@example.com&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Password: &lt;code&gt;Admin123!&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjktcbc95nhgy2yby0osz.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjktcbc95nhgy2yby0osz.png" alt="OpenObserve dashboard" width="800" height="436"&gt;&lt;/a&gt;&lt;/p&gt;


&lt;h2&gt;
  
  
  Step 3: Download OpenTelemetry Java Agent
&lt;/h2&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;mkdir &lt;/span&gt;agents
curl &lt;span class="nt"&gt;-L&lt;/span&gt; https://github.com/open-telemetry/opentelemetry-java-instrumentation/releases/latest/download/opentelemetry-javaagent.jar &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-o&lt;/span&gt; agents/opentelemetry-javaagent.jar
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0fx87t0a6oqn4pg7ofnb.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0fx87t0a6oqn4pg7ofnb.png" alt="Download OpenTelemetry Java Agent" width="800" height="170"&gt;&lt;/a&gt;&lt;/p&gt;


&lt;h2&gt;
  
  
  Step 4: Configure agent export to OpenObserve
&lt;/h2&gt;

&lt;p&gt;Example from &lt;code&gt;user-service/scripts/start.sh&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;OTEL_SERVICE_NAME&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;user-service
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;OTEL_RESOURCE_ATTRIBUTES&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;service.name&lt;span class="o"&gt;=&lt;/span&gt;user-service,deployment.environment&lt;span class="o"&gt;=&lt;/span&gt;dev
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;OTEL_TRACES_EXPORTER&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;otlp
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;OTEL_METRICS_EXPORTER&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;none
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;OTEL_LOGS_EXPORTER&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;none
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;OTEL_EXPORTER_OTLP_TRACES_ENDPOINT&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;http://localhost:5080/api/default/traces
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;OTEL_EXPORTER_OTLP_TRACES_PROTOCOL&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;http/protobuf
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;OTEL_EXPORTER_OTLP_TRACES_HEADERS&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"Authorization=Basic {token}"&lt;/span&gt;

java &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-Xms256m&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-Xmx512m&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-javaagent&lt;/span&gt;:../agents/opentelemetry-javaagent.jar &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-jar&lt;/span&gt; target/user-service-0.0.1-SNAPSHOT.jar
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Get &lt;code&gt;{token}&lt;/code&gt; from OpenObserve UI:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flavcpola4g6acknmci64.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flavcpola4g6acknmci64.png" alt="OpenTelemetry token location" width="800" height="280"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 5: Start discovery-service
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;cd &lt;/span&gt;discovery-service
mvn clean &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-Dmaven&lt;/span&gt;.test.skip
sh scripts/start.sh
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Open: &lt;code&gt;http://localhost:8761&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Faw8dxbyyfusxrafqvduk.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Faw8dxbyyfusxrafqvduk.png" alt="Discovery service" width="800" height="295"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 6: Start user/order/payment services
&lt;/h2&gt;

&lt;p&gt;Run each in a separate terminal.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;cd &lt;/span&gt;user-service
mvn clean &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-Dmaven&lt;/span&gt;.test.skip
sh scripts/start.sh
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;cd &lt;/span&gt;order-service
mvn clean &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-Dmaven&lt;/span&gt;.test.skip
sh scripts/start.sh
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;cd &lt;/span&gt;payment-service
mvn clean &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-Dmaven&lt;/span&gt;.test.skip
sh scripts/start.sh
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Verify registration in Eureka:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3btnshbhbcoc5wx25ozb.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3btnshbhbcoc5wx25ozb.png" alt="Eureka dashboard" width="800" height="338"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 7: Generate traces
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1) Create user
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-X&lt;/span&gt; POST http://localhost:8081/api/users &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{
    "name": "Priya Sharma",
    "email": "priya@example.com",
    "phone": "+91-9876543210"
  }'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Feq5neqphvupz3bidqdmu.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Feq5neqphvupz3bidqdmu.png" alt="Create user API" width="800" height="239"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  2) Create order
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-X&lt;/span&gt; POST http://localhost:8082/api/orders &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{
    "userId": 1,
    "productName": "Mechanical Keyboard",
    "quantity": 1,
    "totalAmount": 4999.00
  }'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbl4100lzdbalcaux0hag.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbl4100lzdbalcaux0hag.png" alt="Create order API" width="800" height="221"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  3) Process payment (full distributed trace)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-X&lt;/span&gt; POST http://localhost:8083/api/payments/process &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{
    "userId": 1,
    "orderId": 1,
    "amount": 4999.00,
    "currency": "INR",
    "paymentMethod": "UPI"
  }'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1cit2w42p9mohjrjz0uz.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1cit2w42p9mohjrjz0uz.png" alt="Process payment API" width="800" height="198"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  4) Trigger an error trace
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-X&lt;/span&gt; POST http://localhost:8082/api/orders &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{
    "userId": 9999,
    "productName": "Test Product",
    "quantity": 1,
    "totalAmount": 100.00
  }'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Expected: &lt;code&gt;400 Bad Request&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcm2rbdfrvh1qlf4x05x6.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcm2rbdfrvh1qlf4x05x6.png" alt="Error test API" width="800" height="156"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Visualize in OpenObserve
&lt;/h2&gt;

&lt;p&gt;Go to &lt;code&gt;http://localhost:5080&lt;/code&gt; -&amp;gt; &lt;strong&gt;Traces&lt;/strong&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Trace Explorer
&lt;/h3&gt;

&lt;p&gt;You'll see:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Trace ID&lt;/li&gt;
&lt;li&gt;Root span&lt;/li&gt;
&lt;li&gt;Service&lt;/li&gt;
&lt;li&gt;Duration&lt;/li&gt;
&lt;li&gt;Span count&lt;/li&gt;
&lt;li&gt;Status&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8ahns6faayter6s4tg1n.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8ahns6faayter6s4tg1n.png" alt="Trace explorer" width="800" height="435"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Filter examples
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;service_name = payment-service&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;status = ERROR&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;Duration range&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;operation_name&lt;/code&gt; for specific endpoints&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2th52axaqj1yl6pcav0n.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2th52axaqj1yl6pcav0n.png" alt="Filter by service" width="800" height="436"&gt;&lt;/a&gt;&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1bxpyygudbc8715k29n7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1bxpyygudbc8715k29n7.png" alt="Filter by error" width="800" height="438"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Flamegraph + Gantt chart
&lt;/h3&gt;

&lt;p&gt;Click a &lt;code&gt;POST /api/payments/process&lt;/code&gt; trace.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Flamegraph&lt;/strong&gt;: nested span timing hierarchy&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Gantt&lt;/strong&gt;: timeline-aligned span bars&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9qjfrrtg9p48v1bunh43.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9qjfrrtg9p48v1bunh43.png" alt="Flamegraph view" width="800" height="225"&gt;&lt;/a&gt;&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2ertp2jg6wiq45g4qx4n.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2ertp2jg6wiq45g4qx4n.png" alt="Gantt chart view" width="800" height="331"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Query traces with SQL
&lt;/h2&gt;

&lt;p&gt;OpenObserve supports SQL over trace data.&lt;/p&gt;

&lt;h3&gt;
  
  
  Slowest payment traces
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;trace_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;duration&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;service_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;operation_name&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="nv"&gt;"default"&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;service_name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'payment-service'&lt;/span&gt;
  &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;operation_name&lt;/span&gt; &lt;span class="k"&gt;LIKE&lt;/span&gt; &lt;span class="s1"&gt;'%payments/process%'&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;duration&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;
&lt;span class="k"&gt;LIMIT&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffoqi5yexqdywd2x58953.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffoqi5yexqdywd2x58953.png" alt="Slow traces SQL" width="800" height="436"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Error count by service
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;service_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;error_count&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="nv"&gt;"default"&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;span_status&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'ERROR'&lt;/span&gt;
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;service_name&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;error_count&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F422qumf0xylbneza7n9m.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F422qumf0xylbneza7n9m.png" alt="Error count SQL" width="800" height="436"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Avg/max latency by service
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;service_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="k"&gt;AVG&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;duration&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;avg_duration_us&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="k"&gt;MAX&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;duration&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;max_duration_us&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;request_count&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="nv"&gt;"default"&lt;/span&gt;
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;service_name&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8ay426qextfca93vmhhs.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8ay426qextfca93vmhhs.png" alt="Latency SQL" width="800" height="435"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  What the Java agent captured automatically
&lt;/h2&gt;

&lt;p&gt;Without adding tracing code, the OpenTelemetry Java Agent instrumented:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Spring Web incoming HTTP requests&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;RestTemplate&lt;/code&gt; outbound calls (&lt;code&gt;traceparent&lt;/code&gt; injected)&lt;/li&gt;
&lt;li&gt;JDBC/MySQL queries&lt;/li&gt;
&lt;li&gt;Context propagation across service boundaries&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;See supported libraries: &lt;a href="https://github.com/open-telemetry/opentelemetry-java-instrumentation/blob/main/docs/supported-libraries.md" rel="noopener noreferrer"&gt;OpenTelemetry Java Instrumentation&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Final takeaway
&lt;/h2&gt;

&lt;p&gt;You now have end-to-end distributed tracing for a Java microservices app with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Zero-code instrumentation&lt;/li&gt;
&lt;li&gt;Full request path visibility&lt;/li&gt;
&lt;li&gt;Visual root-cause analysis (flamegraph/Gantt)&lt;/li&gt;
&lt;li&gt;SQL-based troubleshooting in OpenObserve&lt;/li&gt;
&lt;li&gt;A path to production scaling without vendor lock-in&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Further reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://openobserve.ai/blog/distributed-tracing-in-dotnet-application/" rel="noopener noreferrer"&gt;Distributed Tracing in .NET Applications using OpenTelemetry&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://openobserve.ai/blog/monitoring-go-with-opentelemetry/" rel="noopener noreferrer"&gt;Distributed Tracing in Go Applications with OpenTelemetry and OpenObserve&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://openobserve.ai/blog/distributed-tracing-in-nodejs-with-opentelemetry/" rel="noopener noreferrer"&gt;Distributed Tracing in Node.js Applications with OpenTelemetry&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>java</category>
      <category>springboot</category>
      <category>opentelemetry</category>
      <category>observability</category>
    </item>
    <item>
      <title>Top Log Visualization Tools in 2026: Dashboards, Search &amp; AI-Assisted Analysis</title>
      <dc:creator>Manas Sharma</dc:creator>
      <pubDate>Tue, 17 Mar 2026 08:44:41 +0000</pubDate>
      <link>https://dev.to/openobserve/top-log-visualization-tools-in-2026-dashboards-search-ai-assisted-analysis-2g9</link>
      <guid>https://dev.to/openobserve/top-log-visualization-tools-in-2026-dashboards-search-ai-assisted-analysis-2g9</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Quick answer:&lt;/strong&gt; The best log visualization tools in 2026 are &lt;strong&gt;OpenObserve&lt;/strong&gt;, Kibana (Elastic Stack), Grafana + Loki, Datadog Logs, and Splunk. OpenObserve stands out by combining traditional dashboards with a built-in AI assistant (&lt;strong&gt;O2 Assistant&lt;/strong&gt;) that lets you query, correlate, and visualize logs in plain English.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  What Separates Great Log Visualization from Basic Log Search?
&lt;/h2&gt;

&lt;p&gt;Most log tools can search. The best ones let you &lt;em&gt;understand&lt;/em&gt;. &lt;/p&gt;

&lt;p&gt;In 2026, the gap has widened between tools that simply dump raw text and those that provide a fast path from &lt;strong&gt;alert → root cause → fix&lt;/strong&gt;. The features that define the leaders today include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Saved Views &amp;amp; Search Templates&lt;/strong&gt; – Reuse complex filters without starting from scratch.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Dashboard Templating&lt;/strong&gt; – Parameterized views that scale across services and environments.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Anomaly Detection&lt;/strong&gt; – Surfacing "unknown unknowns" without manual thresholds.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Deep Drill-Down&lt;/strong&gt; – Moving from a high-level spike to specific log lines in one click.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;AI-Assisted Analysis&lt;/strong&gt; – Using natural language to generate complex queries.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  The Best Log Visualization Tools in 2026
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Tool&lt;/th&gt;
&lt;th&gt;AI-Assisted Analysis&lt;/th&gt;
&lt;th&gt;Open Source&lt;/th&gt;
&lt;th&gt;Deployment&lt;/th&gt;
&lt;th&gt;Best For&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;OpenObserve&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;O2 Assistant + MCP&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;Self-hosted / Cloud&lt;/td&gt;
&lt;td&gt;Full-stack observability with AI&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Kibana (Elastic)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Partial (ML add-on)&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;Self-hosted / Cloud&lt;/td&gt;
&lt;td&gt;Full-text search, complex pipelines&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Grafana + Loki&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Partial (plugin)&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;Self-hosted / Cloud&lt;/td&gt;
&lt;td&gt;Prometheus-native teams&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Datadog Logs&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Watchdog AI&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;SaaS&lt;/td&gt;
&lt;td&gt;Managed, all-in-one observability&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Splunk&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Splunk AI&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;Self-hosted / Cloud&lt;/td&gt;
&lt;td&gt;Enterprise SIEM &amp;amp; security&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  1. OpenObserve — Best for AI-Assisted Log Visualization
&lt;/h2&gt;

&lt;p&gt;OpenObserve is the only tool where AI-assisted analysis is native, not bolted on. Its &lt;strong&gt;O2 Assistant&lt;/strong&gt; is a full observability co-pilot that understands your schema, queries, and infrastructure topology.&lt;/p&gt;

&lt;h3&gt;
  
  
  What makes O2 Assistant different?
&lt;/h3&gt;

&lt;p&gt;Traditional visualization requires you to know what to look for. With O2 Assistant, the workflow inverts: &lt;strong&gt;You describe the problem; the tool finds the evidence.&lt;/strong&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"Show me error rate spikes in the payment service over the last 6 hours, correlated with any upstream database latency."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1gmm9x86afugdgnemr4o.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1gmm9x86afugdgnemr4o.png" alt="NLP mode for SQL queries with AI Assistant" width="800" height="435"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Key Capabilities
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Natural Language to Query:&lt;/strong&gt; Translates English into SQL, PromQL, or VRL scripts.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cross-Telemetry Correlation:&lt;/strong&gt; Query logs, metrics, and traces in the same conversation thread.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;AI-Generated Dashboards:&lt;/strong&gt; Use the MCP (Model Context Protocol) server to build entire dashboards from a single prompt.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Ad-hoc Investigation:&lt;/strong&gt; Perfect for "2 AM incidents" where you don't have a pre-built dashboard ready.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Works with Your Existing Stack
&lt;/h3&gt;

&lt;p&gt;OpenObserve supports &lt;strong&gt;Fluent Bit, Vector, Logstash, Filebeat, and OpenTelemetry&lt;/strong&gt;. You can repoint your existing shippers and be up and running in minutes. It also features a built-in visual pipeline editor with over 100 VRL functions for real-time parsing and redaction.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fo117ol034gyau7m5ecu2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fo117ol034gyau7m5ecu2.png" alt="Agent receivers ingestion flow into OpenObserve" width="800" height="453"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  2. Kibana (Elastic Stack) — Best for Full-Text Search
&lt;/h2&gt;

&lt;p&gt;Kibana remains the gold standard for inverted-index search. Its &lt;strong&gt;Lens&lt;/strong&gt; visualization engine and &lt;strong&gt;Discover&lt;/strong&gt; view are incredibly mature.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Strengths:&lt;/strong&gt; High customizability, mature drag-and-drop editors, and powerful ML-driven anomaly detection.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Weaknesses:&lt;/strong&gt; High resource consumption (RAM-hungry) and a steeper learning curve for KQL (Kibana Query Language) compared to natural language interfaces.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  3. Grafana + Loki — Best for Prometheus-Native Teams
&lt;/h2&gt;

&lt;p&gt;For teams already deep in the Prometheus ecosystem, Grafana + Loki is the natural choice. It uses the same label model and UI you already know.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Strengths:&lt;/strong&gt; Unified dashboards for metrics, logs, and traces; excellent Kubernetes integration.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Weaknesses:&lt;/strong&gt; Loki only indexes labels, making full-text search over unstructured logs slower and more expensive than indexed alternatives.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  4. Datadog Logs — Best Managed Option
&lt;/h2&gt;

&lt;p&gt;Datadog offers the most polished "zero-ops" experience. Its &lt;strong&gt;Watchdog AI&lt;/strong&gt; surfaces anomalies automatically, and the integration between logs and distributed traces is seamless.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Tradeoff:&lt;/strong&gt; Cost. As log volume grows, Datadog’s pricing often forces teams to sample or redact data aggressively to stay within budget.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  5. Splunk — Best for Enterprise Security
&lt;/h2&gt;

&lt;p&gt;Splunk is the powerhouse of the SIEM world. If your log visualization needs are tied to forensic investigation and strict compliance, Splunk’s SPL (Search Processing Language) is unmatched. For standard app observability, however, it is often considered overengineered.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Shift: From Dashboards to Conversations
&lt;/h2&gt;

&lt;p&gt;The old way of observing involved building dashboards for "known" failure modes. But modern, distributed systems fail in "unknown" ways. &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;AI-assisted log analysis&lt;/strong&gt; changes the game by allowing exploratory investigation. When you can generate a correlated view across logs and metrics via a chat interface, the "Time to Resolution" (TTR) drops significantly. This is why OpenObserve’s native AI integration represents a fundamental shift in how we handle incidents in 2026.&lt;/p&gt;




&lt;h2&gt;
  
  
  FAQ
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;What is the lowest-cost log tool?&lt;/strong&gt;&lt;br&gt;
OpenObserve typically offers the lowest storage costs (up to 140x lower than ELK) due to its S3-native architecture.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Does OpenObserve work with OpenTelemetry?&lt;/strong&gt;&lt;br&gt;
Yes, it is OTLP-native and supports logs, metrics, and traces via OpenTelemetry collectors.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Can I create dashboards using AI?&lt;/strong&gt;&lt;br&gt;
Yes. Using OpenObserve's AI assistant, you can generate complete dashboard panels from a simple text prompt.&lt;/p&gt;




&lt;h3&gt;
  
  
  Get Started
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://cloud.openobserve.ai" rel="noopener noreferrer"&gt;OpenObserve Cloud&lt;/a&gt;&lt;/strong&gt; — 14-day free trial, no credit card required.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Self-hosted&lt;/strong&gt; — Run it as a single binary or via Helm charts in under 10 minutes.&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>devops</category>
      <category>observability</category>
      <category>logs</category>
      <category>ai</category>
    </item>
    <item>
      <title>Jaeger for Distributed Tracing: A Complete Guide with OpenObserve Comparison</title>
      <dc:creator>Manas Sharma</dc:creator>
      <pubDate>Fri, 13 Feb 2026 15:12:29 +0000</pubDate>
      <link>https://dev.to/openobserve/jaeger-for-distributed-tracing-a-complete-guide-with-openobserve-comparison-22ac</link>
      <guid>https://dev.to/openobserve/jaeger-for-distributed-tracing-a-complete-guide-with-openobserve-comparison-22ac</guid>
      <description>&lt;p&gt;As software systems evolve, they become increasingly complex, especially with the rise of microservices and distributed architectures. Keeping track of what's happening across different services can quickly become a daunting task. Tracing tools like Jaeger have emerged as essential solutions for debugging and monitoring distributed applications, helping developers understand and optimise their systems.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;In this blog, we will cover:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;The Pillars of Observability&lt;/li&gt;
&lt;li&gt;Background on Distributed Tracing&lt;/li&gt;
&lt;li&gt;What Is Jaeger?&lt;/li&gt;
&lt;li&gt;How Jaeger Works: Key Concepts and Components&lt;/li&gt;
&lt;li&gt;How Jaeger Collects and Visualizes Traces&lt;/li&gt;
&lt;li&gt;Getting Started with Jaeger&lt;/li&gt;
&lt;li&gt;Getting Started with OpenObserve&lt;/li&gt;
&lt;li&gt;Jaeger vs. OpenObserve&lt;/li&gt;
&lt;li&gt;Conclusion&lt;/li&gt;
&lt;li&gt;Real-World Case Study: Jidu's Journey to 100% Tracing Fidelity&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Prerequisites:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;A running Docker instance with admin access.&lt;/li&gt;
&lt;li&gt;An OpenObserve instance or cloud account ready to receive logs.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The Pillars of Observability
&lt;/h2&gt;

&lt;p&gt;To truly understand Jaeger, it's vital to grasp the concept of observability. Observability allows us to infer the internal states of systems through their outputs, and it primarily revolves around three pillars:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Logging:&lt;/strong&gt; Capturing individual events or errors.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Metrics:&lt;/strong&gt; Quantifying system performance and resource usage.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tracing:&lt;/strong&gt; Visualizing request paths and measuring latency across services.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;While logging and metrics provide critical insights, distributed tracing complements them by offering context on how different services interact and depend on one another.&lt;/p&gt;

&lt;h2&gt;
  
  
  Background on Distributed Tracing
&lt;/h2&gt;

&lt;p&gt;Before we dive into Jaeger, it's essential to understand the concept of distributed tracing and why it's crucial in microservices environments.&lt;/p&gt;

&lt;h3&gt;
  
  
  What is Distributed Tracing?
&lt;/h3&gt;

&lt;p&gt;Distributed tracing is a methodology used to track and analyze requests as they traverse through various services in a distributed system. It helps in visualizing the journey of a request, from the initial entry point all the way to the final response.&lt;/p&gt;

&lt;p&gt;E.g. Service A → Service B → Service C → Service D&lt;/p&gt;

&lt;h3&gt;
  
  
  Why is Distributed Tracing Important?
&lt;/h3&gt;

&lt;p&gt;In monolithic applications, tracing and debugging are straightforward. However, modern applications often depend on multiple microservices communicating over networks, complicating the identification of delays or failures.&lt;/p&gt;

&lt;p&gt;Logging alone can't capture complex dependencies or detect bottlenecks. Distributed tracing tools like Jaeger provide end-to-end visibility of requests, capturing metadata at each step, which helps developers:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Trace requests across services&lt;/li&gt;
&lt;li&gt;Visualise service dependencies and interactions&lt;/li&gt;
&lt;li&gt;Identify performance bottlenecks&lt;/li&gt;
&lt;li&gt;Quickly troubleshoot issues by pinpointing problematic services&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  What Is Jaeger?
&lt;/h2&gt;

&lt;p&gt;Jaeger is an open-source, end-to-end distributed tracing tool originally developed by Uber Technologies. Now part of the CNCF (Cloud Native Computing Foundation), Jaeger allows developers to trace requests as they propagate through distributed systems, providing insights into service behavior and performance bottlenecks.&lt;/p&gt;

&lt;p&gt;With Jaeger, you can:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Track request latency and identify services contributing to slow response times&lt;/li&gt;
&lt;li&gt;Monitor errors and investigate the root cause of failures across services&lt;/li&gt;
&lt;li&gt;Visualise dependency graphs for services to understand relationships and interactions&lt;/li&gt;
&lt;li&gt;Optimise performance by identifying and removing bottlenecks&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Jaeger is widely adopted due to its powerful tracing capabilities, ease of use, and integration with other monitoring tools in the observability stack.&lt;/p&gt;

&lt;h2&gt;
  
  
  How Jaeger Works: Key Concepts and Components
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fh58gelacidtvo5vb5aww.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fh58gelacidtvo5vb5aww.png" alt="jaeger_architecture" width="800" height="220"&gt;&lt;/a&gt;&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fm6zjvc6i9g09are7jbfs.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fm6zjvc6i9g09are7jbfs.png" alt="jaeger_architecture" width="800" height="220"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Jaeger traces requests as they travel through various services in a distributed system. It captures information about each service's interaction, which helps in pinpointing issues. Let's break down the primary components of Jaeger to understand its functioning:&lt;/p&gt;
&lt;h3&gt;
  
  
  Spans and Traces:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Span:&lt;/strong&gt; A span represents a single unit of work within a trace, capturing details like start time, duration, and any metadata or tags. Each span represents a single service call or action in the overall trace.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Trace:&lt;/strong&gt; A trace represents the entire journey of a request across multiple spans. For instance, when a user makes a request to an application, a trace records the entire sequence, from the front end to each microservice involved.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvf62f3whriy1l7oyrksa.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvf62f3whriy1l7oyrksa.png" alt="jaeger_trace" width="800" height="414"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This screenshot is from the HOT Commerce project by OpenObserve, which demonstrates tracing across microservices. For more details, visit the project on &lt;a href="https://github.com/openobserve/hotcommerce/" rel="noopener noreferrer"&gt;GitHub here.&lt;/a&gt;&lt;/p&gt;
&lt;h4&gt;
  
  
  Trace Analysis:
&lt;/h4&gt;

&lt;p&gt;In the image above, each line represents a span—a single operation within the overall trace, showing the journey of a request across services:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Trace:&lt;/strong&gt; The set of spans forms the trace, covering services like frontend, shop, product, review, and price.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Longest Span:&lt;/strong&gt; The frontend service takes the longest time at 2.53 seconds.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Shortest Span:&lt;/strong&gt; The request handler completes in just 27.00 microseconds (µs).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Total Spans:&lt;/strong&gt; There are 15 spans, each representing a unit of work, such as middleware processing, database calls, and service interactions.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This breakdown shows how the request interacts with multiple services and highlights areas for potential optimization.&lt;/p&gt;
&lt;h3&gt;
  
  
  Jaeger Client:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Jaeger clients are libraries that you embed in your application code to instrument services and collect tracing data. These clients generate spans and traces, sending them to a collector for storage and analysis.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Alternatively, instead of using the Jaeger-specific client, you can also use OpenTelemetry (OTel) SDKs for instrumentation. OpenTelemetry is a vendor-neutral observability framework that can work with multiple tracing backends, including Jaeger. Using OTel SDKs allows flexibility to switch or integrate with other observability tools.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;
  
  
  Agent:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;The Jaeger agent is a lightweight daemon running alongside the application. It receives traces emitted by the client and batches them for efficient transmission to the collector.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Alternatively, the OpenTelemetry Collector can be used as an alternative to the Jaeger Agent. The OTel Collector is a versatile tool that not only receives, processes, and exports tracing data but can also handle metrics and logs. It can send data to multiple observability backends, making it a flexible choice for distributed tracing setups.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;
  
  
  Collector:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;The Jaeger collector receives traces from agents and stores them in a backend. It also performs any preprocessing or filtering needed for the traces before they are stored.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;In OpenTelemetry-based setups, the OTel Collector can handle this role as well, offering additional features like data transformation and routing, which make it ideal for complex or multi-backend environments.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;
  
  
  Query Service and UI:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Jaeger provides a UI for querying and visualising traces. Through this UI, developers can search for traces, identify latency bottlenecks, and visualise service dependencies and call hierarchies.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;
  
  
  Storage Backend:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Jaeger supports various storage backends like Cassandra, Elasticsearch, or even local files for persistence. This allows you to store traces for later analysis and comparisons.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;
  
  
  How Jaeger Collects and Visualizes Traces
&lt;/h2&gt;

&lt;p&gt;When a user request enters a service, the Jaeger client library starts a trace, generating a unique trace ID for that request. As the request flows through different services, the trace ID propagates along, with each service generating a span representing its part of the work. These spans are sent to the Jaeger agent and ultimately stored in the backend.&lt;/p&gt;

&lt;p&gt;The Jaeger UI allows you to visualise traces in a timeline view, making it easier to observe the sequence of events and locate bottlenecks. The UI also provides a service dependency graph that shows the relationships between services, allowing you to monitor dependencies and the overall health of your system.&lt;/p&gt;
&lt;h2&gt;
  
  
  Getting Started with Jaeger
&lt;/h2&gt;

&lt;p&gt;Here's a quick guide to setting up Jaeger in your environment. We'll use Docker to deploy Jaeger and assume you have Docker installed.&lt;br&gt;
For a complete setup guide, refer to the &lt;a href="https://www.jaegertracing.io/docs/1.62/getting-started/" rel="noopener noreferrer"&gt;Jaeger Getting Started Documentation.&lt;/a&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  Step 1: Deploy Jaeger with Docker
&lt;/h3&gt;

&lt;p&gt;Jaeger offers an all-in-one image for testing and development purposes. To start the Jaeger all-in-one container, run the following command:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker run &lt;span class="nt"&gt;--rm&lt;/span&gt; &lt;span class="nt"&gt;--name&lt;/span&gt; jaeger &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-e&lt;/span&gt; &lt;span class="nv"&gt;COLLECTOR_ZIPKIN_HOST_PORT&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;:9411 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-p&lt;/span&gt; 6831:6831/udp &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-p&lt;/span&gt; 6832:6832/udp &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-p&lt;/span&gt; 5778:5778 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-p&lt;/span&gt; 16686:16686 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-p&lt;/span&gt; 4317:4317 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-p&lt;/span&gt; 4318:4318 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-p&lt;/span&gt; 14250:14250 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-p&lt;/span&gt; 14268:14268 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-p&lt;/span&gt; 14269:14269 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-p&lt;/span&gt; 9411:9411 &lt;span class="se"&gt;\&lt;/span&gt;
  jaegertracing/all-in-one:1.62.0
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The above command runs the Jaeger all-in-one Docker container, which is useful for testing and development. It exposes the following ports:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;6831/udp &amp;amp; 6832/udp:&lt;/strong&gt; Receive trace data from Jaeger agents.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;5778:&lt;/strong&gt; Agent configuration HTTP endpoint.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;16686:&lt;/strong&gt; Jaeger Query UI for viewing and searching traces.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;4317:&lt;/strong&gt; OpenTelemetry gRPC endpoint for tracing data.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;4318:&lt;/strong&gt; OpenTelemetry HTTP endpoint for tracing data.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;14250:&lt;/strong&gt; gRPC endpoint for the Jaeger collector.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;14268:&lt;/strong&gt; HTTP endpoint for the collector to receive traces.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;14269:&lt;/strong&gt; Health check endpoint for the collector.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;9411:&lt;/strong&gt; Zipkin-compatible endpoint for receiving data.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Note:&lt;/strong&gt; This setup uses memory as the default backend storage, which is intended for short-term use and is not recommended for production due to the lack of persistence.&lt;/p&gt;

&lt;p&gt;You can access the Jaeger UI at &lt;strong&gt;&lt;a href="http://localhost:16686" rel="noopener noreferrer"&gt;http://localhost:16686&lt;/a&gt;&lt;/strong&gt;, to visualise and interact with the traces collected.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F16fjasa45n0jtybt3ynu.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F16fjasa45n0jtybt3ynu.jpg" alt="jaeger_UI" width="800" height="419"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 2: Instrument the HotROD Sample Application
&lt;/h3&gt;

&lt;p&gt;Next, we'll instrument the HotROD sample application to work with Jaeger for distributed tracing.&lt;/p&gt;

&lt;h4&gt;
  
  
  What is HotROD?
&lt;/h4&gt;

&lt;p&gt;HotROD is a microservices application simulating a ride-hailing service, similar to Uber or Lyft. It consists of multiple services, such as ride management and driver management, making it an ideal example for demonstrating distributed tracing in a microservices architecture.&lt;/p&gt;

&lt;p&gt;To run the HotROD application alongside Jaeger, use the following Docker command:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker run &lt;span class="nt"&gt;--rm&lt;/span&gt; &lt;span class="nt"&gt;-it&lt;/span&gt; &lt;span class="nt"&gt;--link&lt;/span&gt; jaeger &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-p8080-8083&lt;/span&gt;:8080-8083 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-e&lt;/span&gt; &lt;span class="nv"&gt;OTEL_EXPORTER_OTLP_ENDPOINT&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"http://jaeger:4318"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  jaegertracing/example-hotrod:1.62.0 &lt;span class="se"&gt;\&lt;/span&gt;
  all &lt;span class="nt"&gt;--otel-exporter&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;otlp
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The above command will run the HotROD sample application in a Docker container, linking it to the Jaeger container. It will expose ports 8080 to 8083 on the host for accessing the HotROD services. The application is configured to send tracing data to Jaeger via the OpenTelemetry Protocol (OTLP) at the specified endpoint.&lt;/p&gt;

&lt;p&gt;You can access the HotROD UI at &lt;strong&gt;&lt;a href="http://localhost:8080" rel="noopener noreferrer"&gt;http://localhost:8080&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fx7b4m2pbiu0jslly5ama.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fx7b4m2pbiu0jslly5ama.jpg" alt="hotrod_UI" width="800" height="237"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 3: View Traces in Jaeger UI
&lt;/h3&gt;

&lt;p&gt;Once your application is instrumented, run a few requests to generate some traces.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6sb9fcjwqooi9alix5o1.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6sb9fcjwqooi9alix5o1.gif" alt="hotrod_UI_clicks" width="600" height="337"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Then, navigate to &lt;strong&gt;&lt;a href="http://localhost:16686" rel="noopener noreferrer"&gt;http://localhost:16686&lt;/a&gt;&lt;/strong&gt;, where you can query traces, visualise the flow of requests, and see latency and dependency data.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxgrll4vcgxh6w9s2g7gs.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxgrll4vcgxh6w9s2g7gs.gif" alt="jeager_UI_1" width="" height=""&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Getting Started with OpenObserve
&lt;/h2&gt;

&lt;p&gt;Now, let's guide you through the setup of OpenObserve using Docker for deployment.&lt;br&gt;
For a detailed setup guide, you can refer to the &lt;a href="https://openobserve.ai/docs/quickstart/#openobserve-cloud/" rel="noopener noreferrer"&gt;OpenObserve Quickstart Documentation.&lt;/a&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  Step 1: Deploy OpenObserve with Docker
&lt;/h3&gt;

&lt;p&gt;OpenObserve provides a Docker image for easy deployment. To start using OpenObserve, run the following command:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker run &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--name&lt;/span&gt; openobserve &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;-v&lt;/span&gt; &lt;span class="nv"&gt;$PWD&lt;/span&gt;/data:/data &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;-e&lt;/span&gt; &lt;span class="nv"&gt;ZO_DATA_DIR&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"/data"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;-p&lt;/span&gt; 5080:5080 &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;-e&lt;/span&gt; &lt;span class="nv"&gt;ZO_ROOT_USER_EMAIL&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"root@example.com"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;-e&lt;/span&gt; &lt;span class="nv"&gt;ZO_ROOT_USER_PASSWORD&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"Complexpass#123"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    public.ecr.aws/zinclabs/openobserve:latest
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The command will start an OpenObserve Docker container named openobserve, with the following configurations:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Persistent Storage:&lt;/strong&gt; Maps the local directory $PWD/data to the container's /data directory.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Authentication:&lt;/strong&gt; Sets the root user email and password for the OpenObserve interface.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Port Exposure:&lt;/strong&gt; Exposes port 5080 for external access to the OpenObserve web application.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You can access the OpenObserve UI at &lt;strong&gt;&lt;a href="http://localhost:5080" rel="noopener noreferrer"&gt;http://localhost:5080&lt;/a&gt;&lt;/strong&gt; to visualise and interact with your observability data.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fi6wdf804p6lpkpoc3gzq.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fi6wdf804p6lpkpoc3gzq.jpg" alt="O2_login_page" width="800" height="408"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Log in with the following credentials:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;User email:&lt;/strong&gt; &lt;a href="mailto:root@example.com"&gt;root@example.com&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Password:&lt;/strong&gt; Complexpass#123&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnh99r0u8d3orr1jthkub.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnh99r0u8d3orr1jthkub.gif" alt="O2_login" width="" height=""&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 2: Instrument the HotROD Sample Application
&lt;/h3&gt;

&lt;p&gt;Run the following command to configure the HotROD sample app to send tracing data to OpenObserve (O2). Replace placeholders with the correct values from your OpenObserve setup.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker run &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--rm&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--link&lt;/span&gt; &amp;lt;O2_CONTAINER_NAME&amp;gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--env&lt;/span&gt; &lt;span class="nv"&gt;OTEL_EXPORTER_OTLP_ENDPOINT&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&amp;lt;O2_ENDPOINT&amp;gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--env&lt;/span&gt; &lt;span class="nv"&gt;OTEL_EXPORTER_OTLP_HEADERS&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"&amp;lt;Authorization=Basic &amp;lt;BASE64_ENCODED_CREDENTIALS&amp;gt;&amp;gt;"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-p&lt;/span&gt; 8080-8083:8080-8083 &lt;span class="se"&gt;\&lt;/span&gt;
  jaegertracing/example-hotrod:latest &lt;span class="se"&gt;\&lt;/span&gt;
  all
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This command does the following:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Runs the HotROD application in a Docker container and links it to your OpenObserve container.&lt;/li&gt;
&lt;li&gt;Sets the environment variable for the OpenTelemetry exporter endpoint to send tracing data to OpenObserve.&lt;/li&gt;
&lt;li&gt;Configures the necessary headers for authentication.&lt;/li&gt;
&lt;li&gt;Maps ports 8080 to 8083 for accessing the HotROD services externally.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;By running this command, you'll be able to generate trace data from the HotROD application and send it to OpenObserve for visualisation and analysis.&lt;/p&gt;

&lt;p&gt;You can find the HTTP endpoint and authorization details in the Data Sources section, under Traces (OpenTelemetry).&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkllnh3ucbvv8tspj5pa4.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkllnh3ucbvv8tspj5pa4.gif" alt="O2_endpoint" width="600" height="289"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This is how the command looks after replacing required fields:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker run &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--rm&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--link&lt;/span&gt; openobserve &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--env&lt;/span&gt; &lt;span class="nv"&gt;OTEL_EXPORTER_OTLP_ENDPOINT&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;http://13.232.45.32:5080/api/default &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--env&lt;/span&gt; &lt;span class="nv"&gt;OTEL_EXPORTER_OTLP_HEADERS&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"Authorization=Basic cm9vdEBleGFtcGxlLmNvbTpTMzVHMjhaMEkxVEdxYm9q"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-p&lt;/span&gt; 8080-8083:8080-8083 &lt;span class="se"&gt;\&lt;/span&gt;
  jaegertracing/example-hotrod:latest &lt;span class="se"&gt;\&lt;/span&gt;
  all
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Replace &lt;strong&gt;&lt;/strong&gt; and &lt;strong&gt;&lt;/strong&gt; with your specific values.&lt;/p&gt;

&lt;p&gt;You can access the HotROD UI at &lt;strong&gt;&lt;a href="http://localhost:8080" rel="noopener noreferrer"&gt;http://localhost:8080&lt;/a&gt;&lt;/strong&gt;. Once your application is instrumented, run a few requests to generate some traces.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7irphunqqc4fbt66tajv.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7irphunqqc4fbt66tajv.gif" alt="hotrod_UI_clicks" width="600" height="337"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 3: View Traces in OpenObserve UI
&lt;/h3&gt;

&lt;p&gt;Once your application is instrumented, generate some telemetry data by making requests to your services. You can then explore the data in the OpenObserve UI at &lt;strong&gt;&lt;a href="http://localhost:5080" rel="noopener noreferrer"&gt;http://localhost:5080&lt;/a&gt;&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6iefveaaicfdtwqgsm55.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6iefveaaicfdtwqgsm55.gif" alt="O2_traces" width="" height=""&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fiu34uzvg71clcr9ej5hr.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fiu34uzvg71clcr9ej5hr.jpg" alt="O2_traces" width="800" height="449"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fc1w2yqsy77byomw10172.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fc1w2yqsy77byomw10172.jpg" alt="O2_traces" width="800" height="449"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Jaeger vs. OpenObserve
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;strong&gt;Challenge&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Jaeger&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;OpenObserve (O2)&lt;/strong&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Scalability&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Struggles with high traffic&lt;/td&gt;
&lt;td&gt;Built for high scalability and performance&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Unified Platform&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Separate tools for logs and metrics&lt;/td&gt;
&lt;td&gt;Combines metrics, logs, and traces into one platform&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Querying&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Basic querying options&lt;/td&gt;
&lt;td&gt;Advanced querying capabilities for deeper insights&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Cost Management&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Higher storage and processing costs&lt;/td&gt;
&lt;td&gt;Optimized for lower resource usage&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;User Experience&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Traditional, complex interfaces&lt;/td&gt;
&lt;td&gt;Modern, intuitive interface for easy navigation and analysis&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Jaeger is an excellent tool for getting started with distributed tracing and is widely adopted for microservices observability. However, as systems grow, Jaeger's limitations in data handling and cross-function observability (metrics, logs, and traces) may become restrictive.&lt;/p&gt;

&lt;p&gt;OpenObserve addresses these limitations by unifying metrics, logs, and traces in a single platform, making it a more comprehensive observability solution. With its scalability, enhanced query capabilities, and cost-effectiveness, OpenObserve empowers teams to monitor, troubleshoot, and optimise complex distributed systems more efficiently.&lt;/p&gt;

&lt;h2&gt;
  
  
  Real-World Case Study: Jidu's Journey to 100% Tracing Fidelity
&lt;/h2&gt;

&lt;p&gt;To see OpenObserve's impact in action, read about Jidu's journey to achieving &lt;strong&gt;100% tracing fidelity using OpenObserve&lt;/strong&gt;. Their challenge with Jaeger with Elasticsearch backend limited their ability to ingest traces and they were able to ingest only 10% of traces that their application generated (10 TB per day) and performance was bad for the money that was spent on the resources.&lt;/p&gt;

&lt;p&gt;After moving from Jaeger+Elasticsearch to OpenObserve they were able to increase trace ingestion to 100% (10 TB) offering higher performance on the same hardware and reduced storage cost as well. They eventually started ingesting 100 TB of traces per day in OpenObserve. Their team's work offers valuable insights into overcoming the challenges of tracing at scale and ensuring trace fidelity. You can read the full case study &lt;a href="https://openobserve.ai/blog/jidu-journey-to-100-tracing-fidelity/" rel="noopener noreferrer"&gt;here.&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This case demonstrates how OpenObserve's unified approach to observability enables improved trace fidelity and facilitates better troubleshooting, performance optimization, and insight gathering across distributed systems.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Ready to get started?&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://openobserve.ai/downloads/" rel="noopener noreferrer"&gt;Download OpenObserve&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://cloud.openobserve.ai/" rel="noopener noreferrer"&gt;Try OpenObserve Cloud&lt;/a&gt; with a 14-day free trial&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://short.openobserve.ai/community" rel="noopener noreferrer"&gt;Join our community&lt;/a&gt; for support and discussions&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>jaeger</category>
      <category>observability</category>
      <category>microservices</category>
      <category>tracing</category>
    </item>
    <item>
      <title>Top 10 Lightstep Alternatives for 2026 (OpenTelemetry-Native Options)</title>
      <dc:creator>Manas Sharma</dc:creator>
      <pubDate>Wed, 04 Feb 2026 14:41:04 +0000</pubDate>
      <link>https://dev.to/openobserve/top-10-lightstep-alternatives-for-2026-opentelemetry-native-options-2ol4</link>
      <guid>https://dev.to/openobserve/top-10-lightstep-alternatives-for-2026-opentelemetry-native-options-2ol4</guid>
      <description>&lt;p&gt;ServiceNow announced the sunset of &lt;strong&gt;Lightstep (Cloud Observability)&lt;/strong&gt; effective March 1, 2026. If you're a Lightstep user, you're facing a forced migration with no direct replacement offered by ServiceNow.&lt;/p&gt;

&lt;p&gt;Several factors are driving teams to evaluate &lt;strong&gt;Lightstep alternatives&lt;/strong&gt;:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Forced migration&lt;/strong&gt; - March 2026 EOL deadline approaching with no migration path from ServiceNow&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cost optimization&lt;/strong&gt; - Opportunity to reduce observability spending by 60-90% with modern platforms&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Vendor lock-in concerns&lt;/strong&gt; - Avoid future platform sunsets by choosing OpenTelemetry-native solutions&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;OpenTelemetry standardization&lt;/strong&gt; - Move to vendor-neutral instrumentation that works across platforms&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Data sovereignty&lt;/strong&gt; - Teams need self-hosted or regional deployment options for compliance&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;In this guide, we'll explore ten &lt;strong&gt;OpenTelemetry-native alternatives to Lightstep&lt;/strong&gt; that address these concerns, from open source platforms to specialized SaaS solutions. We'll include real cost comparisons, migration code snippets, and technical analysis to help you choose the right replacement and migrate before the March 2026 deadline.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Lightstep Sunset: What You Need to Know
&lt;/h2&gt;

&lt;p&gt;The clock is ticking. ServiceNow has officially announced the sunset of Lightstep (rebranded as ServiceNow Cloud Observability), with the service reaching End-of-Life (EOL) by March 1, 2026.&lt;/p&gt;

&lt;p&gt;For engineering teams that relied on Lightstep for its pioneering work in distributed tracing and OpenTelemetry (OTel), this is a critical turning point. You need a replacement that respects your existing OTel instrumentation, handles high-cardinality data without breaking the bank, and doesn't trap you in a proprietary agent ecosystem.&lt;/p&gt;

&lt;p&gt;This guide analyzes the &lt;strong&gt;Top 10 Lightstep alternatives for 2026&lt;/strong&gt;, focusing on:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;OpenTelemetry compatibility&lt;/strong&gt; - Native OTel support vs translation layers&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Migration ease&lt;/strong&gt; - How quickly can you switch without rewriting code?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Total cost of ownership&lt;/strong&gt; - Real pricing for production workloads&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;High-cardinality support&lt;/strong&gt; - Can it handle user IDs, request IDs at scale?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Vendor lock-in risk&lt;/strong&gt; - Will you face this problem again in 3 years?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Bottom line&lt;/strong&gt;: OpenObserve emerges as the best drop-in replacement, offering significant cost savings while maintaining OpenTelemetry-native architecture and distributed tracing capabilities.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why This Guide Exists
&lt;/h2&gt;

&lt;p&gt;As observability requirements evolve in 2026, Lightstep users face a forced migration due to ServiceNow's March 1, 2026 end-of-life announcement. With no direct replacement or migration path provided by ServiceNow, teams must evaluate alternatives quickly.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Evidence from Real Migrations:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Cost reduction:&lt;/strong&gt; - Production data shows dramatic savings when moving from Lightstep to modern OpenTelemetry-native alternatives.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Migration timeline: Fast with OTel&lt;/strong&gt; - Teams using OpenTelemetry can migrate quickly by changing collector configuration. This is significantly faster than platforms that need new instrumentation.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;OpenTelemetry-native prevents lock-in&lt;/strong&gt; - Vendor-neutral instrumentation using OpenTelemetry standards enables future flexibility. You're not rewriting code or learning proprietary agents if you need to switch platforms again.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Unified observability simplifies operations&lt;/strong&gt; - Logs, metrics, and traces in one platform reduces tool sprawl, context switching, and correlation complexity that teams experienced with fragmented monitoring stacks.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  What Lightstep Users Need to Replicate
&lt;/h3&gt;

&lt;p&gt;Lightstep was known for several key capabilities that any replacement must match:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;OpenTelemetry pioneer&lt;/strong&gt; - Lightstep was an early contributor to OpenTelemetry and built its platform as OTel-native from day one&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Distributed tracing excellence&lt;/strong&gt; - High-cardinality trace data at scale without performance penalties or cost explosions&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Unified observability&lt;/strong&gt; - Logs, metrics, and traces correlated in a single platform with powerful cross-signal queries&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Change Intelligence&lt;/strong&gt; - Deployment tracking and automatic correlation between changes and performance impacts&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Service dependency mapping&lt;/strong&gt; - Visual representation of service relationships and data flows&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;SQL-based querying&lt;/strong&gt; - Accessible query language for both developers and SREs&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Your replacement platform needs to match these capabilities while avoiding the vendor lock-in risk that led to this forced migration.&lt;/p&gt;




&lt;h2&gt;
  
  
  What to Look for in a Lightstep Alternative
&lt;/h2&gt;

&lt;p&gt;When evaluating observability platforms to replace Lightstep, assess these critical dimensions:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Criterion&lt;/th&gt;
&lt;th&gt;Why It Matters&lt;/th&gt;
&lt;th&gt;What to Evaluate&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;OpenTelemetry Native&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Ensures easy migration without code changes&lt;/td&gt;
&lt;td&gt;Native OTLP support vs translation layers that add complexity&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Migration Timeline&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;March 2026 deadline approaching fast&lt;/td&gt;
&lt;td&gt;Can you complete migration quickly with your team size?&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Cost Structure&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Opportunity to reduce observability spend&lt;/td&gt;
&lt;td&gt;Transparent pricing vs usage-based surprises and hidden fees&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Distributed Tracing&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Core Lightstep capability you can't lose&lt;/td&gt;
&lt;td&gt;High-cardinality support, trace quality, sampling strategies&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Data Ownership&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Avoid future vendor lock-in scenarios&lt;/td&gt;
&lt;td&gt;Self-hosted deployment option available or SaaS-only?&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Unified Observability&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Reduce tool sprawl and context switching&lt;/td&gt;
&lt;td&gt;Logs, metrics, traces in one platform with correlation&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Query Capabilities&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Investigation efficiency during incidents&lt;/td&gt;
&lt;td&gt;SQL/PromQL vs proprietary query languages requiring training&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Service Maps&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Dependency visualization and troubleshooting&lt;/td&gt;
&lt;td&gt;Automatic topology mapping from trace data&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Integration Ecosystem&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Works with your existing infrastructure&lt;/td&gt;
&lt;td&gt;Cloud providers, databases, Kubernetes, CI/CD tools&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Vendor Stability&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Avoid another sudden platform sunset&lt;/td&gt;
&lt;td&gt;Long-term viability, funding, community support, roadmap&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Scalability&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Handle growing data volumes&lt;/td&gt;
&lt;td&gt;Performance at 2x, 5x, 10x current data volumes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;High-Cardinality Support&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Modern app requirements (user IDs, request IDs)&lt;/td&gt;
&lt;td&gt;Cost and performance impact of high-cardinality dimensions&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  Top 10 Lightstep Alternatives
&lt;/h2&gt;

&lt;p&gt;Jump to comparison table&lt;/p&gt;

&lt;h3&gt;
  
  
  1. OpenObserve (The Drop-in Replacement)
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://openobserve.ai" rel="noopener noreferrer"&gt;OpenObserve&lt;/a&gt;&lt;/strong&gt; is the best Lightstep alternative for teams wanting unified observability with OpenTelemetry-native architecture, no vendor lock-in, and 90% cost savings. It delivers the same distributed tracing capabilities Lightstep users rely on, but with transparent pricing and self-hosting options.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9s1kk3jdcee6k2xxxa4u.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9s1kk3jdcee6k2xxxa4u.png" alt="OpenObserve Dashboard" width="800" height="452"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why OpenObserve is the best Lightstep alternative:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;OpenObserve isn't just similar to Lightstep - it's architecturally compatible. Both platforms are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Built for OpenTelemetry from day one&lt;/li&gt;
&lt;li&gt;Designed for high-cardinality distributed tracing at scale&lt;/li&gt;
&lt;li&gt;Focused on unified observability (logs, metrics, traces)&lt;/li&gt;
&lt;li&gt;Using SQL-based query languages (vs proprietary DSLs)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;The difference?&lt;/strong&gt; OpenObserve gives you complete data ownership through self-hosting options.&lt;/p&gt;

&lt;h4&gt;
  
  
  OpenObserve Pros:
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;True Drop-in Replacement&lt;/strong&gt;: Migration from Lightstep requires changing one config file in your OpenTelemetry Collector - no application code changes needed&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;OpenTelemetry-Native&lt;/strong&gt;: Native OTLP support means seamless integration with your existing OTel instrumentation&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;High-Cardinality Friendly&lt;/strong&gt;: Handles user-level dimensions and request IDs without performance degradation or cost explosions&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Unified Observability&lt;/strong&gt;: Logs, metrics, and traces in one platform with powerful correlation capabilities&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;SQL + PromQL Querying&lt;/strong&gt;: Familiar query languages instead of proprietary syntax requiring training&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Self-Hosted or Cloud&lt;/strong&gt;: Deploy on your infrastructure for complete control, or use managed cloud for simplicity&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Transparent Pricing&lt;/strong&gt;: Ingestion-based pricing model with no hidden per-host or per-metric fees&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  OpenObserve Cons:
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;Community maturity: While the core platform is battle-tested, the AI agent community is newer compared to established vendors&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  Migration from Lightstep:
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Easiest migration path of any alternative.&lt;/strong&gt; If you're using OpenTelemetry (which Lightstep users are):&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Sign up for OpenObserve (cloud or self-hosted in 10 minutes)&lt;/li&gt;
&lt;li&gt;Update your OpenTelemetry Collector exporter configuration (change endpoint URL and auth token)&lt;/li&gt;
&lt;li&gt;Restart collector - data immediately flows to OpenObserve&lt;/li&gt;
&lt;li&gt;Rebuild dashboards (OpenObserve provides similar visualization capabilities)&lt;/li&gt;
&lt;li&gt;Set up alerts (SQL-based, often simpler than Lightstep's UI-based approach)&lt;/li&gt;
&lt;/ol&gt;

&lt;h4&gt;
  
  
  Best For:
&lt;/h4&gt;

&lt;p&gt;Teams seeking a &lt;strong&gt;Lightstep replacement&lt;/strong&gt; that maintains OpenTelemetry-native architecture, matches distributed tracing capabilities, and dramatically reduces costs without sacrificing functionality. Ideal for organizations wanting data ownership through self-hosting while avoiding vendor lock-in.&lt;/p&gt;




&lt;h3&gt;
  
  
  2. Grafana Stack (LGTM)
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://grafana.com/" rel="noopener noreferrer"&gt;Grafana Stack&lt;/a&gt;&lt;/strong&gt; (Loki for logs, Grafana for visualization, Tempo for traces, Mimir/Prometheus for metrics) is a popular open-source Lightstep alternative composed of best-in-class tools.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0xxtoswb9wvxwxqpprzg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0xxtoswb9wvxwxqpprzg.png" alt="Grafana Dashboard" width="800" height="435"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Grafana Stack Pros:
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Best Visualization&lt;/strong&gt;: Grafana dashboards are industry-leading with extensive customization options&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Open Source &amp;amp; Vendor-Neutral&lt;/strong&gt;: No proprietary formats or lock-in across the stack&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tempo for Tracing&lt;/strong&gt;: OpenTelemetry-native distributed tracing with excellent performance&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Large Ecosystem&lt;/strong&gt;: Thousands of integrations, plugins, and community dashboards&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Flexible Deployment&lt;/strong&gt;: Self-host components individually or use managed Grafana Cloud&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Prometheus Standard&lt;/strong&gt;: Industry-standard metrics collection and querying (PromQL)&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  Grafana Stack Cons:
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;Not a single unified product like Lightstep - requires managing multiple components&lt;/li&gt;
&lt;li&gt;Operational complexity increases significantly at scale (4 different systems)&lt;/li&gt;
&lt;li&gt;Correlation across logs/metrics/traces requires manual setup&lt;/li&gt;
&lt;li&gt;Steeper learning curve than unified platforms&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  Migration from Lightstep:
&lt;/h4&gt;

&lt;p&gt;Configure OpenTelemetry Collector to export traces to Tempo, metrics to Prometheus/Mimir, and logs to Loki. More complex than single-platform alternatives due to multiple destinations.&lt;/p&gt;

&lt;h4&gt;
  
  
  Best For:
&lt;/h4&gt;

&lt;p&gt;Teams wanting &lt;strong&gt;maximum flexibility&lt;/strong&gt; and best-in-class visualization who are comfortable managing multiple components. Good for organizations with strong infrastructure teams or using Grafana Cloud to reduce operational burden.&lt;/p&gt;




&lt;h3&gt;
  
  
  3. Honeycomb
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://www.honeycomb.io/" rel="noopener noreferrer"&gt;Honeycomb&lt;/a&gt;&lt;/strong&gt; is a modern Lightstep alternative focused on high-cardinality observability and debugging distributed systems.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F02ugeeqs0a19hlmsg3wg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F02ugeeqs0a19hlmsg3wg.png" alt="Honeycomb Traces" width="800" height="560"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Honeycomb Pros:
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Excellent for Distributed Tracing&lt;/strong&gt;: Purpose-built for understanding complex request flows across microservices&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;High-Cardinality Native&lt;/strong&gt;: Handles millions of unique dimension values (user IDs, request IDs) without performance issues&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Fast Exploratory Queries&lt;/strong&gt;: Rapid ad-hoc querying enables real-time investigation during incidents&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;OpenTelemetry Native&lt;/strong&gt;: Built from ground up to ingest and leverage OpenTelemetry data&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;BubbleUp Feature&lt;/strong&gt;: Automatically surfaces anomalies and patterns in high-cardinality data&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Developer-Centric UX&lt;/strong&gt;: Designed around developer and SRE workflows rather than infrastructure-only monitoring&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  Honeycomb Cons:
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;SaaS-only (no self-hosted option)&lt;/li&gt;
&lt;li&gt;Less focus on traditional dashboards (more investigation-oriented)&lt;/li&gt;
&lt;li&gt;Pricing scales with event volume (can grow quickly with high traffic)&lt;/li&gt;
&lt;li&gt;Logs and metrics support still evolving compared to tracing strength&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  Migration from Lightstep:
&lt;/h4&gt;

&lt;p&gt;Straightforward for OpenTelemetry users. Update collector configuration to send traces to Honeycomb. Strong documentation for Lightstep migration scenarios.&lt;/p&gt;

&lt;h4&gt;
  
  
  Best For:
&lt;/h4&gt;

&lt;p&gt;Teams prioritizing &lt;strong&gt;distributed tracing excellence&lt;/strong&gt; and high-cardinality debugging capabilities over traditional dashboard-heavy monitoring. Ideal for microservices architectures where understanding request flows is critical.&lt;/p&gt;




&lt;h3&gt;
  
  
  4. Datadog
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://www.datadoghq.com/" rel="noopener noreferrer"&gt;Datadog&lt;/a&gt;&lt;/strong&gt; is a comprehensive Lightstep alternative offering all-in-one observability with extensive integrations and enterprise features.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffhnct1t2q00nwq20j61w.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffhnct1t2q00nwq20j61w.png" alt="Datadog APM" width="800" height="616"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Datadog Pros:
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Most Comprehensive Platform&lt;/strong&gt;: Covers infrastructure, APM, logs, traces, RUM, synthetics, and security in one platform&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;700+ Integrations&lt;/strong&gt;: Extensive integration marketplace for cloud providers, databases, and frameworks&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Mature APM&lt;/strong&gt;: Deep application performance monitoring with code-level insights&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Enterprise-Grade&lt;/strong&gt;: Strong governance, compliance, and multi-tenancy capabilities&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Excellent UX&lt;/strong&gt;: Polished interface with powerful visualization and alerting&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  Datadog Cons:
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Very Expensive&lt;/strong&gt;: Often more expensive than Lightstep, with complex multi-vector pricing&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Vendor Lock-in&lt;/strong&gt;: Proprietary agents and data formats make switching difficult&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cost Surprises&lt;/strong&gt;: Usage-based pricing can lead to unexpected bills with traffic spikes&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;OpenTelemetry Support Limited&lt;/strong&gt;: Treats OTel metrics as expensive "custom metrics"&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  Migration from Lightstep:
&lt;/h4&gt;

&lt;p&gt;Requires Datadog agents or OpenTelemetry Collector configured for Datadog. More complex than OTel-native alternatives due to Datadog's proprietary ingestion formats.&lt;/p&gt;

&lt;h4&gt;
  
  
  Best For:
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Enterprise teams&lt;/strong&gt; with large budgets prioritizing ecosystem breadth and polished UX over cost optimization. Good if observability budget isn't constrained and you value comprehensive built-in features.&lt;/p&gt;




&lt;h3&gt;
  
  
  5. New Relic
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://newrelic.com/" rel="noopener noreferrer"&gt;New Relic&lt;/a&gt;&lt;/strong&gt; is a SaaS observability platform offering unified logs, metrics, traces, and APM with OpenTelemetry support.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7obcbz3xf34z8136uqi1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7obcbz3xf34z8136uqi1.png" alt="New Relic APM" width="800" height="500"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  New Relic Pros:
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Unified Platform&lt;/strong&gt;: Full-stack observability in single SaaS platform&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Strong APM&lt;/strong&gt;: Deep code-level performance insights and error tracking&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;OpenTelemetry Support&lt;/strong&gt;: Native OTLP ingestion simplifies migration&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Per-GB Pricing&lt;/strong&gt;: More predictable than per-host models (though still usage-based)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Developer-Friendly&lt;/strong&gt;: Good documentation and onboarding experience&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  New Relic Cons:
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Proprietary Translation&lt;/strong&gt;: Translates OpenTelemetry data into New Relic format (vendor lock-in)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Costs Scale Quickly&lt;/strong&gt;: Per-GB pricing grows fast with verbose logging or high trace volumes&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;SaaS-Only&lt;/strong&gt;: No self-hosted option for data sovereignty&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Historical Billing Issues&lt;/strong&gt;: Past controversies around retroactive pricing changes&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  Migration from Lightstep:
&lt;/h4&gt;

&lt;p&gt;OpenTelemetry Collector can send data directly to New Relic via OTLP. Simpler than Datadog but creates some vendor lock-in through data format translation.&lt;/p&gt;

&lt;h4&gt;
  
  
  Best For:
&lt;/h4&gt;

&lt;p&gt;Teams wanting a &lt;strong&gt;familiar SaaS experience&lt;/strong&gt; similar to Lightstep with strong APM capabilities and willing to accept usage-based pricing for operational simplicity.&lt;/p&gt;




&lt;h3&gt;
  
  
  6. Chronosphere
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://chronosphere.io/" rel="noopener noreferrer"&gt;Chronosphere&lt;/a&gt;&lt;/strong&gt; is a cloud-native observability platform built by ex-Uber engineers, focused on controlling costs at scale while supporting OpenTelemetry.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fadtmr9s0xz703x8pmos8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fadtmr9s0xz703x8pmos8.png" alt="Chronosphere Platform" width="800" height="484"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Chronosphere Pros:
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Built for Scale&lt;/strong&gt;: Created by engineers who built M3 at Uber for handling massive metric volumes&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cost Controls&lt;/strong&gt;: Native cost visibility and controls to prevent observability bill explosions&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;OpenTelemetry Compatible&lt;/strong&gt;: Works with OTel Collector and standard instrumentation&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;High-Cardinality Metrics&lt;/strong&gt;: Handles modern application requirements without performance degradation&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Governance Features&lt;/strong&gt;: Strong multi-tenancy and access controls for large organizations&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Query Performance&lt;/strong&gt;: Fast queries even on large datasets&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  Chronosphere Cons:
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;Primarily metrics-focused (traces and logs less mature than competitors)&lt;/li&gt;
&lt;li&gt;Enterprise pricing (not as cost-effective as open source alternatives)&lt;/li&gt;
&lt;li&gt;Smaller ecosystem compared to established players&lt;/li&gt;
&lt;li&gt;SaaS-focused (limited self-hosted options)&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  Migration from Lightstep:
&lt;/h4&gt;

&lt;p&gt;OpenTelemetry Collector can export metrics to Chronosphere. Straightforward for metrics migration, but you'll need additional solutions for comprehensive tracing that Lightstep provided.&lt;/p&gt;

&lt;h4&gt;
  
  
  Best For:
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Large-scale environments&lt;/strong&gt; generating massive metric volumes where cost control and governance are critical. Good for teams migrating from Lightstep who want enterprise support but need better cost predictability.&lt;/p&gt;




&lt;h3&gt;
  
  
  7. Jaeger
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://www.jaegertracing.io/" rel="noopener noreferrer"&gt;Jaeger&lt;/a&gt;&lt;/strong&gt; is an open-source distributed tracing platform and graduated CNCF project, offering core tracing capabilities without logs or metrics.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fo2mfqcj53vzll9rad3q8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fo2mfqcj53vzll9rad3q8.png" alt="Jaeger UI" width="800" height="408"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Jaeger Pros:
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Completely Free&lt;/strong&gt;: Open source with no licensing costs whatsoever&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;CNCF Graduated&lt;/strong&gt;: Proven stability and community support through Cloud Native Computing Foundation&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;OpenTelemetry Native&lt;/strong&gt;: Built as the reference implementation for OpenTelemetry tracing&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Battle-Tested&lt;/strong&gt;: Used in production by thousands of organizations globally&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Flexible Storage&lt;/strong&gt;: Supports Cassandra, Elasticsearch, Kafka, and Badger backends&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Lightweight&lt;/strong&gt;: Focused solely on distributed tracing without feature bloat&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  Jaeger Cons:
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Tracing Only&lt;/strong&gt;: No logs or metrics - requires separate tools for unified observability&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Basic UI&lt;/strong&gt;: Functional but less polished than commercial alternatives&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Self-Hosted Only&lt;/strong&gt;: Requires managing infrastructure (no managed SaaS option)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Limited Advanced Features&lt;/strong&gt;: Missing some of Lightstep's Change Intelligence and correlation features&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  Migration from Lightstep:
&lt;/h4&gt;

&lt;p&gt;Simple for OpenTelemetry users. Point collector traces to Jaeger endpoint. However, you'll need additional tools for logs and metrics that Lightstep provided.&lt;/p&gt;

&lt;h4&gt;
  
  
  Best For:
&lt;/h4&gt;

&lt;p&gt;Teams needing &lt;strong&gt;just distributed tracing&lt;/strong&gt; at zero cost and comfortable with self-hosting. Often paired with Prometheus (metrics) and Grafana Loki (logs) for complete observability.&lt;/p&gt;




&lt;h3&gt;
  
  
  8. Elastic Observability
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://www.elastic.co/observability" rel="noopener noreferrer"&gt;Elastic Observability&lt;/a&gt;&lt;/strong&gt; (part of Elastic Stack/ELK) provides unified logs, metrics, APM, and traces with powerful search capabilities.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fesw1pnbms5l4h924tu8x.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fesw1pnbms5l4h924tu8x.png" alt="Elastic APM" width="800" height="405"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Elastic Observability Pros:
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Powerful Search&lt;/strong&gt;: Elasticsearch excels at full-text and structured log search&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Unified Platform&lt;/strong&gt;: Logs, metrics, APM, and traces in single stack&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Flexible Deployment&lt;/strong&gt;: Self-hosted, managed Elastic Cloud, or hybrid&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Large Ecosystem&lt;/strong&gt;: Extensive integrations with Beats and Logstash&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Security + Observability&lt;/strong&gt;: Strong overlap with SIEM capabilities for security teams&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  Elastic Observability Cons:
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Expensive at Scale&lt;/strong&gt;: Elasticsearch clusters require significant infrastructure investment&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Operational Complexity&lt;/strong&gt;: Managing Elasticsearch at scale requires expertise&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Storage Costs&lt;/strong&gt;: Full-fidelity data retention gets expensive quickly&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;OpenTelemetry Support&lt;/strong&gt;: Works but not as seamless as OTel-native platforms&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  Migration from Lightstep:
&lt;/h4&gt;

&lt;p&gt;OpenTelemetry Collector can export to Elastic APM. Requires more operational setup than simpler alternatives due to Elasticsearch cluster management.&lt;/p&gt;

&lt;h4&gt;
  
  
  Best For:
&lt;/h4&gt;

&lt;p&gt;Teams with &lt;strong&gt;heavy log analytics&lt;/strong&gt; requirements or existing Elasticsearch investments who want to consolidate observability into their ELK stack.&lt;/p&gt;




&lt;h3&gt;
  
  
  9. Dynatrace
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://www.dynatrace.com/" rel="noopener noreferrer"&gt;Dynatrace&lt;/a&gt;&lt;/strong&gt; is an enterprise APM and observability platform with AI-powered automation and root cause analysis.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1zmgnttfrflrun2hoiu1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1zmgnttfrflrun2hoiu1.png" alt="Dynatrace Dashboard" width="800" height="451"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Dynatrace Pros:
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Automatic Instrumentation&lt;/strong&gt;: OneAgent automatically discovers and instruments applications&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Davis AI&lt;/strong&gt;: AI engine reduces alert noise through intelligent root cause analysis&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Enterprise-Grade&lt;/strong&gt;: Handles very large, complex enterprise environments&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Hybrid Support&lt;/strong&gt;: Works across on-premises, cloud, and hybrid infrastructures&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Low Maintenance&lt;/strong&gt;: Highly automated requiring minimal configuration&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  Dynatrace Cons:
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Very Expensive&lt;/strong&gt;: Premium enterprise pricing, often higher than Lightstep&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Proprietary Technology&lt;/strong&gt;: OneAgent and data formats create vendor lock-in&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Complex Licensing&lt;/strong&gt;: Unit-based pricing model can be difficult to predict&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;OpenTelemetry&lt;/strong&gt;: Supports OTel but pushes proprietary OneAgent approach&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  Migration from Lightstep:
&lt;/h4&gt;

&lt;p&gt;Requires deploying OneAgent (Dynatrace's proprietary agent) rather than continuing with OpenTelemetry Collector. More disruptive migration than OTel-native alternatives.&lt;/p&gt;

&lt;h4&gt;
  
  
  Best For:
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Large enterprises&lt;/strong&gt; with complex environments prioritizing automation and willing to pay premium prices for reduced operational overhead.&lt;/p&gt;




&lt;h3&gt;
  
  
  10. Splunk Observability Cloud
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://www.splunk.com/en_us/products/observability.html" rel="noopener noreferrer"&gt;Splunk Observability Cloud&lt;/a&gt;&lt;/strong&gt; (formerly SignalFx) offers real-time metrics, APM, and infrastructure monitoring focused on cloud-native environments.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3czu341ad6wonip8jmvw.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3czu341ad6wonip8jmvw.png" alt="Splunk Observability" width="800" height="490"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Splunk Observability Pros:
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Real-Time Streaming&lt;/strong&gt;: NoSample architecture provides full-fidelity, real-time telemetry&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Strong Metrics&lt;/strong&gt;: Excellent time-series metrics handling and analytics&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Enterprise Features&lt;/strong&gt;: Robust access controls, compliance, and security capabilities&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Splunk Ecosystem&lt;/strong&gt;: Integrates with Splunk platform for unified security and observability&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Mature Platform&lt;/strong&gt;: Proven at scale in large enterprise environments&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  Splunk Observability Cons:
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Expensive&lt;/strong&gt;: Data-volume-based pricing can be prohibitively expensive&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Complexity&lt;/strong&gt;: Splunk's enterprise focus adds complexity for smaller teams&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Storage Costs&lt;/strong&gt;: Full-fidelity streaming requires significant storage investment&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;OpenTelemetry&lt;/strong&gt;: Supports OTel but historically pushed proprietary instrumentation&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Migrating from Lightstep to OpenObserve
&lt;/h2&gt;

&lt;p&gt;OpenObserve has first-class support for OpenTelemetry, which means no vendor lock-in and seamless integration with your existing instrumentation.&lt;/p&gt;

&lt;p&gt;Your applications don't change. Your OpenTelemetry instrumentation doesn't change. Only the collector destination changes.&lt;/p&gt;

&lt;p&gt;O2 supports standardized telemetry collection (i.e., FluentBit, OpenTelemetry, Logstash) ensuring seamless integration. It exposes APIs for ingestion, search, and more, allowing programmatic access to everything. OpenObserve works with any object storage such as S3 or GCS and stores data in open formats, avoiding vendor lock-in on collection and storage.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fo117ol034gyau7m5ecu2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fo117ol034gyau7m5ecu2.png" alt="Agent receivers ingestion flow into OpenObserve" width="800" height="453"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Migration Path
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;1. Point your OTel collectors to OpenObserve&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Already using OpenTelemetry? Just update your exporter endpoint. No re-instrumentation required.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fo53n7wkkly06tqz8o7md.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fo53n7wkkly06tqz8o7md.png" alt="Otel Collector Data Sources Page" width="800" height="456"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;After (OpenObserve Configuration):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;exporters&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;otlphttp/openobserve&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;endpoint&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;https://your-org.openobserve.ai/api/default/&lt;/span&gt;
    &lt;span class="na"&gt;headers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;Authorization&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Basic&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;${OPENOBSERVE_TOKEN}"&lt;/span&gt;
      &lt;span class="na"&gt;stream-name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;default"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;2. Run both platforms in parallel&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Test OpenObserve with your production traffic while Lightstep still runs. Validate data quality and dashboard parity before fully committing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Complete migration&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Once validated, migrate all workloads to OpenObserve.&lt;/p&gt;




&lt;h3&gt;
  
  
  Why Migration is Seamless
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;SQL/PromQL querying&lt;/strong&gt; - Universal languages your team already knows. No proprietary DSL to learn.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;OpenTelemetry-native&lt;/strong&gt; - Your existing instrumentation works as-is. No agent rewrites or application changes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Self-hosted or cloud&lt;/strong&gt; - Deploy however your team prefers. Cloud for simplicity, self-hosted for complete control.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Similar visualization&lt;/strong&gt; - Familiar observability workflows. Dashboards, service maps, trace views work the same way.&lt;/p&gt;




&lt;h3&gt;
  
  
  Need Help?
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Talk to our team for a personalized migration plan.&lt;/strong&gt; We'll help you:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Validate technical feasibility for your specific setup&lt;/li&gt;
&lt;li&gt;Recreate your critical dashboards and alerting rules&lt;/li&gt;
&lt;li&gt;Accelerate the migration process with hands-on support&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://openobserve.ai/contact-us/" rel="noopener noreferrer"&gt;Contact us for migration support&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Comparison Table: Lightstep Alternatives
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Tool&lt;/th&gt;
&lt;th&gt;Deployment&lt;/th&gt;
&lt;th&gt;OTel Native&lt;/th&gt;
&lt;th&gt;Pricing Model&lt;/th&gt;
&lt;th&gt;Migration Ease&lt;/th&gt;
&lt;th&gt;Best For&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;OpenObserve&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Cloud / Self-hosted&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Ingestion-based&lt;/td&gt;
&lt;td&gt;Very Easy (1 config change)&lt;/td&gt;
&lt;td&gt;Drop-in Lightstep replacement with 90% cost savings&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Grafana Stack&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Cloud / Self-hosted&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Modular (LGTM)&lt;/td&gt;
&lt;td&gt;Moderate (Multiple components)&lt;/td&gt;
&lt;td&gt;Maximum flexibility and best visualization&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Honeycomb&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;SaaS only&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Event-based&lt;/td&gt;
&lt;td&gt;Very Easy (OTel-native)&lt;/td&gt;
&lt;td&gt;High-cardinality tracing excellence&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Datadog&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;SaaS only&lt;/td&gt;
&lt;td&gt;Supported&lt;/td&gt;
&lt;td&gt;Host/Usage-based&lt;/td&gt;
&lt;td&gt;Moderate (More complex)&lt;/td&gt;
&lt;td&gt;Enterprise teams with unlimited budget&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;New Relic&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;SaaS only&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Per-GB&lt;/td&gt;
&lt;td&gt;Easy (OTel-native)&lt;/td&gt;
&lt;td&gt;Familiar SaaS with strong APM&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Chronosphere&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;SaaS / Cloud&lt;/td&gt;
&lt;td&gt;Compatible&lt;/td&gt;
&lt;td&gt;Enterprise&lt;/td&gt;
&lt;td&gt;Moderate (Metrics-focused)&lt;/td&gt;
&lt;td&gt;Large-scale metrics with cost controls&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Jaeger&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Self-hosted&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Free (Open source)&lt;/td&gt;
&lt;td&gt;Easy (Traces only)&lt;/td&gt;
&lt;td&gt;Distributed tracing only (no logs/metrics)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Elastic&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Cloud / Self-hosted&lt;/td&gt;
&lt;td&gt;Supported&lt;/td&gt;
&lt;td&gt;Data-volume&lt;/td&gt;
&lt;td&gt;Moderate (Operational complexity)&lt;/td&gt;
&lt;td&gt;Log-heavy workloads with search focus&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Dynatrace&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;SaaS / Hybrid&lt;/td&gt;
&lt;td&gt;Supported&lt;/td&gt;
&lt;td&gt;Unit-based&lt;/td&gt;
&lt;td&gt;Moderate (OneAgent required)&lt;/td&gt;
&lt;td&gt;Large enterprises needing automation&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Splunk&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;SaaS / On-prem&lt;/td&gt;
&lt;td&gt;Supported&lt;/td&gt;
&lt;td&gt;Data-volume&lt;/td&gt;
&lt;td&gt;Moderate (Complex pricing)&lt;/td&gt;
&lt;td&gt;Security + Observability convergence&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;With ServiceNow's March 1, 2026 Lightstep end-of-life deadline approaching, teams have an opportunity to modernize their observability stack while dramatically reducing costs and avoiding future vendor lock-in.&lt;/p&gt;

&lt;h3&gt;
  
  
  Key Takeaways
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;1. OpenObserve is the best drop-in replacement for Lightstep&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;For most teams, OpenObserve offers the optimal combination of:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;OpenTelemetry-native architecture (easy migration - just change collector config)&lt;/li&gt;
&lt;li&gt;Similar distributed tracing capabilities (high-cardinality support, service maps, unified observability)&lt;/li&gt;
&lt;li&gt;Data ownership through self-hosting option&lt;/li&gt;
&lt;li&gt;No vendor lock-in risk&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;2. OpenTelemetry-native platforms prevent future lock-in&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Choose alternatives that support OpenTelemetry natively (OpenObserve, Honeycomb, Jaeger, Grafana) rather than platforms that translate OTel data into proprietary formats (Datadog, Dynatrace). This ensures you can switch platforms again in the future without rewriting application code.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Migration is straightforward with OpenTelemetry&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;If you're already using OpenTelemetry (which Lightstep users are), migration to OTel-native platforms like OpenObserve requires just updating your collector configuration. No application code changes, no re-instrumentation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Start migration now&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;With the EOL deadline approaching, begin your evaluation and pilot testing immediately. Most teams can validate OpenObserve in a test environment within days.&lt;/p&gt;

&lt;h3&gt;
  
  
  Recommended Action Plan
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;This week&lt;/strong&gt;: Sign up for OpenObserve free trial and test with a non-critical service&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Next week&lt;/strong&gt;: Update OpenTelemetry Collector config and validate data flow&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Following weeks&lt;/strong&gt;: Build dashboards and alerts, run parallel with Lightstep&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Complete migration&lt;/strong&gt;: Gradually move production workloads to OpenObserve&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Whether you choose OpenObserve or another alternative, prioritize &lt;strong&gt;OpenTelemetry-native platforms&lt;/strong&gt; to avoid rewriting instrumentation and ensure long-term flexibility.&lt;/p&gt;




&lt;h2&gt;
  
  
  Take the Next Step
&lt;/h2&gt;

&lt;p&gt;Ready to explore the best Lightstep alternative?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Try OpenObserve&lt;/strong&gt;: &lt;a href="https://openobserve.ai/downloads/" rel="noopener noreferrer"&gt;Download&lt;/a&gt; or sign up for &lt;a href="https://cloud.openobserve.ai/" rel="noopener noreferrer"&gt;OpenObserve Cloud&lt;/a&gt; with a 14-day free trial.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Talk to our team&lt;/strong&gt;: &lt;a href="https://openobserve.ai/contact-us/" rel="noopener noreferrer"&gt;Schedule a migration consultation&lt;/a&gt; to get a personalized plan for your Lightstep replacement.&lt;/p&gt;




&lt;h2&gt;
  
  
  FAQ: Lightstep Alternatives
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Why is ServiceNow shutting down Lightstep?
&lt;/h3&gt;

&lt;p&gt;ServiceNow acquired Lightstep but decided to discontinue it without providing a replacement. The official reason wasn't detailed publicly, but it's part of their portfolio rationalization. For you, this means finding an alternative before March 1, 2026.&lt;/p&gt;

&lt;h3&gt;
  
  
  I'm using Lightstep right now - what should I do?
&lt;/h3&gt;

&lt;p&gt;Start testing alternatives immediately. Most migrations take 2-4 weeks, so:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;This month&lt;/strong&gt;: Test OpenObserve or another OTel-native platform with a non-prod service&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Next month&lt;/strong&gt;: Validate data volume handling and build critical dashboards&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Following months&lt;/strong&gt;: Migrate production workloads gradually&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Will I lose all my historical data when Lightstep shuts down?
&lt;/h3&gt;

&lt;p&gt;Yes, unless you export it now. ServiceNow stops accepting data after March 1, 2026. Use Lightstep's export APIs to save critical traces you need for compliance or debugging. Most teams only export essential data since full historical migration is rarely necessary.&lt;/p&gt;

&lt;h3&gt;
  
  
  Do I have to rewrite all my instrumentation code?
&lt;/h3&gt;

&lt;p&gt;No. If you're using OpenTelemetry (most Lightstep users are), just update your OTel Collector config to point to the new platform. Zero application code changes. Only if you're using Lightstep-specific SDKs (rare) would you need to re-instrument.&lt;/p&gt;

&lt;h3&gt;
  
  
  How long does it actually take to migrate from Lightstep?
&lt;/h3&gt;

&lt;p&gt;2-4 weeks realistically:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Week 1: Setup and testing&lt;/li&gt;
&lt;li&gt;Week 2: Build dashboards, run parallel with Lightstep&lt;/li&gt;
&lt;li&gt;Week 3-4: Migrate production services&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Some vendors claim "migrations in an hour" - that's just the config change. Budget a month to do it properly with dashboard recreation and validation.&lt;/p&gt;

&lt;h3&gt;
  
  
  What happens if I miss the March 2026 deadline?
&lt;/h3&gt;

&lt;p&gt;ServiceNow stops accepting telemetry. Your observability goes dark - zero visibility into production. Set up at least a basic OTel-native platform (even free Jaeger) as a fallback to avoid complete blindness.&lt;/p&gt;

&lt;h3&gt;
  
  
  Can I keep using OpenTelemetry after migrating?
&lt;/h3&gt;

&lt;p&gt;Yes - that's the whole point. Your OTel instrumentation continues working unchanged. This is why we recommend OTel-native platforms (OpenObserve, Honeycomb, Jaeger) over proprietary ones (Datadog, Dynatrace) that translate OTel into their formats. Keeps you flexible for future switches.&lt;/p&gt;




</description>
      <category>observability</category>
      <category>opentelemetry</category>
      <category>monitoring</category>
      <category>devops</category>
    </item>
    <item>
      <title>FastAPI + OpenTelemetry: Stop Debugging with grep (Use Distributed Tracing)</title>
      <dc:creator>Manas Sharma</dc:creator>
      <pubDate>Mon, 02 Feb 2026 03:50:55 +0000</pubDate>
      <link>https://dev.to/openobserve/fastapi-opentelemetry-stop-debugging-with-grep-use-distributed-tracing-16m5</link>
      <guid>https://dev.to/openobserve/fastapi-opentelemetry-stop-debugging-with-grep-use-distributed-tracing-16m5</guid>
      <description>&lt;p&gt;How do you debug a FastAPI app that talks to 5 other services?&lt;/p&gt;

&lt;p&gt;Most people grep through logs:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Service A logs: "Request received ✓"&lt;/li&gt;
&lt;li&gt;Service B logs: "Processing ✓"&lt;/li&gt;
&lt;li&gt;Service C logs: "Query executed ✓"&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;User:&lt;/strong&gt; "It failed"&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Classic distributed systems problem: every service &lt;em&gt;thinks&lt;/em&gt; it worked, but the request still broke somewhere.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The issue?&lt;/strong&gt; Logs are isolated. Each service writes independently with no context about where the request came from or where it's going next.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The fix?&lt;/strong&gt; OpenTelemetry distributed tracing. Every request gets a unique trace ID that follows it across all services—like a tracking number for API calls. When something breaks, you follow the trace ID and see exactly where it failed.&lt;/p&gt;

&lt;p&gt;Setup takes 20 minutes. Debugging goes from hours of log archaeology to "oh, there it is" in under a minute.&lt;/p&gt;




&lt;h2&gt;
  
  
  Introduction to OpenTelemetry &amp;amp; OpenObserve
&lt;/h2&gt;

&lt;p&gt;OpenTelemetry represents "an open-source observability framework" that enables developers to gather logs, metrics, and traces in a standardized manner. OpenObserve serves as a complementary platform, providing intuitive interfaces for analyzing telemetry data effectively.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why OpenTelemetry for FastAPI?
&lt;/h2&gt;

&lt;p&gt;The framework streamlines logging by integrating with existing logging libraries. This unified methodology enables consistent metadata capture across logs, traces, and metrics—making it simpler to correlate information throughout your application stack.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Problem with Traditional Logging
&lt;/h3&gt;

&lt;p&gt;When debugging microservices:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Each service logs separately&lt;/li&gt;
&lt;li&gt;No connection between related requests across services&lt;/li&gt;
&lt;li&gt;You're grep-ing through multiple log files trying to piece together what happened&lt;/li&gt;
&lt;li&gt;Time zones, log formats, and missing context make correlation nearly impossible&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  What OpenTelemetry Solves
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Distributed Tracing:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Every request gets a unique trace ID&lt;/li&gt;
&lt;li&gt;Trace ID follows the request across all services&lt;/li&gt;
&lt;li&gt;See the complete request path in one view&lt;/li&gt;
&lt;li&gt;Identify exactly where failures occur&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Unified Observability:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Logs, metrics, and traces in one place&lt;/li&gt;
&lt;li&gt;Correlate log lines to specific traces&lt;/li&gt;
&lt;li&gt;See performance metrics alongside request flows&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  OpenObserve Key Features
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Lightweight &amp;amp; Deployable&lt;/strong&gt;: Operates as a single binary on laptops or containerized environments&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Intuitive Interface&lt;/strong&gt;: More user-friendly than comparable tools&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Query Flexibility&lt;/strong&gt;: Supports both SQL and PromQL syntax&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Integrated Alerting&lt;/strong&gt;: Built-in capabilities eliminate additional configuration&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cost Efficiency&lt;/strong&gt;: Achieves substantially lower storage expenses than competitors (140x less than Elasticsearch)&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  How It Works: Quick Overview
&lt;/h2&gt;

&lt;p&gt;The setup involves five main components:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;OpenTelemetry Collector&lt;/strong&gt; - Receives and processes telemetry data&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;FastAPI Instrumentation&lt;/strong&gt; - Automatically captures traces from your FastAPI app&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;OpenObserve&lt;/strong&gt; - Stores and visualizes logs, metrics, and traces&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Trace IDs&lt;/strong&gt; - Unique identifiers that follow requests across services&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Dashboards&lt;/strong&gt; - See correlated logs and traces in one view&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Example: Debugging with Trace IDs
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Before OpenTelemetry:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;grep&lt;/span&gt; &lt;span class="s2"&gt;"user_id=12345"&lt;/span&gt; service1.log  &lt;span class="c"&gt;# Found request&lt;/span&gt;
&lt;span class="nb"&gt;grep&lt;/span&gt; &lt;span class="s2"&gt;"timestamp=14:23:45"&lt;/span&gt; service2.log  &lt;span class="c"&gt;# Which timezone?&lt;/span&gt;
&lt;span class="nb"&gt;grep&lt;/span&gt; &lt;span class="s2"&gt;"error"&lt;/span&gt; service3.log  &lt;span class="c"&gt;# Too many results&lt;/span&gt;
&lt;span class="c"&gt;# 2 hours later... still searching&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;After OpenTelemetry:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Search by trace ID across all services&lt;/span&gt;
&lt;span class="nb"&gt;grep&lt;/span&gt; &lt;span class="s2"&gt;"trace_id=abc123"&lt;/span&gt; &lt;span class="k"&gt;*&lt;/span&gt;.log
&lt;span class="c"&gt;# Instantly see: Request → Auth → Database → External API timeout&lt;/span&gt;
&lt;span class="c"&gt;# 2 minutes to identify root cause&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  What You'll Get
&lt;/h2&gt;

&lt;p&gt;With FastAPI + OpenTelemetry + OpenObserve:&lt;/p&gt;

&lt;p&gt;✅ &lt;strong&gt;Automatic tracing&lt;/strong&gt; for all FastAPI endpoints&lt;br&gt;
✅ &lt;strong&gt;Trace IDs&lt;/strong&gt; that follow requests across microservices&lt;br&gt;
✅ &lt;strong&gt;Log correlation&lt;/strong&gt; - click a trace to see all related logs&lt;br&gt;
✅ &lt;strong&gt;Performance metrics&lt;/strong&gt; - response times, error rates per endpoint&lt;br&gt;
✅ &lt;strong&gt;Fast debugging&lt;/strong&gt; - find issues in minutes, not hours&lt;/p&gt;




&lt;h2&gt;
  
  
  Ready to Set This Up?
&lt;/h2&gt;

&lt;p&gt;The complete setup guide (with step-by-step instructions, code examples, and configuration files) is available on OpenObserve's blog.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What you'll learn:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Installing OpenTelemetry Collector&lt;/li&gt;
&lt;li&gt;Configuring YAML for log and trace collection&lt;/li&gt;
&lt;li&gt;Setting up OpenObserve locally or in the cloud&lt;/li&gt;
&lt;li&gt;Instrumenting your FastAPI application with automatic tracing&lt;/li&gt;
&lt;li&gt;Testing and analyzing traces in the OpenObserve dashboard&lt;/li&gt;
&lt;li&gt;Common troubleshooting tips&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;👉 &lt;strong&gt;&lt;a href="https://openobserve.ai/blog/monitoring-fastapi-application-using-opentelemetry-and-openobserve/" rel="noopener noreferrer"&gt;Read the full setup guide here&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Looking for an OpenTelemetry-native backend?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;If you need something that works with your existing OTel setup—self-hosted or managed cloud, SQL + PromQL querying, unified logs/metrics/traces, with enterprise features (SSO, RBAC, multi-tenancy) but without the Datadog/Elastic price tag:&lt;/p&gt;

&lt;p&gt;Check out &lt;a href="https://openobserve.ai" rel="noopener noreferrer"&gt;OpenObserve&lt;/a&gt;. Open-source, 140x lower storage costs, built for teams that want control over their observability stack.&lt;/p&gt;

&lt;p&gt;→ &lt;a href="https://cloud.openobserve.ai" rel="noopener noreferrer"&gt;Try the cloud version&lt;/a&gt; (14-day trial)&lt;br&gt;
  → &lt;a href="https://openobserve.ai/downloads/" rel="noopener noreferrer"&gt;Download&lt;/a&gt;&lt;/p&gt;

</description>
      <category>fastapi</category>
      <category>python</category>
      <category>opentelemetry</category>
      <category>observability</category>
    </item>
    <item>
      <title>Your GPU cluster might be wasting $50k/year through thermal throttling and you'd never know. NVIDIA GPU Monitoring Dashboards</title>
      <dc:creator>Manas Sharma</dc:creator>
      <pubDate>Sun, 01 Feb 2026 17:35:54 +0000</pubDate>
      <link>https://dev.to/manas_sharma/-1epg</link>
      <guid>https://dev.to/manas_sharma/-1epg</guid>
      <description>&lt;div class="crayons-card c-embed text-styles text-styles--secondary"&gt;
    &lt;div class="c-embed__content"&gt;
        &lt;div class="c-embed__cover"&gt;
          &lt;a href="https://dev.to/openobserve/nvidia-gpu-monitoring-with-dcgm-exporter-and-openobserve-complete-setup-guide-34k6" class="c-link align-middle" rel="noopener noreferrer"&gt;
            &lt;img alt="" src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Focr4871j6jbr7k3xunyt.png" height="auto" class="m-0"&gt;
          &lt;/a&gt;
        &lt;/div&gt;
      &lt;div class="c-embed__body"&gt;
        &lt;h2 class="fs-xl lh-tight"&gt;
          &lt;a href="https://dev.to/openobserve/nvidia-gpu-monitoring-with-dcgm-exporter-and-openobserve-complete-setup-guide-34k6" rel="noopener noreferrer" class="c-link"&gt;
            NVIDIA GPU Monitoring: Catch Thermal Throttling Before It Costs You $50k/Year - DEV Community
          &lt;/a&gt;
        &lt;/h2&gt;
          &lt;p class="truncate-at-3"&gt;
            Stop wasting $50k+ annually on GPU inefficiencies. Monitor H100/H200/A100 clusters in 30 minutes with DCGM + OpenObserve.
          &lt;/p&gt;
        &lt;div class="color-secondary fs-s flex items-center"&gt;
            &lt;img alt="favicon" class="c-embed__favicon m-0 mr-2 radius-0" src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8j7kvp660rqzt99zui8e.png"&gt;
          dev.to
        &lt;/div&gt;
      &lt;/div&gt;
    &lt;/div&gt;
&lt;/div&gt;


</description>
      <category>devops</category>
      <category>monitoring</category>
      <category>gpu</category>
      <category>observability</category>
    </item>
    <item>
      <title>Your GPU cluster might be wasting $50k/year through thermal throttling and you'd never know. Here's how to catch it before it burns your budget. 30-min setup with DCGM + OpenTelemetry.</title>
      <dc:creator>Manas Sharma</dc:creator>
      <pubDate>Sun, 01 Feb 2026 17:35:00 +0000</pubDate>
      <link>https://dev.to/manas_sharma/your-gpu-cluster-might-be-wasting-50kyear-through-thermal-throttling-and-youd-never-know-heres-3gcf</link>
      <guid>https://dev.to/manas_sharma/your-gpu-cluster-might-be-wasting-50kyear-through-thermal-throttling-and-youd-never-know-heres-3gcf</guid>
      <description>&lt;div class="ltag__link"&gt;
  &lt;a href="/manas_sharma" class="ltag__link__link"&gt;
    &lt;div class="ltag__link__pic"&gt;
      &lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3739096%2F64c63567-d504-47de-b304-1cd488cc2906.jpeg" alt="manas_sharma"&gt;
    &lt;/div&gt;
  &lt;/a&gt;
  &lt;a href="https://dev.to/manas_sharma/nvidia-gpu-monitoring-with-dcgm-exporter-and-openobserve-complete-setup-guide-34k6" class="ltag__link__link"&gt;
    &lt;div class="ltag__link__content"&gt;
      &lt;h2&gt;NVIDIA GPU Monitoring: Catch Thermal Throttling Before It Costs You $50k/Year&lt;/h2&gt;
      &lt;h3&gt;Manas Sharma ・ Feb 1&lt;/h3&gt;
      &lt;div class="ltag__link__taglist"&gt;
        &lt;span class="ltag__link__tag"&gt;#devops&lt;/span&gt;
        &lt;span class="ltag__link__tag"&gt;#monitoring&lt;/span&gt;
        &lt;span class="ltag__link__tag"&gt;#gpu&lt;/span&gt;
        &lt;span class="ltag__link__tag"&gt;#observability&lt;/span&gt;
      &lt;/div&gt;
    &lt;/div&gt;
  &lt;/a&gt;
&lt;/div&gt;


</description>
      <category>devops</category>
      <category>monitoring</category>
      <category>gpu</category>
      <category>observability</category>
    </item>
    <item>
      <title>NVIDIA GPU Monitoring: Catch Thermal Throttling Before It Costs You $50k/Year</title>
      <dc:creator>Manas Sharma</dc:creator>
      <pubDate>Sun, 01 Feb 2026 09:19:19 +0000</pubDate>
      <link>https://dev.to/openobserve/nvidia-gpu-monitoring-with-dcgm-exporter-and-openobserve-complete-setup-guide-34k6</link>
      <guid>https://dev.to/openobserve/nvidia-gpu-monitoring-with-dcgm-exporter-and-openobserve-complete-setup-guide-34k6</guid>
      <description>&lt;p&gt;Thermal throttling at 3 AM because you didn't catch that GPU running hot? Your $240k H200 cluster shouldn't be bleeding $50k+ annually through silent failures and inefficiencies.&lt;/p&gt;

&lt;p&gt;We built this guide because monitoring NVIDIA GPUs with traditional tools was taking 4-8 hours of setup time. Here's how to get DCGM Exporter + OpenObserve running in ~30 minutes and catch issues before they torch your budget.&lt;/p&gt;




&lt;p&gt;AI-driven infrastructure landscape is evolving and GPU clusters represent one of the most significant capital investments for organizations. Whether you're running large language models, training deep learning models, or processing massive datasets, your NVIDIA GPUs (H100s, H200s, A100s, or L40S) are the workhorses powering your most critical workloads.&lt;/p&gt;

&lt;p&gt;But here's the challenge: &lt;strong&gt;how do you know if your GPU infrastructure is performing optimally?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Traditional monitoring approaches fall short when it comes to GPU infrastructure. System metrics like CPU and memory utilization don't tell you if your GPUs are thermal throttling, experiencing memory bottlenecks, or operating at peak efficiency. You need deep visibility into GPU-specific metrics like utilization, temperature, power consumption, memory usage, and PCIe throughput.&lt;/p&gt;

&lt;p&gt;This is where &lt;strong&gt;NVIDIA's Data Center GPU Manager (DCGM) Exporter&lt;/strong&gt; combined with &lt;strong&gt;OpenObserve&lt;/strong&gt; creates a powerful, cost-effective monitoring solution that gives you real-time insights into your GPU infrastructure.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why GPU Monitoring Matters
&lt;/h2&gt;

&lt;h3&gt;
  
  
  The High Cost of GPU Inefficiency
&lt;/h3&gt;

&lt;p&gt;Consider this scenario: You're running an 8x NVIDIA H200 cluster. Each H200 costs approximately $30,000-$40,000, meaning your hardware investment alone is around $240,000-$320,000. Operating costs (power, cooling, infrastructure) can easily add another $50,000-$100,000 annually.&lt;/p&gt;

&lt;p&gt;Now imagine:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Thermal throttling&lt;/strong&gt; reducing performance by 15% due to poor cooling&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;GPU memory leaks&lt;/strong&gt; causing jobs to fail silently&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Underutilization&lt;/strong&gt; with GPUs sitting idle 40% of the time&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Hardware failures&lt;/strong&gt; going undetected until complete outage&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;PCIe bottlenecks&lt;/strong&gt; limiting data transfer rates&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Without proper monitoring, you're flying blind. You might be:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Wasting $50,000+ annually&lt;/strong&gt; on inefficient GPU utilization&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Missing critical performance degradation&lt;/strong&gt; before it impacts production&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Unable to justify ROI&lt;/strong&gt; on GPU infrastructure to stakeholders&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Lacking data&lt;/strong&gt; for capacity planning and optimization decisions&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  What You Need to Monitor
&lt;/h3&gt;

&lt;p&gt;Effective GPU monitoring requires tracking dozens of metrics across multiple dimensions:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Performance Metrics:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;GPU compute utilization (%)&lt;/li&gt;
&lt;li&gt;Memory bandwidth utilization (%)&lt;/li&gt;
&lt;li&gt;Tensor Core utilization&lt;/li&gt;
&lt;li&gt;SM (Streaming Multiprocessor) occupancy&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Thermal &amp;amp; Power:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;GPU temperature (°C)&lt;/li&gt;
&lt;li&gt;Power consumption (W)&lt;/li&gt;
&lt;li&gt;Power limit throttling events&lt;/li&gt;
&lt;li&gt;Thermal throttling events&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Memory:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;GPU memory usage (MB/GB)&lt;/li&gt;
&lt;li&gt;Memory allocation failures&lt;/li&gt;
&lt;li&gt;ECC (Error Correction Code) errors&lt;/li&gt;
&lt;li&gt;Memory clock speeds&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Interconnect:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;PCIe throughput (TX/RX)&lt;/li&gt;
&lt;li&gt;NVLink bandwidth&lt;/li&gt;
&lt;li&gt;NVSwitch fabric health&lt;/li&gt;
&lt;li&gt;Data transfer bottlenecks&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Health &amp;amp; Reliability:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;XID errors (hardware faults)&lt;/li&gt;
&lt;li&gt;Page retirement events&lt;/li&gt;
&lt;li&gt;GPU compute capability&lt;/li&gt;
&lt;li&gt;Driver version compliance&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The Solution: DCGM Exporter + OpenObserve
&lt;/h2&gt;

&lt;h3&gt;
  
  
  What is DCGM Exporter?
&lt;/h3&gt;

&lt;p&gt;NVIDIA's Data Center GPU Manager (DCGM) is a suite of tools for managing and monitoring NVIDIA datacenter GPUs. DCGM Exporter exposes GPU metrics in Prometheus format, making it easy to integrate with modern observability platforms.&lt;/p&gt;

&lt;p&gt;You can find more details about DCGM exporter &lt;a href="https://github.com/NVIDIA/dcgm-exporter" rel="noopener noreferrer"&gt;here&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key capabilities:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Exposes 40+ GPU metrics per device&lt;/li&gt;
&lt;li&gt;Supports all modern NVIDIA datacenter GPUs (A100, H100, H200, L40S)&lt;/li&gt;
&lt;li&gt;Low overhead monitoring (~1% GPU utilization)&lt;/li&gt;
&lt;li&gt;Works with Docker, Kubernetes, and bare metal&lt;/li&gt;
&lt;li&gt;Handles multi-GPU and multi-node deployments&lt;/li&gt;
&lt;li&gt;Provides health diagnostics and error detection&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Complete Setup Guide
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Prerequisites
&lt;/h3&gt;

&lt;p&gt;Before starting, ensure you have:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;GPU-enabled server (cloud or on-premises)&lt;/li&gt;
&lt;li&gt;NVIDIA GPUs installed and recognized by the system&lt;/li&gt;
&lt;li&gt;NVIDIA drivers version 535+ (550+ recommended for H200)&lt;/li&gt;
&lt;li&gt;Docker installed and configured with NVIDIA Container Toolkit&lt;/li&gt;
&lt;li&gt;OpenObserve instance (cloud or self-hosted)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Step 1: Verify GPU Detection
&lt;/h3&gt;

&lt;p&gt;First, confirm your GPUs are properly detected by the system:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Check if GPUs are visible&lt;/span&gt;
nvidia-smi

&lt;span class="c"&gt;# Expected output: List of GPUs with utilization, temperature, and memory&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For NVIDIA H200 or multi-GPU systems with NVSwitch, you'll need the NVIDIA Fabric Manager:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Install fabric manager (version should match your driver)&lt;/span&gt;
&lt;span class="nb"&gt;sudo &lt;/span&gt;apt update
&lt;span class="nb"&gt;sudo &lt;/span&gt;apt &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-y&lt;/span&gt; nvidia-driver-535 nvidia-fabricmanager-535

&lt;span class="c"&gt;# Reboot to load new driver&lt;/span&gt;
&lt;span class="nb"&gt;sudo &lt;/span&gt;reboot

&lt;span class="c"&gt;# After reboot, start the service&lt;/span&gt;
&lt;span class="nb"&gt;sudo &lt;/span&gt;systemctl start nvidia-fabricmanager
&lt;span class="nb"&gt;sudo &lt;/span&gt;systemctl &lt;span class="nb"&gt;enable &lt;/span&gt;nvidia-fabricmanager

&lt;span class="c"&gt;# Verify&lt;/span&gt;
nvidia-smi  &lt;span class="c"&gt;# Should now show all GPUs&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 2: Deploy DCGM Exporter
&lt;/h3&gt;

&lt;p&gt;Deploy DCGM Exporter as a Docker container. This lightweight container exposes GPU metrics on port 9400:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker run &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--gpus&lt;/span&gt; all &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--cap-add&lt;/span&gt; SYS_ADMIN &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--network&lt;/span&gt; host &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--name&lt;/span&gt; dcgm-exporter &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--restart&lt;/span&gt; unless-stopped &lt;span class="se"&gt;\&lt;/span&gt;
  nvcr.io/nvidia/k8s/dcgm-exporter:3.3.5-3.4.0-ubuntu22.04
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Configuration breakdown:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;--gpus all&lt;/code&gt; - Grants access to all GPUs on the host&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;--cap-add SYS_ADMIN&lt;/code&gt; - Required for DCGM to query GPU metrics&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;--network host&lt;/code&gt; - Uses host networking for easier access&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;--restart unless-stopped&lt;/code&gt; - Ensures resilience across reboots&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Verify DCGM is working:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Wait 10 seconds for initialization&lt;/span&gt;
&lt;span class="nb"&gt;sleep &lt;/span&gt;10

&lt;span class="c"&gt;# Access metrics from inside the container&lt;/span&gt;
docker &lt;span class="nb"&gt;exec &lt;/span&gt;dcgm-exporter curl &lt;span class="nt"&gt;-s&lt;/span&gt; http://localhost:9400/metrics | &lt;span class="nb"&gt;head&lt;/span&gt; &lt;span class="nt"&gt;-30&lt;/span&gt;

&lt;span class="c"&gt;# You should see output like:&lt;/span&gt;
&lt;span class="c"&gt;# DCGM_FI_DEV_GPU_UTIL{gpu="0",UUID="GPU-xxxx",...} 45.0&lt;/span&gt;
&lt;span class="c"&gt;# DCGM_FI_DEV_GPU_TEMP{gpu="0",UUID="GPU-xxxx",...} 42.0&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 3: Configure OpenTelemetry Collector
&lt;/h3&gt;

&lt;p&gt;The OpenTelemetry Collector scrapes metrics from DCGM Exporter and forwards them to OpenObserve. Create the configuration:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;receivers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;prometheus&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;config&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;scrape_configs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;job_name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;dcgm-gpu-metrics'&lt;/span&gt;
          &lt;span class="na"&gt;scrape_interval&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;30s&lt;/span&gt;
          &lt;span class="na"&gt;static_configs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;targets&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;localhost:9400'&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
          &lt;span class="na"&gt;metric_relabel_configs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="c1"&gt;# Keep only DCGM metrics&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;source_labels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;__name__&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
              &lt;span class="na"&gt;regex&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;DCGM_.*'&lt;/span&gt;
              &lt;span class="na"&gt;action&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;keep&lt;/span&gt;

&lt;span class="na"&gt;exporters&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;otlphttp/openobserve&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;endpoint&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;https://example.openobserve.ai/api/ORG_NAME/&lt;/span&gt;
    &lt;span class="na"&gt;headers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;Authorization&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Basic&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;YOUR_O2_TOKEN"&lt;/span&gt;

&lt;span class="na"&gt;processors&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;batch&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;timeout&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;10s&lt;/span&gt;
    &lt;span class="na"&gt;send_batch_size&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1024&lt;/span&gt;

&lt;span class="na"&gt;service&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;pipelines&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;metrics&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;receivers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;prometheus&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
      &lt;span class="na"&gt;processors&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;batch&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
      &lt;span class="na"&gt;exporters&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;otlphttp/openobserve&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Get your OpenObserve credentials:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# For Ingestion token authentication (recommended):&lt;/span&gt;
Go to OpenObserve UI → Datasources -&amp;gt; Custom -&amp;gt; Otel Collector
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyd0f0rh47j5c6jy1ii6q.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyd0f0rh47j5c6jy1ii6q.jpeg" alt="openobserve ingestion token" width="800" height="379"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Update the &lt;code&gt;Authorization&lt;/code&gt; header in the config with your base64-encoded credentials.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 4: Deploy OpenTelemetry Collector
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker run &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--network&lt;/span&gt; host &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-v&lt;/span&gt; &lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;pwd&lt;/span&gt;&lt;span class="si"&gt;)&lt;/span&gt;/otel-collector-config.yaml:/etc/otel-collector-config.yaml &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--name&lt;/span&gt; otel-collector &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--restart&lt;/span&gt; unless-stopped &lt;span class="se"&gt;\&lt;/span&gt;
  otel/opentelemetry-collector-contrib:latest &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--config&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;/etc/otel-collector-config.yaml
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Check OpenTelemetry Collector:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# View collector logs&lt;/span&gt;
docker logs otel-collector

&lt;span class="c"&gt;# Look for successful scrapes (no error messages)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Check OpenObserve:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Log into OpenObserve UI&lt;/li&gt;
&lt;li&gt;Navigate to &lt;strong&gt;Metrics&lt;/strong&gt; section&lt;/li&gt;
&lt;li&gt;Search for metrics starting with &lt;code&gt;DCGM_&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Data should appear within 1-2 minutes&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fk82kheukfq0gjptaamg1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fk82kheukfq0gjptaamg1.png" alt="dcgm metrics list" width="800" height="431"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 5: Generate GPU Load (Optional)
&lt;/h3&gt;

&lt;p&gt;To verify monitoring is working, generate some GPU activity:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Install PyTorch&lt;/span&gt;
pip3 &lt;span class="nb"&gt;install &lt;/span&gt;torch

&lt;span class="c"&gt;# Create a load test script&lt;/span&gt;
&lt;span class="nb"&gt;cat&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; gpu_load.py &lt;span class="o"&gt;&amp;lt;&amp;lt;&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="no"&gt;EOF&lt;/span&gt;&lt;span class="sh"&gt;'
import torch
import time

print("Starting GPU load test...")
devices = [torch.device(f'cuda:{i}') for i in range(torch.cuda.device_count())]
tensors = [torch.randn(15000, 15000, device=d) for d in devices]

print(f"Loaded {len(devices)} GPUs")
while True:
    for tensor in tensors:
        _ = torch.mm(tensor, tensor)
    time.sleep(0.5)
&lt;/span&gt;&lt;span class="no"&gt;EOF

&lt;/span&gt;&lt;span class="c"&gt;# Run load test&lt;/span&gt;
python3 gpu_load.py
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Watch your metrics in OpenObserve - you should see GPU utilization spike!&lt;/p&gt;

&lt;h2&gt;
  
  
  Creating Dashboards in OpenObserve
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;Download the Dashboards from our &lt;a href="https://github.com/openobserve/dashboards/tree/main/NVIDIA%20GPU%20Monitoring" rel="noopener noreferrer"&gt;community repository&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;In OpenObserve UI, go to Dashboards → Import -&amp;gt; Drop your files here -&amp;gt; select your json -&amp;gt; Import&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1prhwbahlrq91r5n5ecn.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1prhwbahlrq91r5n5ecn.gif" alt="steps to show how to import dashboards" width="" height=""&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Once the dashboard has been imported, you will see the below metrics that were prebuilt and you can always customize the dashboards as needed.
&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcnqidg9ca68b8y81uq2n.gif" alt="gpu-dash.gif" width="560" height="304"&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Setting Up Alerts
&lt;/h2&gt;

&lt;p&gt;Critical alerts to configure in OpenObserve:&lt;/p&gt;

&lt;h3&gt;
  
  
  1. High GPU Temperature
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;DCGM_FI_DEV_GPU_TEMP &amp;gt; 85
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Severity:&lt;/strong&gt; Warning at 85°C, Critical at 90°C&lt;br&gt;
&lt;strong&gt;Action:&lt;/strong&gt; Check cooling systems, reduce workload&lt;/p&gt;

&lt;h3&gt;
  
  
  2. GPU Memory Near Capacity
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;(DCGM_FI_DEV_FB_USED / (DCGM_FI_DEV_FB_USED + DCGM_FI_DEV_FB_FREE)) &amp;gt; 0.90
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Severity:&lt;/strong&gt; Warning at 90%, Critical at 95%&lt;br&gt;
&lt;strong&gt;Action:&lt;/strong&gt; Optimize memory usage or scale horizontally&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Low GPU Utilization (Waste Detection)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;avg(DCGM_FI_DEV_GPU_UTIL) &amp;lt; 20
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Duration:&lt;/strong&gt; For 30 minutes&lt;br&gt;
&lt;strong&gt;Action:&lt;/strong&gt; Review workload scheduling, consider rightsizing&lt;/p&gt;

&lt;h3&gt;
  
  
  4. GPU Hardware Errors
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;increase(DCGM_FI_DEV_XID_ERRORS[5m]) &amp;gt; 0
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Severity:&lt;/strong&gt; Critical&lt;br&gt;
&lt;strong&gt;Action:&lt;/strong&gt; Immediate investigation, potential RMA&lt;/p&gt;

&lt;h3&gt;
  
  
  5. Thermal Throttling Detected
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;increase(DCGM_FI_DEV_THERMAL_VIOLATION[5m]) &amp;gt; 0
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Severity:&lt;/strong&gt; Warning&lt;br&gt;
&lt;strong&gt;Action:&lt;/strong&gt; Improve cooling or reduce ambient temperature&lt;/p&gt;

&lt;h3&gt;
  
  
  6. GPU Offline
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;absent(DCGM_FI_DEV_GPU_TEMP)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Duration:&lt;/strong&gt; For 2 minutes&lt;br&gt;
&lt;strong&gt;Action:&lt;/strong&gt; Check GPU health, driver status, fabric manager&lt;/p&gt;

&lt;h2&gt;
  
  
  Traditional Monitoring vs. GPU Monitoring with OpenObserve
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Aspect&lt;/th&gt;
&lt;th&gt;Traditional Monitoring (Prometheus/Grafana)&lt;/th&gt;
&lt;th&gt;OpenObserve for GPU Monitoring&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Setup Complexity&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Requires Prometheus, node exporters, Grafana, storage backend, and complex configuration&lt;/td&gt;
&lt;td&gt;Single unified platform with built-in visualization&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Storage Costs&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;High - Prometheus stores all metrics at full resolution, requires expensive SSD storage&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;80% lower&lt;/strong&gt; - Advanced compression and columnar storage&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Multi-tenancy&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Complex setup requiring multiple Prometheus instances or federation&lt;/td&gt;
&lt;td&gt;Built-in with organization isolation and access controls&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Alerting&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Separate alerting system (Alertmanager), complex routing configuration&lt;/td&gt;
&lt;td&gt;Integrated alerting with flexible notification channels&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Long-term Retention&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Expensive - requires additional tools like Thanos or Cortex&lt;/td&gt;
&lt;td&gt;Native long-term storage with automatic data lifecycle management&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;GPU-Specific Features&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Generic time-series database, not optimized for GPU metrics&lt;/td&gt;
&lt;td&gt;Optimized for high-cardinality workloads like GPU monitoring&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Log Correlation&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Separate log management system needed (ELK, Loki)&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;Unified logs, metrics, and traces&lt;/strong&gt; in one platform&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Setup Time&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;4-8 hours (multiple components, configurations, troubleshooting)&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;30 minutes&lt;/strong&gt; (end-to-end)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Maintenance Overhead&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;High - multiple systems to update, monitor, and troubleshoot&lt;/td&gt;
&lt;td&gt;Low - single platform with automatic updates&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  ROI Examples
&lt;/h3&gt;

&lt;p&gt;For an 8-GPU H200 cluster worth $320,000:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Detect thermal throttling early:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;15% performance loss = $48,000 annual waste&lt;/li&gt;
&lt;li&gt;Early detection saves this loss&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;ROI: 990% in first year&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Optimize utilization:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Increase from 40% to 70% = 75% more work&lt;/li&gt;
&lt;li&gt;Defer $240,000 expansion by 1 year&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;ROI: 4,900% in first year&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Prevent downtime:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;1 hour downtime = $2,800 revenue loss&lt;/li&gt;
&lt;li&gt;Preventing 5 hours/year = $14,000 saved&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;ROI: 289% in first year&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;GPU monitoring is no longer optional—it's essential infrastructure for any organization running GPU workloads. The combination of DCGM Exporter and OpenObserve provides:&lt;/p&gt;

&lt;p&gt;✅ &lt;strong&gt;Complete visibility&lt;/strong&gt; into GPU health, performance, and utilization&lt;br&gt;
✅ &lt;strong&gt;Cost optimization&lt;/strong&gt; through identifying waste and inefficiencies&lt;br&gt;
✅ &lt;strong&gt;Proactive alerting&lt;/strong&gt; to prevent outages and degradation&lt;br&gt;
✅ &lt;strong&gt;Data-driven decisions&lt;/strong&gt; for capacity planning and architecture&lt;br&gt;
✅ &lt;strong&gt;89% lower TCO&lt;/strong&gt; compared to traditional monitoring stacks&lt;br&gt;
✅ &lt;strong&gt;30-minute setup&lt;/strong&gt; vs. days with traditional tools&lt;/p&gt;

&lt;p&gt;Whether you're running AI/ML workloads, rendering farms, scientific computing, or GPU-accelerated databases, this monitoring solution delivers immediate ROI while scaling effortlessly as your infrastructure grows.&lt;/p&gt;

&lt;h3&gt;
  
  
  Resources
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;DCGM Exporter:&lt;/strong&gt; &lt;a href="https://github.com/NVIDIA/dcgm-exporter" rel="noopener noreferrer"&gt;github.com/NVIDIA/dcgm-exporter&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;OpenObserve:&lt;/strong&gt; &lt;a href="https://openobserve.ai" rel="noopener noreferrer"&gt;openobserve.ai&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;OpenObserve Docs:&lt;/strong&gt; &lt;a href="https://openobserve.ai/docs" rel="noopener noreferrer"&gt;openobserve.ai/docs&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;OpenTelemetry Collector:&lt;/strong&gt; &lt;a href="https://opentelemetry.io/docs/collector" rel="noopener noreferrer"&gt;opentelemetry.io/docs/collector&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;h4&gt;
  
  
  Get Started with OpenObserve Today!
&lt;/h4&gt;

&lt;p&gt;Sign up for a &lt;a href="https://cloud.openobserve.ai" rel="noopener noreferrer"&gt;14 day trial&lt;/a&gt;&lt;br&gt;
Check out our &lt;a href="https://github.com/openobserve" rel="noopener noreferrer"&gt;GitHub repository&lt;/a&gt; for self-hosting and contribution opportunities&lt;/p&gt;
&lt;/blockquote&gt;




&lt;p&gt;&lt;strong&gt;Debugging GPU infrastructure shouldn't feel like a 2 AM guessing game.&lt;/strong&gt;&lt;br&gt;
Try &lt;a href="//cloud.openobserve.ai"&gt;OpenObserve&lt;/a&gt; for free&lt;/p&gt;

</description>
      <category>devops</category>
      <category>monitoring</category>
      <category>gpu</category>
      <category>observability</category>
    </item>
  </channel>
</rss>
