DEV Community: Manas Sharma

How to Monitor AI Agents in Production

Manas Sharma — Thu, 28 May 2026 06:18:40 +0000

TLDR

Monitoring AI agents in production requires distributed tracing: a single user request fans out into 10 or more internal operations, and logs alone cannot show you which step is slow, failing, or burning your token budget.

OpenTelemetry's gen_ai.* semantic conventions give you standardized span attributes for LLM calls, tool invocations, and agent steps. Some are stable today; others are still experimental.

Auto-instrumentation libraries (OpenLLMetry, OpenInference, OpenLIT) cover most agent frameworks with two to three lines of initialization code. You do not change your agent code.

Traces ship to OpenObserve over OTLP. From there you get SQL-queryable trace data, token usage dashboards, cost attribution by agent and model, and alerting on latency and cost anomalies.

OpenObserve also exposes an MCP server. You can query your live agent traces from a Claude or GPT session without opening a dashboard.

Why Agents Are Harder to Monitor Than a Single LLM Call

A single LLM call is straightforward to observe. One HTTP request, one response, one latency number. You can log the input and output and call it done.

An agent is different. When a user sends a message, the agent calls an LLM to decide what to do, invokes a tool, processes the result, calls the LLM again, possibly calls another tool, and eventually returns a response. That one user message becomes ten or more internal operations. Some of those operations call external APIs. Some retry. Some spawn sub-agents.

Without distributed tracing, you see none of this structure. You know the response took 8 seconds. You do not know whether the LLM took 7 of those seconds or whether a tool made three retries before timing out.

Four categories of problems appear in production agents that you cannot debug without traces:

Latency. Which step is slow? The LLM call? The tool execution? A retry loop the agent entered because the tool returned ambiguous output?
Cost. Which agent, which task, which model is consuming tokens? A single misconfigured prompt can bloat your monthly bill.
Failures. Did the tool fail silently and return an empty result? Did the agent exhaust its step limit and return to a fallback?
Quality. Did the agent complete the task, or did it reason its way to a confident-sounding wrong answer?

Distributed tracing gives you a complete record of every operation, in order, with timing and attributes. That record is what makes these questions answerable.

The OTel Data Model for AI Agents

OpenTelemetry's GenAI semantic conventions define a standard set of span attributes for AI workloads. The stable attributes you can build on today:

Attribute	What it captures
`gen_ai.system`	LLM provider: openai, anthropic, cohere
`gen_ai.operation.name`	Operation type: chat, embeddings, text_completion
`gen_ai.request.model`	Model name: gpt-4o, claude-3-5-sonnet-20241022
`gen_ai.usage.input_tokens`	Tokens consumed by the prompt
`gen_ai.usage.output_tokens`	Tokens in the model response
`gen_ai.response.finish_reasons`	Why the model stopped: stop, tool_calls, length

For agent-specific spans, the conventions extend to gen_ai.agent.name, gen_ai.agent.description, gen_ai.tool.name, and gen_ai.tool.description. These are still marked experimental as of early 2026 but are already implemented by the major instrumentation libraries and are stable enough to use in production.

For a full breakdown of what OpenTelemetry captures for LLM workloads, including how SRE teams use the three signal types together, see OpenTelemetry for LLMs: Complete SRE Guide.

Spans: LLM calls, tool invocations, and agent steps

Every significant operation in an agent's lifecycle becomes a span:

gen_ai.chat: wraps a single LLM API call. Carries model name, token counts, and finish reason.
gen_ai.tool: wraps a single tool invocation. Child of the LLM call span that requested it.
agent.step: wraps one full reasoning cycle. Parent of all LLM and tool spans within that cycle.

Events vs. attributes for prompt and response content

Prompt and completion content is large. Storing it as span attributes inflates trace payloads and storage costs. The OTel GenAI convention puts prompt and completion content into span events (typed gen_ai.content.prompt and gen_ai.content.completion) rather than attributes. Events attach to the span but are stored separately, keeping the attribute payload small while preserving full content for debugging.

In practice: leave content capture enabled during development. Before shipping to production, disable it at the application level or route it through the Collector for redaction.

Trace context propagation across agent boundaries

When an orchestrator delegates to a worker agent, the worker's spans need to appear under the same root trace. For HTTP-based delegation, include the W3C traceparent header in the outgoing request and extract it in the worker. For in-process delegation (LangGraph node transitions, OpenAI Agents SDK handoffs), auto-instrumentation handles this automatically.

Picking Your Auto-Instrumentation Library

Three libraries sit between your agent code and the OTel SDK. The examples in this blog use LangChain and the OpenAI Agents SDK, both supported by all three libraries. For support across other frameworks (CrewAI, AutoGen, DSPy, and more), check each library's docs.

Library	Signals	LangChain	OpenAI Agents	Config overhead
OpenLLMetry (`traceloop-sdk`)	Traces + Metrics + Logs	Yes	Yes	Medium
OpenInference	Traces only	Yes	Yes	Low
OpenLIT	Traces + Metrics	Yes	Yes	Minimal

OpenLLMetry captures the most signals and covers the widest framework catalog. OpenLIT is the easiest entry point: one import, one function call. OpenInference is traces-only but has the closest alignment with OTel GenAI semantic conventions.

For teams starting out: use OpenLLMetry. For teams already running an OTel SDK setup: use the official opentelemetry-instrumentation-* packages from opentelemetry-python-contrib, which include opentelemetry-instrumentation-langchain and opentelemetry-instrumentation-openai-agents-v2.

For a full walkthrough of OpenLIT with OpenObserve, including pre-built dashboards for GPU and vector database monitoring, see LLM Observability for AI Applications with OpenObserve and OpenLIT.

For a broader comparison of open-source LLM observability tooling, see Top Open Source LLM Observability Tools.

Example 1: Instrumenting a LangChain Agent

The following examples use LangChain and the OpenAI Agents SDK. The instrumentation pattern is the same for virtually every other agent framework: install a library, initialize before importing framework classes, point the exporter at your backend.

LangChain's current recommended approach for building agents uses LangGraph as the execution runtime. The opentelemetry-instrumentation-langchain package instruments both.

Install:

pip install opentelemetry-sdk \
    opentelemetry-exporter-otlp-proto-http \
    opentelemetry-instrumentation-openai \
    langgraph langchain-openai

Initialize before any LangChain imports:

from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter
from opentelemetry.instrumentation.openai import OpenAIInstrumentor

exporter = OTLPSpanExporter(
    endpoint="<your-openobserve-otlp-endpoint>",
    headers={
        "Authorization": "Basic <base64(email:password)>",
        "stream-name": "default",
    },
)

provider = TracerProvider()
provider.add_span_processor(BatchSpanProcessor(exporter))

OpenAIInstrumentor().instrument(tracer_provider=provider)

Note: opentelemetry-instrumentation-langchain has a known compatibility issue with current LangGraph versions. OpenAIInstrumentor covers the spans that matter: LLM calls with token counts, model name, and finish reason. LangChain graph-level spans can be added manually if needed.

A simple ReAct agent with a tool:

from langchain.agents import create_react_agent
from langchain_openai import ChatOpenAI
from langchain_core.tools import tool

@tool
def get_stock_price(ticker: str) -> str:
    """Get the current stock price for a ticker symbol."""
    # Replace with your actual data source
    return f"{ticker}: $142.50"

llm = ChatOpenAI(model="gpt-4o-mini")
agent = create_react_agent(llm, [get_stock_price])

result = agent.invoke({
    "messages": [{"role": "user", "content": "What is the price of AAPL?"}]
})

You did not add a single line to the agent code. The instrumentation wraps LangChain's framework classes at import time and emits spans for every LLM call and tool invocation.

What you get in OpenObserve:

Root span for the graph execution
One child span per LLM call with gen_ai.request.model, gen_ai.usage.input_tokens, and gen_ai.usage.output_tokens
One child span per tool invocation with the tool name and execution result
Wall clock timing on every span

By default, prompt and completion content is captured. Disable it for production:

OTEL_INSTRUMENTATION_GENAI_CAPTURE_MESSAGE_CONTENT=no_content

Example 2: Instrumenting an OpenAI Agents SDK App

Install:

pip install opentelemetry-sdk \
    opentelemetry-exporter-otlp-proto-http \
    opentelemetry-instrumentation-openai-agents \
    openai-agents

Initialize:

from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter
from opentelemetry.instrumentation.openai_agents import OpenAIAgentsInstrumentor

exporter = OTLPSpanExporter(
    endpoint="<your-openobserve-otlp-endpoint>",
    headers={
        "Authorization": "Basic <base64(email:password)>",
        "stream-name": "default",
    },
)

provider = TracerProvider()
provider.add_span_processor(BatchSpanProcessor(exporter))
OpenAIAgentsInstrumentor().instrument(tracer_provider=provider)

A two-agent handoff:

from agents import Agent, handoff, Runner, function_tool

@function_tool
def search_knowledge_base(query: str) -> str:
    """Search the internal knowledge base for product information."""
    return f"Results for '{query}': Feature Y has been available since v2.3."

support_agent = Agent(
    name="support_agent",
    instructions="Answer customer questions using the knowledge base.",
    tools=[search_knowledge_base],
    model="gpt-4o-mini",
)

triage_agent = Agent(
    name="triage_agent",
    instructions="Route incoming requests to the correct specialist.",
    handoffs=[handoff(support_agent)],
    model="gpt-4o-mini",
)

result = Runner.run_sync(triage_agent, "How do I enable feature Y?")

The instrumentation generates spans for each agent activation (tagged with gen_ai.agent.name), each LLM generation (with model and token counts), each tool call (with name and arguments), and each handoff between agents. The handoff span shows up as a child of the triage agent span and a parent of the support agent span, giving you the full call tree.

Content capture is controlled separately from OpenLLMetry:

OTEL_INSTRUMENTATION_GENAI_CAPTURE_MESSAGE_CONTENT=span_only

Options: span_only, event_only, span_and_event, no_content. Use no_content in production if prompts contain PII.

Shipping Traces to OpenObserve

The OTLP exporter configuration shown in the examples above works for both self-hosted and cloud deployments. The only difference is the endpoint URL.

Self-hosted OpenObserve (port 5080):

OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:5080/api/default/v1/traces
OTEL_EXPORTER_OTLP_HEADERS=Authorization=Basic <base64_token>,stream-name=default

OpenObserve Cloud:

OTEL_EXPORTER_OTLP_ENDPOINT=https://api.openobserve.ai/api/<your_org>/v1/traces
OTEL_EXPORTER_OTLP_HEADERS=Authorization=Basic <base64_token>,stream-name=default

Generate the base64 token:

echo -n "your_email@example.com:your_password" | base64

Direct export vs. OTel Collector

Direct export is simpler for development and small deployments. The application sends spans directly to OpenObserve with no intermediate hop.

The OTel Collector adds a processing layer between your agent and OpenObserve. It is worth adding when you need any of the following:

PII redaction before spans leave your application network
Tail-based sampling to reduce trace volume (see the production checklist below)
Routing the same telemetry to multiple backends simultaneously

For a complete OTLP exporter configuration guide covering both the direct and Collector paths, see LangChain and LlamaIndex Tracing with OpenObserve.

Sample Collector configuration pointing at OpenObserve:

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  batch:

exporters:
  otlphttp/openobserve:
    endpoint: <your-openobserve-otlp-endpoint>
    headers:
      Authorization: "Basic <base64_token>"
      stream-name: default

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch]
      exporters: [otlphttp/openobserve]

You can find your OTLP endpoint and the matching Authorization header in the OpenObserve UI under Data Sources → OpenTelemetry Collector — copy the values directly from there into your Collector config.

What to Look For in OpenObserve

Reading a multi-agent trace waterfall

The trace timeline shows every span as a horizontal bar: width is duration, indentation is the parent-child relationship. For a LangChain ReAct agent, you can immediately see which LLM call or tool invocation is driving latency, something that's invisible in logs.

SQL queries for token usage and cost

OpenObserve lets you query trace data with SQL directly against the gen_ai.* attributes. For example, token usage by model over the last hour:

SELECT
    gen_ai_request_model AS model,
    SUM(CAST(gen_ai_usage_input_tokens AS BIGINT)) AS input_tokens,
    SUM(CAST(gen_ai_usage_output_tokens AS BIGINT)) AS output_tokens
FROM default
WHERE gen_ai_request_model IS NOT NULL
GROUP BY gen_ai_request_model
ORDER BY input_tokens DESC

Note: OpenObserve stores span attributes as top-level flattened fields using underscores (gen_ai_request_model, not attributes['gen_ai.request.model']). The time range filter is applied via the dashboard time picker rather than in SQL, since _timestamp is stored as nanosecond Int64 and is not directly comparable to NOW().

You can extend the same pattern to P99 latency by agent (span_name = 'agent.step') or error rate by tool (span_name = 'gen_ai.tool'). For a full cost attribution setup (per-agent, per-model, with real-time spend alerting), see LLM Cost Monitoring with OpenObserve.

Querying Agent Traces via MCP

OpenObserve exposes an MCP server, so any MCP-compatible LLM client can query your trace store directly, with no dashboard or SQL client required. Connect it to Claude Code:

claude mcp add o2 https://api.openobserve.ai/api/<your_org>/mcp \
  -t http \
  --header "Authorization: Basic <base64_token>"

For self-hosted OpenObserve, replace the URL with http://localhost:5080/api/<your_org>/mcp. Once connected, ask questions like "which tool had the highest error rate in the last hour?" and get structured results back in your LLM session.

For a full guide to MCP servers in the observability stack, see What OpenObserve MCP server can do?

Production Checklist

PII redaction

Disable prompt and completion capture at the application level before traces leave the process:

# OpenLLMetry
TRACELOOP_TRACE_CONTENT=false

# OpenAI Agents SDK / OTel GenAI instrumentation
OTEL_INSTRUMENTATION_GENAI_CAPTURE_MESSAGE_CONTENT=no_content

For finer-grained redaction (specific patterns, or third-party instrumentation you don't fully control), OpenObserve has a native sensitive data redaction feature with 140+ built-in PII patterns and redact/hash/drop actions applied at ingestion time. See Sensitive Data Redaction in OpenObserve for a full walkthrough, or the OTel Collector approach for logs if you prefer to handle it at the pipeline level.

Sampling for LLM traffic

LLM spans are large and frequent. Tracing at 100% is expensive. Use tail-based sampling in the Collector: keep 100% of error traces and slow traces (e.g. >5s), and sample the rest probabilistically (e.g. 10%). This preserves the traces you need for debugging while keeping storage costs predictable. For a deeper look at head- vs. tail-based sampling tradeoffs and Collector configuration, see Head-Based vs Tail-Based Sampling.

Alerting

Four alerts to configure before your agent goes to production:

Latency spike: P99 of agent.step spans exceeds 10 seconds in a 5-minute window
Cost anomaly: total gen_ai.usage.output_tokens per hour exceeds your 7-day baseline by 3x
Tool failure rate: error percentage on any gen_ai.tool span exceeds 5% in 15 minutes
Trace volume spike: unique trace IDs per minute exceeds 5x the normal rate (retry storm or agent stuck in a loop)

OpenObserve supports scheduled and real-time alerts with SQL, PromQL, or the query builder. See the Alerts docs to configure these.

Try It on OpenObserve Cloud

OpenObserve Cloud gives you an OTLP endpoint ready to accept traces, metrics, and logs with no infrastructure to provision. Point your exporter at https://api.openobserve.ai/api/<your_org>/v1/traces, set your auth header, and agent traces start appearing in the UI within seconds. The same SQL queries, cost dashboards, and MCP server are available from day one.

Start for free on OpenObserve Cloud

How to Monitor OpenAI API Costs and Token Usage with OpenTelemetry

Manas Sharma — Fri, 15 May 2026 13:02:47 +0000

TL;DR

Capture gen_ai.* semantic convention attributes on every OpenAI call: request model, input tokens, output tokens. Add feature, user_id, and team on every span so you can break down cost by who and what is spending.
Compute gen_ai.usage.cost_usd from a pricing table you control and emit it as both a span attribute (for per-request drill-down) and a histogram metric (for aggregation and alerting).
Alert on cost anomalies relative to your historical baseline, not just static budget thresholds. Retry loops and runaway agents show up as deviations before they ever cross a daily spend limit.

Why OpenAI bills are impossible to predict without instrumentation

Running an LLM app in production without instrumentation is a slow way to find out your margins are negative. Token consumption is non-obvious: a single user with a verbose system prompt and long chat history can cost 20x more per interaction than an average user. A bug in a retry loop can 10x your daily spend in an hour. A single new feature that adds RAG context to every call can double your input token count overnight.

The OpenAI dashboard tells you what you spent yesterday. It does not tell you which feature, which user, which prompt template, or which model variant drove the spend. By the time you notice a cost spike in your billing dashboard, you have already paid for it.

The fix is the same fix you use for any production system: emit structured telemetry at the point of the API call and make it queryable. OpenTelemetry gives you a vendor-neutral way to do this, and a growing set of GenAI-specific conventions means the fields you emit today will still be meaningful in two years.

Quick start: Jump to the Python setup or Node.js setup if you just need the code.

The three signals you actually need to track

For LLM cost monitoring, three signals carry almost all the value:

Token usage tells you how much capacity you consumed. Input tokens and output tokens, always separately, because they price differently.
Cost is the dollar-denominated derivative of token usage. You compute it at emit time using a pricing table you control.
Latency tells you how long users waited. For streaming endpoints, split this into time to first token and total duration.

Everything else (error rate, finish reason, response model) is useful context for these three. Start with the three and add context as you need it.

What OpenTelemetry's GenAI semantic conventions give you

OpenTelemetry has a dedicated set of semantic conventions for generative AI workloads, living under the gen_ai.* namespace. The point of conventions is that the same attribute names work across providers and observability backends, so your queries do not break when you swap from OpenAI to Anthropic or from one backend to another.

The attributes you will use most:

Attribute	What it holds
`gen_ai.provider.name`	Provider name: `openai`
`gen_ai.request.model`	Model requested by your code: `gpt-4o`, `gpt-4o-mini`
`gen_ai.response.model`	Model the provider actually used (can differ if provider routes)
`gen_ai.operation.name`	`chat`, `text_completion`, `embeddings`
`gen_ai.usage.input_tokens`	Prompt tokens consumed
`gen_ai.usage.output_tokens`	Completion tokens generated
`gen_ai.request.temperature`	Temperature parameter (useful when debugging determinism)
`gen_ai.request.max_tokens`	Max tokens parameter
`gen_ai.response.finish_reasons`	Why the model stopped: `stop`, `length`, `content_filter`

One attribute worth noting: gen_ai.system has been renamed to gen_ai.provider.name in the current OTel GenAI spec. Most instrumentation libraries still emit gen_ai.system today. Your backend should accept both until library adoption catches up.

Instrumenting a Python app with the official OTel OpenAI SDK

This guide uses opentelemetry-instrumentation-openai-v2, the official OTel package maintained in opentelemetry-python-contrib. It follows the GenAI semantic conventions closely and is the right choice for OpenAI instrumentation.

Install the three packages

pip install opentelemetry-distro
pip install opentelemetry-exporter-otlp
pip install opentelemetry-instrumentation-openai-v2

Then run the bootstrap command once to install auto-instrumentation for any other libraries in your app (Flask, FastAPI, requests, and so on):

opentelemetry-bootstrap --action=install

Set the OTLP endpoint for OpenObserve

Grab your OTLP HTTP endpoint and Authorization header from the OpenObserve UI under Data Sources -> Traces (OpenTelemetry) -> OTLP HTTP. Set these environment variables:

export OTEL_SERVICE_NAME=my-llm-app
export OTEL_EXPORTER_OTLP_ENDPOINT="https://api.openobserve.ai/api/<your-org>"
export OTEL_EXPORTER_OTLP_HEADERS="Authorization=Basic <your-auth-token>"
export OTEL_EXPORTER_OTLP_PROTOCOL=http/protobuf
export OTEL_PYTHON_LOGGING_AUTO_INSTRUMENTATION_ENABLED=true

If you are self-hosting OpenObserve, the endpoint is typically http://localhost:5080/api/<your-org>.

Run with `opentelemetry-instrument`

Wrap your existing run command:

opentelemetry-instrument python app.py

No code changes to app.py. The OpenAI SDK is wrapped at import time, and every chat.completions.create call emits a span with the gen_ai.* attributes populated.

A minimal example app

# app.py
import os
from openai import OpenAI

client = OpenAI()

resp = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "Summarize observability in one sentence."}],
)

print(resp.choices[0].message.content)
print("Input tokens:", resp.usage.prompt_tokens)
print("Output tokens:", resp.usage.completion_tokens)

Run it with opentelemetry-instrument python app.py and check the Traces tab in OpenObserve. You should see a span named chat gpt-4o-mini with the token counts attached.

Capturing message content (and the privacy tradeoff)

The instrumentation does not capture the prompt or completion text by default. To enable it:

export OTEL_INSTRUMENTATION_GENAI_CAPTURE_MESSAGE_CONTENT=true

This ships the full prompt and completion as log events. It is useful for debugging but has real privacy implications: you are now logging whatever your users typed, including anything they pasted in. If your app handles regulated data (health, finance, anything under GDPR or HIPAA), do not enable this globally. Enable it per-environment or per-feature flag, and scrub sensitive fields before the exporter sees them.

Instrumenting a Node.js app

For Node.js, the pattern is the same. Install the packages:

npm install @opentelemetry/api \
  @opentelemetry/sdk-node \
  @opentelemetry/exporter-trace-otlp-http \
  @opentelemetry/instrumentation-openai

Create a tracing.js bootstrap file:

// tracing.js
const { NodeSDK } = require('@opentelemetry/sdk-node');
const { OTLPTraceExporter } = require('@opentelemetry/exporter-trace-otlp-http');
const { OpenAIInstrumentation } = require('@opentelemetry/instrumentation-openai');
const { Resource } = require('@opentelemetry/resources');

const sdk = new NodeSDK({
  resource: new Resource({
    'service.name': 'my-llm-app-node',
    'deployment.environment': process.env.NODE_ENV || 'development',
  }),
  traceExporter: new OTLPTraceExporter({
    url: `${process.env.OTEL_EXPORTER_OTLP_ENDPOINT}/v1/traces`,
    headers: {
      Authorization: process.env.OTEL_EXPORTER_OTLP_HEADERS,
    },
  }),
  instrumentations: [new OpenAIInstrumentation()],
});

sdk.start();

Then preload it when you run your app:

node --require ./tracing.js app.js

Same result: every OpenAI call produces a span in OpenObserve with the GenAI attributes populated.

Building a cost calculation layer

OpenAI's SDK gives you token counts. It does not give you dollars. You have to multiply tokens by a price, and that price changes. Build this as a small, updatable module.

Pricing table as code

Keep this in source control. Review it every quarter, or every time a provider announces a price change.

# pricing.py
# Prices in USD per 1 million tokens, as of April 2026.
# Verify against provider pricing pages before each release.

MODEL_PRICING = {
    "gpt-4o":      {"input": 2.50,  "output": 10.00},
    "gpt-4o-mini": {"input": 0.15,  "output": 0.60},
    "o1":          {"input": 15.00, "output": 60.00},
    "o1-mini":     {"input": 3.00,  "output": 12.00},
}


def calculate_cost(model: str, input_tokens: int, output_tokens: int) -> float:
    """Return the estimated cost in USD for a single LLM call."""
    pricing = MODEL_PRICING.get(model)
    if not pricing:
        # Unknown model. Emit 0 and alert separately so you can add pricing.
        return 0.0
    input_cost = (input_tokens / 1_000_000) * pricing["input"]
    output_cost = (output_tokens / 1_000_000) * pricing["output"]
    return round(input_cost + output_cost, 6)

Emitting cost as a custom metric

The official -v2 package does not emit cost, only tokens. Add cost yourself with a thin wrapper that runs after each call:

# tracked_llm.py
import time
from opentelemetry import trace, metrics
from openai import OpenAI
from pricing import calculate_cost

tracer = trace.get_tracer("llm-cost")
meter = metrics.get_meter("llm-cost")

cost_histogram = meter.create_histogram(
    name="gen_ai.usage.cost_usd",
    description="Estimated cost of a single LLM call in USD",
    unit="USD",
)

client = OpenAI()


def tracked_chat(messages, model="gpt-4o-mini", feature="unknown", user_id="anon"):
    with tracer.start_as_current_span("gen_ai.chat") as span:
        span.set_attribute("gen_ai.provider.name", "openai")
        span.set_attribute("gen_ai.request.model", model)
        span.set_attribute("feature", feature)
        span.set_attribute("user_id", user_id)

        start = time.perf_counter()
        response = client.chat.completions.create(model=model, messages=messages)
        elapsed_ms = (time.perf_counter() - start) * 1000

        input_tokens = response.usage.prompt_tokens
        output_tokens = response.usage.completion_tokens
        cost = calculate_cost(model, input_tokens, output_tokens)

        # Span attributes for per-request investigation
        span.set_attribute("gen_ai.usage.input_tokens", input_tokens)
        span.set_attribute("gen_ai.usage.output_tokens", output_tokens)
        span.set_attribute("gen_ai.usage.cost_usd", cost)
        span.set_attribute("gen_ai.latency.duration_ms", elapsed_ms)
        span.set_attribute("gen_ai.response.model", response.model)

        # Metric for aggregation
        cost_histogram.record(cost, {
            "gen_ai.provider.name": "openai",
            "gen_ai.request.model": model,
            "feature": feature,
            "user_id": user_id,
        })

        return response

You now have cost on the span (for drill-down) and cost as a metric (for aggregation, alerting, and dashboards). Both are labeled with feature so you can break them down later.

Attributing cost to users, features, and teams

This is the section most readers came for. Raw token counts do not answer "who is spending our money." Attribution does.

Adding attributes on every span

Every LLM call should carry four attribution dimensions:

feature: which product path triggered the call (document_summary, chat_reply, rag_answer)
user_id: hashed user identifier for per-user rollups
team: which internal team or product area owns the feature
environment: prod, staging, dev

Wire them through as keyword arguments on your wrapper:

result = tracked_chat(
    messages=[{"role": "user", "content": prompt}],
    model="gpt-4o",
    feature="document_summary",
    user_id=hashed_user_id,
)

Building the cost attribution dashboard

A complete LLM cost dashboard covers two concerns: spend attribution and token efficiency. Organize it across two tabs.

Tab 1: LLM Cost Overview

Four single-stat tiles at the top give you the headline numbers at a glance: Total LLM Cost ($), Total Input Tokens, Total Output Tokens, and Total LLM Calls. These are the first things you check when something looks off.

Below the tiles:

LLM Cost Over Time ($): bar chart over the selected time range. Reveals bursty spend patterns and days that are trending above baseline.
Cost by Model: pie chart, one slice per gen_ai.request.model. Shows your model mix and whether a cheaper model is handling the bulk of traffic.
Input vs Output Cost Over Time ($): grouped bar chart with two series, input_cost and output_cost. Output tokens cost 3-4x more than input tokens on most models; this panel tells you which side is driving cost growth.
Token Usage by Model: grouped bar chart of input_tokens and output_tokens per model. Cross-reference this with Cost by Model to spot models that are expensive relative to their token volume.
Token Usage Over Time: time series of token counts. Useful for capacity planning and catching prompt inflation.

Alerting on cost anomalies and rate-limit errors

Static budget thresholds are table stakes. The interesting failures are the ones that do not cross a static threshold until it is too late.

Threshold alerts vs anomaly detection

A threshold alert fires when daily spend exceeds $500. It works for the blunt cases. It misses three common failure modes:

A retry loop that 3x's a specific feature's token usage in an hour. The daily threshold may still be fine by end of day, but you paid 3x for that hour.
A prompt injection that triggers a long runaway completion on a single request, burning 100k output tokens in one call.
Seasonal growth that quietly pushes baseline from $300/day to $600/day over a month, outpacing capacity plans.

Anomaly detection catches all three by comparing current behavior to historical baseline rather than to a fixed number.

A daily budget threshold

Set this first. In OpenObserve, create an alert on the gen_ai.usage.cost_usd metric:

Trigger: SUM(gen_ai_usage_cost_usd) over 24h is greater than 500
Evaluation frequency: every 5 minutes
Action: Slack or PagerDuty, routed to the LLM-platform team

An anomaly-based alert for cost spikes

This is more valuable. Create an anomaly alert on gen_ai.usage.cost_usd grouped by feature, with a training window of the last 14 days and a sensitivity tuned to catch 3x deviations. A retry loop in the document_summary feature shows up in minutes, before it hits your daily threshold.

Alert on rate-limit errors (HTTP 429)

When OpenAI rate-limits you, downstream calls fail and retries pile up. Fire an alert when gen_ai.response.error.type = rate_limit_exceeded exceeds a low threshold (say, 5 in 5 minutes). This usually surfaces a runaway loop before a cost anomaly does.

Reconciling estimated cost with the OpenAI billing API

Your OTel-derived cost is an estimate. It is usually within a couple of percent, but it drifts from the real bill for three reasons:

Cached input tokens. Repeat prompts are billed at a discount. Your naive pricing math assumes full price.
Reasoning tokens. o1 and similar models emit internal reasoning tokens that count toward billing but may not appear in the standard usage object.
Batch API discounts. If you use the async batch endpoint, those requests are priced lower.

Reconcile monthly. Pull the OpenAI usage endpoint and compare total cost for the window against your OTel sum. If the drift is more than 5 percent, dig in and adjust your pricing table. This is the pattern production teams use: OTel for real-time signal, billing API for ground truth.

Measuring time to first token for streaming

For chat UIs, users feel time to first token (TTFT), not total duration. If you use streaming responses, capture it:

import time

def stream_with_ttft(messages, model="gpt-4o"):
    with tracer.start_as_current_span("gen_ai.chat") as span:
        span.set_attribute("gen_ai.provider.name", "openai")
        span.set_attribute("gen_ai.request.model", model)
        span.set_attribute("gen_ai.response.streaming", True)

        start = time.perf_counter()
        ttft_ms = None

        stream = client.chat.completions.create(
            model=model,
            messages=messages,
            stream=True,
        )

        chunks = []
        for chunk in stream:
            if ttft_ms is None and chunk.choices[0].delta.content:
                ttft_ms = (time.perf_counter() - start) * 1000
                span.set_attribute("gen_ai.latency.ttft_ms", ttft_ms)
            chunks.append(chunk)

        total_ms = (time.perf_counter() - start) * 1000
        span.set_attribute("gen_ai.latency.duration_ms", total_ms)
        return chunks

Now you can alert on TTFT regressions separately from total-duration regressions.

Production checklist

Before shipping this to prod:

✅ Retention policy set on your LLM telemetry stream
✅ PII scrubbing pipeline in place if capturing message content
✅ Sampling strategy decided (100% for LLM spans is usually fine)
✅ Pricing table in source control with quarterly review reminder
✅ Budget threshold alert and anomaly-based alert configured
✅ Monthly reconciliation against OpenAI billing API scheduled

Send your LLM telemetry to OpenObserve

OpenObserve is an open-source observability platform that accepts standard OTLP over HTTP and gRPC. There is no proprietary SDK to adopt and no special instrumentation to learn. Point your OTLP exporter at OpenObserve Cloud or a self-hosted instance, and your LLM spans, logs, and metrics land in the same place as your infrastructure telemetry.

If you want to see this working end to end, spin up a free account at OpenObserve Cloud or check out the LLM Observability overview.

I Built a Dashboard in 30 Seconds with AI

Manas Sharma — Thu, 14 May 2026 10:17:29 +0000

The Problem

It's 2 AM. An alert fires. Cart service is throwing errors. You've got five minutes before someone escalates.

The runbook says: "Check the dashboard. Look at the logs." But which dashboard? What query? You're half-asleep, the alert description tells you nothing useful, and now you're supposed to write SQL from scratch while someone in Slack asks "any update?"

Most of us have been there. And most runbooks were written by someone who never had to use them under pressure.

What if you could just type: "cart is throwing errors. find the root cause." and get a real answer?

That's what I tested with the new AI Assistant in OpenObserve. Here's what happened.

It's Not Anomaly Detection. It's Something Simpler.

Most AI + observability discussions jump straight to anomaly detection or ML-powered forecasting. Those are interesting. But the thing that's actually changing how I work right now is simpler: an assistant embedded in the platform that lets me ask questions in plain English and get answers from my own production data.

No SQL. No PromQL. Just describe what you want.

I ran four real scenarios against live data from an otel-demo microservices app and a Kubernetes cluster. Here's how each one went.

1. The Dashboard Request That Normally Kills Your Afternoon

Someone from the business team asks for a dashboard. They don't know SQL. They don't know PromQL. They just want to see what's happening with nginx — request rate, how fast it's responding, how many errors.

Normally this kills thirty minutes: finding the right log stream, writing queries, dragging panels, tweaking units.

Instead, I typed:

create a dashboard for my nginx logs showing request rate, latency percentiles, and 4xx vs 5xx errors.

Thirty seconds later I had a production-ready dashboard. It picked the right log stream. It listed the relevant fields. It wrote the SQL queries. It chose appropriate visualizations — line chart for request rate, heatmap for latency distribution, stacked bar for status codes. These were real queries against actual data. Not a template.

Here's what stuck with me: the person who asked for this could have done it themselves. They don't need to know what a PromQL query looks like. They just describe what they want to see.

2. Same Thing, Different Domain: Infrastructure

Application logs worked. But what about infrastructure?

build a K8s host metrics dashboard showing CPU, memory, disk per node.

Completely different data source — Kubernetes metrics, not nginx logs. Same experience. The assistant figured out where the data lived, what metrics to pull, and how to visualize them.

What impressed me was the panel design. Usage per node and cumulative across the cluster. Separate tabs for CPU, memory, and disk. It understood that "CPU per node" implies a time series grouped by host, not a single aggregate gauge. That's the kind of design decision a human SRE makes after looking at the data — and the assistant just did it.

The assistant had enough context about the infrastructure to know what clusters were running and what hosts were connected. I didn't explain my setup. It already knew.

3. Proactive: Don't Wait Until Something Breaks

Dashboards are great, but nobody wants to stare at them all day. I wanted to see if I could use the assistant proactively — scan everything, find problems before they escalate.

what's the health of the otel-demo right now? if anything is red, create an alert.

This isn't asking for one dashboard or one service. It's saying: scan all services, tell me how we're doing, and if something looks off, lock in an alert so I'm covered.

It checked error rates and latencies across every service. Found the ones running green, identified the ones that weren't. And for anything red — it created an alert. Right there. No configuration. No navigating to the alerts page.

This is the kind of thing most teams only set up after an incident, during the postmortem, when someone says "we should have caught this earlier." One sentence and you're covered before the page goes off.

4. Something's Actually Broken: Root Cause Analysis

Now the real test. The cart service in the otel-demo app is throwing errors. Not a synthetic scenario — a real incident.

otel-demo app cart is throwing errors. find the root cause.

What happened next is worth breaking down step by step:

It searched across both logs and traces — not one or the other, both at once
It looked for errors in the last six hours and found none
It automatically widened the search window — I didn't tell it to do that
It identified the pattern: cart service failing on database writes under load
It showed me the exact traces, the error distribution over time, and the specific downstream call that was failing

Every step was visible. I could expand any tool call, see the exact query it ran, and verify the result. It's not a black box. It shows its work — and if I disagreed with where it was going, I could redirect it.

Once I had the root cause, I stayed in the same conversation:

alert me if cart error rate crosses 10 errors in 5 minutes.

Same context. Same conversation. Investigation to prevention in two sentences.

That last part is what I keep coming back to. The assistant doesn't just help you find problems — it helps you lock in the fix so you don't get paged for the same thing at 3 AM next week.

Beyond the UI: Take It to Your IDE

Here's the part that changes the workflow entirely. You don't have to be inside the OpenObserve UI to get this.

OpenObserve exposes all of this through an MCP server. Connect your AI coding assistant (Claude Code, Cursor, whatever you use) directly to your production observability data. One command:

claude mcp add o2 https://api.openobserve.ai/api/default/mcp \
  -t http \
  --header "Authorization: Basic <YOUR_TOKEN>"

That's it. Under five minutes. Now your IDE can query production logs, metrics, and traces. Debug a deploy from your terminal. Pull up a trace without leaving your editor. Check error rates during a code review.

The assistant follows you wherever you work — not just inside the observability platform.

What This Actually Changes

There's been a lot of noise about AI in observability. Most of it falls into two camps:

Anomaly detection — useful in theory, unpredictable in practice, hard to trust
AI replaces on-call — not happening, and most engineers don't want it to

The thing that's working right now is neither of those. It's reducing the friction between "something is wrong" and "here's what I know."

Not replacing your judgment. Not replacing your experience. Just removing the parts of incident response that feel like operating a query builder with one eye open at 2 AM.

From "I need to see what's happening" to "I know what happened and we're covered next time" — in one conversation.

Resources

Have you tried connecting AI assistants to your observability stack? What's working? What's still painful? Drop a comment — I'm genuinely curious what others are seeing.

Monitoring Java Microservices with OpenTelemetry and OpenObserve

Manas Sharma — Fri, 10 Apr 2026 12:14:39 +0000

Monitoring microservices is hard.

When a user request fans out across multiple services, each with its own database, logs, and failure modes, traditional monitoring tools often give you a fragmented picture. You can tell something is slow, but not exactly where or why.

Distributed tracing solves this.

In this tutorial, we'll implement distributed tracing for a Java Spring Boot microservices application using two open-source tools: OpenTelemetry and OpenObserve.

If your stack includes other languages, check out these guides too:

What you'll build

By the end of this guide, you'll have:

A working Spring Boot microservices setup with cross-service HTTP calls
Zero-code instrumentation using the OpenTelemetry Java Agent
End-to-end traces in OpenObserve with flamegraph and Gantt chart views

What is distributed tracing?

In microservices, one user action can trigger a chain of calls across many services. If a request takes 3 seconds, tracing helps answer:

Which service caused the delay?
Which operation failed?
Where exactly time was spent?

Distributed tracing works by attaching context (trace_id, span_id) at request entry and propagating it across service boundaries (usually with traceparent headers). This gives you one complete request journey.

A trace is made up of spans. Each span records:

Service + operation
Start time + duration
HTTP details (method, URL, status)
DB query metadata
Errors/exceptions
Parent-child relationships

For deeper fundamentals: Distributed Tracing Basics to Beyond

Why OpenTelemetry + OpenObserve?

OpenTelemetry

OpenTelemetry is a CNCF standard for traces, metrics, and logs.

For Java, the OpenTelemetry Java Agent can auto-instrument Spring Boot, JDBC, and HTTP clients with no code changes.

OpenObserve

OpenObserve is an open-source backend for logs, metrics, and traces.

OTLP-native ingest
SQL-powered analytics
Unified observability in one interface
Lightweight and storage-efficient

Architecture used in this tutorial

We'll run four services:

Service	Port	Responsibility
`discovery-service`	8761	Eureka registry
`user-service`	8081	User CRUD (MySQL)
`order-service`	8082	Order management; calls `user-service`
`payment-service`	8083	Payment processing; calls `order-service`

The key trace path is:

payment-service -> order-service -> user-service -> MySQL

Prerequisites

Java 17+
Maven 3.8+
Docker + Docker Compose
MySQL 8 (or use Dockerized MySQL from compose)

Step 1: Clone the project

git clone https://github.com/openobserve/java-distributed-tracing.git
cd java-distributed-tracing

Step 2: Start OpenObserve and MySQL

docker-compose up -d

This starts:

OpenObserve: http://localhost:5080
MySQL: localhost:3306 (tracingdb)

Email: admin@example.com
Password: Admin123!

Step 3: Download OpenTelemetry Java Agent

mkdir agents
curl -L https://github.com/open-telemetry/opentelemetry-java-instrumentation/releases/latest/download/opentelemetry-javaagent.jar \
  -o agents/opentelemetry-javaagent.jar

Step 4: Configure agent export to OpenObserve

Example from user-service/scripts/start.sh:

export OTEL_SERVICE_NAME=user-service
export OTEL_RESOURCE_ATTRIBUTES=service.name=user-service,deployment.environment=dev
export OTEL_TRACES_EXPORTER=otlp
export OTEL_METRICS_EXPORTER=none
export OTEL_LOGS_EXPORTER=none
export OTEL_EXPORTER_OTLP_TRACES_ENDPOINT=http://localhost:5080/api/default/traces
export OTEL_EXPORTER_OTLP_TRACES_PROTOCOL=http/protobuf
export OTEL_EXPORTER_OTLP_TRACES_HEADERS="Authorization=Basic {token}"

java \
  -Xms256m \
  -Xmx512m \
  -javaagent:../agents/opentelemetry-javaagent.jar \
  -jar target/user-service-0.0.1-SNAPSHOT.jar

Get {token} from OpenObserve UI:

Step 5: Start discovery-service

cd discovery-service
mvn clean install -Dmaven.test.skip
sh scripts/start.sh

Open: http://localhost:8761

Step 6: Start user/order/payment services

Run each in a separate terminal.

cd user-service
mvn clean install -Dmaven.test.skip
sh scripts/start.sh

cd order-service
mvn clean install -Dmaven.test.skip
sh scripts/start.sh

cd payment-service
mvn clean install -Dmaven.test.skip
sh scripts/start.sh

Verify registration in Eureka:

Step 7: Generate traces

1) Create user

curl -X POST http://localhost:8081/api/users \
  -H "Content-Type: application/json" \
  -d '{
    "name": "Priya Sharma",
    "email": "priya@example.com",
    "phone": "+91-9876543210"
  }'

2) Create order

curl -X POST http://localhost:8082/api/orders \
  -H "Content-Type: application/json" \
  -d '{
    "userId": 1,
    "productName": "Mechanical Keyboard",
    "quantity": 1,
    "totalAmount": 4999.00
  }'

3) Process payment (full distributed trace)

curl -X POST http://localhost:8083/api/payments/process \
  -H "Content-Type: application/json" \
  -d '{
    "userId": 1,
    "orderId": 1,
    "amount": 4999.00,
    "currency": "INR",
    "paymentMethod": "UPI"
  }'

4) Trigger an error trace

curl -X POST http://localhost:8082/api/orders \
  -H "Content-Type: application/json" \
  -d '{
    "userId": 9999,
    "productName": "Test Product",
    "quantity": 1,
    "totalAmount": 100.00
  }'

Expected: 400 Bad Request

Visualize in OpenObserve

Go to http://localhost:5080 -> Traces

Trace Explorer

You'll see:

Trace ID
Root span
Service
Duration
Span count
Status

Filter examples

service_name = payment-service
status = ERROR
Duration range
operation_name for specific endpoints

Flamegraph + Gantt chart

Click a POST /api/payments/process trace.

Flamegraph: nested span timing hierarchy
Gantt: timeline-aligned span bars

Query traces with SQL

OpenObserve supports SQL over trace data.

Slowest payment traces

SELECT trace_id, duration, service_name, operation_name
FROM "default"
WHERE service_name = 'payment-service'
  AND operation_name LIKE '%payments/process%'
ORDER BY duration DESC
LIMIT 10;

Error count by service

SELECT service_name, COUNT(*) as error_count
FROM "default"
WHERE span_status = 'ERROR'
GROUP BY service_name
ORDER BY error_count DESC;

Avg/max latency by service

SELECT service_name,
       AVG(duration) as avg_duration_us,
       MAX(duration) as max_duration_us,
       COUNT(*) as request_count
FROM "default"
GROUP BY service_name;

What the Java agent captured automatically

Without adding tracing code, the OpenTelemetry Java Agent instrumented:

Spring Web incoming HTTP requests
RestTemplate outbound calls (traceparent injected)
JDBC/MySQL queries
Context propagation across service boundaries

See supported libraries: OpenTelemetry Java Instrumentation

Final takeaway

You now have end-to-end distributed tracing for a Java microservices app with:

Zero-code instrumentation
Full request path visibility
Visual root-cause analysis (flamegraph/Gantt)
SQL-based troubleshooting in OpenObserve
A path to production scaling without vendor lock-in

Top Log Visualization Tools in 2026: Dashboards, Search & AI-Assisted Analysis

Manas Sharma — Tue, 17 Mar 2026 08:44:41 +0000

Quick answer: The best log visualization tools in 2026 are OpenObserve, Kibana (Elastic Stack), Grafana + Loki, Datadog Logs, and Splunk. OpenObserve stands out by combining traditional dashboards with a built-in AI assistant (O2 Assistant) that lets you query, correlate, and visualize logs in plain English.

What Separates Great Log Visualization from Basic Log Search?

Most log tools can search. The best ones let you understand.

In 2026, the gap has widened between tools that simply dump raw text and those that provide a fast path from alert → root cause → fix. The features that define the leaders today include:

Saved Views & Search Templates – Reuse complex filters without starting from scratch.
Dashboard Templating – Parameterized views that scale across services and environments.
Anomaly Detection – Surfacing "unknown unknowns" without manual thresholds.
Deep Drill-Down – Moving from a high-level spike to specific log lines in one click.
AI-Assisted Analysis – Using natural language to generate complex queries.

The Best Log Visualization Tools in 2026

Tool	AI-Assisted Analysis	Open Source	Deployment	Best For
OpenObserve	O2 Assistant + MCP	✅	Self-hosted / Cloud	Full-stack observability with AI
Kibana (Elastic)	Partial (ML add-on)	✅	Self-hosted / Cloud	Full-text search, complex pipelines
Grafana + Loki	Partial (plugin)	✅	Self-hosted / Cloud	Prometheus-native teams
Datadog Logs	Watchdog AI	❌	SaaS	Managed, all-in-one observability
Splunk	Splunk AI	❌	Self-hosted / Cloud	Enterprise SIEM & security

1. OpenObserve — Best for AI-Assisted Log Visualization

OpenObserve is the only tool where AI-assisted analysis is native, not bolted on. Its O2 Assistant is a full observability co-pilot that understands your schema, queries, and infrastructure topology.

What makes O2 Assistant different?

Traditional visualization requires you to know what to look for. With O2 Assistant, the workflow inverts: You describe the problem; the tool finds the evidence.

"Show me error rate spikes in the payment service over the last 6 hours, correlated with any upstream database latency."

Key Capabilities

Natural Language to Query: Translates English into SQL, PromQL, or VRL scripts.
Cross-Telemetry Correlation: Query logs, metrics, and traces in the same conversation thread.
AI-Generated Dashboards: Use the MCP (Model Context Protocol) server to build entire dashboards from a single prompt.
Ad-hoc Investigation: Perfect for "2 AM incidents" where you don't have a pre-built dashboard ready.

Works with Your Existing Stack

OpenObserve supports Fluent Bit, Vector, Logstash, Filebeat, and OpenTelemetry. You can repoint your existing shippers and be up and running in minutes. It also features a built-in visual pipeline editor with over 100 VRL functions for real-time parsing and redaction.

2. Kibana (Elastic Stack) — Best for Full-Text Search

Kibana remains the gold standard for inverted-index search. Its Lens visualization engine and Discover view are incredibly mature.

Strengths: High customizability, mature drag-and-drop editors, and powerful ML-driven anomaly detection.
Weaknesses: High resource consumption (RAM-hungry) and a steeper learning curve for KQL (Kibana Query Language) compared to natural language interfaces.

3. Grafana + Loki — Best for Prometheus-Native Teams

For teams already deep in the Prometheus ecosystem, Grafana + Loki is the natural choice. It uses the same label model and UI you already know.

Strengths: Unified dashboards for metrics, logs, and traces; excellent Kubernetes integration.
Weaknesses: Loki only indexes labels, making full-text search over unstructured logs slower and more expensive than indexed alternatives.

4. Datadog Logs — Best Managed Option

Datadog offers the most polished "zero-ops" experience. Its Watchdog AI surfaces anomalies automatically, and the integration between logs and distributed traces is seamless.

Tradeoff: Cost. As log volume grows, Datadog’s pricing often forces teams to sample or redact data aggressively to stay within budget.

5. Splunk — Best for Enterprise Security

Splunk is the powerhouse of the SIEM world. If your log visualization needs are tied to forensic investigation and strict compliance, Splunk’s SPL (Search Processing Language) is unmatched. For standard app observability, however, it is often considered overengineered.

The Shift: From Dashboards to Conversations

The old way of observing involved building dashboards for "known" failure modes. But modern, distributed systems fail in "unknown" ways.

AI-assisted log analysis changes the game by allowing exploratory investigation. When you can generate a correlated view across logs and metrics via a chat interface, the "Time to Resolution" (TTR) drops significantly. This is why OpenObserve’s native AI integration represents a fundamental shift in how we handle incidents in 2026.

FAQ

What is the lowest-cost log tool?
OpenObserve typically offers the lowest storage costs (up to 140x lower than ELK) due to its S3-native architecture.

Does OpenObserve work with OpenTelemetry?
Yes, it is OTLP-native and supports logs, metrics, and traces via OpenTelemetry collectors.

Can I create dashboards using AI?
Yes. Using OpenObserve's AI assistant, you can generate complete dashboard panels from a simple text prompt.

Get Started

OpenObserve Cloud — 14-day free trial, no credit card required.
Self-hosted — Run it as a single binary or via Helm charts in under 10 minutes.

Jaeger for Distributed Tracing: A Complete Guide with OpenObserve Comparison

Manas Sharma — Fri, 13 Feb 2026 15:12:29 +0000

As software systems evolve, they become increasingly complex, especially with the rise of microservices and distributed architectures. Keeping track of what's happening across different services can quickly become a daunting task. Tracing tools like Jaeger have emerged as essential solutions for debugging and monitoring distributed applications, helping developers understand and optimise their systems.

In this blog, we will cover:

The Pillars of Observability
Background on Distributed Tracing
What Is Jaeger?
How Jaeger Works: Key Concepts and Components
How Jaeger Collects and Visualizes Traces
Getting Started with Jaeger
Getting Started with OpenObserve
Jaeger vs. OpenObserve
Conclusion
Real-World Case Study: Jidu's Journey to 100% Tracing Fidelity

Prerequisites:

A running Docker instance with admin access.
An OpenObserve instance or cloud account ready to receive logs.

The Pillars of Observability

To truly understand Jaeger, it's vital to grasp the concept of observability. Observability allows us to infer the internal states of systems through their outputs, and it primarily revolves around three pillars:

Logging: Capturing individual events or errors.
Metrics: Quantifying system performance and resource usage.
Tracing: Visualizing request paths and measuring latency across services.

While logging and metrics provide critical insights, distributed tracing complements them by offering context on how different services interact and depend on one another.

Background on Distributed Tracing

Before we dive into Jaeger, it's essential to understand the concept of distributed tracing and why it's crucial in microservices environments.

What is Distributed Tracing?

Distributed tracing is a methodology used to track and analyze requests as they traverse through various services in a distributed system. It helps in visualizing the journey of a request, from the initial entry point all the way to the final response.

E.g. Service A → Service B → Service C → Service D

Why is Distributed Tracing Important?

In monolithic applications, tracing and debugging are straightforward. However, modern applications often depend on multiple microservices communicating over networks, complicating the identification of delays or failures.

Logging alone can't capture complex dependencies or detect bottlenecks. Distributed tracing tools like Jaeger provide end-to-end visibility of requests, capturing metadata at each step, which helps developers:

Trace requests across services
Visualise service dependencies and interactions
Identify performance bottlenecks
Quickly troubleshoot issues by pinpointing problematic services

What Is Jaeger?

Jaeger is an open-source, end-to-end distributed tracing tool originally developed by Uber Technologies. Now part of the CNCF (Cloud Native Computing Foundation), Jaeger allows developers to trace requests as they propagate through distributed systems, providing insights into service behavior and performance bottlenecks.

With Jaeger, you can:

Track request latency and identify services contributing to slow response times
Monitor errors and investigate the root cause of failures across services
Visualise dependency graphs for services to understand relationships and interactions
Optimise performance by identifying and removing bottlenecks

Jaeger is widely adopted due to its powerful tracing capabilities, ease of use, and integration with other monitoring tools in the observability stack.

How Jaeger Works: Key Concepts and Components

Jaeger traces requests as they travel through various services in a distributed system. It captures information about each service's interaction, which helps in pinpointing issues. Let's break down the primary components of Jaeger to understand its functioning:

Spans and Traces:

Span: A span represents a single unit of work within a trace, capturing details like start time, duration, and any metadata or tags. Each span represents a single service call or action in the overall trace.
Trace: A trace represents the entire journey of a request across multiple spans. For instance, when a user makes a request to an application, a trace records the entire sequence, from the front end to each microservice involved.

This screenshot is from the HOT Commerce project by OpenObserve, which demonstrates tracing across microservices. For more details, visit the project on GitHub here.

Trace Analysis:

In the image above, each line represents a span—a single operation within the overall trace, showing the journey of a request across services:

Trace: The set of spans forms the trace, covering services like frontend, shop, product, review, and price.
Longest Span: The frontend service takes the longest time at 2.53 seconds.
Shortest Span: The request handler completes in just 27.00 microseconds (µs).
Total Spans: There are 15 spans, each representing a unit of work, such as middleware processing, database calls, and service interactions.

This breakdown shows how the request interacts with multiple services and highlights areas for potential optimization.

Jaeger Client:

Jaeger clients are libraries that you embed in your application code to instrument services and collect tracing data. These clients generate spans and traces, sending them to a collector for storage and analysis.
Alternatively, instead of using the Jaeger-specific client, you can also use OpenTelemetry (OTel) SDKs for instrumentation. OpenTelemetry is a vendor-neutral observability framework that can work with multiple tracing backends, including Jaeger. Using OTel SDKs allows flexibility to switch or integrate with other observability tools.

Agent:

The Jaeger agent is a lightweight daemon running alongside the application. It receives traces emitted by the client and batches them for efficient transmission to the collector.
Alternatively, the OpenTelemetry Collector can be used as an alternative to the Jaeger Agent. The OTel Collector is a versatile tool that not only receives, processes, and exports tracing data but can also handle metrics and logs. It can send data to multiple observability backends, making it a flexible choice for distributed tracing setups.

Collector:

The Jaeger collector receives traces from agents and stores them in a backend. It also performs any preprocessing or filtering needed for the traces before they are stored.
In OpenTelemetry-based setups, the OTel Collector can handle this role as well, offering additional features like data transformation and routing, which make it ideal for complex or multi-backend environments.

Query Service and UI:

Jaeger provides a UI for querying and visualising traces. Through this UI, developers can search for traces, identify latency bottlenecks, and visualise service dependencies and call hierarchies.

Storage Backend:

Jaeger supports various storage backends like Cassandra, Elasticsearch, or even local files for persistence. This allows you to store traces for later analysis and comparisons.

How Jaeger Collects and Visualizes Traces

When a user request enters a service, the Jaeger client library starts a trace, generating a unique trace ID for that request. As the request flows through different services, the trace ID propagates along, with each service generating a span representing its part of the work. These spans are sent to the Jaeger agent and ultimately stored in the backend.

The Jaeger UI allows you to visualise traces in a timeline view, making it easier to observe the sequence of events and locate bottlenecks. The UI also provides a service dependency graph that shows the relationships between services, allowing you to monitor dependencies and the overall health of your system.

Getting Started with Jaeger

Here's a quick guide to setting up Jaeger in your environment. We'll use Docker to deploy Jaeger and assume you have Docker installed.
For a complete setup guide, refer to the Jaeger Getting Started Documentation.

Step 1: Deploy Jaeger with Docker

Jaeger offers an all-in-one image for testing and development purposes. To start the Jaeger all-in-one container, run the following command:

docker run --rm --name jaeger \
  -e COLLECTOR_ZIPKIN_HOST_PORT=:9411 \
  -p 6831:6831/udp \
  -p 6832:6832/udp \
  -p 5778:5778 \
  -p 16686:16686 \
  -p 4317:4317 \
  -p 4318:4318 \
  -p 14250:14250 \
  -p 14268:14268 \
  -p 14269:14269 \
  -p 9411:9411 \
  jaegertracing/all-in-one:1.62.0

The above command runs the Jaeger all-in-one Docker container, which is useful for testing and development. It exposes the following ports:

6831/udp & 6832/udp: Receive trace data from Jaeger agents.
5778: Agent configuration HTTP endpoint.
16686: Jaeger Query UI for viewing and searching traces.
4317: OpenTelemetry gRPC endpoint for tracing data.
4318: OpenTelemetry HTTP endpoint for tracing data.
14250: gRPC endpoint for the Jaeger collector.
14268: HTTP endpoint for the collector to receive traces.
14269: Health check endpoint for the collector.
9411: Zipkin-compatible endpoint for receiving data.

Note: This setup uses memory as the default backend storage, which is intended for short-term use and is not recommended for production due to the lack of persistence.

You can access the Jaeger UI at http://localhost:16686, to visualise and interact with the traces collected.

Step 2: Instrument the HotROD Sample Application

Next, we'll instrument the HotROD sample application to work with Jaeger for distributed tracing.

What is HotROD?

HotROD is a microservices application simulating a ride-hailing service, similar to Uber or Lyft. It consists of multiple services, such as ride management and driver management, making it an ideal example for demonstrating distributed tracing in a microservices architecture.

To run the HotROD application alongside Jaeger, use the following Docker command:

docker run --rm -it --link jaeger \
  -p8080-8083:8080-8083 \
  -e OTEL_EXPORTER_OTLP_ENDPOINT="http://jaeger:4318" \
  jaegertracing/example-hotrod:1.62.0 \
  all --otel-exporter=otlp

The above command will run the HotROD sample application in a Docker container, linking it to the Jaeger container. It will expose ports 8080 to 8083 on the host for accessing the HotROD services. The application is configured to send tracing data to Jaeger via the OpenTelemetry Protocol (OTLP) at the specified endpoint.

You can access the HotROD UI at http://localhost:8080

Step 3: View Traces in Jaeger UI

Once your application is instrumented, run a few requests to generate some traces.

Then, navigate to http://localhost:16686, where you can query traces, visualise the flow of requests, and see latency and dependency data.

Getting Started with OpenObserve

Now, let's guide you through the setup of OpenObserve using Docker for deployment.
For a detailed setup guide, you can refer to the OpenObserve Quickstart Documentation.

Step 1: Deploy OpenObserve with Docker

OpenObserve provides a Docker image for easy deployment. To start using OpenObserve, run the following command:

docker run \
    --name openobserve \
    -v $PWD/data:/data \
    -e ZO_DATA_DIR="/data" \
    -p 5080:5080 \
    -e ZO_ROOT_USER_EMAIL="root@example.com" \
    -e ZO_ROOT_USER_PASSWORD="Complexpass#123" \
    public.ecr.aws/zinclabs/openobserve:latest

The command will start an OpenObserve Docker container named openobserve, with the following configurations:

Persistent Storage: Maps the local directory $PWD/data to the container's /data directory.
Authentication: Sets the root user email and password for the OpenObserve interface.
Port Exposure: Exposes port 5080 for external access to the OpenObserve web application.

You can access the OpenObserve UI at http://localhost:5080 to visualise and interact with your observability data.

User email: root@example.com
Password: Complexpass#123

Step 2: Instrument the HotROD Sample Application

Run the following command to configure the HotROD sample app to send tracing data to OpenObserve (O2). Replace placeholders with the correct values from your OpenObserve setup.

docker run \
  --rm \
  --link <O2_CONTAINER_NAME> \
  --env OTEL_EXPORTER_OTLP_ENDPOINT=<O2_ENDPOINT> \
  --env OTEL_EXPORTER_OTLP_HEADERS="<Authorization=Basic <BASE64_ENCODED_CREDENTIALS>>" \
  -p 8080-8083:8080-8083 \
  jaegertracing/example-hotrod:latest \
  all

This command does the following:

Runs the HotROD application in a Docker container and links it to your OpenObserve container.
Sets the environment variable for the OpenTelemetry exporter endpoint to send tracing data to OpenObserve.
Configures the necessary headers for authentication.
Maps ports 8080 to 8083 for accessing the HotROD services externally.

By running this command, you'll be able to generate trace data from the HotROD application and send it to OpenObserve for visualisation and analysis.

You can find the HTTP endpoint and authorization details in the Data Sources section, under Traces (OpenTelemetry).

This is how the command looks after replacing required fields:

docker run \
  --rm \
  --link openobserve \
  --env OTEL_EXPORTER_OTLP_ENDPOINT=http://13.232.45.32:5080/api/default \
  --env OTEL_EXPORTER_OTLP_HEADERS="Authorization=Basic cm9vdEBleGFtcGxlLmNvbTpTMzVHMjhaMEkxVEdxYm9q" \
  -p 8080-8083:8080-8083 \
  jaegertracing/example-hotrod:latest \
  all

Replace and with your specific values.

You can access the HotROD UI at http://localhost:8080. Once your application is instrumented, run a few requests to generate some traces.

Step 3: View Traces in OpenObserve UI

Once your application is instrumented, generate some telemetry data by making requests to your services. You can then explore the data in the OpenObserve UI at http://localhost:5080.

Jaeger vs. OpenObserve

Challenge	Jaeger	OpenObserve (O2)
Scalability	Struggles with high traffic	Built for high scalability and performance
Unified Platform	Separate tools for logs and metrics	Combines metrics, logs, and traces into one platform
Querying	Basic querying options	Advanced querying capabilities for deeper insights
Cost Management	Higher storage and processing costs	Optimized for lower resource usage
User Experience	Traditional, complex interfaces	Modern, intuitive interface for easy navigation and analysis

Conclusion

Jaeger is an excellent tool for getting started with distributed tracing and is widely adopted for microservices observability. However, as systems grow, Jaeger's limitations in data handling and cross-function observability (metrics, logs, and traces) may become restrictive.

OpenObserve addresses these limitations by unifying metrics, logs, and traces in a single platform, making it a more comprehensive observability solution. With its scalability, enhanced query capabilities, and cost-effectiveness, OpenObserve empowers teams to monitor, troubleshoot, and optimise complex distributed systems more efficiently.

Real-World Case Study: Jidu's Journey to 100% Tracing Fidelity

To see OpenObserve's impact in action, read about Jidu's journey to achieving 100% tracing fidelity using OpenObserve. Their challenge with Jaeger with Elasticsearch backend limited their ability to ingest traces and they were able to ingest only 10% of traces that their application generated (10 TB per day) and performance was bad for the money that was spent on the resources.

After moving from Jaeger+Elasticsearch to OpenObserve they were able to increase trace ingestion to 100% (10 TB) offering higher performance on the same hardware and reduced storage cost as well. They eventually started ingesting 100 TB of traces per day in OpenObserve. Their team's work offers valuable insights into overcoming the challenges of tracing at scale and ensuring trace fidelity. You can read the full case study here.

This case demonstrates how OpenObserve's unified approach to observability enables improved trace fidelity and facilitates better troubleshooting, performance optimization, and insight gathering across distributed systems.

Ready to get started?

Download OpenObserve
Try OpenObserve Cloud with a 14-day free trial
Join our community for support and discussions

Top 10 Lightstep Alternatives for 2026 (OpenTelemetry-Native Options)

Manas Sharma — Wed, 04 Feb 2026 14:41:04 +0000

ServiceNow announced the sunset of Lightstep (Cloud Observability) effective March 1, 2026. If you're a Lightstep user, you're facing a forced migration with no direct replacement offered by ServiceNow.

Several factors are driving teams to evaluate Lightstep alternatives:

Forced migration - March 2026 EOL deadline approaching with no migration path from ServiceNow
Cost optimization - Opportunity to reduce observability spending by 60-90% with modern platforms
Vendor lock-in concerns - Avoid future platform sunsets by choosing OpenTelemetry-native solutions
OpenTelemetry standardization - Move to vendor-neutral instrumentation that works across platforms
Data sovereignty - Teams need self-hosted or regional deployment options for compliance

In this guide, we'll explore ten OpenTelemetry-native alternatives to Lightstep that address these concerns, from open source platforms to specialized SaaS solutions. We'll include real cost comparisons, migration code snippets, and technical analysis to help you choose the right replacement and migrate before the March 2026 deadline.

The Lightstep Sunset: What You Need to Know

The clock is ticking. ServiceNow has officially announced the sunset of Lightstep (rebranded as ServiceNow Cloud Observability), with the service reaching End-of-Life (EOL) by March 1, 2026.

For engineering teams that relied on Lightstep for its pioneering work in distributed tracing and OpenTelemetry (OTel), this is a critical turning point. You need a replacement that respects your existing OTel instrumentation, handles high-cardinality data without breaking the bank, and doesn't trap you in a proprietary agent ecosystem.

This guide analyzes the Top 10 Lightstep alternatives for 2026, focusing on:

OpenTelemetry compatibility - Native OTel support vs translation layers
Migration ease - How quickly can you switch without rewriting code?
Total cost of ownership - Real pricing for production workloads
High-cardinality support - Can it handle user IDs, request IDs at scale?
Vendor lock-in risk - Will you face this problem again in 3 years?

Bottom line: OpenObserve emerges as the best drop-in replacement, offering significant cost savings while maintaining OpenTelemetry-native architecture and distributed tracing capabilities.

Why This Guide Exists

As observability requirements evolve in 2026, Lightstep users face a forced migration due to ServiceNow's March 1, 2026 end-of-life announcement. With no direct replacement or migration path provided by ServiceNow, teams must evaluate alternatives quickly.

Evidence from Real Migrations:

Cost reduction: - Production data shows dramatic savings when moving from Lightstep to modern OpenTelemetry-native alternatives.
Migration timeline: Fast with OTel - Teams using OpenTelemetry can migrate quickly by changing collector configuration. This is significantly faster than platforms that need new instrumentation.
OpenTelemetry-native prevents lock-in - Vendor-neutral instrumentation using OpenTelemetry standards enables future flexibility. You're not rewriting code or learning proprietary agents if you need to switch platforms again.
Unified observability simplifies operations - Logs, metrics, and traces in one platform reduces tool sprawl, context switching, and correlation complexity that teams experienced with fragmented monitoring stacks.

What Lightstep Users Need to Replicate

Lightstep was known for several key capabilities that any replacement must match:

OpenTelemetry pioneer - Lightstep was an early contributor to OpenTelemetry and built its platform as OTel-native from day one
Distributed tracing excellence - High-cardinality trace data at scale without performance penalties or cost explosions
Unified observability - Logs, metrics, and traces correlated in a single platform with powerful cross-signal queries
Change Intelligence - Deployment tracking and automatic correlation between changes and performance impacts
Service dependency mapping - Visual representation of service relationships and data flows
SQL-based querying - Accessible query language for both developers and SREs

Your replacement platform needs to match these capabilities while avoiding the vendor lock-in risk that led to this forced migration.

What to Look for in a Lightstep Alternative

When evaluating observability platforms to replace Lightstep, assess these critical dimensions:

Criterion	Why It Matters	What to Evaluate
OpenTelemetry Native	Ensures easy migration without code changes	Native OTLP support vs translation layers that add complexity
Migration Timeline	March 2026 deadline approaching fast	Can you complete migration quickly with your team size?
Cost Structure	Opportunity to reduce observability spend	Transparent pricing vs usage-based surprises and hidden fees
Distributed Tracing	Core Lightstep capability you can't lose	High-cardinality support, trace quality, sampling strategies
Data Ownership	Avoid future vendor lock-in scenarios	Self-hosted deployment option available or SaaS-only?
Unified Observability	Reduce tool sprawl and context switching	Logs, metrics, traces in one platform with correlation
Query Capabilities	Investigation efficiency during incidents	SQL/PromQL vs proprietary query languages requiring training
Service Maps	Dependency visualization and troubleshooting	Automatic topology mapping from trace data
Integration Ecosystem	Works with your existing infrastructure	Cloud providers, databases, Kubernetes, CI/CD tools
Vendor Stability	Avoid another sudden platform sunset	Long-term viability, funding, community support, roadmap
Scalability	Handle growing data volumes	Performance at 2x, 5x, 10x current data volumes
High-Cardinality Support	Modern app requirements (user IDs, request IDs)	Cost and performance impact of high-cardinality dimensions

Top 10 Lightstep Alternatives

Jump to comparison table

1. OpenObserve (The Drop-in Replacement)

OpenObserve is the best Lightstep alternative for teams wanting unified observability with OpenTelemetry-native architecture, no vendor lock-in, and 90% cost savings. It delivers the same distributed tracing capabilities Lightstep users rely on, but with transparent pricing and self-hosting options.

Why OpenObserve is the best Lightstep alternative:

OpenObserve isn't just similar to Lightstep - it's architecturally compatible. Both platforms are:

Built for OpenTelemetry from day one
Designed for high-cardinality distributed tracing at scale
Focused on unified observability (logs, metrics, traces)
Using SQL-based query languages (vs proprietary DSLs)

The difference? OpenObserve gives you complete data ownership through self-hosting options.

OpenObserve Pros:

True Drop-in Replacement: Migration from Lightstep requires changing one config file in your OpenTelemetry Collector - no application code changes needed
OpenTelemetry-Native: Native OTLP support means seamless integration with your existing OTel instrumentation
High-Cardinality Friendly: Handles user-level dimensions and request IDs without performance degradation or cost explosions
Unified Observability: Logs, metrics, and traces in one platform with powerful correlation capabilities
SQL + PromQL Querying: Familiar query languages instead of proprietary syntax requiring training
Self-Hosted or Cloud: Deploy on your infrastructure for complete control, or use managed cloud for simplicity
Transparent Pricing: Ingestion-based pricing model with no hidden per-host or per-metric fees

OpenObserve Cons:

Community maturity: While the core platform is battle-tested, the AI agent community is newer compared to established vendors

Migration from Lightstep:

Easiest migration path of any alternative. If you're using OpenTelemetry (which Lightstep users are):

Sign up for OpenObserve (cloud or self-hosted in 10 minutes)
Update your OpenTelemetry Collector exporter configuration (change endpoint URL and auth token)
Restart collector - data immediately flows to OpenObserve
Rebuild dashboards (OpenObserve provides similar visualization capabilities)
Set up alerts (SQL-based, often simpler than Lightstep's UI-based approach)

Best For:

Teams seeking a Lightstep replacement that maintains OpenTelemetry-native architecture, matches distributed tracing capabilities, and dramatically reduces costs without sacrificing functionality. Ideal for organizations wanting data ownership through self-hosting while avoiding vendor lock-in.

2. Grafana Stack (LGTM)

Grafana Stack (Loki for logs, Grafana for visualization, Tempo for traces, Mimir/Prometheus for metrics) is a popular open-source Lightstep alternative composed of best-in-class tools.

Grafana Stack Pros:

Best Visualization: Grafana dashboards are industry-leading with extensive customization options
Open Source & Vendor-Neutral: No proprietary formats or lock-in across the stack
Tempo for Tracing: OpenTelemetry-native distributed tracing with excellent performance
Large Ecosystem: Thousands of integrations, plugins, and community dashboards
Flexible Deployment: Self-host components individually or use managed Grafana Cloud
Prometheus Standard: Industry-standard metrics collection and querying (PromQL)

Grafana Stack Cons:

Not a single unified product like Lightstep - requires managing multiple components
Operational complexity increases significantly at scale (4 different systems)
Correlation across logs/metrics/traces requires manual setup
Steeper learning curve than unified platforms

Migration from Lightstep:

Configure OpenTelemetry Collector to export traces to Tempo, metrics to Prometheus/Mimir, and logs to Loki. More complex than single-platform alternatives due to multiple destinations.

Best For:

Teams wanting maximum flexibility and best-in-class visualization who are comfortable managing multiple components. Good for organizations with strong infrastructure teams or using Grafana Cloud to reduce operational burden.

3. Honeycomb

Honeycomb is a modern Lightstep alternative focused on high-cardinality observability and debugging distributed systems.

Honeycomb Pros:

Excellent for Distributed Tracing: Purpose-built for understanding complex request flows across microservices
High-Cardinality Native: Handles millions of unique dimension values (user IDs, request IDs) without performance issues
Fast Exploratory Queries: Rapid ad-hoc querying enables real-time investigation during incidents
OpenTelemetry Native: Built from ground up to ingest and leverage OpenTelemetry data
BubbleUp Feature: Automatically surfaces anomalies and patterns in high-cardinality data
Developer-Centric UX: Designed around developer and SRE workflows rather than infrastructure-only monitoring

Honeycomb Cons:

SaaS-only (no self-hosted option)
Less focus on traditional dashboards (more investigation-oriented)
Pricing scales with event volume (can grow quickly with high traffic)
Logs and metrics support still evolving compared to tracing strength

Migration from Lightstep:

Straightforward for OpenTelemetry users. Update collector configuration to send traces to Honeycomb. Strong documentation for Lightstep migration scenarios.

Best For:

Teams prioritizing distributed tracing excellence and high-cardinality debugging capabilities over traditional dashboard-heavy monitoring. Ideal for microservices architectures where understanding request flows is critical.

4. Datadog

Datadog is a comprehensive Lightstep alternative offering all-in-one observability with extensive integrations and enterprise features.

Datadog Pros:

Most Comprehensive Platform: Covers infrastructure, APM, logs, traces, RUM, synthetics, and security in one platform
700+ Integrations: Extensive integration marketplace for cloud providers, databases, and frameworks
Mature APM: Deep application performance monitoring with code-level insights
Enterprise-Grade: Strong governance, compliance, and multi-tenancy capabilities
Excellent UX: Polished interface with powerful visualization and alerting

Datadog Cons:

Very Expensive: Often more expensive than Lightstep, with complex multi-vector pricing
Vendor Lock-in: Proprietary agents and data formats make switching difficult
Cost Surprises: Usage-based pricing can lead to unexpected bills with traffic spikes
OpenTelemetry Support Limited: Treats OTel metrics as expensive "custom metrics"

Migration from Lightstep:

Requires Datadog agents or OpenTelemetry Collector configured for Datadog. More complex than OTel-native alternatives due to Datadog's proprietary ingestion formats.

Best For:

Enterprise teams with large budgets prioritizing ecosystem breadth and polished UX over cost optimization. Good if observability budget isn't constrained and you value comprehensive built-in features.

5. New Relic

New Relic is a SaaS observability platform offering unified logs, metrics, traces, and APM with OpenTelemetry support.

New Relic Pros:

Unified Platform: Full-stack observability in single SaaS platform
Strong APM: Deep code-level performance insights and error tracking
OpenTelemetry Support: Native OTLP ingestion simplifies migration
Per-GB Pricing: More predictable than per-host models (though still usage-based)
Developer-Friendly: Good documentation and onboarding experience

New Relic Cons:

Proprietary Translation: Translates OpenTelemetry data into New Relic format (vendor lock-in)
Costs Scale Quickly: Per-GB pricing grows fast with verbose logging or high trace volumes
SaaS-Only: No self-hosted option for data sovereignty
Historical Billing Issues: Past controversies around retroactive pricing changes

Migration from Lightstep:

OpenTelemetry Collector can send data directly to New Relic via OTLP. Simpler than Datadog but creates some vendor lock-in through data format translation.

Best For:

Teams wanting a familiar SaaS experience similar to Lightstep with strong APM capabilities and willing to accept usage-based pricing for operational simplicity.

6. Chronosphere

Chronosphere is a cloud-native observability platform built by ex-Uber engineers, focused on controlling costs at scale while supporting OpenTelemetry.

Chronosphere Pros:

Built for Scale: Created by engineers who built M3 at Uber for handling massive metric volumes
Cost Controls: Native cost visibility and controls to prevent observability bill explosions
OpenTelemetry Compatible: Works with OTel Collector and standard instrumentation
High-Cardinality Metrics: Handles modern application requirements without performance degradation
Governance Features: Strong multi-tenancy and access controls for large organizations
Query Performance: Fast queries even on large datasets

Chronosphere Cons:

Primarily metrics-focused (traces and logs less mature than competitors)
Enterprise pricing (not as cost-effective as open source alternatives)
Smaller ecosystem compared to established players
SaaS-focused (limited self-hosted options)

Migration from Lightstep:

OpenTelemetry Collector can export metrics to Chronosphere. Straightforward for metrics migration, but you'll need additional solutions for comprehensive tracing that Lightstep provided.

Best For:

Large-scale environments generating massive metric volumes where cost control and governance are critical. Good for teams migrating from Lightstep who want enterprise support but need better cost predictability.

7. Jaeger

Jaeger is an open-source distributed tracing platform and graduated CNCF project, offering core tracing capabilities without logs or metrics.

Jaeger Pros:

Completely Free: Open source with no licensing costs whatsoever
CNCF Graduated: Proven stability and community support through Cloud Native Computing Foundation
OpenTelemetry Native: Built as the reference implementation for OpenTelemetry tracing
Battle-Tested: Used in production by thousands of organizations globally
Flexible Storage: Supports Cassandra, Elasticsearch, Kafka, and Badger backends
Lightweight: Focused solely on distributed tracing without feature bloat

Jaeger Cons:

Tracing Only: No logs or metrics - requires separate tools for unified observability
Basic UI: Functional but less polished than commercial alternatives
Self-Hosted Only: Requires managing infrastructure (no managed SaaS option)
Limited Advanced Features: Missing some of Lightstep's Change Intelligence and correlation features

Migration from Lightstep:

Simple for OpenTelemetry users. Point collector traces to Jaeger endpoint. However, you'll need additional tools for logs and metrics that Lightstep provided.

Best For:

Teams needing just distributed tracing at zero cost and comfortable with self-hosting. Often paired with Prometheus (metrics) and Grafana Loki (logs) for complete observability.

8. Elastic Observability

Elastic Observability (part of Elastic Stack/ELK) provides unified logs, metrics, APM, and traces with powerful search capabilities.

Elastic Observability Pros:

Powerful Search: Elasticsearch excels at full-text and structured log search
Unified Platform: Logs, metrics, APM, and traces in single stack
Flexible Deployment: Self-hosted, managed Elastic Cloud, or hybrid
Large Ecosystem: Extensive integrations with Beats and Logstash
Security + Observability: Strong overlap with SIEM capabilities for security teams

Elastic Observability Cons:

Expensive at Scale: Elasticsearch clusters require significant infrastructure investment
Operational Complexity: Managing Elasticsearch at scale requires expertise
Storage Costs: Full-fidelity data retention gets expensive quickly
OpenTelemetry Support: Works but not as seamless as OTel-native platforms

Migration from Lightstep:

OpenTelemetry Collector can export to Elastic APM. Requires more operational setup than simpler alternatives due to Elasticsearch cluster management.

Best For:

Teams with heavy log analytics requirements or existing Elasticsearch investments who want to consolidate observability into their ELK stack.

9. Dynatrace

Dynatrace is an enterprise APM and observability platform with AI-powered automation and root cause analysis.

Dynatrace Pros:

Automatic Instrumentation: OneAgent automatically discovers and instruments applications
Davis AI: AI engine reduces alert noise through intelligent root cause analysis
Enterprise-Grade: Handles very large, complex enterprise environments
Hybrid Support: Works across on-premises, cloud, and hybrid infrastructures
Low Maintenance: Highly automated requiring minimal configuration

Dynatrace Cons:

Very Expensive: Premium enterprise pricing, often higher than Lightstep
Proprietary Technology: OneAgent and data formats create vendor lock-in
Complex Licensing: Unit-based pricing model can be difficult to predict
OpenTelemetry: Supports OTel but pushes proprietary OneAgent approach

Migration from Lightstep:

Requires deploying OneAgent (Dynatrace's proprietary agent) rather than continuing with OpenTelemetry Collector. More disruptive migration than OTel-native alternatives.

Best For:

Large enterprises with complex environments prioritizing automation and willing to pay premium prices for reduced operational overhead.

10. Splunk Observability Cloud

Splunk Observability Cloud (formerly SignalFx) offers real-time metrics, APM, and infrastructure monitoring focused on cloud-native environments.

Splunk Observability Pros:

Real-Time Streaming: NoSample architecture provides full-fidelity, real-time telemetry
Strong Metrics: Excellent time-series metrics handling and analytics
Enterprise Features: Robust access controls, compliance, and security capabilities
Splunk Ecosystem: Integrates with Splunk platform for unified security and observability
Mature Platform: Proven at scale in large enterprise environments

Splunk Observability Cons:

Expensive: Data-volume-based pricing can be prohibitively expensive
Complexity: Splunk's enterprise focus adds complexity for smaller teams
Storage Costs: Full-fidelity streaming requires significant storage investment
OpenTelemetry: Supports OTel but historically pushed proprietary instrumentation

Migrating from Lightstep to OpenObserve

OpenObserve has first-class support for OpenTelemetry, which means no vendor lock-in and seamless integration with your existing instrumentation.

Your applications don't change. Your OpenTelemetry instrumentation doesn't change. Only the collector destination changes.

O2 supports standardized telemetry collection (i.e., FluentBit, OpenTelemetry, Logstash) ensuring seamless integration. It exposes APIs for ingestion, search, and more, allowing programmatic access to everything. OpenObserve works with any object storage such as S3 or GCS and stores data in open formats, avoiding vendor lock-in on collection and storage.

Migration Path

1. Point your OTel collectors to OpenObserve

Already using OpenTelemetry? Just update your exporter endpoint. No re-instrumentation required.

After (OpenObserve Configuration):

exporters:
  otlphttp/openobserve:
    endpoint: https://your-org.openobserve.ai/api/default/
    headers:
      Authorization: "Basic ${OPENOBSERVE_TOKEN}"
      stream-name: "default"

2. Run both platforms in parallel

Test OpenObserve with your production traffic while Lightstep still runs. Validate data quality and dashboard parity before fully committing.

3. Complete migration

Once validated, migrate all workloads to OpenObserve.

Why Migration is Seamless

SQL/PromQL querying - Universal languages your team already knows. No proprietary DSL to learn.

OpenTelemetry-native - Your existing instrumentation works as-is. No agent rewrites or application changes.

Self-hosted or cloud - Deploy however your team prefers. Cloud for simplicity, self-hosted for complete control.

Similar visualization - Familiar observability workflows. Dashboards, service maps, trace views work the same way.

Need Help?

Talk to our team for a personalized migration plan. We'll help you:

Validate technical feasibility for your specific setup
Recreate your critical dashboards and alerting rules
Accelerate the migration process with hands-on support

Comparison Table: Lightstep Alternatives

Tool	Deployment	OTel Native	Pricing Model	Migration Ease	Best For
OpenObserve	Cloud / Self-hosted	Yes	Ingestion-based	Very Easy (1 config change)	Drop-in Lightstep replacement with 90% cost savings
Grafana Stack	Cloud / Self-hosted	Yes	Modular (LGTM)	Moderate (Multiple components)	Maximum flexibility and best visualization
Honeycomb	SaaS only	Yes	Event-based	Very Easy (OTel-native)	High-cardinality tracing excellence
Datadog	SaaS only	Supported	Host/Usage-based	Moderate (More complex)	Enterprise teams with unlimited budget
New Relic	SaaS only	Yes	Per-GB	Easy (OTel-native)	Familiar SaaS with strong APM
Chronosphere	SaaS / Cloud	Compatible	Enterprise	Moderate (Metrics-focused)	Large-scale metrics with cost controls
Jaeger	Self-hosted	Yes	Free (Open source)	Easy (Traces only)	Distributed tracing only (no logs/metrics)
Elastic	Cloud / Self-hosted	Supported	Data-volume	Moderate (Operational complexity)	Log-heavy workloads with search focus
Dynatrace	SaaS / Hybrid	Supported	Unit-based	Moderate (OneAgent required)	Large enterprises needing automation
Splunk	SaaS / On-prem	Supported	Data-volume	Moderate (Complex pricing)	Security + Observability convergence

Conclusion

With ServiceNow's March 1, 2026 Lightstep end-of-life deadline approaching, teams have an opportunity to modernize their observability stack while dramatically reducing costs and avoiding future vendor lock-in.

Key Takeaways

1. OpenObserve is the best drop-in replacement for Lightstep

For most teams, OpenObserve offers the optimal combination of:

OpenTelemetry-native architecture (easy migration - just change collector config)
Similar distributed tracing capabilities (high-cardinality support, service maps, unified observability)
Data ownership through self-hosting option
No vendor lock-in risk

2. OpenTelemetry-native platforms prevent future lock-in

Choose alternatives that support OpenTelemetry natively (OpenObserve, Honeycomb, Jaeger, Grafana) rather than platforms that translate OTel data into proprietary formats (Datadog, Dynatrace). This ensures you can switch platforms again in the future without rewriting application code.

3. Migration is straightforward with OpenTelemetry

If you're already using OpenTelemetry (which Lightstep users are), migration to OTel-native platforms like OpenObserve requires just updating your collector configuration. No application code changes, no re-instrumentation.

4. Start migration now

With the EOL deadline approaching, begin your evaluation and pilot testing immediately. Most teams can validate OpenObserve in a test environment within days.

Recommended Action Plan

This week: Sign up for OpenObserve free trial and test with a non-critical service
Next week: Update OpenTelemetry Collector config and validate data flow
Following weeks: Build dashboards and alerts, run parallel with Lightstep
Complete migration: Gradually move production workloads to OpenObserve

Whether you choose OpenObserve or another alternative, prioritize OpenTelemetry-native platforms to avoid rewriting instrumentation and ensure long-term flexibility.

Take the Next Step

Ready to explore the best Lightstep alternative?

Try OpenObserve: Download or sign up for OpenObserve Cloud with a 14-day free trial.

Talk to our team: Schedule a migration consultation to get a personalized plan for your Lightstep replacement.

FAQ: Lightstep Alternatives

Why is ServiceNow shutting down Lightstep?

ServiceNow acquired Lightstep but decided to discontinue it without providing a replacement. The official reason wasn't detailed publicly, but it's part of their portfolio rationalization. For you, this means finding an alternative before March 1, 2026.

I'm using Lightstep right now - what should I do?

Start testing alternatives immediately. Most migrations take 2-4 weeks, so:

This month: Test OpenObserve or another OTel-native platform with a non-prod service
Next month: Validate data volume handling and build critical dashboards
Following months: Migrate production workloads gradually

Will I lose all my historical data when Lightstep shuts down?

Yes, unless you export it now. ServiceNow stops accepting data after March 1, 2026. Use Lightstep's export APIs to save critical traces you need for compliance or debugging. Most teams only export essential data since full historical migration is rarely necessary.

Do I have to rewrite all my instrumentation code?

No. If you're using OpenTelemetry (most Lightstep users are), just update your OTel Collector config to point to the new platform. Zero application code changes. Only if you're using Lightstep-specific SDKs (rare) would you need to re-instrument.

How long does it actually take to migrate from Lightstep?

2-4 weeks realistically:

Week 1: Setup and testing
Week 2: Build dashboards, run parallel with Lightstep
Week 3-4: Migrate production services

Some vendors claim "migrations in an hour" - that's just the config change. Budget a month to do it properly with dashboard recreation and validation.

What happens if I miss the March 2026 deadline?

ServiceNow stops accepting telemetry. Your observability goes dark - zero visibility into production. Set up at least a basic OTel-native platform (even free Jaeger) as a fallback to avoid complete blindness.

Can I keep using OpenTelemetry after migrating?

Yes - that's the whole point. Your OTel instrumentation continues working unchanged. This is why we recommend OTel-native platforms (OpenObserve, Honeycomb, Jaeger) over proprietary ones (Datadog, Dynatrace) that translate OTel into their formats. Keeps you flexible for future switches.

FastAPI + OpenTelemetry: Stop Debugging with grep (Use Distributed Tracing)

Manas Sharma — Mon, 02 Feb 2026 03:50:55 +0000

How do you debug a FastAPI app that talks to 5 other services?

Most people grep through logs:

Service A logs: "Request received ✓"
Service B logs: "Processing ✓"
Service C logs: "Query executed ✓"
User: "It failed"

Classic distributed systems problem: every service thinks it worked, but the request still broke somewhere.

The issue? Logs are isolated. Each service writes independently with no context about where the request came from or where it's going next.

The fix? OpenTelemetry distributed tracing. Every request gets a unique trace ID that follows it across all services—like a tracking number for API calls. When something breaks, you follow the trace ID and see exactly where it failed.

Setup takes 20 minutes. Debugging goes from hours of log archaeology to "oh, there it is" in under a minute.

Introduction to OpenTelemetry & OpenObserve

OpenTelemetry represents "an open-source observability framework" that enables developers to gather logs, metrics, and traces in a standardized manner. OpenObserve serves as a complementary platform, providing intuitive interfaces for analyzing telemetry data effectively.

Why OpenTelemetry for FastAPI?

The framework streamlines logging by integrating with existing logging libraries. This unified methodology enables consistent metadata capture across logs, traces, and metrics—making it simpler to correlate information throughout your application stack.

The Problem with Traditional Logging

When debugging microservices:

Each service logs separately
No connection between related requests across services
You're grep-ing through multiple log files trying to piece together what happened
Time zones, log formats, and missing context make correlation nearly impossible

What OpenTelemetry Solves

Distributed Tracing:

Every request gets a unique trace ID
Trace ID follows the request across all services
See the complete request path in one view
Identify exactly where failures occur

Unified Observability:

Logs, metrics, and traces in one place
Correlate log lines to specific traces
See performance metrics alongside request flows

OpenObserve Key Features

Lightweight & Deployable: Operates as a single binary on laptops or containerized environments
Intuitive Interface: More user-friendly than comparable tools
Query Flexibility: Supports both SQL and PromQL syntax
Integrated Alerting: Built-in capabilities eliminate additional configuration
Cost Efficiency: Achieves substantially lower storage expenses than competitors (140x less than Elasticsearch)

How It Works: Quick Overview

The setup involves five main components:

OpenTelemetry Collector - Receives and processes telemetry data
FastAPI Instrumentation - Automatically captures traces from your FastAPI app
OpenObserve - Stores and visualizes logs, metrics, and traces
Trace IDs - Unique identifiers that follow requests across services
Dashboards - See correlated logs and traces in one view

Example: Debugging with Trace IDs

Before OpenTelemetry:

grep "user_id=12345" service1.log  # Found request
grep "timestamp=14:23:45" service2.log  # Which timezone?
grep "error" service3.log  # Too many results
# 2 hours later... still searching

After OpenTelemetry:

# Search by trace ID across all services
grep "trace_id=abc123" *.log
# Instantly see: Request → Auth → Database → External API timeout
# 2 minutes to identify root cause

What You'll Get

With FastAPI + OpenTelemetry + OpenObserve:

✅ Automatic tracing for all FastAPI endpoints
✅ Trace IDs that follow requests across microservices
✅ Log correlation - click a trace to see all related logs
✅ Performance metrics - response times, error rates per endpoint
✅ Fast debugging - find issues in minutes, not hours

Ready to Set This Up?

The complete setup guide (with step-by-step instructions, code examples, and configuration files) is available on OpenObserve's blog.

What you'll learn:

Installing OpenTelemetry Collector
Configuring YAML for log and trace collection
Setting up OpenObserve locally or in the cloud
Instrumenting your FastAPI application with automatic tracing
Testing and analyzing traces in the OpenObserve dashboard
Common troubleshooting tips

👉 Read the full setup guide here

Looking for an OpenTelemetry-native backend?

If you need something that works with your existing OTel setup—self-hosted or managed cloud, SQL + PromQL querying, unified logs/metrics/traces, with enterprise features (SSO, RBAC, multi-tenancy) but without the Datadog/Elastic price tag:

Check out OpenObserve. Open-source, 140x lower storage costs, built for teams that want control over their observability stack.

→ Try the cloud version (14-day trial)
→ Download

Your GPU cluster might be wasting $50k/year through thermal throttling and you'd never know. NVIDIA GPU Monitoring Dashboards

Manas Sharma — Sun, 01 Feb 2026 17:35:54 +0000

NVIDIA GPU Monitoring: Catch Thermal Throttling Before It Costs You $50k/Year - DEV Community

Stop wasting $50k+ annually on GPU inefficiencies. Monitor H100/H200/A100 clusters in 30 minutes with DCGM + OpenObserve.

dev.to

Your GPU cluster might be wasting $50k/year through thermal throttling and you'd never know. Here's how to catch it before it burns your budget. 30-min setup with DCGM + OpenTelemetry.

Manas Sharma — Sun, 01 Feb 2026 17:35:00 +0000

NVIDIA GPU Monitoring: Catch Thermal Throttling Before It Costs You $50k/Year

Manas Sharma ・ Feb 1

#devops #monitoring #gpu #observability

NVIDIA GPU Monitoring: Catch Thermal Throttling Before It Costs You $50k/Year

Manas Sharma — Sun, 01 Feb 2026 09:19:19 +0000

Thermal throttling at 3 AM because you didn't catch that GPU running hot? Your $240k H200 cluster shouldn't be bleeding $50k+ annually through silent failures and inefficiencies.

We built this guide because monitoring NVIDIA GPUs with traditional tools was taking 4-8 hours of setup time. Here's how to get DCGM Exporter + OpenObserve running in ~30 minutes and catch issues before they torch your budget.

AI-driven infrastructure landscape is evolving and GPU clusters represent one of the most significant capital investments for organizations. Whether you're running large language models, training deep learning models, or processing massive datasets, your NVIDIA GPUs (H100s, H200s, A100s, or L40S) are the workhorses powering your most critical workloads.

But here's the challenge: how do you know if your GPU infrastructure is performing optimally?

Traditional monitoring approaches fall short when it comes to GPU infrastructure. System metrics like CPU and memory utilization don't tell you if your GPUs are thermal throttling, experiencing memory bottlenecks, or operating at peak efficiency. You need deep visibility into GPU-specific metrics like utilization, temperature, power consumption, memory usage, and PCIe throughput.

This is where NVIDIA's Data Center GPU Manager (DCGM) Exporter combined with OpenObserve creates a powerful, cost-effective monitoring solution that gives you real-time insights into your GPU infrastructure.

Why GPU Monitoring Matters

The High Cost of GPU Inefficiency

Consider this scenario: You're running an 8x NVIDIA H200 cluster. Each H200 costs approximately $30,000-$40,000, meaning your hardware investment alone is around $240,000-$320,000. Operating costs (power, cooling, infrastructure) can easily add another $50,000-$100,000 annually.

Now imagine:

Thermal throttling reducing performance by 15% due to poor cooling
GPU memory leaks causing jobs to fail silently
Underutilization with GPUs sitting idle 40% of the time
Hardware failures going undetected until complete outage
PCIe bottlenecks limiting data transfer rates

Without proper monitoring, you're flying blind. You might be:

Wasting $50,000+ annually on inefficient GPU utilization
Missing critical performance degradation before it impacts production
Unable to justify ROI on GPU infrastructure to stakeholders
Lacking data for capacity planning and optimization decisions

What You Need to Monitor

Effective GPU monitoring requires tracking dozens of metrics across multiple dimensions:

Performance Metrics:

GPU compute utilization (%)
Memory bandwidth utilization (%)
Tensor Core utilization
SM (Streaming Multiprocessor) occupancy

Thermal & Power:

GPU temperature (°C)
Power consumption (W)
Power limit throttling events
Thermal throttling events

Memory:

GPU memory usage (MB/GB)
Memory allocation failures
ECC (Error Correction Code) errors
Memory clock speeds

Interconnect:

PCIe throughput (TX/RX)
NVLink bandwidth
NVSwitch fabric health
Data transfer bottlenecks

Health & Reliability:

XID errors (hardware faults)
Page retirement events
GPU compute capability
Driver version compliance

The Solution: DCGM Exporter + OpenObserve

What is DCGM Exporter?

NVIDIA's Data Center GPU Manager (DCGM) is a suite of tools for managing and monitoring NVIDIA datacenter GPUs. DCGM Exporter exposes GPU metrics in Prometheus format, making it easy to integrate with modern observability platforms.

You can find more details about DCGM exporter here.

Key capabilities:

Exposes 40+ GPU metrics per device
Supports all modern NVIDIA datacenter GPUs (A100, H100, H200, L40S)
Low overhead monitoring (~1% GPU utilization)
Works with Docker, Kubernetes, and bare metal
Handles multi-GPU and multi-node deployments
Provides health diagnostics and error detection

Complete Setup Guide

Prerequisites

Before starting, ensure you have:

GPU-enabled server (cloud or on-premises)
NVIDIA GPUs installed and recognized by the system
NVIDIA drivers version 535+ (550+ recommended for H200)
Docker installed and configured with NVIDIA Container Toolkit
OpenObserve instance (cloud or self-hosted)

Step 1: Verify GPU Detection

First, confirm your GPUs are properly detected by the system:

# Check if GPUs are visible
nvidia-smi

# Expected output: List of GPUs with utilization, temperature, and memory

For NVIDIA H200 or multi-GPU systems with NVSwitch, you'll need the NVIDIA Fabric Manager:

# Install fabric manager (version should match your driver)
sudo apt update
sudo apt install -y nvidia-driver-535 nvidia-fabricmanager-535

# Reboot to load new driver
sudo reboot

# After reboot, start the service
sudo systemctl start nvidia-fabricmanager
sudo systemctl enable nvidia-fabricmanager

# Verify
nvidia-smi  # Should now show all GPUs

Step 2: Deploy DCGM Exporter

Deploy DCGM Exporter as a Docker container. This lightweight container exposes GPU metrics on port 9400:

docker run -d \
  --gpus all \
  --cap-add SYS_ADMIN \
  --network host \
  --name dcgm-exporter \
  --restart unless-stopped \
  nvcr.io/nvidia/k8s/dcgm-exporter:3.3.5-3.4.0-ubuntu22.04

Configuration breakdown:

--gpus all - Grants access to all GPUs on the host
--cap-add SYS_ADMIN - Required for DCGM to query GPU metrics
--network host - Uses host networking for easier access
--restart unless-stopped - Ensures resilience across reboots

Verify DCGM is working:

# Wait 10 seconds for initialization
sleep 10

# Access metrics from inside the container
docker exec dcgm-exporter curl -s http://localhost:9400/metrics | head -30

# You should see output like:
# DCGM_FI_DEV_GPU_UTIL{gpu="0",UUID="GPU-xxxx",...} 45.0
# DCGM_FI_DEV_GPU_TEMP{gpu="0",UUID="GPU-xxxx",...} 42.0

Step 3: Configure OpenTelemetry Collector

The OpenTelemetry Collector scrapes metrics from DCGM Exporter and forwards them to OpenObserve. Create the configuration:

receivers:
  prometheus:
    config:
      scrape_configs:
        - job_name: 'dcgm-gpu-metrics'
          scrape_interval: 30s
          static_configs:
            - targets: ['localhost:9400']
          metric_relabel_configs:
            # Keep only DCGM metrics
            - source_labels: [__name__]
              regex: 'DCGM_.*'
              action: keep

exporters:
  otlphttp/openobserve:
    endpoint: https://example.openobserve.ai/api/ORG_NAME/
    headers:
      Authorization: "Basic YOUR_O2_TOKEN"

processors:
  batch:
    timeout: 10s
    send_batch_size: 1024

service:
  pipelines:
    metrics:
      receivers: [prometheus]
      processors: [batch]
      exporters: [otlphttp/openobserve]

Get your OpenObserve credentials:

# For Ingestion token authentication (recommended):
Go to OpenObserve UI → Datasources -> Custom -> Otel Collector

Update the Authorization header in the config with your base64-encoded credentials.

Step 4: Deploy OpenTelemetry Collector

docker run -d \
  --network host \
  -v $(pwd)/otel-collector-config.yaml:/etc/otel-collector-config.yaml \
  --name otel-collector \
  --restart unless-stopped \
  otel/opentelemetry-collector-contrib:latest \
  --config=/etc/otel-collector-config.yaml

Check OpenTelemetry Collector:

# View collector logs
docker logs otel-collector

# Look for successful scrapes (no error messages)

Check OpenObserve:

Log into OpenObserve UI
Navigate to Metrics section
Search for metrics starting with DCGM_
Data should appear within 1-2 minutes

Step 5: Generate GPU Load (Optional)

To verify monitoring is working, generate some GPU activity:

# Install PyTorch
pip3 install torch

# Create a load test script
cat > gpu_load.py <<'EOF'
import torch
import time

print("Starting GPU load test...")
devices = [torch.device(f'cuda:{i}') for i in range(torch.cuda.device_count())]
tensors = [torch.randn(15000, 15000, device=d) for d in devices]

print(f"Loaded {len(devices)} GPUs")
while True:
    for tensor in tensors:
        _ = torch.mm(tensor, tensor)
    time.sleep(0.5)
EOF

# Run load test
python3 gpu_load.py

Watch your metrics in OpenObserve - you should see GPU utilization spike!

Creating Dashboards in OpenObserve

Download the Dashboards from our community repository.
In OpenObserve UI, go to Dashboards → Import -> Drop your files here -> select your json -> Import

Once the dashboard has been imported, you will see the below metrics that were prebuilt and you can always customize the dashboards as needed.

Setting Up Alerts

Critical alerts to configure in OpenObserve:

1. High GPU Temperature

DCGM_FI_DEV_GPU_TEMP > 85

Severity: Warning at 85°C, Critical at 90°C
Action: Check cooling systems, reduce workload

2. GPU Memory Near Capacity

(DCGM_FI_DEV_FB_USED / (DCGM_FI_DEV_FB_USED + DCGM_FI_DEV_FB_FREE)) > 0.90

Severity: Warning at 90%, Critical at 95%
Action: Optimize memory usage or scale horizontally

3. Low GPU Utilization (Waste Detection)

avg(DCGM_FI_DEV_GPU_UTIL) < 20

Duration: For 30 minutes
Action: Review workload scheduling, consider rightsizing

4. GPU Hardware Errors

increase(DCGM_FI_DEV_XID_ERRORS[5m]) > 0

Severity: Critical
Action: Immediate investigation, potential RMA

5. Thermal Throttling Detected

increase(DCGM_FI_DEV_THERMAL_VIOLATION[5m]) > 0

Severity: Warning
Action: Improve cooling or reduce ambient temperature

6. GPU Offline

absent(DCGM_FI_DEV_GPU_TEMP)

Duration: For 2 minutes
Action: Check GPU health, driver status, fabric manager

Traditional Monitoring vs. GPU Monitoring with OpenObserve

Aspect	Traditional Monitoring (Prometheus/Grafana)	OpenObserve for GPU Monitoring
Setup Complexity	Requires Prometheus, node exporters, Grafana, storage backend, and complex configuration	Single unified platform with built-in visualization
Storage Costs	High - Prometheus stores all metrics at full resolution, requires expensive SSD storage	80% lower - Advanced compression and columnar storage
Multi-tenancy	Complex setup requiring multiple Prometheus instances or federation	Built-in with organization isolation and access controls
Alerting	Separate alerting system (Alertmanager), complex routing configuration	Integrated alerting with flexible notification channels
Long-term Retention	Expensive - requires additional tools like Thanos or Cortex	Native long-term storage with automatic data lifecycle management
GPU-Specific Features	Generic time-series database, not optimized for GPU metrics	Optimized for high-cardinality workloads like GPU monitoring
Log Correlation	Separate log management system needed (ELK, Loki)	Unified logs, metrics, and traces in one platform
Setup Time	4-8 hours (multiple components, configurations, troubleshooting)	30 minutes (end-to-end)
Maintenance Overhead	High - multiple systems to update, monitor, and troubleshoot	Low - single platform with automatic updates

ROI Examples

For an 8-GPU H200 cluster worth $320,000:

Detect thermal throttling early:

15% performance loss = $48,000 annual waste
Early detection saves this loss
ROI: 990% in first year

Optimize utilization:

Increase from 40% to 70% = 75% more work
Defer $240,000 expansion by 1 year
ROI: 4,900% in first year

Prevent downtime:

1 hour downtime = $2,800 revenue loss
Preventing 5 hours/year = $14,000 saved
ROI: 289% in first year

Conclusion

GPU monitoring is no longer optional—it's essential infrastructure for any organization running GPU workloads. The combination of DCGM Exporter and OpenObserve provides:

✅ Complete visibility into GPU health, performance, and utilization
✅ Cost optimization through identifying waste and inefficiencies
✅ Proactive alerting to prevent outages and degradation
✅ Data-driven decisions for capacity planning and architecture
✅ 89% lower TCO compared to traditional monitoring stacks
✅ 30-minute setup vs. days with traditional tools

Whether you're running AI/ML workloads, rendering farms, scientific computing, or GPU-accelerated databases, this monitoring solution delivers immediate ROI while scaling effortlessly as your infrastructure grows.

Resources

DCGM Exporter: github.com/NVIDIA/dcgm-exporter
OpenObserve: openobserve.ai
OpenObserve Docs: openobserve.ai/docs
OpenTelemetry Collector: opentelemetry.io/docs/collector

Get Started with OpenObserve Today!

Sign up for a 14 day trial
Check out our GitHub repository for self-hosting and contribution opportunities

Debugging GPU infrastructure shouldn't feel like a 2 AM guessing game.
Try OpenObserve for free

DEV Community: Manas Sharma

How to Monitor AI Agents in Production

Why Agents Are Harder to Monitor Than a Single LLM Call

The OTel Data Model for AI Agents

Spans: LLM calls, tool invocations, and agent steps

Events vs. attributes for prompt and response content

Trace context propagation across agent boundaries

Picking Your Auto-Instrumentation Library

Example 1: Instrumenting a LangChain Agent

Example 2: Instrumenting an OpenAI Agents SDK App

Shipping Traces to OpenObserve

Direct export vs. OTel Collector

What to Look For in OpenObserve

Reading a multi-agent trace waterfall

SQL queries for token usage and cost

Querying Agent Traces via MCP

Production Checklist

PII redaction

Sampling for LLM traffic

Alerting

Try It on OpenObserve Cloud

How to Monitor OpenAI API Costs and Token Usage with OpenTelemetry

TL;DR

Why OpenAI bills are impossible to predict without instrumentation

The three signals you actually need to track

What OpenTelemetry's GenAI semantic conventions give you

Instrumenting a Python app with the official OTel OpenAI SDK

Install the three packages

Set the OTLP endpoint for OpenObserve

Run with opentelemetry-instrument

A minimal example app

Capturing message content (and the privacy tradeoff)

Instrumenting a Node.js app

Building a cost calculation layer

Pricing table as code

Emitting cost as a custom metric

Attributing cost to users, features, and teams

Adding attributes on every span

Building the cost attribution dashboard

Alerting on cost anomalies and rate-limit errors

Threshold alerts vs anomaly detection

A daily budget threshold

An anomaly-based alert for cost spikes

Alert on rate-limit errors (HTTP 429)

Reconciling estimated cost with the OpenAI billing API

Measuring time to first token for streaming

Production checklist

Send your LLM telemetry to OpenObserve

Further Reading

I Built a Dashboard in 30 Seconds with AI

The Problem

It's Not Anomaly Detection. It's Something Simpler.

1. The Dashboard Request That Normally Kills Your Afternoon

2. Same Thing, Different Domain: Infrastructure

3. Proactive: Don't Wait Until Something Breaks

4. Something's Actually Broken: Root Cause Analysis

Beyond the UI: Take It to Your IDE

What This Actually Changes

Resources

Monitoring Java Microservices with OpenTelemetry and OpenObserve

What you'll build

What is distributed tracing?

Why OpenTelemetry + OpenObserve?

OpenTelemetry

OpenObserve

Architecture used in this tutorial

Prerequisites

Step 1: Clone the project

Step 2: Start OpenObserve and MySQL

Step 3: Download OpenTelemetry Java Agent

Step 4: Configure agent export to OpenObserve

Step 5: Start discovery-service

Step 6: Start user/order/payment services

Step 7: Generate traces

1) Create user

2) Create order

3) Process payment (full distributed trace)

4) Trigger an error trace

Visualize in OpenObserve

Trace Explorer

Run with `opentelemetry-instrument`