didiViking

Posted on Jun 28 • Originally published at Medium on Aug 21, 2025

OTel Me More on Traces: introducing VictoriaMetrics’ Trace Analyzer

#opentelemetry #opentelemetrycollect #distributedtracing #observability

Tracing is an essential part of modern observability, helping developers understand how requests flow through distributed systems. OpenTelemetry (OTel) has become the de facto standard for collecting traces across services, and VictoriaMetrics’ UI now includes a powerful Trace Analyzer that provides detailed execution traces of queries. These traces show how VictoriaMetrics queries are processed internally, highlighting stages, timings, and resource usage so you can turn raw query execution data into actionable performance insights. In this article, we’ll explore how the Trace Analyzer works, how to use it, why it matters for your monitoring stack and how you can combine it with OTel to get both system-wide spans and deep query-level traces.

Why Tracing Matters

In complex architectures, latency issues and failures rarely occur in isolation. Distributed tracing helps answer critical questions:

Which service is slowing down my request?
Where are errors originating in a multi-service call chain?
How does traffic affect the performance of each component?

OTel provides a standardized way to instrument your applications and collect trace data. However, collecting traces is just the first step, analyzing them efficiently is where the real value comes in.

Introducing VictoriaMetrics Trace Analyzer

VictoriaMetrics’ Trace Analyzer is designed to make query trace exploration intuitive and fast. With it, you can:

Visualize query execution as an interactive tree, showing each evaluation step and how long it took.
Inspect details such as time ranges, step size, number of matched series, and points processed.
Identify bottlenecks in queries (e.g. heavy rollups, large series scans, or cache misses).
Combine Trace Analyzer with OTel-collected spans in your stack: use OTel for distributed request flows and Trace Analyzer for deep insights into how VictoriaMetrics executes your queries.

Before diving into setup, let’s get a better understanding of VictoriaMetrics’ architecture and where Trace Analyzer fits.

Components of a VictoriaMetrics cluster explained

In VMUI (VictoriaMetrics Native UI), you’ll discover powerful, unique tools such as Trace Analyzer, Trace Query, and Query Analyzer, along with features that make exploring and debugging metrics a breeze. These include raw metrics exploration, cardinality analysis, top and active queries inspection, a WITH expressions playground, metric relabeling, downsampling, and retention filter debuggers.

Getting Started

Install either VictoriaMetrics single-node or cluster version. For this tutorial, I’m using VictoriaMetrics single-node version and the setup will clearly be a lot easier. In a VictoriaMetrics single-node setup, we do not need a separate vmagent container because the single-node binary already includes a built-in Prometheus-compatible scraping engine. All Prometheus-compatible targets (like node_exporter or application endpoints) can be discovered and scraped directly using a scrape.yaml configuration file, which is mounted into the VictoriaMetrics container. This file defines all the jobs, endpoints, and labels needed for metrics collection, so adding vmagent would be redundant and could cause duplicate scrapes.
I installed VictoriaMetrics single-node and defined all static scrape jobs in a scrape.yaml file.

global:
  scrape_interval: 10s

scrape_configs:
  - job_name: 'node'
    static_configs:
      - targets: ['node-exporter:9100']

  - job_name: 'standard-app'
    static_configs:
      - targets: ['demo-app:8083']

  - job_name: 'highcard'
    static_configs:
      - targets: ['demo-highcard-app:8084']

  - job_name: 'traffic-skew'
    static_configs:
      - targets: ['demo-traffic-skew-app:8085']

Pull and run the OTel Collector, I used the docker image setup for otel/opentelemetry-collector-contrib, and I defined a docker-compose.yaml file where I added my entire setup. Below my docker-compose.yaml file which has a Grafana OSS local setup, and some basic apps to generate traffic and metrics.

services:
  victoria-metrics:
    image: victoriametrics/victoria-metrics:v1.122.0
    ports:
      - "18428:8428"

  grafana:
    image: grafana/grafana:11.1.0
    ports:
      - "13000:3000"
    environment:
      - GF_SECURITY_ADMIN_USER=[redacted]
      - GF_SECURITY_ADMIN_PASSWORD=[redacted]
    volumes:
      - ./grafana-provisioning:/etc/grafana/provisioning
    depends_on:
      - victoria-metrics

  node-exporter:
    image: prom/node-exporter:v1.7.0
    ports:
      - "9101:9100"

  demo-app:
    build:
      context: ./demo-app
    environment:
      - PORT=8083
      - LABEL_COUNT=5
    ports:
      - "8083:8083"

  loadgen-demo-app:
    image: python:3.12-alpine
    depends_on:
      - demo-app
    command: >
      sh -c "pip install --no-cache-dir requests &&
             python -u -c 'import requests, time; 
             while True: requests.get(\"http://demo-app:8083\"); time.sleep(0.1)'"

  demo-highcard-app:
    build:
      context: ./demo-highcard-app
    environment:
      - PORT=8084
      - LABEL_COUNT=50
    ports:
      - "8084:8084"

  loadgen-demo-highcard-app:
    image: python:3.12-alpine
    depends_on:
      - demo-highcard-app
    command: >
      sh -c "pip install --no-cache-dir requests &&
             python -u -c 'import requests, time; 
             while True: requests.get(\"http://demo-highcard-app:8084\"); time.sleep(0.1)'"

  demo-traffic-skew-app:
    build:
      context: ./demo-traffic-skew-app
    environment:
      - PORT=8085
      - LABEL_COUNT=5
    ports:
      - "8085:8085"

  loadgen-demo-traffic-skew-app:
    image: python:3.12-alpine
    depends_on:
      - demo-traffic-skew-app
    command: >
      sh -c "pip install --no-cache-dir requests &&
             python -u -c 'import requests, time; 
             while True: requests.get(\"http://demo-traffic-skew-app:8085\"); time.sleep(0.1)'

  otel-collector:
    image: otel/opentelemetry-collector-contrib:0.132.0
    command: ["--config=/etc/otel-collector/config.yaml"]
    volumes:
      - ./otel-collector/config.yaml:/etc/otel-collector/config.yaml:ro

    ports:
      - "8889:8889"  
      - "4317:4317"  
      - "4318:4318"
    depends_on:
      - victoria-metrics
      - demo-app
      - demo-highcard-app
      - demo-traffic-skew-app
      - node-exporter

However, we still use an OTel scrape job in the collector’s config because it allows the OTel Collector to expose its own internal metrics (like otelcol_exporter_send_failed_metric_points_total) to VictoriaMetrics. Without this job, these important telemetry metrics from the collector itself would not be visible. Essentially: scrape.yaml handles all external targets, while the OTel scrape job handles self-monitoring of the collector. Below you can view my otel-collector config.yaml file, where metricsql-lab-victoria-metrics-1 is the hostname used in this demo setup. In this case, the otel-collector config.yaml file will be stored in a separate folder from where my victoriametrics and the metrics apps are hosted.

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: "0.0.0.0:4317"
      http:
        endpoint: "0.0.0.0:4318"

  prometheus:
    config:
      scrape_configs:
        - job_name: 'otel-collector'
          static_configs:
            - targets: ['0.0.0.0:8888']

  filelog:
    include: ["/logs/*.log"]
    start_at: beginning

processors:
  resourcedetection:
    detectors: [system]
  batch:
    timeout: 5s

exporters:
  prometheusremotewrite:
    endpoint: "http://metricsql-lab-victoria-metrics-1:8428/api/v1/write"
    retry_on_failure:
      enabled: true
      initial_interval: 5s
      max_interval: 30s
      max_elapsed_time: 0

  prometheus:
    endpoint: "0.0.0.0:8889"

service:
  telemetry:
    metrics:
      level: basic

  pipelines:
    metrics:
      receivers: [otlp, prometheus]
      processors: [resourcedetection, batch]
      exporters: [prometheus, prometheusremotewrite]

Ensure the OTel Collector is on the same Docker network as VictoriaMetrics and the demo apps. You might run into different errors, depending on how complex your setup is, and if one of those errors is network related, you might want to troubleshoot it with:

docker network ls
docker inspect network <your-network>

If the otel collector doesn’t show in the same network with the rest of your services, simply add it to the same network, reboot your otel collector service and you’re good to test.

docker network connect <your-network> otelcol
docker restart <otelcol>
docker ps -a
docker logs -f <otelcol>

Switch to VMUI: you can go directly into the http://localhost:8428 or whatever port you have defined for it.

What I like to do immediately is go to Explore>Explore Cardinality. This is where the juicy part starts happening and this view actually helps me troubleshoot cardinality.

Looking at the metric names with highest number of series I can drill down into the metric name. In this case, I’m searching for otelcol-contrib, so I’ll check the labels section, job and drill down there.

Done. Now I’m going even further to look at the metrics for this job.

This panel helps me understand which metrics have the highest cardinality and which have the lowest. To learn how to use the Cardinality Explorer properly and be more efficient in your troubleshooting, then please read this article.

For example this panel shows us OTel Collector’s own internal metrics, which are very useful in providing us the necessary info to monitor the OTel collector’s health. We can check if:

the collector is receiving data from all configured sources (otelcol_receiver_accepted_spans_total, otelcol_receiver_received_metrics_total)
there are any exporters failing to send metrics or traces to the backend (otelcol_exporter_send_failed_metric_points_total)
how fast is the collector processing incoming telemetry (otelcol_processor_accepted_spans_total, otelcol_processor_dropped_spans_total)
there are any backlogs or bottlenecks forming in any pipeline stages

A concrete example of this metric otelcol_exporter_prometheusremotewrite_consumers will let me know how many “consumer pipelines” are connected to that exporter at a given moment and verify that the metrics pipelines are actually connected to the remote write backend and not idle.

We can drill down into each query by exploring the Query and its capabilities. Using the query {job="otelcol-contrib} I can activate Trace query by enabling the button below it and executing the query again to get the results.

When you enable query tracing in VictoriaMetrics, each line in the output represents a step in the query’s execution. The trace shows how long the step took, which API endpoint was called and what kind of operation was performed. You’ll also see the time range that was evaluated, the resolution of the query, and the actual expression being run. At the end of the line, VictoriaMetrics reports how many time series matched the query and how many data points were processed. Taken together, these lines form a detailed breakdown of the query lifecycle, making it easier to understand how the system executed your request and where most of the work happened.

Let’s save one of these traces to JSON and explore the Trace Analyzer.

Upload the trace JSON file into Trace Analyzer, then expand all nodes to explore details.

Once you upload a trace JSON into VictoriaMetrics’ Trace Analyzer , you get more than just raw query execution lines, it offers interactive exploration and analysis : you can see the query broken down into a tree of operations , where each node shows the type of step (e.g., evaluation, rollup, merge), how long it took, and how many series and points it processed. You can expand or collapse nodes to focus on the steps that matter most and quickly identify bottlenecks or expensive parts of your query. It also highlights cache hits vs full computation , so you can see whether repeated queries benefit from caching.

Additionally, the interface lets you filter, sort, and navigate through large traces, making it easier to understand complex queries at a glance. Essentially, the Trace Analyzer transforms a static JSON trace into a visual, interactive breakdown that makes query performance and optimization much more intuitive.

Beyond visualizing queries, Trace Analyzer helps you optimize performance by revealing which steps take the most time or process the most series. It lets you monitor cache effectiveness , compare runs to spot regressions, and provides actionable insights for query tuning, infrastructure adjustments, and team education.

What’s Ahead: VictoriaTraces and the Bigger Observability Picture

While Trace Analyzer gives deep insights into how VictoriaMetrics executes individual queries, VictoriaTraces takes observability to the next level. VictoriaTraces is designed to handle distributed tracing at scale , collecting and storing OpenTelemetry spans across services and applications. It complements VictoriaMetrics by focusing on end-to-end request flows , while Trace Analyzer focuses on query-level execution details.

Together, they provide a full-spectrum view of your observability stack : VictoriaTraces shows how requests move through your distributed system, helping you detect latency or bottlenecks across services, while Trace Analyzer lets you dig into the performance of your queries themselves , identifying expensive steps, cache behavior, and series volume. By combining the two, developers and SREs can correlate application-level traces with database/query-level insights , making performance optimization more precise and actionable.

This integration strengthens the overall ecosystem, enabling teams to move from detection to diagnosis to optimization faster, all within a unified, efficient monitoring environment.

Ready to explore more and engage with the community?

Have questions for me? Give me a shout on:

Github

Bluesky: @didiviking.bsky.social

X: @dianavtodea

Mastodon: @dianatodea