Syukur

Posted on Aug 28

Application Tracing With Golang, OpenTelemetry, and Grafana Tempo

#opentelemetry #go #tracing #grafana

Trace

Trace is a record or history from a request. Trace give us big picture abour what is happening from application accepts request to return a response. This record is build above span.

Span

Span is a unit that records particular operation within certain time window. Span contains information about trace name, timestamp from request starts to when it ends, and other informations in form of span attributes such as HTTP request, HTTP URL, and URL path. This is an example of hello span on json format.

{
  "name": "hello",
  "context": {
    "trace_id": "5b8aa5a2d2c872e8321cf37308d69df2",
    "span_id": "051581bf3cb55c13"
  },
  "parent_id": null,
  "start_time": "2022-04-29T18:52:58.114201Z",
  "end_time": "2022-04-29T18:52:58.114687Z",
  "attributes": {
    "http.route": "some_route1"
  },
  "events": [
    {
      "name": "Guten Tag!",
      "timestamp": "2022-04-29T18:52:58.114561Z",
      "attributes": {
        "event_attributes": 1
      }
    }
  ]
}

Span above contains informations about

a name hello
span lifetime on start_time and end_time fields
http.route span attribute records information URL
event field that tells us what happens during span lifetime

Span hello is a parent span, because value on parent_id field is null. Pay attention on hello_greetings span below.

{
  "name": "hello-greetings",
  "context": {
    "trace_id": "5b8aa5a2d2c872e8321cf37308d69df2",
    "span_id": "5fb397be34d26b51"
  },
  "parent_id": "051581bf3cb55c13",
  "start_time": "2022-04-29T18:52:58.114304Z",
  "end_time": "2022-04-29T22:52:58.114561Z",
  "attributes": {
    "http.route": "some_route2"
  },
  "events": [
    {
      "name": "hey there!",
      "timestamp": "2022-04-29T18:52:58.114561Z",
      "attributes": {
        "event_attributes": 1
      }
    },
    {
      "name": "bye now!",
      "timestamp": "2022-04-29T18:52:58.114585Z",
      "attributes": {
        "event_attributes": 1
      }
    }
  ]
}

Value of parent_id field on greetings span is same as value of span_id on hello span. We call greetings span is a child span from hello span. Below is hello-salutation span, where it's parent_id is equal to hello's span_id.

{
  "name": "hello-salutations",
  "context": {
    "trace_id": "5b8aa5a2d2c872e8321cf37308d69df2",
    "span_id": "93564f51e1abe1c2"
  },
  "parent_id": "051581bf3cb55c13",
  "start_time": "2022-04-29T18:52:58.114492Z",
  "end_time": "2022-04-29T18:52:58.114631Z",
  "attributes": {
    "http.route": "some_route3"
  },
  "events": [
    {
      "name": "hey there!",
      "timestamp": "2022-04-29T18:52:58.114561Z",
      "attributes": {
        "event_attributes": 1
      }
    }
  ]
}

OpenTelemetry Collector

Collector is a component to receive, extract, and export telemetry data. OpenTelemetry collector is use to receive, extract, and export telemetry data in various formats such as Jaeger, Prometheus, or FluentBit. OpenTelemetry collector built above several components:

Receiver

Receiver focus on collecting data from various sources and formats. Here is an example of receiver.

receivers:
  # Data sources: logs
  fluentforward:
    endpoint: 0.0.0.0:8006

  # Data sources: metrics
  hostmetrics:
    scrapers:
      cpu:
      disk:
      filesystem:
      load:
      memory:
      network:
      process:
      processes:
      paging:

  # Data sources: traces
  jaeger:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      thrift_binary:
      thrift_compact:
      thrift_http:

  # Data sources: traces, metrics, logs
  kafka:
    protocol_version: 2.0.0

  # Data sources: traces, metrics
  opencensus:

  # Data sources: traces, metrics, logs
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
        tls:
          cert_file: cert.pem
          key_file: cert-key.pem
      http:
        endpoint: 0.0.0.0:4318

  # Data sources: metrics
  prometheus:
    config:
      scrape_configs:
        - job_name: otel-collector
          scrape_interval: 5s
          static_configs:
            - targets: [localhost:8888]

  # Data sources: traces
  zipkin:

Processor

Processor is use to process data from receiver. Processed data on this component cosists of filtering, renaming, or recalculating. We could not only define processor once, but also multiple of them, depending on our needs. Outcome of telemetry data depends on sequence or order of data processing on this component. Here is example of processor definition

processors:
  # Data sources: traces
  attributes:
    actions:
      - key: environment
        value: production
        action: insert
      - key: db.statement
        action: delete
      - key: email
        action: hash

  # Data sources: traces, metrics, logs
  batch:

  # Data sources: metrics, metrics, logs
  filter:
    error_mode: ignore
    traces:
      span:
        - 'attributes["container.name"] == "app_container_1"'
        - 'resource.attributes["host.name"] == "localhost"'
        - 'name == "app_3"'
      spanevent:
        - 'attributes["grpc"] == true'
        - 'IsMatch(name, ".*grpc.*")'
    metrics:
      metric:
        - 'name == "my.metric" and resource.attributes["my_label"] == "abc123"'
        - 'type == METRIC_DATA_TYPE_HISTOGRAM'
      datapoint:
        - 'metric.type == METRIC_DATA_TYPE_SUMMARY'
        - 'resource.attributes["service.name"] == "my_service_name"'
    logs:
      log_record:
        - 'IsMatch(body, ".*password.*")'
        - 'severity_number < SEVERITY_NUMBER_WARN'

  # Data sources: traces, metrics, logs
  memory_limiter:
    check_interval: 5s
    limit_mib: 4000
    spike_limit_mib: 500

  # Data sources: traces
  resource:
    attributes:
      - key: cloud.zone
        value: zone-1
        action: upsert
      - key: k8s.cluster.name
        from_attribute: k8s-cluster
        action: insert
      - key: redundant-attribute
        action: delete

  # Data sources: traces
  probabilistic_sampler:
    hash_seed: 22
    sampling_percentage: 15

  # Data sources: traces
  span:
    name:
      to_attributes:
        rules:
          - ^\/api\/v1\/document\/(?P<documentId>.*)\/update$
      from_attributes: [db.svc, operation]
      separator: '::'

Exporter

Exporter is component that aims to export telemetry data that has been processed to other backend processes such as Jaeger, Zipkin, Prometheus, or Kafka. Each key on exporter use type/name format, where type defines kind of exporter like Jaeger, Kafka, or OpenTelemetry, and name is an optional field to seperate exporter that has same type. Here is an example of exporter.

exporters:
  # Data sources: traces, metrics, logs
  file:
    path: ./filename.json

  # Data sources: traces
  otlp/jaeger:
    endpoint: jaeger-server:4317
    tls:
      cert_file: cert.pem
      key_file: cert-key.pem

  # Data sources: traces, metrics, logs
  kafka:
    protocol_version: 2.0.0

  # Data sources: traces, metrics, logs
  # NOTE: Prior to v0.86.0 use `logging` instead of `debug`
  debug:
    verbosity: detailed

  # Data sources: traces, metrics
  opencensus:
    endpoint: otelcol2:55678

  # Data sources: traces, metrics, logs
  otlp:
    endpoint: otelcol2:4317
    tls:
      cert_file: cert.pem
      key_file: cert-key.pem

  # Data sources: traces, metrics
  otlphttp:
    endpoint: https://otlp.example.com:4318

  # Data sources: metrics
  prometheus:
    endpoint: 0.0.0.0:8889
    namespace: default

  # Data sources: metrics
  prometheusremotewrite:
    endpoint: http://prometheus.example.com:9411/api/prom/push
    # When using the official Prometheus (running via Docker)
    # endpoint: 'http://prometheus:9090/api/v1/write', add:
    # tls:
    #   insecure: true

  # Data sources: traces
  zipkin:
    endpoint: http://zipkin.example.com:9411/api/v2/spans

Instrumenting Go Application With OpenTelemetry Tracing

Spin Up Collector

Trace and span data emitted by server will be collected, stored, filtered, and exported by OpenTelemetry collector. The rules of collector defined by a configuration file as shown below.

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  batch:

exporters:
  otlphttp/trace:
    endpoint: http://tempo:4318
    tls:
      insecure: true

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch]
      exporters: [otlphttp/trace]

OpenTelemetry collector will collects all emmited traces and spans as mentioned at receivers.otlp.http.endpoint field. Value of this field is 0.0.0.0:4318 where 0.0.0.0 means the collector will listen to all IPv4 addresses.

After traces and spans has been collected by receiver, then they will be processed on processor. In this example, the data will be processed as batch.

Processed data then will be exported to trace visualization technology such as Grafana Tempo. Exporters are defined on exporters section.

Finally, we compile collecting, processing, and exporting traces data on service field.

Defining Exporter

Exporter on code level aims to connect application with collector. NewTracer function below explains exporter creation and connects created exporter with collector.

package trace

import (
    "context"
    "fmt"
    "net/http"
    "runtime"
    "time"

    "go.opentelemetry.io/otel"
    "go.opentelemetry.io/otel/trace"

    "go.opentelemetry.io/otel/attribute"
    "go.opentelemetry.io/otel/exporters/otlp/otlptrace"
    "go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracehttp"
    "go.opentelemetry.io/otel/propagation"
    "go.opentelemetry.io/otel/sdk/resource"
    traceSdk "go.opentelemetry.io/otel/sdk/trace"
)

var globalServiceName string

func NewTracer(
    ctx context.Context,
    serviceName string,
) error {
    propagator := propagation.NewCompositeTextMapPropagator(
        propagation.TraceContext{},
        propagation.Baggage{},
    )
    otel.SetTextMapPropagator(propagator)

  // Create OpenTelemetry tracing client that exports
  // trace data to OpenTelemetry HTTP receiver
    traceClientHTTP := otlptracehttp.NewClient(
        otlptracehttp.WithInsecure(),
    )

    traceExporter, err := otlptrace.New(
        ctx,
        traceClientHTTP,
    )
    if err != nil {
        return err
    }

    res, err := resource.New(
        ctx,
        resource.WithContainer(),
        resource.WithFromEnv(),
    )
    if err != nil {
        return err
    }

    tracerProvider := traceSdk.NewTracerProvider(
        traceSdk.WithBatcher(
            traceExporter,
            traceSdk.WithBatchTimeout(time.Second),
        ),
        traceSdk.WithResource(
            res,
        ),
    )

    globalServiceName = serviceName

    otel.SetTracerProvider(tracerProvider)

    return nil
}

Define Tracing Components

In this article, we are going to record events on HTTP request and business logic/service by creating trace and generate span from each component. HTTPLevelSpan to trace over HTTP handler, while ServiceLevelSpan to trace over service level.

package trace

import (
    "context"
    "fmt"
    "net/http"
    "runtime"
    "time"

    "go.opentelemetry.io/otel"
    "go.opentelemetry.io/otel/trace"

    "go.opentelemetry.io/otel/attribute"
)

func HTTPLevelSpan(
    ctx context.Context,
    r *http.Request,
) (context.Context, trace.Span) {
    spanName := fmt.Sprintf("%s %s", r.Method, r.URL.String())
    ctx, span := otel.Tracer(globalServiceName).Start(ctx, spanName)

    if span.IsRecording() {
        span.SetAttributes(attribute.String("http.request.method", r.Method))
        span.SetAttributes(attribute.String("server.address", r.RemoteAddr))
        span.SetAttributes(attribute.String("url.full", r.URL.String()))
    }

    return ctx, span
}

func ServiceLevelSpan(
    ctx context.Context,
) (context.Context, trace.Span) {
    pc := make([]uintptr, 10)
    runtime.Callers(2, pc)
    f := runtime.FuncForPC(pc[0])

    ctx, span := otel.Tracer(globalServiceName).Start(ctx, f.Name())

    if span.IsRecording() {
        span.SetAttributes(attribute.String("code.function.name", f.Name()))
    }

    return ctx, span
}

When conducting trace on any aspect of application like HTTP, service, GRPC, or database, it is recommended to follow Semantic Convention or semconv. Semantic convention give us meaningful information about generated trace and span. Based on code above, code.function, server.address, and url.full are examples of semantic convention.

Instrumenting Application

Code below demonstrate usages of HTTPLevelSpan and ServiceLevelSpan

package server

import (
    "log"
    "net/http"
)

func RunServer() {
    http.HandleFunc("/", home())
    if err := http.ListenAndServe(":9320", nil); err != nil {
        log.Fatalf("could not start server: %v\n", err)
    }
}

package server

import (
    "fmt"
    "log"
    "net/http"
    "strconv"
    "time"

    "go-trace-demo/service"
    "go-trace-demo/utils/metrics"
    utilsTrace "go-trace-demo/utils/trace"
)

func home() http.HandlerFunc {
    return func(w http.ResponseWriter, r *http.Request) {
        ctx := r.Context()

    ctx, span := utilsTrace.HTTPLevelSpan(ctx, r)
        defer span.End()

        respCode := 200

        failQuery := r.URL.Query().Get("fail")

        if failQuery != "" {
            respCode = 500

            isFail, err := strconv.ParseBool(failQuery)
            if err != nil {
                w.WriteHeader(http.StatusInternalServerError)
                w.Write([]byte("error parse query"))

                return
            }

            if isFail {
                w.WriteHeader(http.StatusInternalServerError)
                w.Write([]byte("encounter fail"))

                return
            }
        }

        err := service.GetUserByID(ctx)
        if err != nil {
            return
        }

        err = service.ListStores(ctx)
        if err != nil {
            return
        }

        w.Write([]byte("Hello, World!"))
    }
}

package service

import (
    "context"

    utilsTrace "go-trace-demo/utils/trace"
)

func GetUserByID(
    ctx context.Context,
) error {
    ctx, span := utilsTrace.ServiceLevelSpan(ctx)

    defer span.End()

    return nil
}

func ListStores(
    ctx context.Context,
) error {
    ctx, span := utilsTrace.ServiceLevelSpan(ctx)

    defer span.End()

    return nil
}

Below is main code

func main() {
    ctx := context.Background()

    serviceName := os.Getenv("OTEL_EXPORTER_OTLP_ENDPOINT")

    err := utilsTrace.NewTracer(ctx, serviceName)
    if err != nil {
        log.Fatalln("fail setup tracer")
    }

    server.RunServer()
}

Glue Everything

We'll run application, OpenTelemetry Collector, and Grafana Tempo using docker compose file below

version: "3"

networks:
  observability:

services:
  tempo:
    image: grafana/tempo:main-c5323bf
    command: [ "-config.file=/etc/tempo.yaml" ]
    volumes:
      - ./config/tempo-config.yaml:/etc/tempo.yaml
      - ./config/tempo-data:/var/tempo
    ports:
      - "3200:3200" # tempo
      - "4317" # open this port for OpenTelemetry gRPC exporter
      - "4318" # open this port for OpenTelemetry HTTP exporter
    networks:
      - observability

  otel-collector:
    image: otel/opentelemetry-collector:latest
    volumes:
      - ./config/otel-config.yaml:/etc/otel/config.yaml
    command:
      - '--config=/etc/otel/config.yaml'
    ports:
      - "4317:4317" #grpc
      - "4318:4318" #http
    depends_on:
      - loki
    networks:
      - observability

  app:
    container_name: app
    build:
      context: ./
      dockerfile: ./Dockerfile
    ports:
      - 9320:9320
    networks:
      - observability
    depends_on:
      - otel-collector
    environment:
      - OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4318
      - OTEL_SERVICE_NAME=go-loki-app-demo

  grafana:
    environment:
      - GF_PATHS_PROVISIONING=/etc/grafana/provisioning
      - GF_AUTH_ANONYMOUS_ENABLED=true
      - GF_AUTH_ANONYMOUS_ORG_ROLE=Admin
    entrypoint:
      - sh
      - -euc
      - |
        mkdir -p /etc/grafana/provisioning/datasources
        cat <<EOF > /etc/grafana/provisioning/datasources/ds.yaml
        apiVersion: 1
        datasources:
        - name: Loki
          type: loki
          access: proxy 
          orgId: 1
          url: http://loki:3100
          basicAuth: false
          isDefault: true
          version: 1
          editable: false
        - name: Tempo
          type: tempo
          access: proxy 
          orgId: 1
          url: http://tempo:3200
          basicAuth: false
          version: 1
          editable: false
        EOF
        /run.sh
    image: grafana/grafana:latest
    ports:
      - "3000:3000"
    networks:
      - observability

Our application defined at service.app section. Be aware that we define OTEL_EXPORTER_OTLP_ENDPOINT environment variables for application. The reason is when we are defining OpenTelemetry HTTP exporter, we don't specify OpenTelemetry collector address. OpenTelemetry will look for OTEL_EXPORTER_OTLP_ENDPOINT environment variable to know where application should send trace data. If you have specific OpenTelemetry exporter, you can use WithEndpointURL argument for otlptracehttp.NewClient. Additionally, we define OTEL_SERVICE_NAME environment variable where the purpose is for OpenTelemetry identify what service they are receive the trace data. If OTEL_SERVICE_NAME not provided, OpenTelemetry will fill with unknown_service value.

Visualizing Traces Using Grafana Tempo

After successful container creation with Docker compose, let's generate trace data by accessing localhost:9320/. Next, access Grafana on localhost:3000, then head to Exlpore menu by click dropdown menu on top left corner.

Switch to Search query type instead of TraceQL, then click Service Name dropdown. It will show service name we provided on OTEL_SERVICE_NAME
. By click Run Query, we'll have something like this.

Explore trace data by click available row on Trace column. We will be redirected to trace detail like this
.

Additionally, explore span attribute by click one of span. We will see span attribute and resource attribute we defined when we are creating OpenTelemetry trace instance.

DEV Community