DEV Community

Michael Murphy
Michael Murphy

Posted on • Edited on

OpenTelemetry in Dotnet

Introduction

Why OpenTelemetry?

As ecosystems grow, applications invariably get tied into services that handle logging, tracing and observability data. An example of this is using Grafana specific agents and prometheus for collecting data from containers.

OpenTelemetry offers an easy platform agnostic way to collect all this data. Instead of being tied to a specific platform or SaaS, you can migrate from one to another as you please. You can even use multiple platforms eg coupling Grafana and Zipkin. This piece offers a summation of my journey into OpenTelemetry in .NET.

Instrumenting an application

What do we collect?

For this example, I'm gonna utilise .NET to illustrate how instrumentation works. There are three key types of data that we can collect.

  1. Logs - Logs are pretty much a standard we are used to. They're a captured piece of data at a point in time. This piece won't really go into them as logs were being collected and dispatched with other tooling. I've found OpenTelemetry is in its infancy with them still.
  2. Traces - These are spans of time that can have multiple child spans that illustrate an event that occurred and the length of time taken. These traces can be distributed and tracked across multiple systems provided they're instrumented for it.Traces also have TraceIds in the form of a guid that can link logs and traces.

    Causal relationships between Spans in a single Trace
    
            [Span A]  ←←←(the root span)
                |
        +------+------+
        |             |
    [Span B]      [Span C] ←←←(Span C is a `child` of Span A)
        |             |
    [Span D]      +---+-------+
                |           |
            [Span E]    [Span F]
    

    Source of Illustation

  3. Metrics - Metrics are timings of specific events that can be used to capture statistical data and observability. Eg how long a request takes or CPU info at a point in time.

Integrating instrumentation

Most of the setup in an application is pretty streamlined and gets integrated in the startup of the application. Separate endpoints in the opentelemetry collector handle logs, trace and metrics.

Traces

Below is a sample configuration for one of our applications. We setup a tracer as one of the services in .NET. The AddSource syntax is used to name the type of provider. The resource builder is shared among all OpenTelemetry similar to a hostbuilder in dotnet. We also specify the endpoint of the collector.

    services.AddOpenTelemetryTracing(options =>
    {
        options.AddSource("Email.RiskScore.Tracing");
        options.SetResourceBuilder(rb).SetSampler(new AlwaysOnSampler()).AddAspNetCoreInstrumentation()
            .AddHttpClientInstrumentation()
            .AddMySqlDataInstrumentation(settings => settings.SetDbStatement = true);
        options.AddOtlpExporter(otlpOptions => { otlpOptions.Endpoint = new Uri(endpoint + "/api/v1/traces"); });
    });
Enter fullscreen mode Exit fullscreen mode

In this case, I've included AspNet instrumentation which captures dotnet related metrics including things like garbage collection. On top of that, we're capturing each outgoing http request and all sql requests. This allows for a far more accurate picture to be painted of what the requests are doing or how much time they're taking out of the application flow.

On top of that, it is also possible to create custom traces. Firstly a static activity source needs to be created like below. After this is done, you can access it and use it to write traces.

public static ActivitySource Writer = null;
public static void Init(string tracerName)
{
    Writer = new ActivitySource(tracerName,"1.0.0");
    var listener = new ActivityListener()
    {
        ShouldListenTo = _ => true,
        Sample = (ref ActivityCreationOptions<ActivityContext> _) => ActivitySamplingResult.AllData
    };
    ActivitySource.AddActivityListener(listener);
}
Enter fullscreen mode Exit fullscreen mode

Following this, introduce a trace around an operation. In the case of retrieving data, we mark the activity as Client cause we are retrieving data. Server and Internal are other markers we use. On top of that, we can store tags with associated info eg parameters.

These end up merging into existing traces with the capturing of additional data.

var activity = SharedTelemetryUtilities.Writer?.StartActivity("get_phone_data", ActivityKind.Client);
//Some operation takes place for a length of time
activity.AddTag("param", paramValue)
activity?.Stop();
Enter fullscreen mode Exit fullscreen mode

In terms of accessing traces, the optimal way is to embed the trace id within your logs. Then you can see the associated traffic with the logs. Tooling such as Grafana Tempo or Zipkin can produce a flame graph with a pretty nice breakdown of the request. I've included an example below of a trace that was entirely produced by autoinstrumentation. It's capable of picking up all outgoing and incoming calls. This goes as far as producing urls and query data. Below is a sample of a trace from Grafana. You can see how the length of each span gives a clear picture of what operations are taking the most time. You can delve further into the call in the explorer.

Trace Grafana

Metrics

Metrics allow for statistical data to be captured, this can be used in conjunction with tooling such as prometheus.

    services.AddOpenTelemetryMetrics(options =>
    {
        options.SetResourceBuilder(rb)
            .AddAspNetCoreInstrumentation()
            .AddHttpClientInstrumentation();
        options.AddMeter(SharedTelemetryUtilities.MeterName);
        options.AddView(TelemetryProperties.RequestMeter, new ExplicitBucketHistogramConfiguration()
        {
            Boundaries = { 50, 75, 100, 250, 500, 1000, 2000, 5000, 10000 }
        });
        options.AddOtlpExporter(otlpOptions => { otlpOptions.Endpoint = new Uri(endpoint + "/api/v1/metrics"); });
    });
Enter fullscreen mode Exit fullscreen mode

Setup is similar to traces. The one slightly different piece is the inclusion of the view. This allows for non standard buckets to be utilised to track our request duration, we include the name of the instrumentation so it knows which to apply it to. On top of that, we've introduced some tracking around duration of httpclient requests and other aspnet functionality.

        Histogram<double> histogram = null;
        var stopwatch = Stopwatch.StartNew();
        histogram = SharedTelemetryUtilities.Meter.CreateHistogram<double>(TelemetryProperties.RequestMeter);

        //Operation occurs here.
        histogram.Record(stopwatch.Elapsed.TotalMilliseconds);
Enter fullscreen mode Exit fullscreen mode

Similar to the trace, we can access a static meter and create our prefered meter with that. Histograms make sense for statistical info but counters and other tools are included in meter depending on one's needs. Once again, we're still using built-in dotnet functionality to measure this data.

Below is such data being sent to Grafana, the particularly great thing about capturing metrics like this is your dashboards tend to load instantaneously cause it's not attempting to parse through logs for example.

Image description

Collecting the data

So one way in which we're being platform agnostic is that built-in dotnet tooling is utilised. The next is the collector, tools such as Grafana or Datadog have their own agents that tend to run alongside your containers. In the world of opentelemetry, we can let these agents die the deaths they deserve.

The OpenTelemetry Collector needs to run alongside your application similar to the agents. In the example below, we send metrics to a prometheus instance and traces to Tempo, both based in Grafana Cloud, I've obfuscated api keys.

receivers:
    prometheus:
        config:
        scrape_configs:
        - job_name: 'otel-collector'
            scrape_interval: 10s
            static_configs:
            - targets: ['0.0.0.0:8888']
    otlp:
        protocols:
        grpc:
        http:
processors:
    batch:
        timeout: 10s
    resourcedetection:
        detectors: [ecs,system,env]
exporters:
    otlp:
        endpoint: tempo-us-central1.grafana.net:443
        headers:
        authorization: Basic [ApiKey]
    prometheusremotewrite:
        endpoint: https://364320:api_key@prometheus-prod-10-prod-us-central-0.grafana.net/api/prom/push
        external_labels:
            env: dev
            region: us-west-2    
            app: sample-app
service:
    pipelines:
        traces:
            receivers: [otlp]
            processors: [batch, resourcedetection]
            exporters: [otlp]
        metrics:
            receivers: [prometheus]
            processors: [batch,resourcedetection]
            exporters: [prometheusremotewrite]
Enter fullscreen mode Exit fullscreen mode

Now I'll break down this config to understand each piece of it.

Receivers

Receivers are effectively the point where you configure what will be utilising this data. In this case, I've configured it to utilise prometheus, we're also utilising OTLP itself which is an opentelemetry standard. With this combination we gather data around the performance of the collector itself and utilise the prometheus node exporter to expose additional data.

receivers:
    prometheus:
        config:
        scrape_configs:
        - job_name: 'otel-collector'
            scrape_interval: 10s
            static_configs:
            - targets: ['0.0.0.0:8888']
        - job_name: 'node'
            scrape_interval: 30s
            static_configs:
            - targets: ['localhost:9100']
    otlp:
        protocols:
        grpc:
        http:
Enter fullscreen mode Exit fullscreen mode

If we wished to expand this to use other tooling, it would just be a matter of adding it. Some require no config eg zipkin: would be a default config for zipkin. The collected data will no go anywhere though unless it's configured at the exporter level.

Processors

My utilisation of processors is pretty basic. Processors operate on the data itself but also can further enrich based on contextual information gathered such as regional info. Common usages include sampling to reduce the amount of data being sent to providers and preventing the collector from overusing memory. In our case, we rely on batch and Resource detection.

Batch allows for metrics, traces etc to be batched into bundles. The default is 8192 and then it will send, the timeout is the max time before sending regardless. I'd highly recommend utilising batches if you're dealing with high volumes of traffic.

    batch:
        timeout: 10s
    resourcedetection:
        detectors: [ecs,system,env]
Enter fullscreen mode Exit fullscreen mode

Resource detection gathers data around the kind of system it's operating in. Eg in this case it will gather standard machine info but also Amazon ECS specific and any associated environment variables.

Exporters

Exporters effectively configure what locations the data will go to. Metrics and traces for example can go to multiple different locations and we can embed authentication info etc into it. In addition in the case of metrics, I have included custom labelling for sending metrics to prometheus and configured authentication for grafana tempo.

exporters:
        otlp:
            endpoint: tempo-us-central1.grafana.net:443
            headers:
            authorization: Basic [Base64 ApiKey]
        prometheusremotewrite:
            endpoint: https://364320:api_key@prometheus-prod-10-prod-us-central-0.grafana.net/api/prom/push
            external_labels:
                env: dev
                region: us-west-2    
                app: sample-app
Enter fullscreen mode Exit fullscreen mode

Service

At this layer, we configure what pieces of tooling that we've configured that we actually want to use. This can allow for a quick way to use different tooling if necessary for debugging in a dev environment for example.

service:
    pipelines:
        traces:
            receivers: [otlp]
            processors: [batch, resourcedetection]
            exporters: [otlp]
        metrics:
            receivers: [prometheus]
            processors: [batch,resourcedetection]
            exporters: [prometheusremotewrite]
Enter fullscreen mode Exit fullscreen mode

The pipelines can be thought of as a deployment pipeline. We have multiple pieces that can be utilised but if they're not configured in our pipeline, they won't be used.

Traces, metrics and logs are each configured separately. In this case we just have traces and metrics configured.

The otlp receiver is utilised for traces, this will batch its data and embed additional resource information into the traces. They will then be exported otlp protocol into grafana's tempo service.

Meanwhile metrics utilises the data gathered by the prometheus receiver and utilises the same processors. These are then exported by the prometheus remote write we have configured.

Conclusion

OpenTelemetry offers a solution that allows for far greater control over our data. This piece has only touched upon the advantages of it. It's got widespread adoption from the opensource community and the major telemetry services.

There are still points that are not fully up to scratch as it is still in the beta phase however I am very much so of the opinion that it's production ready. A big advantage of the tooling is it does provide support for SaaS tooling such as Grafana, Datadog and Splunk. Another huge benefit is that this forces developers to think of more than simply logs. Eg traces can offer far greater utility than logs much of the time, cause they're constantly capturing request details including if they're failing and why.

The overall support for the different languages is very strong although many of the libraries are still in beta. For any companies who are already using OpenTraces or OpenCensus, this should be a particularly straight orward transition. Aws also offer custom collectors if needed although I have found that the OpenTelemetry one is sufficient.

Resources

Couple of gotchas

  • When defining your instance id, make sure it's set to a random guid or something guaranteed to be unique. Eg ecs task id. It's gonna act as your unique identifier for otel and your metrics will become a mess with multiple instances otherwise.
  • Things like traces and metrics may need custom implementations for tracking the invocations and events.
  • There will be bugs, it goes with the newness of the platform and everything from datadog to grafana are still in the process of integrating into it.

Top comments (0)