DEV Community

Cover image for Observability with Grafana Cloud and OpenTelemetry in .net microservices
Dmitrii
Dmitrii

Posted on

Observability with Grafana Cloud and OpenTelemetry in .net microservices

People say, that application development lifecycle consists of 3 steps

  • make it work
  • make it right ← we’re here
  • make it fast

Suppose you’re developing a microservice. It can be over rest/grpc/kafka or whatever, You’ve completed all functional and non-functional requirements incl. authentication/authorization, validation, your application is secure, scalable and solving a business problem.

In this article, you’ll find out how to make your app production-ready whether it is cloud native and hosted in Kubernetes or more traditional without using containers.

We will cover popular tools and frameworks aiming to solve common needs for every application without reinventing the wheel:

  • Grafana Cloud (Prometheus, Grafana, Loki, Tempo),
  • OpenTelemetry,
  • Serilog

If you haven’t heard about them, no worries, the article applies to any experience level. You can find a fully working demo project on Github.

Observability/Monitoring

Observability means some collected data explaining the state of your application. For production environments it is critical to know how your application behaves. Nginx post simplified it into 3 simple questions:

  • Metrics – “Is there a problem?”
  • Traces – “Where is the problem?”
  • Logs – “What is the problem?”

Plenty of cool tools capable of doing full monitoring of your system can answer all the questions above. The problem we face nowadays isn’t knowing the answer but having too many answers.

This article can help you make the right decision and save tons of time and money for your business. Described approach works best for start-ups and small companies.

From a high-level perspective, there are 2 popular groups:

Today we’re talking about Grafana Cloud (Prometheus for metrics, Loki for logs, Tempo for traces), a SaaS product. Its free plan includes:

  • 10,000 series for Prometheus or Graphite metrics
  • 50 GB of logs
  • 50 GB of traces

It's very generous compared to competitors!

When exceeding these limits, you can freely choose to continue on SaaS or switch to open-source self-hosted distribution.

If you decide to give it a try, Sign Up here before moving to the next step.

Before we start, these 2 portals shortcut will be helpful:

Grafana Agent

Grafana Agent is responsible for delivering your metrics/traces from your application to the cloud. We use grafana agent for **metrics **and **traces **only. The approach for **logging **will be different.

The data flow is illustrated below:

Grafana Agent

Open https://{YouOrganizationName}.grafana.net

=> go to the integrations tab,
=> choose grafana agent
=> and then follow the instructions.

Grafana Agent can work on Windows/MacOS/Debian/RedHat

Grafana Agent Integration Tab

After installation, we need to configure the agent:

If you’re using Windows with the default installation, go to C:\Program Files\Grafana Agent and edit agent-config.yaml For other cases check the documentation

metrics:
  configs:
  - name: integrations
    remote_write:
    - basic_auth:
        password: {replaceit}
        username: {replaceit}
      url: {replaceit}
    scrape_configs:
      - job_name: dogs-service
        scrape_interval: 30s
        metrics_path: /metrics/prometheus
        static_configs:
          - targets: ['localhost:5000']
      - job_name: dogs-service-healthchecks
        scrape_interval: 30s
        metrics_path: /health/prometheus
        static_configs:
          - targets: ['localhost:5000']
  global:
    scrape_interval: 60s
  wal_directory: /tmp/grafana-agent-wal

traces:
  configs:
  - name: default
    remote_write:
      - endpoint: {replaceit}
        basic_auth:
          username: {replaceit}
          password: {replaceit}
    receivers:
      otlp:
        protocols:
          grpc:
Enter fullscreen mode Exit fullscreen mode

To get the username, password and URLs, go to https://grafana.com/orgs/{YouOrganizationName} and hit ‘Details’ in Tempo and Prometheus.

Organization settings

Save the file and restart Grafana Agent service.

Restarting Grafana Agent

With this, Grafana Agent configuration is complete. Now, let’s send some data to Grafana Cloud.

Traces

To understand the traces it’s easier to take a look at the picture.

Grafana Tempo

In short, you can trace an activity through your services in a distributed system.

To demonstrate it, in our simple demo project we’ll be using 3 components:

To send traces to the monitoring system, we need to use some framework. OpenTelemetry is a standardized and recommended approach to implementing tracing in the application nowadays. It gets support from all popular tools so the integration will be seamless.

OpenTelemetry is a collection of tools, APIs, and SDKs. Use it to instrument, generate, collect, and export telemetry data (metrics, logs, and traces) to help you analyze your software’s performance and behavior.

We're going to use OpenTelemetry .NET SDK. Add following nuget dependencies to the project:

<PackageReference Include="OpenTelemetry.Exporter.OpenTelemetryProtocol" Version="1.4.0-alpha.2" />
<PackageReference Include="OpenTelemetry.Exporter.Prometheus.AspNetCore" Version="1.4.0-alpha.2" />
<PackageReference Include="OpenTelemetry.Extensions.Hosting" Version="1.0.0-rc9.6" />
<PackageReference Include="OpenTelemetry.Instrumentation.AspNetCore" Version="1.0.0-rc9.6" />
<PackageReference Include="OpenTelemetry.Instrumentation.EntityFrameworkCore" Version="1.0.0-beta.3" />
<PackageReference Include="OpenTelemetry.Instrumentation.EventCounters" Version="0.1.0-alpha.1" />
<PackageReference Include="OpenTelemetry.Instrumentation.Http" Version="1.0.0-rc9.6" />
<PackageReference Include="OpenTelemetry.Instrumentation.SqlClient" Version="1.0.0-rc9.6" />
Enter fullscreen mode Exit fullscreen mode

Then, in Program.cs configure:

var builder = WebApplication.CreateBuilder(args);
builder.Services.AddOpenTelemetryTracing(options =>
{
    options.ConfigureResource(resourceBuilder =>
    {
        resourceBuilder.AddService(
            builder.Environment.ApplicationName,
            builder.Environment.EnvironmentName,
            builder.Configuration["OpenTelemetry:ApplicationVersion"],
            false,
            Environment.MachineName);
    })
    .AddHttpClientInstrumentation(instrumentationOptions =>
    {
        instrumentationOptions.RecordException = true;
    })
    .AddAspNetCoreInstrumentation(instrumentationOptions =>
    {
        instrumentationOptions.RecordException = true;
    })
    .AddSqlClientInstrumentation(instrumentationOptions =>
    {
        instrumentationOptions.RecordException = true;
        instrumentationOptions.SetDbStatementForText = true;
    })
    .AddEntityFrameworkCoreInstrumentation(instrumentationOptions =>
    {
        instrumentationOptions.SetDbStatementForText = true;
    })
    .AddOtlpExporter(opt =>
    {
        opt.Protocol = OtlpExportProtocol.Grpc;
        opt.Endpoint = new Uri(builder.Configuration["OpenTelemetry:Exporter:Otlp:Endpoint"]);
    });
});

Enter fullscreen mode Exit fullscreen mode

‘OpenTelemetry:Exporter:Otlp:Endpoint’ comes from appsettings.json


  "OpenTelemetry": {
    "ApplicationVersion": "1.0.0", 
    "Exporter": {
      "Otlp": {
        "Endpoint": "http://localhost:4317"
      }
    }
  } 
Enter fullscreen mode Exit fullscreen mode

where http://localhost:4317 is an endpoint of the Grafana Agent we installed in the previous step.

By using OTLP protocol, our application will send traces to grafana agent which will take care of the rest. In our case, the agent will resend it to grafana cloud. If required, you can always easily switch from grafana cloud to your self-hosted tempo just by configuring the agent. There's no need to modify the source code,

That’s it.

Let’s run the app and hit the test endpoint.

GET {{host}}/api/v1/dogs/new
Enter fullscreen mode Exit fullscreen mode

Testing traces

The parent span belongs to our API request, and 2 child spans for calling external API and saving data to the database. There’s a correlation between the duration, operation results, and metadata. That is basically everything we need to trace and debug activity from top to bottom.

Let’s hit another endpoint to see what we get in case of error:

GET {{host}}/api/v1/fail500
Enter fullscreen mode Exit fullscreen mode

Going back to search panel, and searching for our request:

Traces Search Panel

Trace Tags

It looks perfect, we have all the needed information here to backtrace the issue: External API returned HTTP 404, Refit threw an exception, and our API returned HTTP 500 to the client.

Metrics

Metrics - aggregated real-time data to measure your application performance.

For example, it can be the latency of your API endpoints, number of http 5XX errors, and the free space on the hard drive.

There are many frameworks to collect metrics in .net core service:

All of them are working just fine, but we’re going to use OpenTelemetry sdk again, and then expose prometheus endpoint. Grafana Agent will fetch it and send to Grafana Cloud, similarly to traces.

var builder = WebApplication.CreateBuilder(args);
builder.Services.AddOpenTelemetryMetrics(options =>
{
    options.ConfigureResource(resourceBuilder =>
    {
        resourceBuilder.AddService(
            builder.Environment.ApplicationName,
            builder.Environment.EnvironmentName,
            builder.Configuration["OpenTelemetry:ApplicationVersion"],
            false,
            Environment.MachineName);
        resourceBuilder.AddTelemetrySdk();
    })
    .AddHttpClientInstrumentation()
    .AddAspNetCoreInstrumentation()
    .AddEventCounterMetrics()
    .AddPrometheusExporter();
});

Enter fullscreen mode Exit fullscreen mode
var app = builder.Build();
app.UseHealthChecksPrometheusExporter("/health/prometheus", options =>
{
    options.ResultStatusCodes[HealthStatus.Unhealthy] = 200;
});
Enter fullscreen mode Exit fullscreen mode

That’s it, very simple. Go to Grafana Cloud, change data source to prometheus and try to visualize some metrics of your choice. E.g. queries per second:

Setting up metrics visualization

Queries per second

You might already be familiar with Prometheus and Grafana metrics. It’s the most loved tool among DevOps experts all around the globe to monitor VMs, networks, databases, and whatever metrics you can imagine.

Healthchecks

Healthchecks are the edge case of app metrics to automate operations (e.g. when your app should automatically restart when it's running out of memory, route traffic, a new instance in your cluster becomes available, and so on).

Microsoft provided extensive documentation for us, explaining everything about health checks in depth.

To keep the article short and simple, I won't add any more detail in this subject. So we just simply try it out.

First step is Installing required packages:

    <PackageReference Include="AspNetCore.HealthChecks.Network" Version="6.0.4" />
    <PackageReference Include="AspNetCore.HealthChecks.Prometheus.Metrics" Version="6.0.2" />
    <PackageReference Include="AspNetCore.HealthChecks.Publisher.Prometheus" Version="6.0.2" />
    <PackageReference Include="AspNetCore.HealthChecks.System" Version="6.0.5" />
    <PackageReference Include="Microsoft.Diagnostics.NETCore.Client" Version="0.2.328102" />
    <PackageReference Include="Microsoft.Extensions.Diagnostics.HealthChecks.EntityFrameworkCore" Version="6.0.8" />
Enter fullscreen mode Exit fullscreen mode

Configure program.cs

var builder = WebApplication.CreateBuilder(args);
builder.Services.AddHealthChecks()
    .AddDiskStorageHealthCheck(_ => { }, tags: new[] { "live", "ready" })
    .AddPingHealthCheck(_ => { }, tags: new[] { "live", "ready" })
    .AddPrivateMemoryHealthCheck(512 * 1024 * 1024, tags: new[] { "live", "ready" })
    .AddDnsResolveHealthCheck(_ => { }, tags: new[] { "live", "ready" })
    .AddDbContextCheck&lt;DogsDbContext>(tags: new[] { "ready" });
Enter fullscreen mode Exit fullscreen mode
var app = builder.Build();
app.MapHealthChecks("health", new HealthCheckOptions
{
    Predicate = check => check.Tags.Contains("ready"),
    ResponseWriter = HealthChecksLogWriters.WriteResponseAsync
});

app.MapHealthChecks("health/ready", new HealthCheckOptions
{
    Predicate = check => check.Tags.Contains("ready"),
    ResponseWriter = HealthChecksLogWriters.WriteResponseAsync
});

app.MapHealthChecks("health/live", new HealthCheckOptions
{
    Predicate = check => check.Tags.Contains("live"),
    ResponseWriter = HealthChecksLogWriters.WriteResponseAsync
});

app.UseHealthChecksPrometheusExporter("/health/prometheus", options =>
{
    options.ResultStatusCodes[HealthStatus.Unhealthy] = 200;
});
Enter fullscreen mode Exit fullscreen mode

In this example, we collect some crucial health check metrics and expose them in 3 different ways:

  1. For kubernetes probes
  2. For prometheus collector
  3. For people

Kubernetes Probes

You can skip it if you’re not using Kubernetes.

Kubernetes pings our application. Depending on the result, it can decide to be up and running, restart ,or maybe wait a bit longer until the app loads all required dependencies. Only then it will be ready to handle incoming traffic.

Detailed documentation you can find here.

We’re exposing 2 endpoints:

  • /health/live - for **startupProbe and livenessProbe if the app signalized that is not live (some of healthchecks fauld) Kubernetes has to restart the service.
  • /health/ready - readinessProbe When the app is ready Kubernetes will route the traffic to this instance.

Prometheus healthcheck

Additional endpoint for exporting health checks in prometheus format. And similar to app metrics we can use them in grafana:

Prometheus exporter endpoint

Memory consumption

Human readable format

And the last endpoint just to simplify testing and operations. Just open it in your browser:

Human-readable health checks

Logs

Logging for development and production environments will be different.

On production best practice is to write logs to stdout, then use log collector (e.g. promtail) to deliver it to the storage. And for development purposes, it is much easier to have a sink and write logs directly to Grafana Cloud.

I believe that in 2022, OpenTelemetry logs are not ready for production use. It doesn't give you any advantages, and tooling is quite poor compared to well-known tools.

So I still recommend using Serilog for logging. It has an OpenTelemetry sink, so again, you can switch to OTLP anytime when needed without modifying your source code.

As usual we start with installing packages:

<PackageReference Include="Serilog.AspNetCore" Version="6.0.1" />
<PackageReference Include="Serilog.Enrichers.Demystifier" Version="1.0.2" />
<PackageReference Include="Serilog.Enrichers.Span" Version="2.3.0" />
<PackageReference Include="Serilog.Exceptions" Version="8.4.0" />
<PackageReference Include="Serilog.Exceptions.EntityFrameworkCore" Version="8.4.0" />
<PackageReference Include="Serilog.Exceptions.Refit" Version="8.4.0" />
<PackageReference Include="Serilog.Formatting.Compact" Version="1.1.0" />
<PackageReference Include="Serilog.Sinks.Grafana.Loki" Version="8.0.0" />
Enter fullscreen mode Exit fullscreen mode

Program.cs:

builder.Host.UseSerilog((_, configuration) => configuration
    .ReadFrom.Configuration(builder.Configuration)
    .Enrich.WithSpan()
    .Enrich.WithExceptionDetails(new DestructuringOptionsBuilder()
        .WithDefaultDestructurers()
        .WithDestructurers(new IExceptionDestructurer[]
        {
            new DbUpdateExceptionDestructurer(),
            new ApiExceptionDestructurer()
        }))
    .Enrich.WithDemystifiedStackTraces();

builder.Services.AddHttpLogging(logging =>
{
    logging.LoggingFields = HttpLoggingFields.All;
});

Enter fullscreen mode Exit fullscreen mode
var app = builder.Build();
app.UseHttpLogging();
Enter fullscreen mode Exit fullscreen mode

appsettings.json:

"Serilog": {
    "Using": [
      "Serilog.Sinks.Grafana.Loki"
    ],
    "MinimumLevel": {
      "Default": "Debug"
    },
    "WriteTo": [
      {
        "Name": "Console",
        "Args": {
          "formatter": "Serilog.Formatting.Compact.CompactJsonFormatter, Serilog.Formatting.Compact"
        }
      },
      {
        "Name": "GrafanaLoki",
        "Args": {
          "uri": "https://logs-prod3.grafana.net",
          "credentials": {
            "login": "",
            "password": ""
          },
          "labels": [
            {
              "key": "service",
              "value": "demo-services-dogs"
            }
          ],
          "propertiesAsLabels": [
            "app"
          ]
        }
      }
    ]
  }
Enter fullscreen mode Exit fullscreen mode

Don’t forget to update your grafana credentials in your secrets

In our demo project we’re going to log http requests/responses. Mirosoft provides 2 middleware to log http messages (body, headers etc).

Serilog is the most popular logging framework and has tons of extensions, for example:

  • WithSpan - adding information from OpenTelemetry traces.
  • WithExceptionDetails - log the exceptions in convenient and human-readable format.

Okay, after running our service go to Grafana Cloud -> Explore -> Loki Logs Datasource

Loki Logs Datasource

Here are our logs. Thanks to deep integration between Loki and Tempo, Grafana allows us to quickly jump from logs to according to traces.

Loki-tempo integration

Conclusion

Source code is on Github

In the article we explored Grafana Cloud and tried 3 observability tools:

  • Prometheus for metrics
  • Tempo for traces
  • Loki for logs

For every data store we use Grafana UI only, which is super convenient for analytics and troubleshooting.

We also got our hands dirty with OpenTelemetry, which aims to standardize observability tools and protocols to make distributed applications maintenance much easier.

In the next article we’ll cover more topics need to be done for production ready apps, such as:

  • Errors handling;
  • Retries, Jitter, Circuit Breaker patterns.

Top comments (0)