Goh Chun Lin

Posted on Aug 9 • Originally published at cuteprogramming.blog on Apr 27

Observing Orchard Core: Metrics and Logs with Grafana and Amazon CloudWatch

#amazonwebservices #c #experience #grafana

I recently deployed an Orchard Core app on Amazon ECS and wanted to gain better visibility into its performance and health.

Instead of relying solely on basic Amazon CloudWatch metrics, I decided to build a custom monitoring pipeline that has Grafana running on Amazon EC2 receiving metrics and EMF (Embedded Metrics Format) logs sent from the Orchard Core on ECS via CloudFormation configuration.

In this post, I will walk through how I set this up from scratch, what challenges I faced, and how you can do the same.

Source Code

The CloudFormation templates and relevant C# source codes discussed in this article is available on GitHub as part of the Orchard Core Basics Companion (OCBC) Project:https://github.com/gcl-team/Experiment.OrchardCore.Main.

Why Grafana?

In the previous post where we setup the Orchard Core on ECS, we talked about how we can send metrics and logs to CloudWatch. While it is true that CloudWatch offers us out-of-the-box infrastructure metrics and AWS-native alarms and logs, the dashboards CloudWatch provides are limited and not as customisable. Managing observability with just CloudWatch gets tricky when our apps span multiple AWS regions, accounts, or other cloud environments.

The GrafanaLive event in Singapore in September 2023. (Event Page)

If we are looking for solution that is not tied to single vendor like AWS, Grafana can be one of the options. Grafana is an open-source visualisation platform that lets teams monitor real-time metrics from multiple sources, like CloudWatch, X-Ray, Prometheus and so on, all in unified dashboards. It is lightweight, extensible, and ideal for observability in cloud-native environments.

Is Grafana the only solution? Definitely not! However, personally I still prefer Grafana because it is open-source and free to start. In this blog post, we will also see how easy to host Grafana on EC2 and integrate it directly with CloudWatch with no extra agents needed.

Three Pillars of Observability

In observability, there are three pillars, i.e. logs, metrics, and traces.

Lisa Jung, senior developer advocate at Grafana, talks about the three pillars in observability (Image Credit: Grafana Labs)

Firstly, logs are text records that capture events happening in the system.

Secondly, metrics are numeric measurements tracked over time, such as HTTP status code counts, response times, or ECS CPU and memory utilisation rates.

Finally, traces show the form a strong observability foundation which can help us to identify issues faster, reduce downtime, and improve system reliability. This will ultimately support better user experience for our apps.

This is where we need a tool like Grafana because Grafana assists us to visualise, analyse, and alert based on our metrics, making observability practical and actionable.

Setup Grafana on EC2 with CloudFormation

It is straightforward to install Grafana on EC2.

Firstly, let’s define the security group that we will be use for the EC2.

ec2SecurityGroup:
  Type: AWS::EC2::SecurityGroup
  Properties:
    GroupDescription: Allow access to the EC2 instance hosting Grafana
    VpcId: {"Fn::ImportValue": !Sub "${CoreNetworkStackName}-${AWS::Region}-vpcId"}
    SecurityGroupIngress:
      - IpProtocol: tcp
        FromPort: 22
        ToPort: 22
        CidrIp: 0.0.0.0/0 # Caution: SSH open to public, restrict as needed
      - IpProtocol: tcp
        FromPort: 3000
        ToPort: 3000
        CidrIp: 0.0.0.0/0 # Caution: Grafana open to public, restrict as needed
      Tags:
        - Key: Stack
          Value: !Ref AWS::StackName

The VPC ID is imported from another of the common network stack, the cld-core-network, we setup. Please refer to the stack cld-core-network here.

For demo purpose, please notice that both SSH (port 22) and Grafana (port 3000) are open to the world (0.0.0.0/0). It is important to protect the access to EC2 by adding a bastion host, VPN, or IP restriction later.

In addition, the SSH should only be opened temporarily. The SSH access is for when we need to log in to the EC2 instance and troubleshoot Grafana installation manually.

Now, we can proceed to setup EC2 with Grafana installed using the CloudFormation resource below.

ec2Instance:
  Type: AWS::EC2::Instance
  Properties:
    InstanceType: !Ref InstanceType
    ImageId: !Ref Ec2Ami
    NetworkInterfaces:
      - AssociatePublicIpAddress: true
        DeviceIndex: 0
        SubnetId: {"Fn::ImportValue": !Sub "${CoreNetworkStackName}-${AWS::Region}-publicSubnet1Id"}
        GroupSet:
          - !Ref ec2SecurityGroup
    UserData:
      Fn::Base64: !Sub |
        #!/bin/bash
        yum update -y
        yum install -y wget unzip
        wget https://dl.grafana.com/oss/release/grafana-10.1.0-1.x86_64.rpm
        yum install -y grafana-10.1.0-1.x86_64.rpm
        systemctl enable --now grafana-server
    Tags:
      - Key: Name
        Value: "Observability-Instance"

In the CloudFormation template above, we are expecting our users to access the Grafana dashboard directly over the Internet. Hence, we put the EC2 in public subnet and assign an Elastic IP (EIP) to it, as demonstrated below, so that we can have a consistent public accessible static IP for our Grafana.

ecsEip:
  Type: AWS::EC2::EIP

ec2EIPAssociation:
  Type: AWS::EC2::EIPAssociation
  Properties:
    AllocationId: !GetAtt ecsEip.AllocationId
    InstanceId: !Ref ec2Instance

For production systems, placing instances in public subnets and exposing them with a public IP requires us to have strong security measures in place. Otherwise, it is recommended to place our Grafana EC2 instance in private subnets and accessed via Application Load Balancer (ALB) or NAT Gateway to reduce the attack surface.

Pump CloudWatch Metrics to Grafana

Grafana supports CloudWatch as a native data source.

With the appropriate AWS credentials and region, we can use Access Key ID and Secret Access Key to grant Grafana the access to CloudWatcch. The user that the credentials belong to must have the AmazonGrafanaCloudWatchAccess policy.

The user that Grafana uses to access CloudWatch must have the AmazonGrafanaCloudWatchAccess policy.

However, using AWS Access Key/Secret in Grafana data source connection details is less secure and not ideal for EC2 setups. In addition, AmazonGrafanaCloudWatchAccess is a managed policy optimised for running Grafana as a managed service within AWS. Thus, it is recommended to create our own custom policy so that we can limit the permissions to only what is needed, as demonstrated with the following CloudWatch template.

ec2InstanceRole:
  Type: AWS::IAM::Role
  Properties:
    AssumeRolePolicyDocument:
      Version: '2012-10-17'
      Statement:
        - Effect: Allow
          Principal:
            Service: ec2.amazonaws.com
          Action: sts:AssumeRole

    Policies:
      - PolicyName: EC2MetricsAndLogsPolicy
        PolicyDocument:
          Version: '2012-10-17'
          Statement:
            - Sid: AllowReadingMetricsFromCloudWatch
              Effect: Allow
              Action:
                - cloudwatch:ListMetrics
                - cloudwatch:GetMetricData
              Resource: "*"
            - Sid: AllowReadingLogsFromCloudWatch
              Effect: Allow
              Action:
                - logs:DescribeLogGroups
                - logs:GetLogGroupFields
                - logs:StartQuery
                - logs:StopQuery
                - logs:GetQueryResults
                - logs:GetLogEvents
              Resource: "*"

Again, using our custom policy provides better control and follows the best practices of least privilege.

With IAM role, we do not need to provide AWS Access Key/Secret in Grafana connection details for CloudWatch as a data source.

Visualising ECS Service Metrics

Now that Grafana is configured to pull data from CloudWatch, ECS metrics like CPUUtilization and MemoryUtilization, are available. We can proceed to create a dashboard and select the right namespace as well as the right metric name.

Setting up the diagram for memory utilisation of our Orchard Core app in our ECS cluster.

As shown in the following dashboard, we show memory and CPU utilisation rates because they help us ensure that our ECS services are performing within safe limits and not overusing or underutilizing resources. By monitoring the utilisation, we ensure our services are using just the right amount of resources.

Both ECS service metrics and container insights are displayed on Grafana dashboard.

Visualising ECS Container Insights Metrics

ECS Container Insights Metrics are deeper metrics like task counts, network I/O, storage I/O, and so on.

In the dashboard above, we can also see the number of Task Count. Task Count helps us make sure our services are running the right number of instances at all times.

Task Count by itself is not a cost metric, but if we consistently see high task counts with low CPU/memory usage, it indicates we can potentially consolidate workloads and reduce costs.

Instrumenting Orchard Core to Send Custom App Metrics

Now that we have seen how ECS metrics are visualised in Grafana, let’s move on to instrumenting our Orchard Core app to send custom app-level metrics. This will give us deeper visibility into what our app is really doing.

Metrics should be tied to business objectives. It’s crucial that the metrics you collect align with KPIs that can drive decision-making.

Metrics should be actionable. The collected data should help identify where to optimise, what to improve, and how to make decisions. For example, by tracking app-metrics such as response time and HTTP status codes, we gain insight into both performance and reliability of our Orchard Core. This allows us to catch slowdowns or failures early, improving user satisfaction.

SLA vs SLO vs SLI: Key Differences in Service Metrics (Image Credit: Atlassian)

By tracking response times and HTTP code counts at the endpoint level,

we are measuring SLIs that are necessary to monitor if we are meeting our SLOs.

With clear SLOs and SLIs, we can then focus on what really matters from a performance and reliability perspective. For example, a common SLO could be “99.9% of requests to our Orchard Core API endpoints must be processed within 500ms.”

In terms of sending custom app-level metrics from our Orchard Core to CloudWatch and then to Grafana, there are many approaches depending on our use case. If we are looking for simplicity and speed, CloudWatch SDK and EMF are definitely the easiest and most straightforward methods we can use to get started with sending custom metrics from Orchard Core to CloudWatch, and then visualising them in Grafana.

Using CloudWatch SDK to Send Metrics

We will start with creating a middleware called EndpointStatisticsMiddleware with AWSSDK.CloudWatch NuGet package referenced. In the middleware, we create a MetricDatum object to define the metric that we want to send to CloudWatch.

var metricData = new MetricDatum
    {
        MetricName = metricName,
        Value = value,
        Unit = StandardUnit.Count,
        Dimensions = new List<Dimension>
        {
            new Dimension
            {
                Name = "Endpoint", 
                Value = endpointPath
            }
        }
    };

var request = new PutMetricDataRequest
    {
        Namespace = "Experiment.OrchardCore.Main/Performance",
        MetricData = new List<MetricDatum> { metricData }
    };

In the code above, we see new concepts like Namespace, Metric, and Dimension. They are foundational in CloudWatch. We can think of them as ways to organize and label our data to make it easy to find, group, and analyse.

Namespace : A container or category for our metrics. It helps to group related metrics together;
Metric : A series of data points that we want to track. The thing we are measuring, in our example, it could be Http2xxCount and Http4xxCount;
Dimension :A key-value pair that adds context to a metric.

If we do not define the Namespace, Metric, and Dimensions carefully when we send data, Grafana later will not find them, or our charts on the dashboards will be very messy and hard to filter or analyse.

In addition, as shown in the code above, we are capturing the HTTP status code for our Orchard Core endpoints. We will then use PutMetricDataAsync to send the metric data PutMetricDataRequest asynchronously to CloudWatch.

The HTTP status codes of each of our Orchard Core endpoints are now captured on CloudWatch.

In Grafana, now when we want to configure a CloudWatch panel to show the HTTP status codes for each of the endpoint, the first thing we select is the Namespace, which is Experiment.OrchardCore.Main/Performance in our example. Namespace tells Grafana which group of metrics to query.

After picking the Namespace, Grafana lists the available Metrics inside that Namespace. We pick the Metrics we want to plot, such as Http2xxCount and Http4xxCount. Finally, since we are tracking metrics by endpoint, we set the Dimension to Endpoint and select the specific endpoint we are interested in, as shown in the following screenshot.

Using EMF to Send Metrics

While using the CloudWatch SDK works well for sending individual metrics, EMF (Embedded Metric Format) offers a more powerful and scalable way to log structured metrics directly from our app logs.

Before we can use EMF, we must first ensure that the Orchard Core application logs from our ECS tasks are correctly sent to CloudWatch Logs. This is done by configuring the LogConfiguration inside the ECS TaskDefinition as we discussed last time.

  # Unit 12: ECS Task Definition and Service
  ecsTaskDefinition:
    Type: AWS::ECS::TaskDefinition
    Properties:
      ...
      ContainerDefinitions:
        - Name: !Ref ServiceName
          Image: !Ref OrchardCoreImage
          LogConfiguration:
            LogDriver: awslogs
            Options:
              awslogs-group: !Sub "/ecs/${ServiceName}-log-group"
              awslogs-region: !Ref AWS::Region
              awslogs-stream-prefix: ecs
          ...

Once the ECS task is sending logs to CloudWatch Logs, we can start embedding custom metrics into the logs using EMF.

Instead of pushing metrics directly using the CloudWatch SDK, we send structured JSON messages into the container logs. CloudWatch will then auto detects these EMF messages and converts them into CloudWatch Metrics.

The following shows what a simple EMF log message looks like.

{
  "_aws": {
    "Timestamp": 1745653519000,
    "CloudWatchMetrics": [
      {
        "Namespace": "Experiment.OrchardCore.Main/Performance",
        "Dimensions": [["Endpoint"]],
        "Metrics": [
          { "Name": "ResponseTimeMs", "Unit": "Milliseconds" }
        ]
      }
    ]
  },
  "Endpoint": "/api/v1/packages",
  "ResponseTimeMs": 142
}

When a log message reaches CloudWatch Logs, CloudWatch scans the text and looks for a valid _aws JSON object inside anywhere in the message. Thus, even if our log line has extra text before or after, as long as the EMF JSON is properly formatted, CloudWatch extracts it and publishes the metrics automatically.

An example of log with EMF JSON in it on CloudWatch.

After CloudWatch extracts the EMF block from our log message, it automatically turns it into a proper CloudWatch Metric. These metrics are then queryable just like any normal CloudWatch metric and thus available inside Grafana too, as shown in the screenshot below.

Metrics extracted from logs containing EMF JSON are automatically turned into metrics that can be visualised in Grafana just like any other metric.

As we can see, using EMF is easier as compared to going the CloudWatch SDK route because we do not need to change or add extra AWS infrastructure. With EMF, what our app does is just writing special JSON-format logs.

Then CloudWatch Metrics automatically extracts the metrics from those logs with EMF JSON. The entire process requires no new service, no special SDK code, and no CloudWatch PutMetric API calls.

Cost Optimisation with Logs vs Metrics

Logs are more expensive than metrics, especially when we are storing large amounts of data over time. This is also true when logs are stored at a higher retention rate and are more detailed, which means higher storage costs.

Metrics are cheaper to store because they are aggregated data points that do not require the same level of detail as logs.

CloudWatch treats each unique combination of dimensions as a separate metric, even if the metrics have the same metric name. However, compared to logs, metrics are still usually much cheaper at scale.

By embedding metrics into your log data via EMF, we are actually piggybacking metrics into logs, and letting CloudWatch extract metrics without duplicating effort. Thus, when using EMF, we will be paying for both, i.e.

Log ingestion and storage (for the raw logs);
The extracted custom metric (for the metric).

Hence, when we are leveraging EMF, we should consider expire logs faster if we only need the extracted metrics long-term.

Granularity and Sampling

Granularity refers to how frequent the metric data is collected. Fine granularity provides more detailed insights but can lead to increased data volume and costs.

Sampling is a technique to reduce the amount of data collected by capturing only a subset of data points (especially helpful in high-traffic systems). However, the challenge is ensuring that you maintain enough data to make informed decisions while keeping storage and processing costs manageable.

In our Orchard Core app above, currently the middleware that we implement will immediately PutMetricDataAsync to CloudWatch which will then not only slow down our API but it costs more because we need to pay when we send custom metrics to CloudWatch. Thus, we usually “buffer” the metrics first, and then batch-send periodically. This can be done with, for example, HostedService which is an ASP.NET Core background service, to flush metrics at interval.

using Amazon.CloudWatch;
using Amazon.CloudWatch.Model;
using Microsoft.Extensions.Hosting;
using Microsoft.Extensions.Options;
using System.Collections.Concurrent;

public class MetricsPublisher(
        IAmazonCloudWatch cloudWatch, 
        IOptions<MetricsOptions> options,
        ILogger<MetricsPublisher> logger) : BackgroundService
{
    private readonly ConcurrentBag<MetricDatum> _pendingMetrics = new();

    public void TrackMetric(string metricName, double value, string endpointPath)
    {
        _pendingMetrics.Add(new MetricDatum
        {
            MetricName = metricName,
            Value = value,
            Unit = StandardUnit.Count,
            Dimensions = new List<Dimension>
            {
                new Dimension 
                { 
                    Name = "Endpoint", 
                    Value = endpointPath
                }
            }
        });
    }

    protected override async Task ExecuteAsync(CancellationToken stoppingToken)
    {
        logger.LogInformation("MetricsPublisher started.");
        while (!stoppingToken.IsCancellationRequested)
        {
            await Task.Delay(TimeSpan.FromSeconds(options.FlushIntervalSeconds), stoppingToken);
            await FlushMetricsAsync();
        }
    }

    private async Task FlushMetricsAsync()
    {
        if (_pendingMetrics.IsEmpty) return;

        const int MaxMetricsPerRequest = 1000;

        var metricsToSend = new List<MetricDatum>();
        var metricsCount = 0;
        while (_pendingMetrics.TryTake(out var datum))
        {
            metricsToSend.Add(datum);

            metricsCount += 1;
            if (metricsCount >= MaxMetricsPerRequest) break;
        }

        var request = new PutMetricDataRequest
        {
            Namespace = options.Namespace,
            MetricData = metricsToSend
        };

        int attempt = 0;
        while (attempt < options.MaxRetryAttempts)
        {
            try
            {
                await cloudWatch.PutMetricDataAsync(request);
                logger.LogInformation("Flushed {Count} metrics to CloudWatch.", metricsToSend.Count);
                break;
            }
            catch (Exception ex)
            {
                attempt++;
                logger.LogWarning(ex, "Failed to flush metrics. Attempt {Attempt}/{MaxAttempts}", attempt, options.MaxRetryAttempts);
                if (attempt < options.MaxRetryAttempts)
                    await Task.Delay(TimeSpan.FromSeconds(options.RetryDelaySeconds));
                else
                    logger.LogError("Max retry attempts reached. Dropping {Count} metrics.", metricsToSend.Count);
            }
        }
    }

    public override async Task StopAsync(CancellationToken cancellationToken)
    {
        logger.LogInformation("MetricsPublisher stopping.");
        await FlushMetricsAsync();
        await base.StopAsync(cancellationToken);
    }
}

In our Orchard Core API, each incoming HTTP request may run on a different thread. Hence, we need a thread-safe data structure like ConcurrentBag for storing the pending metrics.

Please take note that ConcurrentBag is designed to be an unordered collection. It does not maintain the order of insertion when items are taken from it. However, since the metrics we are sending, which is the counts of HTTP status codes, it does not matter in what order the requests were processed.

In addition, the limit of MetricData that we can send to CloudWatch per request is 1,000. Thus, we have the constant MaxMetricsPerRequest to help us make sure that we retrieve and remove at most 1,000 metrics from the ConcurrentBag.

Finally, we can inject MetricsPublisher to our middleware EndpointStatisticsMiddleware so that it can auto track every API request.

Wrap-Up

In this post, we started by setting up Grafana on EC2, connected it to CloudWatch to visualise ECS metrics. After that, we explored two ways, i.e. CloudWatch SDK and EMF log, to send custom app-level metrics from our Orchard Core app:

Whether we are monitoring system health or reporting on business KPIs, Grafana with CloudWatch offers a powerful observability stack that is both flexible and cost-aware.

DEV Community