DEV Community: An Nguyen

A GitOps Way To Manage Grafana Data Sources At Scale

An Nguyen — Fri, 27 May 2022 14:04:48 +0000

Problem

I'm working for the enterprise organization and assigned the task of improving the monitoring system. Since the monitoring system is a centralized system used for the whole organization, we have to make it easy to use for cross teams in the organization. The system uses Grafana for visualization parts. I will not mention the backend of Grafana in this post. If you're interested, you can refer to my post Ultra Monitoring with Victoria Metrics

In the past, Grafana data sources were manually added via WebUI. We want to avoid doing such kinds of operations. Instead, it should be automated as much as we can. Also, we need to follow GitOps practice to manage, and track/audit changes.

Solution

Thanks to Grafana Provisioning feature. It’s possible to manage data sources in Grafana by adding one or more YAML config files in the provisioning/datasources directory. Each config file can contain a list of data sources that will get added or updated during start up. If the data source already exists, then Grafana updates it to match the configuration file.

Combine with reload provisioning configurations API, we can achieve the goal without needing to restart Grafana on every data sources change

The idea is that Grafana data source configuration files will be kept in a Git repository. Then using AWS Automation to sync configurations to Grafana servers. The Git repository structure looks like below:

.
├── team-1
│   ├── clickhouse-2.yaml
│   └── cloudwatch-1.yaml
├── team-2
│   ├── clickhouse-1.yaml
│   └── influxdb-1.yaml
├── team-3
│   ├── elasticsearch-1.yaml
│   └── victoria-metrics-1.yaml
└── team-4
    ├── mysql-1.yaml
    └── prometheus-1.yml

The solution is a combination of AWS Automation Runbook and Secret Manager so it’s a secured, AWS fully-managed, serverless solution.

The following diagram is high-level architecture of the solution:

But wait!! Why is Secret Manager in architecture diagram?
To answer this question, let's see a data source is stored in the repository:

name: Prometheus Example 1
type: prometheus
access: proxy
url: http://123.123.1.1:9090
user: "username"
password: "password"
basicAuth: "false"
jsonData:
  httpMethod: POST

Data sources may need credentials info, and we cannot leave them as plaintext in the repository which leads to security issues.

Let's back to architecture diagram. Here is how the process works:

Administrators create a secret to store credential of a data source (can be automate portal and/or chatbot)
Administrators review and merge a PR
When PR merged, GitHub/Gitlab pipeline triggers predefined Automation runbook
Runbook executes steps from SSM Documents and gets secrets from Secret Manager
Runbook executes defined steps to generate data source provisioning file and invoke Grafana API to reload data sources.

Runbook has three main steps:

Pull the repository from GitHub/Gitlab into Grafana server
Get data source credentials from Secret Manager
Generate data source provisioning files with credentials

Secrets stored in Secret Manager will have name as following format:
{env}/grafana/datasource/{team}/{datasource-name}
Eg. prod/grafana/datasource/team-3/elasticsearch-1

Secret value are store as JSON format. E.g:

{
  "username": "elasticUser",
  "password": "elasticP@ssw0rD"
}

Each secret will have two required tags. They are:

env: prod/qa/dev
secret-type: grafana-datasource.

Data source file now looks like as following:

name: Elasticsearch Example 1
type: elasticsearch
access: proxy
url: http://elasticsearc.example.com:9200
user: "@team-3/elasticsearch-1:username"
password: "@team-3/elasticsearch-1:password"
database: logs-index
basicAuth: true
jsonData:
  esVersion: 7.7.0
  includeFrozen: false
  logLevelField: ""
  logMessageField: ""
  maxConcurrentShardRequests: 5
  timeField: "@timestamp"

Step #2 in the runbook, I write a Python script to get secret values from Secret Manager and pass to step #3. The Python script return secrets as JSON format as following structure:

{
  "team-1": {
    "clickhouse-2": {
      "username": "team-1-clickhouse-2-username",
      "password": "team-1-clickhouse-2-password"
    }
  },
  "team-2": {
    "mysql-1": {
      "username": "mysql-1-username",
      "password": "mysql1P@ssword"
    }
  },
  "team-3": {
    "victoria-metrics-1": {
      "authorizationToken": "vict0ri@Metric$Tok3n"
    },
    "elasticsearch-1": {
      "username": "elasticUser",
      "password": "elasticP@ssw0rD"
    }
  }
}

Step #3 in the runbook, I also write a small Python script to combine data source files in the repository into Grafana data source provisioning file, and also replace secret holders by the secret values from Secret Manager.
Grafana data source provisioning configuration looks like:

[root@grafana datasources]# pwd
/var/lib/grafana/provisioning/datasources

[root@grafana datasources]# ll
total 16
-rw-r--r-- 1 root root 362 May 22 11:00 team-1.yaml
-rw-r--r-- 1 root root 628 May 22 11:00 team-2.yaml
-rw-r--r-- 1 root root 669 May 22 11:00 team-3.yaml
-rw-r--r-- 1 root root 515 May 22 11:00 team-4.yaml

/var/lib/grafana/provisioning/datasources/team-3.yaml

apiVersion: 1
datasources:
- access: proxy
  basicAuth: true
  database: logs-index
  jsonData:
    esVersion: 7.7.0
    includeFrozen: false
    logLevelField: ''
    logMessageField: ''
    maxConcurrentShardRequests: 5
    timeField: '@timestamp'
  name: Elasticsearch Example 1
  password: elasticP@ssw0rD
  type: elasticsearch
  url: http://elasticsearc.example.com:9200
  user: elasticUser
- access: proxy
  isDefault: true
  jsonData:
    httpHeaderName1: Authorization
  name: Victoria Metrics Example 1
  secureJsonData:
    httpHeaderValue1: Bearer vict0ri@Metric$Tok3n
  type: prometheus
  url: http://ultra-metrics.com

Ultra Monitoring with Victoria Metrics

An Nguyen — Sun, 01 May 2022 09:48:55 +0000

Challenges

Recently, my team has been assigned tasks to redesign the monitoring system. My organization has an ecosystem with hundreds of applications deployed across multiple cloud providers, mostly AWS (tens of AWS accounts in our AWS Org).

The old monitoring system was designed and deployed years ago. It’s a prom stack with a Prometheus instance, Grafana, Alermanager, and various types of exporters. It was good at that time. When the ecosystem grows fast, however, it now has problems:

Not highly available
Not scalable, scaling is too complex and not efficient
Data retention is too short (14 days) due to performance dramatically decreasing and scaling difficulties

With all the above problems, the ideal solution must meet the requirements below:

Highly available
Scalable, able to scale easily
Disaster recovery
Data must be stored for at least a year
Compatible with Prom stack and PromQL so that we don’t spend much effort on migration and getting familiar with the new stack.
Have an efficient way to collect metrics from multiple AWS accounts
The deployment process must be automated, both infra and configurations
Easy to be managed/maintain and automate daily operations tasks
Nice to have if supporting multi-tenant

Solution

After researching and making some PoC, we find that Victoria Metrics is a good fit for us. Vitoria Metrics has all of the required features. It’s highly available built-in, scaling is so easy since every component is separated. We implemented it and are using it for the production environment. We call it by name Ultra Metrics. Let’s look at our solution in detail.

High-level architecture

This is the high-level architecture of the solution:

We use cluster version of Victoria Metrics (VM), the cluster has some major components:

vmstorage: stores the raw data and returns the queried data on the given time range for the given label filters. This is the only stateful component in the cluster.
vminsert: accepts the ingested data and spreads it among vmstorage nodes according to consistent hashing over metric name and all its labels.
vmselect: performs incoming queries by fetching the needed data from all the configured vmstoragenodes
vmauth: is a simple auth proxy, router for the cluster. It reads auth credentials from Authorization HTTP header (Basic Auth, Bearer token, and InfluxDB authorization is supported), matches them against configs, and proxies incoming HTTP requests to the configured targets.
vmagent: is a tiny but mighty agent which helps you collect metrics from various sources and store them in Victoria Metrics or any other Prometheus-compatible storage systems that support the remote_write protocol.
vmalert: executes a list of the given alerting or recording rules against configured data sources. For sending alerting notifications vmalert relies on configured Alertmanager. Recording rules results are persisted via remote write protocol. vmalert is heavily inspired by Prometheus implementation and aims to be compatible with its syntax
promxy: used for querying the data from multiple clusters. It’s Prometheus proxy that makes many shards of Prometheus appear as a single API endpoint to the user.

How does the solution fit into our case?

Here are how Ultra Metrics addresses the requirements:

High availability

The system is able to continue accepting new incoming data and processing new quires when some components of the cluster are temporarily unavailable.

We accomplish this by using the cluster version of VM. Each component is deployed with redundancy and auto-healing. Data is also redundant by replicating (read more) to multiple nodes in the same cluster.

vminsert and vmselect are stateless components and deployed behind a proxy vmauth. vmauth stops routing requests into unavailable nodes.
vmstorage is the only stateful component, however, since data is redundant, it’s fine if some nodes go down temporarily.
- vminsert re-routes incoming data from unavailable vmstorage nodes to healthy vmstorage nodes
- vmselect continues serving responses if a vmstorage node is unavailable

Scalability

Since each component's responsibility is separated, and is mostly stateless services. It’s much easier to scale both vertical and horizontal. Each component may scale independently.

The storage component is the only stateful one. However, vmstorage nodes don't know about each other, don't communicate with each other, and don't share any data. It simplifies cluster maintenance and cluster scaling. Scaling storage layer is now so easy, just adding new nodes and updating vminsert and vmselect configurations. That’s it, no more steps are required.

Disaster recovery

Following Victoria Metrics’ recommendation that all components run in the same subnet network (same availability zone) to utilize high bandwidth, low latency, and thus low error rates. This increases cluster performance.

To have a multi-AZ, even multi-region (which we choose) setup, we run an independent cluster in each AZ or region. Then configure vmagent to send data to all clusters. vmagent has this feature built-in. [promxy](https://github.com/jacksontj/promxy) may be used for querying the data from multiple clusters. It provides a single data source for all PromQL queries meaning Grafana can have a single source and we can have globally aggregated PromQL queries.

Failover can be achieved by a combination of Route53 failover and/or promxy. When an entire AZ/region goes down, the system is still available for both read and write operations. Once the AZ/region is back in operation, missing data will be sent to that cluster by vmagent from its caching buffer.

Multi-tenancy

The system is centralization monitoring system, there are multiple teams using it. Data of each team is stored independently and isolated from others. Team has ability to access data for their own team only. This is exactly what are VM multi-tenancy feature offers.

Victoria Metrics cluster has built-in support for multiple isolated tenants. It’s expected that the data of tenants be stored in a separate database managed by a separate service sitting in front of the Victoria Metrics cluster such as vmauth

Data for all the tenants are evenly spread among available vmstorage nodes. This guarantees even load among vmstorage nodes when different tenants have different amounts of data and different query loads. Performance and resource usage doesn't depend on the number of tenants also.

Let’s say a tenant is an AWS account in the above architecture.

vmagent remote write URL are configured as example below:

URLs for data ingestion:
- https://us-east-1.ultra-metrics.com:8427/api/v1/write
- https://ap-southeast-1.ultra-metrics.com:8427/api/v1/write
URLs for Prometheus querying:
- https://us-east-1.ultra-metrics.com:8427/api/v1/query
- https://ap-southeast-1.ultra-metrics.com:8427/api/v1/query

vmauth configurations look like this snippet:

users:
...
# Requests with the 'Authorization: Bearer account1Secret' and 'Authorization: Token account1Secret'
# header are proxied to https://<internal-nlb-domain>:8481
# For example, https://<internal-nlb-domain>:8427/api/v1/query is proxied to https://<internal-nlb-domain>:8481/select/1/prometheus/api/v1/query
- bearer_token: account1Secret
  url_map:
  - src_paths:
    - /api/v1/query
    - /api/v1/query_range
    - /api/v1/series
    - /api/v1/label/[^/]+/values
    - /api/v1/metadata
    - /api/v1/labels
    - /api/v1/query_exemplars
    url_prefix:
    - https://<internal-nlb-domain>:8481/select/1/prometheus
  - src_paths: 
    - /api/v1/write
    url_prefix: 
    - https://<internal-nlb-domain>:8480/insert/1/prometheus

# Requests with the 'Authorization: Bearer account2Secret' and 'Authorization: Token account2Secret'
# header are proxied to https://<internal-nlb-domain>:8481
# For example, https://<internal-nlb-domain>:8427/api/v1/query is proxied to https://<internal-nlb-domain>:8481/select/2/prometheus/api/v1/query
- bearer_token: account2Secret
  url_map:
  - src_paths:
    - /api/v1/query
    - /api/v1/query_range
    - /api/v1/series
    - /api/v1/label/[^/]+/values
    - /api/v1/metadata
    - /api/v1/labels
    - /api/v1/query_exemplars
    url_prefix:
    - https://<internal-nlb-domain>:8481/select/2/prometheus
  - src_paths: 
    - /api/v1/write
    url_prefix: 
    - https://<internal-nlb-domain>:8480/insert/2/prometheus

Note that:

8247 is vmauth's port
8481 is vmselect's port
8480 is vminsert's port

Prom-stack compatibility

VM implements Prometheus querying API so there is no changes from query APIs, syntax, etc.. So all tools used continue to function as they are.
We don’t even need to make any changes (sidecar, agent, etc...) except to add few lines configurations to the old monitoring system to make it works with new system.
```
remote_write:
  - url: https://us-east-1.ultra-metrics.com:8427/api/v1/write
  - url: https://ap-southeast-1.ultra-metrics.com:8427/api/v1/write
```
Thus, we can continue using the old monitoring while experimenting new system.

Some statistics

Will be updated soon.

What’s next?

Infrastructure provisioning by CDK
Automate cluster deployment using AWS Automation runbook
GitOps for daily operation tasks on the cluster
- A GitOps Way To Manage Grafana Data Sources At Scale

Dynamic routing for multi-tenant multi-region React application with AWS CloudFront

An Nguyen — Sun, 23 Jan 2022 14:53:50 +0000

Introduction

In my organization, we built a SaaS application. It’s a multi-tenancy application. We leverage AWS to host the application then deliver the best experiences to users across the globe. The application spans multiple regions to help us to distribute and isolate infrastructure. It will improve high availability and avoid outages caused by disasters. If there is an outage in a region, only that region is affected but not others, so that the outage is mitigated.

Our application has two main components: a frontend module - a single page web application (React), and a backend module that is a set of microservices running on Kubernetes clusters. It’s quite a basic architecture. However, there are challenges that need to deal with, especially since the application is multi-tenant multi-region

In this post, let’s talk about the frontend module.

Challenges

As said the frontend module is designed and deployed as a region-specific application. Initially, the module is deployed in regional Kubernetes clusters as Nginx pods. For each region, the module is built and hosted in a separate directory of a Docker image. Based on the region in which it’s deployed, the corresponding directory will be used to serve requests.

This deployment architecture requires us to operate and maintain Nginx in Kubernetes clusters as well as handle scaling to meet on-demand users traffic. It's also not good in term of latency since every end-user requests have to reach out to Nginx pods in the specific region. Let's say a user, who locates in the US, accesses a tenant in Singapore which is https://xyz.example.com. That user's requests are routed from the US to Singapore and back. That increases latency thus site loading speed is poor.

Requirements

To overcome the above challenges and have better user experiences, we try to find out a solution that meets the requirements below:

Reduce latency as much as possible so site performance is increased no matter wherever end-users are
Remove operation cost as much as we can
Because of business, we want some regions to go live before/after others. So the application must be region-specific

Solutions

Fortunately, CDN (AWS CloudFront) is the best fit for our case. It's ideal solutions that meet the above requirements.

There are possible solutions

1. A CloudFront distribution for each region

This is the first solution that comes to mind and is the simplest solution. However, we quickly realize that it cannot be done when implemented. It’s because of a CloudFront limitation with Alternative domain name. Below is the error when setting up a second distribution with the same alternative name *.example.com



Invalid request provided: One or more of the CNAMEs you provided are already associated with a different resource

2. One Cloufront distribution + Lambda@Edge for all regions

We leverage CloudFront, Lambda@Edge, and DynamoDB global table. Here is a high-level of the solution:

Since we host the frontend module for each region in a directory of S3 bucket. We have to implement some kind of dynamic routing origin requests to correct directory of S3 bucket for CloudFront distribution.

To implement that dynamic routing, we use Lambda@Edge. Its capability allows us to use any attribute of the HTTP request such as Host, URIPath, Headers, Cookies, or Query String and set the Origin accordingly.

In our case, we'll use Origin request event to trigger Lambda@Edge function that inspects Host to determine the location of the tenant and route request to correct directory of S3 origin bucket.

The following diagram illustrates the sequence of events for our case.

Here is how the process works:

User navigates to the tenant. E.g. https://xyz.example.com
CloudFront serves content from cache if available, otherwise it goes to step 3.
Only after a CloudFront cache miss, the origin request trigger is fired for that behavior. This triggers the Lambda@Edge function to modify origin request.
The Lambda@Edge function queries DynamoDB table to determine which folder should be served for that tenant.
The function continues to send the request to the chosen folder.
The object is returned to CloudFront from Amazon S3, served to the viewer and caches, if applicable

Issues

1. Cannot get tenant identity from Origin request.

To determine tenant location, we need Host header which is also tenant identity. However, the origin request overrides Host header to S3 bucket host, see HTTP request headers and CloudFront behavior. We will use X-Forwarded-Host header instead. Wait, where X-Forwarded-Host comes from? It’s is a copy of Host header with help of CloudFront function triggered by Viewer request event.

Here is how the CloudFront function (viewer request) looks like:



function handler(event) {
    event.request.headers['x-forwarded-host'] = {value: event.request.headers.host.value};
    return event.request;
}

Here is how the Lambda@Edge function (origin request) looks like:



import boto3
from boto3.dynamodb.conditions import Key
from botocore.exceptions import ClientError

def lambda_handler(event, context):
    request = event['Records'][0]['cf']['request']

    table_name = 'tenant-location'
    response = None
    try:
        table = boto3.resource('dynamodb').Table(table_name)
        response = table.query(
            KeyConditionExpression=Key('Tenant').eq(request['headers']['x-forwarded-host'][0]['value']),
            ScanIndexForward=False
        )
    except ClientError:
        table = boto3.resource('dynamodb', 'us-east-1').Table(table_name)
        response = table.query(
            KeyConditionExpression=Key('Tenant').eq(request['headers']['x-forwarded-host'][0]['value']),
            ScanIndexForward=False
        )

    if response and len(response['Items']) > 0:
        request['origin']['s3']['path'] = '/' + response['Items'][0]['Region']
        return request
    else:
        return {
            'status': '302',
            'headers': {
                'location': [{
                    'key': 'Location',
                    'value': 'https://www.example.com',
                }]
            }
        }

2. High latency when cache miss at edge region

That issue is the answer to question “why DynamoDB global table?”

At the first implementation, a normal DynamoDB table is used. We experienced a poor latency (3.57 seconds) when loading the site while cache miss from CloudFront edge region. Inspecting CloudWatch log, found that the lambda function took more than 2.2 seconds to complete. Query tenant info from DynamoDB table is a most time-consuming step.



REPORT RequestId: c12f91db-5880-4ff6-94c3-d5d1f454092c  Duration: 2274.74 ms    Billed Duration: 2275 ms    Memory Size: 128 MB Max Memory Used: 69 MB  Init Duration: 335.50 ms

After CloudFront caches response at the edge region, the latency is good. So only users who first access the tenant in a specific region will experience high latency. However, it’s better if the issue is eliminated too.

DynamoDB global table helps to overcome this issue.

After enabling DynamoDB global table, the request latency is reduced from 3.57 seconds to 968 milliseconds. The lambda function now took 254 milliseconds to complete.



REPORT RequestId: af3889c5-838d-4aed-bc0c-2d96e890d444  Duration: 253.61 ms Billed Duration: 254 ms Memory Size: 128 MB Max Memory Used: 70 MB

Reference

The application architecture

Tracking and Notifying on AWS Sign-in activities

An Nguyen — Sat, 08 Jan 2022 16:08:10 +0000

It is critical to prevent root user access from getting into the wrong hands and to be aware whenever root user activity occurs in your AWS account. Here are some of the key recommendations include:

Avoid using Root account.
All IAM users including Root account must be enabled multi factor authentication (MFA)
Abnormal activities (many failed sign-in attempts, ...) must be detected and notified

However, there are certain actions that can only be performed by the root user. To be certain that all root user activity is authorized and expected, it is important to monitor root API calls to a given AWS account and to notify when this type of activity is detected. This notification gives you the ability to take any necessary steps when an illegitimate root API activity is detected or it can simply be used as a record for any future auditing needs

In order to to comply best practices above, this post I walk through a solution that tracks and notifies on root user activities and abnormal sign-in activities for an AWS account.

Requirements

Tracking all sign-in and related activities
Be notified/alerted whenever Root account sign-in
Be notified/alerted whenever an IAM user sign-in without MFA
Be notified/alerted if the number of failed sign-in of an IAM user greater than 3 in the last hour

Prerequisites

Create and enable a multi-region AWS CloudTrail trail for all AWS regions

Solution

The picture below shows a high-level architecture of the solution

IAM users and/or Root account sign-in to either Web Console or Mobile Console
That sign-in activity is captured and tracked by Cloud Trail
A Cloud Trail event is sent to Event Bridge automatically
Event Bridge triggers a state machine in Step Function
The state machine process the event and send a message SNS topic if needed
SNS with a Lambda function subscribed to the topic will send appropriate notifications Slack

Here are details of the state machine - the main part of the solution

At the first step in the above state machine, the activity is stored in a DynamoDB table. The reason for storing is that we will need historical data of a user for other purposes such as security audits in the future, investigating security issues, etc. We also need it for a later step in the state machine (Count failed sign-in attempt) to determine an alert should be sent or not.

The DynamoDB table is designed (hash key, partition key, indexes, etc.) to store not only sign-in activities but also other kinds of activities. This will help us easily to extend the solution in the future.

At the last steps of the state machine, send alert steps, SNS publish tasks are used instead of Lambda tasks is because we don't want to duplicate sending alert code. A centralized Lambda function that is subscribed to SNS topic will do sending messages to Slack via an incoming webhook.

Scenarios

1. Root sign-in

State machine execution:
Slack alert:

2. Sign-in successful but no MFA used

State machine execution:
Slack alert:

3. Sign-in failed but not more than 2 attempts

State machine execution:

4. More than 2 failed sign-in attempts

State machine execution:
Slack alert:

5. Sign-in successful and MFA used

State machine execution:

6. Not sign-in activity

State machine execution: