<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: An Nguyen</title>
    <description>The latest articles on DEV Community by An Nguyen (@nthienan).</description>
    <link>https://dev.to/nthienan</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F635683%2Fd1b09319-5578-4f01-9c48-836dc8100631.jpeg</url>
      <title>DEV Community: An Nguyen</title>
      <link>https://dev.to/nthienan</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/nthienan"/>
    <language>en</language>
    <item>
      <title>A GitOps Way To Manage Grafana Data Sources At Scale</title>
      <dc:creator>An Nguyen</dc:creator>
      <pubDate>Fri, 27 May 2022 14:04:48 +0000</pubDate>
      <link>https://dev.to/aws-builders/a-gitops-way-to-manage-grafana-data-sources-at-scale-59la</link>
      <guid>https://dev.to/aws-builders/a-gitops-way-to-manage-grafana-data-sources-at-scale-59la</guid>
      <description>&lt;h2&gt;
  
  
  Problem
&lt;/h2&gt;

&lt;p&gt;I'm working for the enterprise organization and assigned the task of improving the monitoring system. Since the monitoring system is a centralized system used for the whole organization, we have to make it easy to use for cross teams in the organization. The system uses Grafana for visualization parts. I will not mention the backend of Grafana in this post. If you're interested, you can refer to my post &lt;a href="https://dev.to/aws-builders/ultra-monitoring-with-victoria-metrics-1p2"&gt;Ultra Monitoring with Victoria Metrics&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In the past, Grafana data sources were manually added via WebUI. We want to avoid doing such kinds of operations. Instead, it should be automated as much as we can. Also, we need to follow GitOps practice to manage, and track/audit changes.&lt;/p&gt;

&lt;h2&gt;
  
  
  Solution
&lt;/h2&gt;

&lt;p&gt;Thanks to &lt;a href="https://grafana.com/docs/grafana/latest/administration/provisioning/" rel="noopener noreferrer"&gt;Grafana Provisioning&lt;/a&gt; feature. It’s possible to manage data sources in Grafana by adding one or more YAML config files in the &lt;code&gt;provisioning/datasources&lt;/code&gt; directory. Each config file can contain a list of data sources that will get added or updated during start up. If the data source already exists, then Grafana updates it to match the configuration file.&lt;/p&gt;

&lt;p&gt;Combine with &lt;a href="https://grafana.com/docs/grafana/latest/http_api/admin/#reload-provisioning-configurations" rel="noopener noreferrer"&gt;reload provisioning configurations API&lt;/a&gt;, we can achieve the goal without needing to restart Grafana on every data sources change&lt;/p&gt;

&lt;p&gt;The idea is that Grafana data source configuration files will be kept in a Git repository. Then using AWS Automation to sync configurations to Grafana servers. The Git repository structure looks like below:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;.&lt;/span&gt;
├── team-1
│   ├── clickhouse-2.yaml
│   └── cloudwatch-1.yaml
├── team-2
│   ├── clickhouse-1.yaml
│   └── influxdb-1.yaml
├── team-3
│   ├── elasticsearch-1.yaml
│   └── victoria-metrics-1.yaml
└── team-4
    ├── mysql-1.yaml
    └── prometheus-1.yml
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The solution is a combination of &lt;a href="https://docs.aws.amazon.com/systems-manager/latest/userguide/systems-manager-automation.html" rel="noopener noreferrer"&gt;AWS Automation&lt;/a&gt; &lt;a href="https://docs.aws.amazon.com/systems-manager/latest/userguide/automation-documents.html" rel="noopener noreferrer"&gt;Runbook&lt;/a&gt; and &lt;a href="https://docs.aws.amazon.com/secretsmanager/latest/userguide/intro.html" rel="noopener noreferrer"&gt;Secret Manager&lt;/a&gt; so it’s a secured, AWS fully-managed, serverless solution.&lt;/p&gt;

&lt;p&gt;The following diagram is high-level architecture of the solution:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4bsfqznnj9m49pobwxet.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4bsfqznnj9m49pobwxet.png" alt="high-level architecture"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;But wait!! Why is Secret Manager in architecture diagram? &lt;br&gt;
To answer this question, let's see a data source is stored in the repository:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Prometheus Example &lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;
&lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;prometheus&lt;/span&gt;
&lt;span class="na"&gt;access&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;proxy&lt;/span&gt;
&lt;span class="na"&gt;url&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;http://123.123.1.1:9090&lt;/span&gt;
&lt;span class="na"&gt;user&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;username"&lt;/span&gt;
&lt;span class="na"&gt;password&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;password"&lt;/span&gt;
&lt;span class="na"&gt;basicAuth&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;false"&lt;/span&gt;
&lt;span class="na"&gt;jsonData&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;httpMethod&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;POST&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Data sources may need credentials info, and we cannot leave them as plaintext in the repository which leads to security issues.&lt;/p&gt;

&lt;p&gt;Let's back to architecture diagram. Here is how the process works:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Administrators create a secret to store credential of a data source (can be automate portal and/or chatbot)&lt;/li&gt;
&lt;li&gt; Administrators review and merge a PR&lt;/li&gt;
&lt;li&gt;When PR merged, GitHub/Gitlab pipeline triggers predefined Automation runbook&lt;/li&gt;
&lt;li&gt;Runbook executes steps from SSM Documents and gets secrets from Secret Manager&lt;/li&gt;
&lt;li&gt;Runbook executes defined steps to generate data source provisioning file and invoke Grafana API to reload data sources.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Runbook has three main steps:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Pull the repository from GitHub/Gitlab into Grafana server&lt;/li&gt;
&lt;li&gt;Get data source credentials from Secret Manager&lt;/li&gt;
&lt;li&gt;Generate data source provisioning files with credentials&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fydxdypq0kcy6tfvehbjs.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fydxdypq0kcy6tfvehbjs.png" alt="Runbook"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Secrets stored in Secret Manager will have name as following format: &lt;br&gt;
&lt;code&gt;{env}/grafana/datasource/{team}/{datasource-name}&lt;/code&gt;&lt;br&gt;
Eg. &lt;code&gt;prod/grafana/datasource/team-3/elasticsearch-1&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Secret value are store as JSON format. E.g:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"username"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"elasticUser"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"password"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"elasticP@ssw0rD"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each secret will have two required tags. They are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;env: prod/qa/dev&lt;/code&gt;&lt;/li&gt;
&lt;li&gt; &lt;code&gt;secret-type: grafana-datasource&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Data source file now looks like as following:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Elasticsearch Example &lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;
&lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;elasticsearch&lt;/span&gt;
&lt;span class="na"&gt;access&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;proxy&lt;/span&gt;
&lt;span class="na"&gt;url&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;http://elasticsearc.example.com:9200&lt;/span&gt;
&lt;span class="na"&gt;user&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;@team-3/elasticsearch-1:username"&lt;/span&gt;
&lt;span class="na"&gt;password&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;@team-3/elasticsearch-1:password"&lt;/span&gt;
&lt;span class="na"&gt;database&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;logs-index&lt;/span&gt;
&lt;span class="na"&gt;basicAuth&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
&lt;span class="na"&gt;jsonData&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;esVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;7.7.0&lt;/span&gt;
  &lt;span class="na"&gt;includeFrozen&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;
  &lt;span class="na"&gt;logLevelField&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;"&lt;/span&gt;
  &lt;span class="na"&gt;logMessageField&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;"&lt;/span&gt;
  &lt;span class="na"&gt;maxConcurrentShardRequests&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;5&lt;/span&gt;
  &lt;span class="na"&gt;timeField&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;@timestamp"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Step #2 in the runbook, I write a Python script to get secret values from Secret Manager and pass to step #3. The Python script return secrets as JSON format as following structure:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"team-1"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"clickhouse-2"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"username"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"team-1-clickhouse-2-username"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"password"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"team-1-clickhouse-2-password"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"team-2"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"mysql-1"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"username"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"mysql-1-username"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"password"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"mysql1P@ssword"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"team-3"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"victoria-metrics-1"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"authorizationToken"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"vict0ri@Metric$Tok3n"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"elasticsearch-1"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"username"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"elasticUser"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"password"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"elasticP@ssw0rD"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Step #3 in the runbook, I also write a small Python script to combine data source files in the repository into Grafana data source provisioning file, and also replace secret holders by the secret values from Secret Manager.&lt;br&gt;
Grafana data source provisioning configuration looks like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="o"&gt;[&lt;/span&gt;root@grafana datasources]# &lt;span class="nb"&gt;pwd&lt;/span&gt;
/var/lib/grafana/provisioning/datasources

&lt;span class="o"&gt;[&lt;/span&gt;root@grafana datasources]# ll
total 16
&lt;span class="nt"&gt;-rw-r--r--&lt;/span&gt; 1 root root 362 May 22 11:00 team-1.yaml
&lt;span class="nt"&gt;-rw-r--r--&lt;/span&gt; 1 root root 628 May 22 11:00 team-2.yaml
&lt;span class="nt"&gt;-rw-r--r--&lt;/span&gt; 1 root root 669 May 22 11:00 team-3.yaml
&lt;span class="nt"&gt;-rw-r--r--&lt;/span&gt; 1 root root 515 May 22 11:00 team-4.yaml
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;/var/lib/grafana/provisioning/datasources/team-3.yaml&lt;/code&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1&lt;/span&gt;
&lt;span class="na"&gt;datasources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;access&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;proxy&lt;/span&gt;
  &lt;span class="na"&gt;basicAuth&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
  &lt;span class="na"&gt;database&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;logs-index&lt;/span&gt;
  &lt;span class="na"&gt;jsonData&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;esVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;7.7.0&lt;/span&gt;
    &lt;span class="na"&gt;includeFrozen&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;
    &lt;span class="na"&gt;logLevelField&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;'&lt;/span&gt;
    &lt;span class="na"&gt;logMessageField&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;'&lt;/span&gt;
    &lt;span class="na"&gt;maxConcurrentShardRequests&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;5&lt;/span&gt;
    &lt;span class="na"&gt;timeField&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;@timestamp'&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Elasticsearch Example &lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;
  &lt;span class="na"&gt;password&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;elasticP@ssw0rD&lt;/span&gt;
  &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;elasticsearch&lt;/span&gt;
  &lt;span class="na"&gt;url&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;http://elasticsearc.example.com:9200&lt;/span&gt;
  &lt;span class="na"&gt;user&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;elasticUser&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;access&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;proxy&lt;/span&gt;
  &lt;span class="na"&gt;isDefault&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
  &lt;span class="na"&gt;jsonData&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;httpHeaderName1&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Authorization&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Victoria Metrics Example &lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;
  &lt;span class="na"&gt;secureJsonData&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;httpHeaderValue1&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Bearer vict0ri@Metric$Tok3n&lt;/span&gt;
  &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;prometheus&lt;/span&gt;
  &lt;span class="na"&gt;url&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;http://ultra-metrics.com&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



</description>
      <category>monitoring</category>
      <category>aws</category>
      <category>grafana</category>
      <category>gitops</category>
    </item>
    <item>
      <title>Ultra Monitoring with Victoria Metrics</title>
      <dc:creator>An Nguyen</dc:creator>
      <pubDate>Sun, 01 May 2022 09:48:55 +0000</pubDate>
      <link>https://dev.to/aws-builders/ultra-monitoring-with-victoria-metrics-1p2</link>
      <guid>https://dev.to/aws-builders/ultra-monitoring-with-victoria-metrics-1p2</guid>
      <description>&lt;h2&gt;
  
  
  Challenges
&lt;/h2&gt;

&lt;p&gt;Recently, my team has been assigned tasks to redesign the monitoring system. My organization has an ecosystem with hundreds of applications deployed across multiple cloud providers, mostly AWS (tens of AWS accounts in our AWS Org).&lt;/p&gt;

&lt;p&gt;The old monitoring system was designed and deployed years ago. It’s a prom stack with a Prometheus instance, Grafana, Alermanager, and various types of exporters. It was good at that time. When the ecosystem grows fast, however, it now has problems: &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Not highly available&lt;/li&gt;
&lt;li&gt;Not scalable, scaling is too complex and not efficient&lt;/li&gt;
&lt;li&gt;Data retention is too short (14 days) due to performance dramatically decreasing and scaling difficulties&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;With all the above problems, the ideal solution must meet the requirements below:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Highly available&lt;/li&gt;
&lt;li&gt;Scalable, able to scale easily&lt;/li&gt;
&lt;li&gt;Disaster recovery&lt;/li&gt;
&lt;li&gt;Data must be stored for at least a year&lt;/li&gt;
&lt;li&gt;Compatible with Prom stack and PromQL so that we don’t spend much effort on migration and getting familiar with the new stack.&lt;/li&gt;
&lt;li&gt;Have an efficient way to collect metrics from multiple AWS accounts&lt;/li&gt;
&lt;li&gt;The deployment process must be automated, both infra and configurations&lt;/li&gt;
&lt;li&gt;Easy to be managed/maintain and automate daily operations tasks&lt;/li&gt;
&lt;li&gt;Nice to have if supporting multi-tenant&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Solution
&lt;/h2&gt;

&lt;p&gt;After researching and making some PoC, we find that &lt;a href="https://victoriametrics.com/" rel="noopener noreferrer"&gt;Victoria Metrics&lt;/a&gt; is a good fit for us. Vitoria Metrics has all of the required features. It’s highly available built-in, scaling is so easy since every component is separated. We implemented it and are using it for the production environment. We call it by name &lt;code&gt;Ultra Metrics&lt;/code&gt;. Let’s look at our solution in detail.&lt;/p&gt;

&lt;h3&gt;
  
  
  High-level architecture
&lt;/h3&gt;

&lt;p&gt;This is the high-level architecture of the solution:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwcgt5qzry0yjefrx9mmp.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwcgt5qzry0yjefrx9mmp.png" alt="high-level architecture" width="800" height="753"&gt;&lt;/a&gt;&lt;br&gt;
We use &lt;a href="https://docs.victoriametrics.com/Cluster-VictoriaMetrics.html" rel="noopener noreferrer"&gt;cluster version&lt;/a&gt; of Victoria Metrics (VM), the cluster has some major components:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;vmstorage&lt;/code&gt;: stores the raw data and returns the queried data on the given time range for the given label filters. This is the only stateful component in the cluster.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;vminsert&lt;/code&gt;: accepts the ingested data and spreads it among &lt;code&gt;vmstorage&lt;/code&gt;
 nodes according to consistent hashing over metric name and all its labels.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;vmselect&lt;/code&gt;: performs incoming queries by fetching the needed data from all the configured &lt;code&gt;vmstorage&lt;/code&gt;nodes&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;vmauth&lt;/code&gt;: is a simple auth proxy, router for the cluster. It reads auth credentials from &lt;em&gt;Authorization&lt;/em&gt; HTTP header (&lt;em&gt;Basic Auth&lt;/em&gt;, &lt;em&gt;Bearer token&lt;/em&gt;, and &lt;a href="https://github.com/VictoriaMetrics/VictoriaMetrics/issues/1897" rel="noopener noreferrer"&gt;&lt;em&gt;InfluxDB authorization&lt;/em&gt;&lt;/a&gt; is supported), matches them against configs, and proxies incoming HTTP requests to the configured targets.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;vmagent&lt;/code&gt;: is a tiny but mighty agent which helps you collect metrics from various sources and store them in Victoria Metrics or any other Prometheus-compatible storage systems that support the &lt;em&gt;remote_write&lt;/em&gt; protocol.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;vmalert&lt;/code&gt;: executes a list of the given &lt;a href="https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/" rel="noopener noreferrer"&gt;alerting&lt;/a&gt; or &lt;a href="https://prometheus.io/docs/prometheus/latest/configuration/recording_rules/" rel="noopener noreferrer"&gt;recording&lt;/a&gt; rules against configured data sources. For sending alerting notifications &lt;code&gt;vmalert&lt;/code&gt; relies on configured &lt;a href="https://github.com/prometheus/alertmanager" rel="noopener noreferrer"&gt;Alertmanager&lt;/a&gt;. Recording rules results are persisted via &lt;a href="https://prometheus.io/docs/prometheus/latest/storage/#remote-storage-integrations" rel="noopener noreferrer"&gt;remote write&lt;/a&gt; protocol. &lt;code&gt;vmalert&lt;/code&gt; is heavily inspired by &lt;a href="https://prometheus.io/docs/alerting/latest/overview/" rel="noopener noreferrer"&gt;Prometheus&lt;/a&gt; implementation and aims to be compatible with its syntax&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;promxy&lt;/code&gt;:  used for querying the data from multiple clusters. It’s Prometheus proxy that makes many shards of Prometheus appear as a single API endpoint to the user.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;
  
  
  How does the solution fit into our case?
&lt;/h3&gt;

&lt;p&gt;Here are how &lt;code&gt;Ultra Metrics&lt;/code&gt; addresses the requirements:&lt;/p&gt;
&lt;h4&gt;
  
  
  High availability
&lt;/h4&gt;

&lt;p&gt;The system is able to continue accepting new incoming data and processing new quires when some components of the cluster are temporarily unavailable. &lt;/p&gt;

&lt;p&gt;We accomplish this by using the cluster version of VM. Each component is deployed with redundancy and auto-healing. Data is also redundant by replicating (&lt;a href="https://docs.victoriametrics.com/Cluster-VictoriaMetrics.html#replication-and-data-safety" rel="noopener noreferrer"&gt;read more&lt;/a&gt;) to multiple nodes in the same cluster.  &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;vminsert&lt;/code&gt; and &lt;code&gt;vmselect&lt;/code&gt; are stateless components and deployed behind a proxy &lt;code&gt;vmauth&lt;/code&gt;. &lt;code&gt;vmauth&lt;/code&gt; stops routing requests into unavailable nodes.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;vmstorage&lt;/code&gt; is the only stateful component, however, since &lt;a href="https://docs.victoriametrics.com/Cluster-VictoriaMetrics.html#replication-and-data-safety" rel="noopener noreferrer"&gt;data is redundant&lt;/a&gt;, it’s fine if some nodes go down temporarily.

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;vminsert&lt;/code&gt; re-routes incoming data from unavailable &lt;code&gt;vmstorage&lt;/code&gt; nodes to healthy &lt;code&gt;vmstorage&lt;/code&gt; nodes&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;vmselect&lt;/code&gt; continues serving responses if a &lt;code&gt;vmstorage&lt;/code&gt; node is unavailable&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;
  
  
  Scalability
&lt;/h4&gt;

&lt;p&gt;Since each component's responsibility is separated, and is mostly stateless services. It’s much easier to scale both vertical and horizontal. Each component may scale independently. &lt;/p&gt;

&lt;p&gt;The storage component is the only stateful one. However, &lt;code&gt;vmstorage&lt;/code&gt; nodes don't know about each other, don't communicate with each other, and don't share any data. It simplifies cluster maintenance and cluster scaling. Scaling storage layer is now so easy, just adding new nodes and updating &lt;code&gt;vminsert&lt;/code&gt; and &lt;code&gt;vmselect&lt;/code&gt; configurations. That’s it, no more steps are required.  &lt;/p&gt;
&lt;h4&gt;
  
  
  Disaster recovery
&lt;/h4&gt;

&lt;p&gt;Following Victoria Metrics’ recommendation that all components run in the same subnet network (same availability zone) to utilize high bandwidth, low latency, and thus low error rates. This increases cluster performance. &lt;/p&gt;

&lt;p&gt;To have a multi-AZ, even multi-region (which we choose) setup, we run an independent cluster in each AZ or region. Then configure &lt;code&gt;vmagent&lt;/code&gt; to send data to all clusters. &lt;code&gt;vmagent&lt;/code&gt; has this feature built-in. &lt;code&gt;[promxy](https://github.com/jacksontj/promxy)&lt;/code&gt; may be used for querying the data from multiple clusters. It provides a single data source for all PromQL queries meaning Grafana can have a single source and we can have globally aggregated PromQL queries.&lt;/p&gt;

&lt;p&gt;Failover can be achieved by a combination of &lt;code&gt;Route53&lt;/code&gt; failover and/or &lt;code&gt;promxy&lt;/code&gt;. When an entire AZ/region goes down, the system is still available for both read and write operations. Once the AZ/region is back in operation, missing data will be sent to that cluster by &lt;code&gt;vmagent&lt;/code&gt; from its caching buffer.&lt;/p&gt;
&lt;h4&gt;
  
  
  Multi-tenancy
&lt;/h4&gt;

&lt;p&gt;The system is centralization monitoring system, there are multiple teams using it. Data of each team is stored independently and isolated from others. Team has ability to access data for their own team only. This is exactly what are VM multi-tenancy feature offers.&lt;/p&gt;

&lt;p&gt;Victoria Metrics cluster has built-in support for multiple isolated tenants. It’s expected that the data of tenants be stored in a separate database managed by a separate service sitting in front of the Victoria Metrics cluster such as &lt;a href="https://docs.victoriametrics.com/vmauth.html" rel="noopener noreferrer"&gt;&lt;code&gt;vmauth&lt;/code&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Data for all the tenants are evenly spread among available &lt;code&gt;vmstorage&lt;/code&gt; nodes. This guarantees even load among &lt;code&gt;vmstorage&lt;/code&gt; nodes when different tenants have different amounts of data and different query loads. Performance and resource usage doesn't depend on the number of tenants also.&lt;/p&gt;

&lt;p&gt;Let’s say a tenant is an AWS account in the above architecture. &lt;/p&gt;

&lt;p&gt;&lt;code&gt;vmagent&lt;/code&gt; remote write URL are configured as example below:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;URLs for data ingestion:

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;https://us-east-1.ultra-metrics.com:8427/api/v1/write&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;https://ap-southeast-1.ultra-metrics.com:8427/api/v1/write&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;URLs for Prometheus querying:

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;https://us-east-1.ultra-metrics.com:8427/api/v1/query&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;https://ap-southeast-1.ultra-metrics.com:8427/api/v1/query&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;code&gt;vmauth&lt;/code&gt; configurations look like this snippet:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;users&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
&lt;span class="nn"&gt;...&lt;/span&gt;
&lt;span class="c1"&gt;# Requests with the 'Authorization: Bearer account1Secret' and 'Authorization: Token account1Secret'&lt;/span&gt;
&lt;span class="c1"&gt;# header are proxied to https://&amp;lt;internal-nlb-domain&amp;gt;:8481&lt;/span&gt;
&lt;span class="c1"&gt;# For example, https://&amp;lt;internal-nlb-domain&amp;gt;:8427/api/v1/query is proxied to https://&amp;lt;internal-nlb-domain&amp;gt;:8481/select/1/prometheus/api/v1/query&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;bearer_token&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;account1Secret&lt;/span&gt;
  &lt;span class="na"&gt;url_map&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;src_paths&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;/api/v1/query&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;/api/v1/query_range&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;/api/v1/series&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;/api/v1/label/[^/]+/values&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;/api/v1/metadata&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;/api/v1/labels&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;/api/v1/query_exemplars&lt;/span&gt;
    &lt;span class="na"&gt;url_prefix&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;https://&amp;lt;internal-nlb-domain&amp;gt;:8481/select/1/prometheus&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;src_paths&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; 
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;/api/v1/write&lt;/span&gt;
    &lt;span class="na"&gt;url_prefix&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; 
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;https://&amp;lt;internal-nlb-domain&amp;gt;:8480/insert/1/prometheus&lt;/span&gt;

&lt;span class="c1"&gt;# Requests with the 'Authorization: Bearer account2Secret' and 'Authorization: Token account2Secret'&lt;/span&gt;
&lt;span class="c1"&gt;# header are proxied to https://&amp;lt;internal-nlb-domain&amp;gt;:8481&lt;/span&gt;
&lt;span class="c1"&gt;# For example, https://&amp;lt;internal-nlb-domain&amp;gt;:8427/api/v1/query is proxied to https://&amp;lt;internal-nlb-domain&amp;gt;:8481/select/2/prometheus/api/v1/query&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;bearer_token&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;account2Secret&lt;/span&gt;
  &lt;span class="na"&gt;url_map&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;src_paths&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;/api/v1/query&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;/api/v1/query_range&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;/api/v1/series&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;/api/v1/label/[^/]+/values&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;/api/v1/metadata&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;/api/v1/labels&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;/api/v1/query_exemplars&lt;/span&gt;
    &lt;span class="na"&gt;url_prefix&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;https://&amp;lt;internal-nlb-domain&amp;gt;:8481/select/2/prometheus&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;src_paths&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; 
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;/api/v1/write&lt;/span&gt;
    &lt;span class="na"&gt;url_prefix&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; 
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;https://&amp;lt;internal-nlb-domain&amp;gt;:8480/insert/2/prometheus&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Note that:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;8247&lt;/code&gt; is &lt;code&gt;vmauth&lt;/code&gt;'s port&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;8481&lt;/code&gt; is &lt;code&gt;vmselect&lt;/code&gt;'s port&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;8480&lt;/code&gt; is &lt;code&gt;vminsert&lt;/code&gt;'s port&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  Prom-stack compatibility
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;VM implements &lt;a href="https://docs.victoriametrics.com/Single-server-VictoriaMetrics.html#prometheus-querying-api-usage" rel="noopener noreferrer"&gt;Prometheus querying API&lt;/a&gt; so there is no changes from query APIs, syntax, etc.. So all tools used continue to function as they are.&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;We don’t even need to make any changes (sidecar, agent, etc...) except to add few lines configurations to the old monitoring system to make it works with new system.&lt;br&gt;
&lt;/p&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;remote_write&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;url&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;https://us-east-1.ultra-metrics.com:8427/api/v1/write&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;url&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;https://ap-southeast-1.ultra-metrics.com:8427/api/v1/write&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;


&lt;p&gt;Thus, we can continue using the old monitoring while experimenting new system.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Some statistics
&lt;/h3&gt;

&lt;p&gt;Will be updated soon.&lt;/p&gt;

&lt;h2&gt;
  
  
  What’s next?
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Infrastructure provisioning by CDK&lt;/li&gt;
&lt;li&gt;Automate cluster deployment using AWS Automation runbook&lt;/li&gt;
&lt;li&gt;GitOps for daily operation tasks on the cluster

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://dev.to/aws-builders/a-gitops-way-to-manage-grafana-data-sources-at-scale-59la"&gt;A GitOps Way To Manage Grafana Data Sources At Scale&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

</description>
    </item>
    <item>
      <title>Dynamic routing for multi-tenant multi-region React application with AWS CloudFront</title>
      <dc:creator>An Nguyen</dc:creator>
      <pubDate>Sun, 23 Jan 2022 14:53:50 +0000</pubDate>
      <link>https://dev.to/aws-builders/dynamic-routing-for-multi-tenant-multi-region-react-application-with-aws-cloudfront-389g</link>
      <guid>https://dev.to/aws-builders/dynamic-routing-for-multi-tenant-multi-region-react-application-with-aws-cloudfront-389g</guid>
      <description>&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;In my organization, we built a SaaS application. It’s a multi-tenancy application. We leverage AWS to host the application then deliver the best experiences to users across the globe. The application spans multiple regions to help us to distribute and isolate infrastructure. It will improve high availability and avoid outages caused by disasters. If there is an outage in a region, only that region is affected but not others, so that the outage is mitigated.&lt;/p&gt;

&lt;p&gt;Our application has two main components: a frontend module - a single page web application (React), and a backend module that is a set of microservices running on Kubernetes clusters. It’s quite a basic architecture. However, there are challenges that need to deal with, especially since the application is multi-tenant multi-region&lt;/p&gt;

&lt;p&gt;In this post, let’s talk about the frontend module.&lt;/p&gt;

&lt;h2&gt;
  
  
  Challenges
&lt;/h2&gt;

&lt;p&gt;As said the frontend module is designed and deployed as a region-specific application. Initially, the module is deployed in regional Kubernetes clusters as Nginx pods. For each region, the module is built and hosted in a separate directory of a Docker image. Based on the region in which it’s deployed, the corresponding directory will be used to serve requests.&lt;br&gt;
&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fi.postimg.cc%2FzfmkS06z%2FMulti-tenant-multi-region-React-application-with-AWS-Cloud-Front-Nginx-deployment.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fi.postimg.cc%2FzfmkS06z%2FMulti-tenant-multi-region-React-application-with-AWS-Cloud-Front-Nginx-deployment.png" alt="Multi-tenant-multi-region-React-application-with-AWS-Cloud-Front-Nginx-deployment.png"&gt;&lt;/a&gt;&lt;br&gt;
This deployment architecture requires us to operate and maintain Nginx in Kubernetes clusters as well as handle scaling to meet on-demand users traffic. It's also not good in term of latency since every end-user requests have to reach out to Nginx pods in the specific region. Let's say a user, who locates in the US, accesses a tenant in Singapore which is &lt;a href="https://xyz.example.com" rel="noopener noreferrer"&gt;https://xyz.example.com&lt;/a&gt;. That user's requests are routed from the US to Singapore and back. That increases latency thus site loading speed is poor.&lt;/p&gt;

&lt;h2&gt;
  
  
  Requirements
&lt;/h2&gt;

&lt;p&gt;To overcome the above challenges and have better user experiences, we try to find out a solution that meets the requirements below:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Reduce latency as much as possible so site performance is increased no matter wherever end-users are&lt;/li&gt;
&lt;li&gt;Remove operation cost as much as we can&lt;/li&gt;
&lt;li&gt;Because of business, we want some regions to go live
before/after others. So the application must be region-specific&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Solutions
&lt;/h2&gt;

&lt;p&gt;Fortunately, CDN (AWS CloudFront) is the best fit for our case. It's ideal solutions that meet the above requirements.&lt;/p&gt;

&lt;p&gt;There are possible solutions&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;1. A CloudFront distribution for each region&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fi.postimg.cc%2FSxYrMwZ1%2FMulti-tenant-multi-region-React-application-with-AWS-Cloud-Front-Multi-CFs.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fi.postimg.cc%2FSxYrMwZ1%2FMulti-tenant-multi-region-React-application-with-AWS-Cloud-Front-Multi-CFs.png" alt="Multi-tenant-multi-region-React-application-with-AWS-Cloud-Front-Multi-CFs.png"&gt;&lt;/a&gt;&lt;br&gt;
This is the first solution that comes to mind and is the simplest solution. However, we quickly realize that it cannot be done when implemented. It’s because of a CloudFront limitation with &lt;code&gt;Alternative domain name&lt;/code&gt;. Below is the error when setting up a second distribution with the same alternative name &lt;code&gt;*.example.com&lt;/code&gt;&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight jsx"&gt;&lt;code&gt;

&lt;span class="nx"&gt;Invalid&lt;/span&gt; &lt;span class="nx"&gt;request&lt;/span&gt; &lt;span class="nx"&gt;provided&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;One&lt;/span&gt; &lt;span class="nx"&gt;or&lt;/span&gt; &lt;span class="nx"&gt;more&lt;/span&gt; &lt;span class="k"&gt;of&lt;/span&gt; &lt;span class="nx"&gt;the&lt;/span&gt; &lt;span class="nx"&gt;CNAMEs&lt;/span&gt; &lt;span class="nx"&gt;you&lt;/span&gt; &lt;span class="nx"&gt;provided&lt;/span&gt; &lt;span class="nx"&gt;are&lt;/span&gt; &lt;span class="nx"&gt;already&lt;/span&gt; &lt;span class="nx"&gt;associated&lt;/span&gt; &lt;span class="kd"&gt;with&lt;/span&gt; &lt;span class="nx"&gt;a&lt;/span&gt; &lt;span class="nx"&gt;different&lt;/span&gt; &lt;span class="nx"&gt;resource&lt;/span&gt;


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Read more &lt;a href="https://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/CNAMEs.html#alternate-domain-names-restrictions" rel="noopener noreferrer"&gt;alternate-domain-names-restrictions&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;2. One Cloufront distribution + Lambda@Edge for all regions&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;We leverage CloudFront, &lt;a href="https://aws.amazon.com/vi/lambda/edge/" rel="noopener noreferrer"&gt;Lambda@Edge&lt;/a&gt;, and &lt;a href="https://aws.amazon.com/vi/dynamodb/global-tables/" rel="noopener noreferrer"&gt;DynamoDB global table&lt;/a&gt;. Here is a high-level of the solution:&lt;br&gt;
&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fi.postimg.cc%2FFsBp19xJ%2FMulti-tenant-multi-region-React-application-with-AWS-Cloud-Front-One-CF.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fi.postimg.cc%2FFsBp19xJ%2FMulti-tenant-multi-region-React-application-with-AWS-Cloud-Front-One-CF.png" alt="Multi-tenant-multi-region-React-application-with-AWS-Cloud-Front-One-CF.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Since we host the frontend module for each region in a directory of S3 bucket. We have to implement some kind of dynamic routing origin requests to correct directory of S3 bucket for CloudFront distribution.&lt;/p&gt;

&lt;p&gt;To implement that dynamic routing, we &lt;a href="https://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/lambda-at-the-edge.html" rel="noopener noreferrer"&gt;use Lambda@Edge&lt;/a&gt;. Its capability allows us to use any attribute of the HTTP request such as &lt;code&gt;Host&lt;/code&gt;, &lt;code&gt;URIPath&lt;/code&gt;, &lt;code&gt;Headers&lt;/code&gt;, &lt;code&gt;Cookies&lt;/code&gt;, or &lt;code&gt;Query String&lt;/code&gt; and set the Origin accordingly.&lt;br&gt;
&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fi.postimg.cc%2FMKTtWc1S%2Flambda-edge.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fi.postimg.cc%2FMKTtWc1S%2Flambda-edge.png" alt="lambda-edge.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In our case, we'll use &lt;code&gt;Origin request&lt;/code&gt; event to trigger Lambda@Edge function that inspects &lt;code&gt;Host&lt;/code&gt; to determine the location of the tenant and route request to correct directory of S3 origin bucket.&lt;/p&gt;

&lt;p&gt;The following diagram illustrates the sequence of events for our case.&lt;br&gt;
&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fi.postimg.cc%2FrpSNh52S%2FMulti-tenant-multi-region-React-application-with-AWS-Cloud-Front-FE.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fi.postimg.cc%2FrpSNh52S%2FMulti-tenant-multi-region-React-application-with-AWS-Cloud-Front-FE.png" alt="Multi-tenant-multi-region-React-application-with-AWS-Cloud-Front-FE.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Here is how the process works:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;User navigates to the tenant. E.g. &lt;a href="https://xyz.example.com" rel="noopener noreferrer"&gt;https://xyz.example.com&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;CloudFront serves content from cache if available, otherwise it goes to step 3.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Only after&lt;/strong&gt; a CloudFront cache miss, the &lt;strong&gt;origin request&lt;/strong&gt; trigger is fired for that behavior. This triggers the Lambda@Edge function to modify origin request.&lt;/li&gt;
&lt;li&gt;The Lambda@Edge function queries DynamoDB table to determine which folder should be served for that tenant.&lt;/li&gt;
&lt;li&gt;The function continues to send the request to the chosen folder.&lt;/li&gt;
&lt;li&gt;The object is returned to CloudFront from Amazon S3, served to the viewer and caches, if applicable&lt;/li&gt;
&lt;/ol&gt;

&lt;h4&gt;
  
  
  Issues
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;1. Cannot get tenant identity from Origin request.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;To determine tenant location, we need &lt;code&gt;Host&lt;/code&gt; header which is also tenant identity. However, the origin request overrides &lt;code&gt;Host&lt;/code&gt; header to S3 bucket host, see &lt;a href="https://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/RequestAndResponseBehaviorCustomOrigin.html#request-custom-headers-behavior" rel="noopener noreferrer"&gt;HTTP request headers and CloudFront behavior&lt;/a&gt;. We will use &lt;code&gt;X-Forwarded-Host&lt;/code&gt; header instead. Wait, where &lt;code&gt;X-Forwarded-Host&lt;/code&gt; comes from? It’s is a copy of &lt;code&gt;Host&lt;/code&gt; header with help of CloudFront function triggered by &lt;code&gt;Viewer request&lt;/code&gt; event. &lt;/p&gt;

&lt;p&gt;Here is how the CloudFront function (viewer request) looks like:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight jsx"&gt;&lt;code&gt;

&lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;handler&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;event&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;event&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;request&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;headers&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;x-forwarded-host&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;event&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;request&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;headers&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;host&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;value&lt;/span&gt;&lt;span class="p"&gt;};&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;event&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;request&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Here is how the Lambda@Edge function (origin request) looks like:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;

&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;boto3&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;boto3.dynamodb.conditions&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Key&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;botocore.exceptions&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ClientError&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;lambda_handler&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;request&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Records&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;cf&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;request&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

    &lt;span class="n"&gt;table_name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;tenant-location&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;
    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
    &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;table&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;boto3&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;resource&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;dynamodb&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nc"&gt;Table&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;table_name&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;table&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;KeyConditionExpression&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nc"&gt;Key&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Tenant&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;eq&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;headers&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;x-forwarded-host&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;value&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]),&lt;/span&gt;
            &lt;span class="n"&gt;ScanIndexForward&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="n"&gt;ClientError&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;table&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;boto3&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;resource&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;dynamodb&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;us-east-1&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nc"&gt;Table&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;table_name&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;table&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;KeyConditionExpression&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nc"&gt;Key&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Tenant&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;eq&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;headers&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;x-forwarded-host&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;value&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]),&lt;/span&gt;
            &lt;span class="n"&gt;ScanIndexForward&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Items&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;origin&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s3&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;path&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;/&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Items&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Region&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;request&lt;/span&gt;
    &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;status&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;302&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;headers&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;location&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[{&lt;/span&gt;
                    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;key&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Location&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;value&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;https://www.example.com&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="p"&gt;}]&lt;/span&gt;
            &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;2. High latency when cache miss at edge region&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;That issue is the answer to question “why DynamoDB global table?”&lt;/p&gt;

&lt;p&gt;At the first implementation, a normal DynamoDB table is used. We experienced a poor latency (&lt;strong&gt;&lt;em&gt;3.57 seconds&lt;/em&gt;&lt;/strong&gt;) when loading the site while cache miss from CloudFront edge region. Inspecting CloudWatch log, found that the lambda function took more than &lt;strong&gt;&lt;em&gt;2.2 seconds&lt;/em&gt;&lt;/strong&gt; to complete. Query tenant info from DynamoDB table is a most time-consuming step.&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight jsx"&gt;&lt;code&gt;

&lt;span class="nx"&gt;REPORT&lt;/span&gt; &lt;span class="nx"&gt;RequestId&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;c12f91db&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;5880&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="nx"&gt;ff6&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;94&lt;/span&gt;&lt;span class="nx"&gt;c3&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="nx"&gt;d5d1f454092c&lt;/span&gt;  &lt;span class="nx"&gt;Duration&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;2274.74&lt;/span&gt; &lt;span class="nx"&gt;ms&lt;/span&gt;    &lt;span class="nx"&gt;Billed&lt;/span&gt; &lt;span class="nx"&gt;Duration&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;2275&lt;/span&gt; &lt;span class="nx"&gt;ms&lt;/span&gt;    &lt;span class="nx"&gt;Memory&lt;/span&gt; &lt;span class="nx"&gt;Size&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;128&lt;/span&gt; &lt;span class="nx"&gt;MB&lt;/span&gt; &lt;span class="nx"&gt;Max&lt;/span&gt; &lt;span class="nx"&gt;Memory&lt;/span&gt; &lt;span class="nx"&gt;Used&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;69&lt;/span&gt; &lt;span class="nx"&gt;MB&lt;/span&gt;  &lt;span class="nx"&gt;Init&lt;/span&gt; &lt;span class="nx"&gt;Duration&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;335.50&lt;/span&gt; &lt;span class="nx"&gt;ms&lt;/span&gt;


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;After CloudFront caches response at the edge region, the latency is good. So only users who first access the tenant in a specific region will experience high latency. However, it’s better if the issue is eliminated too. &lt;/p&gt;

&lt;p&gt;&lt;a href="https://aws.amazon.com/vi/dynamodb/global-tables/" rel="noopener noreferrer"&gt;DynamoDB global table&lt;/a&gt; helps to overcome this issue.&lt;br&gt;
&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fi.postimg.cc%2FMpvY2XgF%2FDynamo-DB-Global-Tables-01-dad2508b80e8b7c544fe1a94a2abd3f770b789da.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fi.postimg.cc%2FMpvY2XgF%2FDynamo-DB-Global-Tables-01-dad2508b80e8b7c544fe1a94a2abd3f770b789da.png" alt="Dynamo-DB-Global-Tables-01-dad2508b80e8b7c544fe1a94a2abd3f770b789da.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;After enabling DynamoDB global table, the request latency is reduced from &lt;strong&gt;&lt;em&gt;3.57 seconds&lt;/em&gt;&lt;/strong&gt; to &lt;strong&gt;&lt;em&gt;968 milliseconds&lt;/em&gt;&lt;/strong&gt;. The lambda function now took &lt;strong&gt;&lt;em&gt;254 milliseconds&lt;/em&gt;&lt;/strong&gt; to complete.&lt;/p&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight jsx"&gt;&lt;code&gt;

&lt;p&gt;&lt;span class="nx"&gt;REPORT&lt;/span&gt; &lt;span class="nx"&gt;RequestId&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;af3889c5&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;838&lt;/span&gt;&lt;span class="nx"&gt;d&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="nx"&gt;aed&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="nx"&gt;bc0c&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="nx"&gt;d96e890d444&lt;/span&gt;  &lt;span class="nx"&gt;Duration&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;253.61&lt;/span&gt; &lt;span class="nx"&gt;ms&lt;/span&gt; &lt;span class="nx"&gt;Billed&lt;/span&gt; &lt;span class="nx"&gt;Duration&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;254&lt;/span&gt; &lt;span class="nx"&gt;ms&lt;/span&gt; &lt;span class="nx"&gt;Memory&lt;/span&gt; &lt;span class="nx"&gt;Size&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;128&lt;/span&gt; &lt;span class="nx"&gt;MB&lt;/span&gt; &lt;span class="nx"&gt;Max&lt;/span&gt; &lt;span class="nx"&gt;Memory&lt;/span&gt; &lt;span class="nx"&gt;Used&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;70&lt;/span&gt; &lt;span class="nx"&gt;MB&lt;/span&gt;&lt;/p&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;h2&gt;
&lt;br&gt;
  &lt;br&gt;
  &lt;br&gt;
  Reference&lt;br&gt;
&lt;/h2&gt;

&lt;h3&gt;
  
  
  The application architecture
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fi.postimg.cc%2FmDBNKhTK%2FMulti-tenant-multi-region-React-application-with-AWS-Cloud-Front.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fi.postimg.cc%2FmDBNKhTK%2FMulti-tenant-multi-region-React-application-with-AWS-Cloud-Front.png" alt="Multi-tenant-multi-region-React-application-with-AWS-Cloud-Front.png"&gt;&lt;/a&gt;&lt;/p&gt;

</description>
      <category>aws</category>
      <category>cloudfront</category>
      <category>lambda</category>
      <category>react</category>
    </item>
    <item>
      <title>Tracking and Notifying on AWS Sign-in activities</title>
      <dc:creator>An Nguyen</dc:creator>
      <pubDate>Sat, 08 Jan 2022 16:08:10 +0000</pubDate>
      <link>https://dev.to/aws-builders/tracking-and-notifying-on-aws-sign-in-activities-31el</link>
      <guid>https://dev.to/aws-builders/tracking-and-notifying-on-aws-sign-in-activities-31el</guid>
      <description>&lt;p&gt;It is critical to prevent root user access from getting into the wrong hands and to be aware whenever root user activity occurs in your AWS account. Here are some of the key recommendations include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;em&gt;Avoid using Root account&lt;/em&gt;&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;em&gt;All IAM users including Root account must be enabled multi factor authentication (MFA)&lt;/em&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;em&gt;Abnormal activities (many failed sign-in attempts, ...)  must be detected and notified&lt;/em&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;However, there are certain actions that can only be performed by the root user. To be certain that all root user activity is authorized and expected, it is important to monitor root API calls to a given AWS account and to notify when this type of activity is detected. This notification gives you the ability to take any necessary steps when an illegitimate root API activity is detected or it can simply be used as a record for any future auditing needs&lt;/p&gt;

&lt;p&gt;In order to to comply best practices above, this post I walk through a solution that tracks and notifies on root user activities and abnormal sign-in activities for an AWS account.&lt;/p&gt;

&lt;h2&gt;
  
  
  Requirements
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Tracking all sign-in and related activities&lt;/li&gt;
&lt;li&gt;Be notified/alerted whenever Root account sign-in&lt;/li&gt;
&lt;li&gt;Be notified/alerted whenever an IAM user sign-in without MFA&lt;/li&gt;
&lt;li&gt;Be notified/alerted if the number of failed sign-in of an IAM user greater than 3 in the last hour&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Prerequisites
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Create and enable a multi-region &lt;a href="https://aws.amazon.com/cloudtrail" rel="noopener noreferrer"&gt;AWS CloudTrail&lt;/a&gt; trail for all AWS regions&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Solution
&lt;/h2&gt;

&lt;p&gt;The picture below shows a high-level architecture of the solution&lt;br&gt;
&lt;a href="https://postimg.cc/MMQ4zj9G" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fa1d4anholalrnmveq3ku.png" alt="architect.png" width="800" height="265"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;IAM users and/or Root account sign-in to either Web Console or Mobile Console&lt;/li&gt;
&lt;li&gt;That sign-in activity is captured and tracked by Cloud Trail&lt;/li&gt;
&lt;li&gt;A Cloud Trail event is sent to Event Bridge automatically&lt;/li&gt;
&lt;li&gt;Event Bridge triggers a state machine in Step Function&lt;/li&gt;
&lt;li&gt;The state machine process the event and send a message SNS topic if needed&lt;/li&gt;
&lt;li&gt;SNS with a Lambda function subscribed to the topic will send appropriate notifications Slack&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Here are details of the state machine - the main part of the solution&lt;br&gt;
&lt;a href="https://postimg.cc/BjLjSYcB" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fst8n0xfkc0yvkt56tn7y.png" alt="aws-sign-in-activity-step-functions.png" width="800" height="518"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;At the first step in the above state machine, the activity is stored in a DynamoDB table. The reason for storing is that we will need historical data of a user for other purposes such as security audits in the future, investigating security issues, etc. We also need it for a later step in the state machine (&lt;code&gt;Count failed sign-in attempt&lt;/code&gt;) to determine an alert should be sent or not.&lt;/p&gt;

&lt;p&gt;The DynamoDB table is designed (hash key, partition key, indexes, etc.) to store not only sign-in activities but also other kinds of activities. This will help us easily to extend the solution in the future.&lt;/p&gt;

&lt;p&gt;At the last steps of the state machine, send alert steps, SNS publish tasks are used instead of Lambda tasks is because we don't want to duplicate sending alert code. A centralized Lambda function that is subscribed to SNS topic will do sending messages to Slack via an incoming webhook.&lt;/p&gt;

&lt;h3&gt;
  
  
  Scenarios
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;1. Root sign-in&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;State machine execution:&lt;br&gt;
&lt;a href="https://postimg.cc/BjMm3vpK" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgzkdulbp5v44f7tut2m7.png" alt="aws-sign-in-activity-root-sign-in.png" width="800" height="567"&gt;&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Slack alert:&lt;br&gt;
&lt;a href="https://postimg.cc/47qJSZjC" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F01m4mz4emk586f2kbrqw.png" alt="aws-sign-in-activity-root-sign-in-alert.png" width="530" height="310"&gt;&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;2. Sign-in successful but no MFA used&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;State machine execution:&lt;br&gt;
&lt;a href="https://postimg.cc/G9DztRdH" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgat309ul5z2vz8pb4s96.png" alt="aws-sign-in-activity-no-mfa.png" width="800" height="567"&gt;&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Slack alert:&lt;br&gt;
&lt;a href="https://postimg.cc/sBRGTkb0" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fha6ms1wl001lg6yjt5hw.png" alt="aws-sign-in-activity-no-mfa-alert.png" width="530" height="250"&gt;&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;3. Sign-in failed but not more than 2 attempts&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;State machine execution:
&lt;a href="https://postimg.cc/HV69rkc4" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fe67l4hqya0n6n6ysbwra.png" alt="aws-sign-in-activity-no-more-than-2-times.png" width="800" height="567"&gt;&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;4. More than 2 failed sign-in attempts&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;State machine execution:&lt;br&gt;
&lt;a href="https://postimg.cc/1fbcGYBf" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fytfo4ohtcvjg95b4ygcd.png" alt="aws-sign-in-activity-more-than-2-times.png" width="800" height="567"&gt;&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Slack alert:&lt;br&gt;
&lt;a href="https://postimg.cc/d7GYsXHd" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fx2azuxtuqo2l53b2hh54.png" alt="aws-sign-in-activity-more-than-2-times-alert.png" width="530" height="244"&gt;&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;5. Sign-in successful and MFA used&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;State machine execution:
&lt;a href="https://postimg.cc/5HzCx4W3" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzs2org7e5al3nqdcjjif.png" alt="aws-sign-in-activity-ok.png" width="800" height="567"&gt;&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;6. Not sign-in activity&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;State machine execution:
&lt;a href="https://postimg.cc/v4LJC9cL" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Footru2xj39qn17vdc2zh.png" alt="aws-sign-in-activity-not-sign-in.png" width="800" height="567"&gt;&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>devsecops</category>
      <category>aws</category>
      <category>stepfunctions</category>
      <category>security</category>
    </item>
  </channel>
</rss>
