DEV Community: Varun Subramanian

Capacity Planning for Elasticsearch

Varun Subramanian — Thu, 22 Jul 2021 14:05:49 +0000

Elasticsearch (ES) is a scalable distributed system that can be used for searching, logging, metrics and much more. To run production ES either self-hosted or in the cloud, one needs to plan the infrastructure and cluster configuration to ensure a healthy and highly reliable performance deployment.
In this article, we will focus on how to estimate and create a plan based on the usage metrics before deploying a production-grade cluster.

Capacity Planning:

Identifying the minimum number of master nodes
Sizing of the Elasticsearch Service

Identifying the minimum number of master nodes:

The most important node in the cluster is the Master node. Master node is responsible for wide range of cluster wide activities such as creation, deletion, shard allocation etc. A stable cluster is dependent on the health of the master node.

It is advisable to have dedicated master nodes because an overloaded master node with other responsibilities will not function properly. The most reliable way to avoid overloading the master with other tasks is to configure all the master-eligible nodes to be dedicated master-eligible nodes which only have the master role, allowing them to focus on managing the cluster.

A lightweight cluster may not require master-eligible nodes but once the cluster has more than 6 nodes, it is advisable to use dedicated master-eligible nodes.

The quorum for decision making when selecting the minimum master nodes is calculated using the below formulae:

Minimum Master Nodes = (N / 2) + 1

N is the total number of “master-eligible” nodes in your cluster (rounded off to the nearest integer)

In an ideal environment, the minimum number of master nodes will be 3 and if not maintained, it can result in a “split-brain” that can lead to an unhealthy cluster and loss of data.

Let us consider the below examples for better understanding:

In Scenario A, you have ten regular nodes (ones that can either hold data and become master), the quorum is 6. Even though if we lose the master node due to network connection, the cluster will elect a new master and will still be healthy.

In Scenario B, you have three dedicated master nodes and a hundred data nodes, the quorum is 2. Even though if we lose the master node due to failure, the cluster will elect a new master and will still be healthy.

In Scenario C, you have two regular nodes, with quorum as 2. If there is a network failure between the nodes, then each node will try to elect itself as the Master and will make the cluster inoperable.

Setting the value to 1 is permissible but it doesn't guarantee protection against loss of data when the master node goes down.

Note: Avoid repeated changes to the master node setting as it may lead to cluster instability when the service attempts to change the number of dedicated master nodes.

Sizing of the Elasticsearch Service:

The sizing of Elasticsearch service is more of making an educated estimate rather than having a surefire methodology. The estimate is more about taking into consideration the storage, services to be used and the Elasticsearch itself. The estimate acts as a useful starting point for most critical aspects of sizing the domains; testing them with representative workloads and monitoring their performance.

The following are the key components to remember before sizing:

Use case, (i.e) for real-time search or monitoring of security, log analytics, etc.
Growth planning, for the long-term and the short-term.

Since Elasticsearch is horizontally scalable, if proper indexing and sharding are not done appropriately at the initial stages, one will have to go through painful approvals to add hardware and will end up underutilising the infrastructure.

The three key components to remember before choosing the appropriate cluster settings are as follows:

Calculating the storage requirements
Choosing the number of shards
Choosing the instance types and testing

Calculating storage requirements:

In Elasticsearch, every document is stored in the form of an index.
The storage of documents can be classified as follows:

Growing Index: A single index that keeps growing over periods of time with periodic updates or insertion. For the Growing index, the data is stored on the disk and based on the available sources one can determine how much storage space it consumes. Some common examples are documents and e-commerce search etc.
Rollover Index: Data is being continuously written to a temporary index, with an indexing period and retention time. For rolling over indices, the amount of data generated will be calculated based on the amount of data generated during the retention period of the index. For example, if you generate 100 MiB of logs per hour, that’s 2.4 GiB per day, which will amount to 72 GiB of data for a retention period of 30 days. Some common examples are log analytics, time-series processing etc.

Other aspects need to be taken into consideration in addition to the storage space are as follows:

Number of replicas: A replica is a complete copy of an index and ends up eating the same amount of disk space. By default, every index in an ES has a replica count of 1. It is recommended to have a replica count as 1, as it will prevent data loss. Replicas also help in improving search performance.
ES overhead: ES reserves 5% or 10% for margin of error and 15% to stay under the disk watermarks for segment merges, logs, and other internal operations.

Insufficient storage space is one of the most common causes of cluster instability, so you should cross-check the numbers when you choose instance types, instance counts, and storage volumes.

Choosing the number of shards:

The second component to consider is choosing the right indexing strategy for the indices. In ES, by default, every index is divided into n numbers of primary and replicas. (For example, if there are 2 primary and 1 replica shard then the total count of shards is 4). The primary shard count for an existing index cannot be changed once created.
Every shard uses some amount of CPU and memory and having too many small shards can cause performance issues and out-of-memory errors. But that doesn't entitle one to create shards that are too large either.

The rule of thumb is to ensure that the shard size is always between 10-50 GiB.

The formula for calculating the approximate number of shards is as follows:

App. Number of Primary Shards = (Source Data + Room to Grow) * (1 + Indexing Overhead) / Desired Shard Size

In simple terms, shards size should be small but not small enough so that the underlying ES instance does not have a needless strain on the hardware.
Let us consider the below example for better understanding:

Scenario 1: Suppose you have 50 GiB of data and you don't expect it to grow over time. Based on the formula above, the number of shards should be (50 * 1.1 / 30) = 2.

Note: The chosen desired shard size is 30 GiB

Scenario 2:

Suppose the same 50 GiB is expected to quadruple by next year, then the approximate shards count would be ((50 + 150) * 1.1 / 30) = 8.
Even though we are not going to be having the extra 150 GiB of data immediately, it is important to note that the preparation does not end up creating multiple unnecessary shards. If you remember from earlier, shards consume huge amounts of CPU and memory and in this scenario, if we end up creating tiny shards this can lead to performance degradation in the present.
With the above shard size as 8, let us make the calculation: (50 * 1.1) / 8 = 6.86 GiB per shard.
The shard size is way below the recommended size range (10-50 GiB) and this will end up consuming extra resources. To solve this problem, our consideration should be more of a middle ground approach of 5 shards, which leaves you with 11 GiB (50 * 1.1 / 5) shards at present and 44 GiB ((50 + 150) * 1.1 / 5) in the future.

In both the above approaches the shards sizing is more of approximation rather than appropriate.

It is very important to note that, never appropriate sizing as you have the risk of running out of disk space before even reaching the threshold limit we set. For example, let us consider an instance that has a disk space of 128 GiB. If you stay below 80% (103 GiB) disk usage and the size of the shards is 10 GiB, then we can accommodate 10 shards approximately.

Note: On a given node, it is advisable to have no more than 20 shards per GiB of Java heap.

Choosing instance types and testing:

After calculating the storage requirements and choosing the number of shards that you need, the next step is to make the hardware decisions. Hardware requirements will vary between workloads, but we can make a guesstimate. In general, the storage limits for each instance type map to the amount of CPU and memory that you might need for your workloads.
The following formulae help with better understanding when it comes to choosing the right instance type

Total Data (GB) = Raw data (GB) per day * Number of days retained * (Number of replicas + 1)
Total Storage (GB) = Total data (GB) * (1 + 0.15 disk Watermark threshold + 0.1 Margin of error)
Total Data Nodes = ROUNDUP(Total storage (GB) / Memory per data node / Memory:Data ratio)

For a better understanding of the formulae, let us consider the below example:

A logging application pushes close to 3 GiB data per day and the retention period of data is 90 days
You can use 8GB memory per node for this small deployment. Let’s do the math:

Total Data (GB) = 3GB x (3 x 30 days) x 2 = 540GB Total Storage (GB) = 540GB x (1+0.15+0.1) = 675GB Total Data Nodes = 675GB disk / 8GB RAM /30 ratio = 3 nodes

To summarise everything we have seen so far:

References:

Sizing of ES
ES Cluster for Production
AWS ES
Sizing and Benchmarking

Monitoring AWS Managed Services using Elastic

Varun Subramanian — Wed, 24 Mar 2021 11:08:18 +0000

Over the course of years the number of cloud native services provided by the cloud platforms have been ever increasing. The platforms do come in with their inbuilt monitoring services but from an administrative standpoint of view it is not that helpful as it does not come with a consolidated view for the end users. To overcome this and to provide a common platform for various service logs and its corresponding visualisation many third party providers have come into the play such as Elastic, Datadog, Splunk etc.

In this article we will be seeing the use of one such service provider: Elastic. Elastic provides out of the box dashboards to visualise AWS managed services such as Loadbalancers, VPC, S3, CloudTrail and Billing. We will be using a sample Web application hosted on AWS cloud to provide how consolidated dashboards can be viewed in Kibana.

Monitoring Dashboards for AWS Services

Once we complete the exercise, below are the set of Dashboards that would be available for the Infrastructure Monitoring team to visualise how AWS services are functioning, identify and troubleshoot problems pertaining to any of the services.

Deployment Architecture

Below is a typical web application deployed on AWS cloud that uses a combination of cloud native services and web application server to serve APIs. We also have a static site hosted on Cloudfront to serve static html/css content.

Metricbeat

Metricbeat is a lightweight shipper that you can install on your servers to periodically collect metrics (cpu usage, Disk I/O, network bytes in/out ) from the operating system and from services running on the server. Metricbeat takes the metrics and statistics that it collects and ships them to the output that you specify, such as Elasticsearch or Logstash. Metricbeat by default supports various pre-built modules but in this article we will be focusing on the AWS module.

Metricbeat collects two broad categories of metrics i.e. host metrics and managed services metrics. We will be dedicating one EC2 machine to run metricbeat for AWS managed services metrics collection and another metricbeat on EC2 machine to demonstrate metrics collected from the host server.

NOTE: We will be focusing on metricbeat agent from 7.4 and later

AWS Module of Metricbeat

The AWS module for metricbeat currently supports out of the box collection of metrics for the following AWS Services:

EC2
Elastic Loadbalancer
Lambda functions
NAT Gateway
RDS
S3 Storage
SNS
SQS
Transit Gateway
VPN
Billing

Before we dive into installing and configuring metricbeat, let's understand how metricbeat collects, stores and sends metrics to Elastic.

Enable Cloudwatch metrics for each of the AWS Managed services. This will start sending AWS services to CloudWatch, from where Metricbeat will be reading. This documentation gives step by step instructions on how to enable Cloudwatch for a specific AWS service.
Metricbeat will be running on a EC2 machines and configured to collect AWS managed services metrics. This is as simple as enabling the AWS module in metricbeat config file.
Hands-off, Metricbeat periodically queries AWS Cloudwatch to read metrics and sends it to Elasticsearch server, where it is indexed and visualised in Kibana.

Prerequisites:

Running a Metricbeat requires the following:

AWS account with credentials. Below section covers the Roles required for Metricbeat to read metrics.
Running Elastic Stack (use you self hosted Elastic or create a free 14-day trial on Elastic Cloud). This article doesn’t work with AWS managed elasticsearch.
EC2 machine to run Metricbeat agent

IAM policy:

Metricbeat requires certain IAM Policy permissions for it to fetch data from the required resources. An IAM policy is an entity that defines permissions to an object within your AWS environment. Create a customized IAM policy for Metricbeat with specific permissions is needed. Please see Creating IAM Policies for more details. After Metricbeat IAM policy is created, you need to add this policy to the IAM user which provided the credentials in the previous step.

The following table shows the IAM Policies that needs to be added to each Metricbeat

Consolidated IAM Policy:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "ec2:DescribeRegions",
                "ec2:DescribeInstances"
            ],
            "Resource": [
                "*"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "cloudwatch:GetMetricData",
                "cloudwatch:ListMetrics"
            ],
            "Resource": [
                "*"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "sts:GetCallerIdentity"
            ],
            "Resource": [
                "*"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "iam:ListAccountAliases"
            ],
            "Resource": [
                "*"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "tag:getResources"
            ],
            "Resource": [
                "*"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "rds:DescribeDBInstances",
                "rds:ListTagsForResource"
            ],
            "Resource": [
                "*"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "sns:ListTopics"
            ],
            "Resource": [
                "*"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "sqs:ListQueues"
            ],
            "Resource": [
                "*"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "ce:GetCostAndUsage"
            ],
            "Resource": [
                "*"
            ]
        }
    ]
}

Metricbeat Agent:

Installing the Metricbeat Agent on EC2 machine:

Launch a EC2 machine that runs Ubuntu and login to that machine and run below command to download and install metricbeat.

curl -L -O https://artifacts.elastic.co/downloads/beats/metricbeat/metricbeat-7.11.1-amd64.deb sudo dpkg -i metricbeat-7.11.1-amd64.deb

Edit the configuration:

Elastic Cloud:

Modify /etc/metricbeat/metricbeat.yml to set the connection information for Elastic Cloud:

cloud.id: <Get your cloud_id from the Elastic Cloud> cloud.auth: "elastic:<password>"

Enable the AWS module:

In the out-of-box configuration of Metricbeat, only the system module is enabled by default, so you will need to explicitly enable the AWS module. The following command enables the AWS configuration in the modules.d directory on MacOS and Linux systems:

metricbeat modules enable aws

Set AWS credentials in the config file:

To configure AWS credentials, users can put the credentials into the Metricbeat module configuration or use environment variables to pass them. The ability to load AWS credentials from a shared credentials file is added into aws module. Create a shared credential by following these steps. File will be created under the below location.

For Windows:
  C:\Users\<yourusername>\.aws\credentials
For  Linux, MacOS, or Unix:
  ~/.aws/credentials

(Optional) Users can specify the profile name using parameter credential_profile_name in aws module config. For more details on AWS credentials types and supported formats, please see AWS credentials configuration for more detail.

Sample AWS module Configuration:

metricbeat.modules:
- module: aws
  period: 300s
  metricsets:
    - ec2
- module: aws
  period: 300s
  metricsets:
    - sqs
  regions:
    - us-west-1
- module: aws
  period: 86400s
  metricsets:
    - s3_request
    - s3_daily_storage
- module: aws
  period: 300s
  metricsets:
    - cloudwatch
  metrics:
    - namespace: AWS/EC2
      name: ["CPUUtilization"]
      statistic: ["Average"]    
    - namespace: AWS/EBS
    - namespace: AWS/ELB
      resource_type: elasticloadbalancing
- module: aws
  period: 60s
  metricsets:
    - elb
    - natgateway
    - rds
    - transitgateway
    - usage
    - vpn

namespace: A namespace in AWS CloudWatch is a container for metrics from a specific application or service. Each service has its own namespace, for example Amazon EC2 uses AWS/EC2 namespace and Amazon Elastic Block Storage uses AWS/EBS namespace. Please see the full list of services and namespaces that publish CloudWatch metrics for more details. name: Users can specify what are the specific CloudWatch metrics that need to be collected per namespace.
dimensions: Dimensions are used to refine metrics returned for each instance. For example, InstanceId, ImageId and InstanceType all can be used as dimensions to filter data requested from Amazon EC2.
statistic: Users can specify one or more statistic methods for each CloudWatch metric setting. By default, average, count, maximum, minimum and sum will all be collected for each metric.
tags: resource_type_filter: Tags for resources will not be collected unless this parameter is set. Each resource in AWS has a specific resource type and the common format is service[:resourceType]. Please see resource type filters for more details.
credential_profile_name: If the aws credentials config is done it will be automatically picked up, else one can enter the details manually.

Start Metricbeat:

The setup command loads the Kibana dashboards. If the dashboards are already set up, omit this command.

sudo metricbeat setup sudo service metricbeat start

Module Status:

Metricbeat comes with pre-built Kibana dashboards and UIs for visualizing log data. The dashboards would have been loaded earlier when the setup command was run.

In the Kibana side navigation:

Click Discover, to see Metricbeat data. Also make sure the predefined metricbeat-* index pattern is selected.
Click Dashboard, then search for "AWS metric*", select the dashboard that you want to open.

Filebeat

Filebeat is a lightweight shipper for forwarding and centralizing log data. Installed as an agent on your servers, Filebeat monitors the log files or locations that you specify, collects log events, and forwards them either to Elasticsearch or Logstash for indexing. Filbeat by default supports various pre-built modules but in this article we will be focusing on the AWS module.

AWS Module of Filebeat:

The AWS module for filebeat currently supports the following:

CloudTrail
CloudWatch
EC2
Elastic Loadbalancer
S3 Access
VPC Flow logs

Prerequisites:

Running a filebeat requires the following:

Configure AWS Services to send logs to S3. All files can be placed on the same bucket.
Setting up the SQS service in the AWS Account. (SQS Service). This SQS service will be used to notify filebeat when new file is placed in S3 bucket configured in step 1. AWS account with credentials.
Running Elastic Stack. (This can either be a cluster in 4. Elasticsearch Service on Elastic Cloud)
Filebeat agent

Filebeat Agent:

Installing the Filebeat Agent

curl -L -O https://artifacts.elastic.co/downloads/beats/filebeat/filebeat-7.11.1-amd64.deb sudo dpkg -i filebeat-7.11.1-amd64.deb

Configure

Modify /etc/filebeat/filebeat.yml to set the connection information for Elastic Cloud:

cloud.id: <Get your cloud_id from the Elastic Cloud> cloud.auth: "elastic:<password>"

Enable S3 to send a notification to SQS when a new file is placed in bucket.

Steps to Enable S3 Event notification to SQS are documented here.

Enable the AWS module

From the configuration of Filebeat, one will need to explicitly enable the AWS module. The following command enables the AWS configuration in the modules.d directory on MacOS and Linux systems:

sudo filebeat modules enable aws

Set AWS credentials in the config file

(Skip, if this has already been done during metricbeat setup)
To configure AWS credentials, users can put the credentials into the Filebeat module configuration or use environment variables to pass them. Create a shared credential by following these steps. File will be created under the below location.

For Windows:
C:\Users\<yourusername>\.aws\credentials

For  Linux, MacOS, or Unix:
~/.aws/credentials

Sample AWS Configuration for filebeat:

- module: aws
  cloudtrail:
    enabled: true
    var.queue_url: https://sqs.myregion.amazonaws.com/123456/myqueue
  cloudwatch:
    enabled: true
    var.queue_url: https://sqs.myregion.amazonaws.com/123456/myqueue
  ec2:
    enabled: true
    var.queue_url: https://sqs.myregion.amazonaws.com/123456/myqueue
  elb:
    enabled: true
    var.queue_url: https://sqs.myregion.amazonaws.com/123456/myqueue
  s3access:
    enabled: true
    var.queue_url: https://sqs.myregion.amazonaws.com/123456/myqueue
  vpcflow:
    enabled: true
    var.queue_url: https://sqs.myregion.amazonaws.com/123456/myqueue

var.queue_url: (Required) AWS SQS queue url.

Start Filebeat:

The setup command loads the Kibana dashboards. If the dashboards are already set up, omit this command.

sudo filebeat setup sudo service filebeat start

Module Status:

Filebeat comes with pre-built Kibana dashboards and UIs for visualizing log data. The dashboards would have been loaded earlier when the setup command was run.

In the Kibana side navigation:

Click Discover, to see Filebeat data. Also make sure the predefined filebeat-* index pattern is selected.
Click Dashboard, then search for "AWS Filebeat*", select the dashboard that you want to open.