DEV Community

Cover image for Endpoint Health Monitoring with AWS Services and Terraform
Tarlan Huseynov
Tarlan Huseynov

Posted on

Endpoint Health Monitoring with AWS Services and Terraform

Table of Contents

  1. Introduction
  2. Implementation via console
  3. IaC and CI/CD

Introduction

In an era where businesses rely on the digital world more than ever, it’s imperative to have robust and reliable systems monitoring our applications’ health. An unexpected outage or system failure can not only affect the user experience but can also lead to revenue loss and damage the company’s reputation. Hence, automated, efficient, and robust health checks and notifications are paramount.

Amazon’s Route53 Health Checks offer an excellent solution to this problem. By sending regular requests to your application endpoints, Route53 is designed to monitor the health of your applications continuously. If the application doesn’t respond within a specified period or the response doesn’t meet certain conditions, Route53 considers the endpoint unhealthy. But identifying the problem is only half the battle; the other half lies in efficient reporting and notification. This is where integrating CloudWatch alerts, SNS, Lambda, and communication tools like Slack and Microsoft Teams become incredibly valuable.

Architecture diagram

Image description

By setting up this kind of architecture, we ensure that we are notified the moment our systems detect any issues. We won’t have to wait for a user to report an issue or for a developer to stumble upon it during their work. This kind of proactive monitoring can save a lot of time and resources.

Furthermore, by automating the whole process using a modular Terraform project, we bring in immense scalability and consistency to our setup. As the infrastructure grows, the ability to manage resources as code becomes a significant advantage. It allows us to quickly deploy similar resources, maintain consistency across environments, and more efficiently manage and track infrastructure changes. We will add an extra layer of automation by implementing Terraform pipelines using GitLab CI. This will enable us to plan and deploy changes to our infrastructure automatically, making the maintenance and extension of our systems even more efficient.

In this article, we’ll explore how we can turbocharge our endpoint monitoring by integrating AWS Route53 health checks with AWS CloudWatch, AWS SNS, AWS Lambda as our backbone and use Slack, Microsoft Teams, and email for notifications — with automated deployments and configuration changes via Terraform + GitLab CI.

Step by step implementation via console

To fully appreciate the level of automation we’re aiming for, it’s essential to first walk through the manual steps required in the process. This will provide the context needed to understand the tasks we’re targeting to automate.

Setting up Route53 Health Checks

The first step is to create health checks for the application endpoints using Route53. These health checks monitor the health of the endpoints by sending regular requests and analyzing the responses based on pre-defined conditions.

AWS Developer Guide: Creating and updating health checks

Image description

Creating CloudWatch Alarms

For each health check, a corresponding CloudWatch alarm is created. These alarms monitor the health check status metric and trigger when the endpoint’s status changes (i.e., when it becomes unhealthy).

AWS Developer Guide: Monitoring health checks using CloudWatch

Image description

Setting up SNS Topic

An Amazon Simple Notification Service (SNS) topic is created for the alarms. When an alarm changes state, it publishes a message to this topic. The topic has subscribers who will receive these messages, like our AWS Lambda function and any email addresses for direct email notifications.

AWS Developer Guide: Creating an Amazon SNS topic

Image description

Building AWS Lambda Function

The Lambda function, which subscribes to the SNS topic, is built to handle the incoming alarm messages. This function parses the alarm data and prepares customized messages for Slack and Microsoft Teams.

AWS Developer Guide: Getting started with Lambda

lambda code [runtime Python 3.7]

import json
import os
import requests

def send_slack_message(message):
    slack_webhook_url = os.getenv('SLACK_WEBHOOK_URL')
    headers = {'Content-type': 'application/json'}
    response = requests.post(slack_webhook_url, headers=headers, data=json.dumps({'text': message}))

    if response.status_code != 200:
        raise ValueError(f'Request to Slack returned an error {response.status_code}, the response is:\n{response.text}')

def send_teams_message(message, color):
    teams_webhook_url = os.getenv('TEAMS_WEBHOOK_URL')
    headers = {'Content-type': 'application/json'}
    message = {
        "@type": "MessageCard",
        "@context": "http://schema.org/extensions",
        "themeColor": color,
        "summary": "Health Check Status",
        "sections": [{
            "activityTitle": "Health Check Status",
            "text": message
        }],
        "potentialAction": [{
            "@type": "OpenUri",
            "name": "Details",
            "targets": [{"os": "default", "uri": "https://console.aws.amazon.com/route53/healthchecks/home#"}]
        }]
    }
    response = requests.post(teams_webhook_url, headers=headers, data=json.dumps(message))

    if response.status_code != 200:
        raise ValueError(f'Request to Teams returned an error {response.status_code}, the response is:\n{response.text}')


def handler(event, context):
    alarm_message = json.loads(event['Records'][0]['Sns']['Message'])
    endpoint = alarm_message['AlarmDescription']  # Assumes the endpoint URL is in the AlarmDescription

    if alarm_message['NewStateValue'] == 'ALARM':
        formatted_message_slack = f"*Endpoint:* {endpoint}\n" \
                            f"*State:* :elmofire: endpoint health check failed :warning:\n" \
                            f"<https://console.aws.amazon.com/route53/healthchecks/home#|Details>"
        formatted_message_teams = f"{endpoint} health check failed!"
        teams_color = "FF0000"  # Red
    elif alarm_message['NewStateValue'] == 'OK':
        formatted_message_slack = f"*Endpoint:* {endpoint}\n" \
                            f"*State:* :baby-yoda-soup: endpoint recovered :white_check_mark:\n" \
                            f"<https://console.aws.amazon.com/route53/healthchecks/home#|Details>"
        formatted_message_teams = f"{endpoint} health check is ok!"
        teams_color = "00FF00"  # Green

    slack_webhook_url = os.getenv('SLACK_WEBHOOK_URL')
    teams_webhook_url = os.getenv('TEAMS_WEBHOOK_URL')

    if slack_webhook_url and slack_webhook_url.strip():
        send_slack_message(formatted_message_slack)
    if teams_webhook_url and teams_webhook_url.strip():
        send_teams_message(formatted_message_teams, teams_color)
Enter fullscreen mode Exit fullscreen mode

Our Lambda function acts as a message interceptor and formatter for health checks. Here’s a brief description of its operation:

  • Environment Variables: Our function utilizes two environment variables: ‘SLACK_WEBHOOK_URL’ and ‘TEAMS_WEBHOOK_URL’. These variables hold the webhook URLs of our Slack and Teams channels, respectively.
  • Event Handler: The main entry point of our Lambda function is the handler method. This method receives an event object which contains the message from Amazon SNS, informing about the health status of our endpoints.
  • Message Processing: Our handler method processes the incoming SNS message, parsing it as JSON and extracting relevant information such as the endpoint (stored in AlarmDescription) and the new state of the alarm (NewStateValue).
  • Notification Formatting: Depending on the state of the alarm, it constructs custom messages for both Slack and Teams. For Slack, it uses markdown syntax and includes relevant emojis. For Teams, it creates a simpler message and determines the color of the Teams card (red for ‘ALARM’ state, green for ‘OK’ state).
  • Message Sending: If the respective webhook URL environment variable is set and not empty, it dispatches the corresponding formatted message to Slack or Teams using the send_slack_message or send_teams_message methods respectively.
  • Sending Slack Message: The send_slack_message method sends a HTTP POST request to the Slack webhook URL with the constructed message as payload. If the request fails for any reason, it raises an exception with a detailed error message.
  • Sending Teams Message: The send_teams_message method also sends a HTTP POST request, but to the Teams webhook URL, with a different payload structure suited for Teams.

Through this Lambda function, we’re able to transform a raw health check alert from AWS into a rich, custom, human-readable notification in our communication channels.

Configuring Slack and Microsoft Teams Webhooks

Webhooks in Slack and Microsoft Teams are set up to receive incoming messages from our AWS Lambda function. Each platform’s webhook URL is saved as an environment variable for the Lambda function to use.

Slack documentation: Sending messages using Incoming Webhooks
MS Teams documentation: Create Incoming Webhooks

Image description

Here is the slack channel view after few notifications 😅

IaC automation with Terraform and GitLab CI

Image description

Terraform Setup: Now that the individual components are ready, they need to be managed efficiently. Terraform scripts will automate the creation and management of these resources.

Root Module

module "health-checks" {
  source              = "./modules/healthcheck"
  endpoints           = {
    endpoint-1 = {
      fqdn = "huseynov.net"
      port          = 443
      path          = "healthcheck"
      search_string = "status:ok"
    },
    endpoint-2 = {
      fqdn = "google.az"
    }
  }
  notification_emails = ["tarlan@huseynov.net"]
  slack_webhook_url   = "https://<Webhook_url>"
  teams_webhook_url   = "https://<Webhook_url>"
}
Enter fullscreen mode Exit fullscreen mode

This root module brings together all the elements needed to set up and manage health checks for specified endpoints, send alerts and enable notifications.

You provide inputs to the module through variables:

  • endpoints: This variable holds the list of endpoints that need to be monitored. Each endpoint is specified as a map with fqdn as the key for endpoint URL. If an endpoint includes the search_string key, the module creates a special kind of health check which not only checks the endpoint availability but also whether the response body contains the specified string.
  • notification_emails: A list of emails that should receive notifications about the health status of endpoints.
  • slack_webhook_url: The URL for the Slack channel where alerts will be posted.
  • teams_webhook_url: The URL for the Microsoft Teams channel where alerts will be posted.

Once configured with appropriate inputs, this module takes care of creating Route53 health checks for the listed endpoints, setting up CloudWatch alarms to monitor these health checks, configuring SNS topics for notifications, and connecting the alerts to your Slack and Microsoft Teams channels as well as your email addresses. All of this complexity is hidden behind a simple and elegant interface, making your health checks management a breeze.

HealthCheck Module

locals {
  endpoints_with_str_check   = { for name, parameters in var.endpoints : name => parameters if contains(keys(parameters), "search_string") }
  endpoints_with_https_check = { for name, parameters in var.endpoints : name => parameters if !contains(keys(parameters), "search_string") }
  route53_health_checks      = merge(aws_route53_health_check.https_check, aws_route53_health_check.https_str_check)
  lambda_alerts_enabled      = var.slack_webhook_url != "" || var.teams_webhook_url != ""
}

# Route 53 health-checks
resource "aws_route53_health_check" "https_check" {
  fqdn              = each.value.fqdn
  reference_name    = each.key
  port              = each.value.port
  type              = "HTTPS"
  resource_path     = each.value.path
  failure_threshold = "3"
  request_interval  = "30"

  tags = {
    Name = "${each.key}-hc"
  }
}

resource "aws_route53_health_check" "https_str_check" {
  for_each          = local.endpoints_with_str_check
  fqdn              = each.value.fqdn
  reference_name    = each.key
  port              = each.value.port
  type              = "HTTPS_STR_MATCH"
  search_string     = each.value.search_string
  resource_path     = each.value.path
  failure_threshold = "3"
  request_interval  = "30"

  tags = {
    Name = "${each.key}-hc"
  }
}

# Cloudwatch alarm
resource "aws_cloudwatch_metric_alarm" "endpoint_hc_alarm" {
  for_each            = local.route53_health_checks
  alarm_name          = "${each.value.reference_name}-hc-alarm"
  comparison_operator = "LessThanThreshold"
  evaluation_periods  = "1"
  metric_name         = "HealthCheckStatus"
  namespace           = "AWS/Route53"
  period              = "60"
  statistic           = "Minimum"
  threshold           = "1"
  alarm_actions       = [aws_sns_topic.alarm_sns_topic.arn]
  ok_actions          = [aws_sns_topic.alarm_sns_topic.arn]
  alarm_description   = each.value.fqdn # This is passed to lambda as endpoint value
  dimensions = {
    HealthCheckId = each.value.id
  }
}

data "archive_file" "notifications_lambda_zip" {
  type        = "zip"
  source_file = "${path.root}/files/lambdas/healthcheck.py"
  output_path = "/tmp/healthcheck.zip"
}

data "aws_iam_policy_document" "lambda_assume_role_policy" {
  statement {
    actions = ["sts:AssumeRole"]
    principals {
      type        = "Service"
      identifiers = ["lambda.amazonaws.com"]
    }
  }
}

resource "aws_iam_role" "notifications_lambda_role" {
  count              = local.lambda_alerts_enabled ? 1 : 0
  name               = "health-check-notifications-lambda-role"
  assume_role_policy = data.aws_iam_policy_document.lambda_assume_role_policy.json
}

resource "aws_lambda_function" "notifications_lambda" {
  count            = local.lambda_alerts_enabled ? 1 : 0
  filename         = data.archive_file.notifications_lambda_zip.output_path
  source_code_hash = data.archive_file.notifications_lambda_zip.output_base64sha256
  function_name    = "health-check-notifications"
  # handler - <python filename>.<handler function name>
  handler = "healthcheck.handler"
  runtime = "python3.7"
  role    = aws_iam_role.notifications_lambda_role[0].arn

  environment {
    variables = {
      TEAMS_WEBHOOK_URL = var.teams_webhook_url
      SLACK_WEBHOOK_URL = var.slack_webhook_url
    }
  }

  lifecycle {
    ignore_changes = [source_code_hash, last_modified]
  }
}

resource "aws_lambda_permission" "lambda_permission" {
  count         = local.lambda_alerts_enabled ? 1 : 0
  statement_id  = "AllowExecutionFromSNS"
  action        = "lambda:InvokeFunction"
  function_name = aws_lambda_function.notifications_lambda[0].function_name
  principal     = "sns.amazonaws.com"
  source_arn    = aws_sns_topic.alarm_sns_topic.arn
}

data "aws_iam_policy_document" "lambda_policy" {
  count = local.lambda_alerts_enabled ? 1 : 0
  statement {
    actions = [
      "logs:CreateLogGroup",
      "logs:CreateLogStream",
      "logs:PutLogEvents"
    ]
    effect    = "Allow"
    resources = ["*"]
  }
}

resource "aws_iam_policy" "lambda_policy" {
  count  = local.lambda_alerts_enabled ? 1 : 0
  name   = "health-check-notifications-lambda"
  policy = data.aws_iam_policy_document.lambda_policy[0].json
}

resource "aws_iam_role_policy_attachment" "lambda_role_attached_policy" {
  count      = local.lambda_alerts_enabled ? 1 : 0
  role       = aws_iam_role.notifications_lambda_role[0].name
  policy_arn = aws_iam_policy.lambda_policy[0].arn
}


# SNS Topic for healthcheck alerts
resource "aws_sns_topic" "alarm_sns_topic" {
  name                             = "endpoint-health-check-alarm-topic"
  lambda_failure_feedback_role_arn = aws_iam_role.delivery_feedback_role.arn
  lambda_success_feedback_role_arn = aws_iam_role.delivery_feedback_role.arn
}

resource "aws_sns_topic_subscription" "lambda_topic_subscription" {
  count     = local.lambda_alerts_enabled ? 1 : 0
  topic_arn = aws_sns_topic.alarm_sns_topic.arn
  protocol  = "lambda"
  endpoint  = aws_lambda_function.notifications_lambda[0].arn
}

resource "aws_sns_topic_subscription" "email_topic_subscription" {
  for_each  = toset(var.notification_emails)
  topic_arn = aws_sns_topic.alarm_sns_topic.arn
  protocol  = "email"
  endpoint  = each.value
}

resource "aws_sns_topic_subscription" "sms_topic_subscription" {
  for_each  = toset(var.notification_mobile)
  topic_arn = aws_sns_topic.alarm_sns_topic.arn
  protocol  = "sms"
  endpoint  = each.value
}

# Feedback role
data "aws_iam_policy_document" "feedback_assume_role_policy" {
  statement {
    actions = ["sts:AssumeRole"]
    principals {
      type        = "Service"
      identifiers = ["sns.amazonaws.com"]
    }
  }
}

resource "aws_iam_role" "delivery_feedback_role" {
  name               = "SNSFeedbackRole"
  assume_role_policy = data.aws_iam_policy_document.feedback_assume_role_policy.json

  inline_policy {
    name = "SNSFeedbackPolicy"

    policy = jsonencode({
      "Version" : "2012-10-17",
      "Statement" : [
        {
          "Effect" : "Allow",
          "Action" : [
            "logs:CreateLogGroup",
            "logs:CreateLogStream",
            "logs:PutLogEvents",
            "logs:PutMetricFilter",
            "logs:PutRetentionPolicy"
          ],
          "Resource" : [
            "*"
          ]
        }
      ]
    })
  }
}
Enter fullscreen mode Exit fullscreen mode

The health-check module is a comprehensive health check solution. It splits the provided endpoints into two categories based on the presence of a search_string. It generates Route53 health checks, with HTTPS type for endpoints without search_string and HTTPS_STR_MATCH for those with search_string.

The module also creates CloudWatch alarms for each endpoint, monitoring health check status and triggering SNS topic notifications when a status change occurs. The SNS topic is linked to Lambda function for alerts on Teams or Slack, and also sends email notifications.

If either Teams or Slack webhook URLs are provided, a Lambda function is deployed. It has the necessary IAM role and permissions for execution and logging. The module also handles SNS topic subscriptions for both email and Lambda function.

Finally, the module provides an IAM role for SNS to handle delivery feedback.

The entire module is neatly encapsulated and parameterized, ready to be utilized in any Terraform project.

Integrating with GitLab CI: The final step is to automate the Terraform workflows using GitLab CI. GitLab pipelines are created for planning and deploying changes to the infrastructure. With each git push, a pipeline is triggered that runs Terraform commands to apply the changes to the infrastructure.

Requirement: Gitlab managed terraform state
_
gitlab-ci.yml_

#################################################################################################################
variables:
  TF_ROOT: ${CI_PROJECT_DIR}
  TF_ADDRESS: ${CI_API_V4_URL}/projects/${CI_PROJECT_ID}/terraform/state/tfstate
  ################ AWS Credentials ####################
  TF_VAR_region: $AWS_REGION_DEV
  TF_VAR_access_key: $AWS_ACCESS_KEY_ID
  TF_VAR_secret_key: $AWS_SECRET_ACCESS_KEY
##################################################################################################################
before_script:
  - cd ${TF_ROOT}
  - echo "SYSTEM INFO:"
  - cat /etc/os-release
  - cat /etc/hostname
##################################################################################################################
stages:
  - init
  - validate
  - plan
  - apply
##################################################################################################################

##################################################################################################################
default:

  image: registry.gitlab.com/gitlab-org/terraform-images/stable:latest

  cache:
    key: ${CI_PROJECT_ID}-terraform-cache

    paths:
      - ${TF_ROOT}/.terraform
##################################################################################################################

##################################################################################################################
INITIALIZE:

  stage: init

  script:
    - gitlab-terraform init

  only:
    - merge_requests
    - master
##################################################################################################################

##################################################################################################################
VALIDATE:

  stage: validate

  script:
    - gitlab-terraform validate

  only:
    - merge_requests
    - master  
##################################################################################################################

##################################################################################################################
PLAN:

  stage: plan

  script:
    - gitlab-terraform plan
    - gitlab-terraform plan-json

  artifacts:
    name: plan
    paths:
      - ${TF_ROOT}/plan.cache
    reports:
      terraform: ${TF_ROOT}/plan.json

  only:
    - merge_requests
    - master      
##################################################################################################################

##################################################################################################################
APPLY:

  stage: apply

  environment:
    name: production

  script:
    - gitlab-terraform apply

  dependencies:
    - PLAN

  when: manual

  only:
    - master
##################################################################################################################
Enter fullscreen mode Exit fullscreen mode

This GitLab CI/CD pipeline is configured to utilize a branch-based workflow with merge requests. It consists of four stages: init, validate, plan, and apply, that get triggered based on the branch and the action performed.

Here is the breakdown:

  1. INITIALIZE stage: This is the first stage that runs gitlab-terraform init. It initializes the Terraform working directory, downloading necessary providers and modules, and sets up the backend. It’s executed for both merge requests and changes to the master branch.

  2. VALIDATE stage: This stage runs gitlab-terraform validate which validates the Terraform configuration files for syntax and internal consistency. This also runs for both merge requests and changes to the master branch.

  3. PLAN stage: This stage generates an execution plan with gitlab-terraform plan and outputs it as JSON with gitlab-terraform plan-json. The plan is saved as an artifact for inspection. It is triggered for merge requests and changes to the master branch.

  4. APPLY stage: This stage is triggered manually and only on the master branch. It runs gitlab-terraform apply to apply the proposed changes from the execution plan. It depends on the PLAN stage.

Note that this pipeline configuration relies on GitLab’s managed Terraform state. The state is stored on GitLab’s infrastructure and is managed by GitLab, providing a secure and reliable backend for Terraform. This is specified by the TF_ADDRESS variable in the configuration, which points to GitLab’s API endpoint for managing Terraform state.

Image description

In this comprehensive article, we delved into the depths of infrastructure as code (IaC) through Terraform, focusing on deploying and managing health checks for a system architecture, aided by GitLab’s CI/CD pipeline. By utilizing AWS services like Route53, CloudWatch, SNS, and Lambda, we constructed a robust system capable of monitoring server status, carrying out HTTP string matching health checks, and notifying stakeholders via multiple channels when incidents occur.

Through this exploration, we hope you’ve gained valuable insights into how you can leverage these tools to enhance your systems’ reliability and maintainability. As always, keep exploring, learning, and innovating!

Back to Table of Contents

Top comments (0)