Darryl Ruggles

Posted on Dec 23, 2023 • Edited on Sep 7, 2024

Serverless Site Health Check Notification System

#aws #lambda #serverless #eventdriven

I have been spending lots of time the last few years working on projects in cloud environments. Most of my adventures involve AWS but i have spent time with all the major clouds and have tried out many of the components they provide. There are so many tools available out there that in many cases it is just a matter of piecing together the best components available to solve whatever your current problem is.

When you have deployed a new app or site in the cloud the focus usually shifts to monitoring and maintenance. One of the core parts you need to keep an eye on is just making sure your site is reachable to your users at all times.

On AWS there are many ways you can do this but one approach i like is to setup Health Checks in the Route 53 area. One misconception about these checks is that you have to use Route 53 for your host DNS or you have to have everything in Route53 to use the Health Check feature. This is not the case. You can use this feature to monitor ANY endpoint (even ones you don't control) via HTTP, HTTPS, or TCP and specify the host by hostname or IP address.

This blog details how you can use some key serverless components from AWS like Amazon Eventbridge, AWS Lambda, and Simple Notification Service to setup a system that will monitor your site (which can be running anywhere) and send emails, text messages, slack messages, and more when the reachability status of your site changes.

Serverless Application Model (SAM)

I'm a big fan of using an Infrastructure as Code (IaC) approach for any project. My go to tools for this are the Servlerless Application Model (SAM) and it's associated CLI (SAM CLI). For more official use cases and for cross platform apps I typically use Terraform.

The setup for this project will be done using SAM and the associated project sample code found in this Github repository uses SAM. Of course all of the components described here can be setup using the AWS console, AWS CLI, or with many other approaches.

The SAM template for this project can be found in the Github repository but here is the template.yaml file.

AWSTemplateFormatVersion: '2010-09-09'
Transform: AWS::Serverless-2016-10-31
Description: >
  route53-health-check-sam

  SAM Template for route53-health-check-sam
    - Route53 Health Check
    - Health Check Cloudwatch Alarm
    - Lambda Function to run when alarm state changes
    - SNS Topic to send updates on health check changes to

Globals:
  Function:
    Timeout: 3
    MemorySize: 128
    Tracing: Active
    LoggingConfig:
      LogFormat: JSON
  Api:
    TracingEnabled: true

Parameters:
  Hostname:
    Type: String
    Description: Hostname to monitor   
    Default: www.amazon.com
  SlackWebhookURL:
    Type: String
    Description: URL to publish slack messages to when health check changes state
    Default: https://hooks.slack.com/triggers/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

Resources:
  HealthCheckStateChangedFunction:
    Type: AWS::Serverless::Function
    Properties:
      CodeUri: site_health_check/
      Handler: app.lambda_handler
      Runtime: python3.12
      Architectures:
        - x86_64      
      Policies:          
        - SNSPublishMessagePolicy: 
            TopicName: !GetAtt Route53HealthCheckSNSTopic.TopicName
      Environment:
        Variables:
          SNS_TOPIC_ARN: !Ref Route53HealthCheckSNSTopic
          SLACK_WEBHOOK_URL: !Ref SlackWebhookURL
      Events:
        Trigger:
          Type: EventBridgeRule
          Properties:
            Pattern:
              source:
                - aws.cloudwatch
              detail-type:
                - CloudWatch Alarm State Change
              detail:
                alarmName:
                  - wildcard: "*-HealthCheckAlarm"

  Route53HealthCheckSNSTopic:
    Type: "AWS::SNS::Topic"
    Properties:
      DisplayName: "Route53 Health Check SNS Topic"
      Subscription:  
        - Endpoint: healthcheck_status@example.com
          Protocol: email          
      TopicName: "Route53HealthCheckSNSTopic"

  Route53HealthCheck: 
    Type: 'AWS::Route53::HealthCheck'
    Properties: 

      HealthCheckConfig: 
        Port: 443
        Type: HTTPS
        ResourcePath: '/'
        FullyQualifiedDomainName: !Ref Hostname
        RequestInterval: 30
        FailureThreshold: 3    

  Route53HealthCheckAlarm:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmDescription: Health Check Alarm
      AlarmName: !Join
        - ''
        - - !Ref Hostname
          - '-HealthCheckAlarm'
      Namespace: AWS/Route53
      MetricName: HealthCheckStatus
      Dimensions:
        - Name: HealthCheckId
          Value: !Ref Route53HealthCheck
      ComparisonOperator: LessThanThreshold
      EvaluationPeriods: 1
      Period: 30
      Statistic: Minimum
      Threshold: 1.0
      TreatMissingData: breaching

Route53 Health Checks

The core component of this solution is the service provided by AWS called Route53 Health Checks. Route53 health checks monitor the health and performance of your web applications, web servers, and other resources.

Route 53 has health checkers in locations around the world. When you create a health check that monitors an endpoint, health checkers start to send requests to the endpoint that you specify to determine whether the endpoint is healthy. You can choose which locations you want Route 53 to use, and you can specify the interval between checks: every 10 seconds or every 30 seconds.

Cloudwatch Alarms for Route53 Health Checks

You can use Amazon Cloudwatch Alarms to monitor the status of your Route53 Health Checks. When setting up the alarm you specify that it should be based on the metric associated with the Route53 Health Check. When the status of the metric changes from UP to DOWN (metric goes from 1 to 0) or DOWN to UP (metric goes from 0 to 1), the status of the Cloudwatch alarm will change as well.

One of the most important components in AWS to build Event-Driven Architectures is Amazon Eventbridge. It is a serverless event bus that helps you receive, filter, transform, route, and deliver events. Most changes that happen in your AWS account automatically get sent to this bus including Cloudwatch alarm state changes. We'll discuss this below.

Amazon Eventbridge

Whenever the status of a Cloudwatch alarm changes, AWS automatically sends and event to the default Eventbridge event bus in your account. In order to take advantage of this and trigger actions on events like this you need to create an Eventbridge Rule to match the events you are interesting in and define what to do when a matching event is seen.

Here is an example of the rule we will be using to match the state change of the Cloudwatch alarm for the state change of the Route53 Health Check. We have setup the alarm name to be the hostname of what we are checking with a suffix of "-HealthCheckAlarm". The rule below will match any event that has an alarm name ending with this suffix.

When you create a rule in Eventbridge to match events you can specify a list of targets or actions you want to execute when a match happens. Here is the target we have setup for this rule. It will call an AWS Lambda Function which will take care of notifying the people we have setup via SNS and Slack.

Simple Notification Service (SNS)

There are many use cases for having to notify someone or some other service on changes in your system. On AWS, the SNS component is usually the best approach to use.

SNS is a managed service that provides message delivery from publishers to subscribers (also known as producers and consumers). Publishers communicate asynchronously with subscribers by sending messages to a topic, which is a logical access point and communication channel. Clients can subscribe to the SNS topic and receive published messages using a supported endpoint type, such as Amazon Kinesis Data Firehose, Amazon SQS, AWS Lambda, HTTP, email, mobile push notifications, and mobile text messages (SMS).

When you setup a new SNS topic and register endpoints (listeners) you typically have to accept being added to the notification list. For example with email notifications you will receive an email confirming you are expecting the notifications and want to receive them. For example you will see an email like this.

AWS Lambda Function

When the Cloudwatch alarm state changes we have setup a Lambda function (called for example - route53-health-check-sam-HealthCheckStateChangedFu) to be executed. The information about the site status and which site it was will be sent in the payload to the lambda function invocation.

Below is an example of lambda handler code. It is using the highly recommended Powertools for AWS Lambda library to ensure best practices around tracing, logging, metrics, and more. The function gets passed an SNS topic and Slack webhook URL at creation time (via environment variables) to send notifications to. It parses the passed in event information to determine which hostname had it's status changed and what the new status is and sends out notifications.

One nice part about using SNS from a Lambda function is that you can set the notification subject and body to be whatever you want. When you use SNS directly it defines those for you.

@tracer.capture_lambda_handler
@logger.inject_lambda_context(log_event=True)
@metrics.log_metrics(capture_cold_start_metric=True)
def lambda_handler(event, context):

    SNS_TOPIC_ARN = os.environ['SNS_TOPIC_ARN']
    SLACK_WEBHOOK_URL = os.environ['SLACK_WEBHOOK_URL']

    logger.info(event)

    logger.info(f"SNS_TOPIC_NAME={SNS_TOPIC_ARN}")
    logger.info(f"SLACK_WEBHOOK_URL={SLACK_WEBHOOK_URL}")
    try:                
        if event['detail-type'] != "CloudWatch Alarm State Change":
            logger.error(f"This is not an event we care about - not sure why we got here")
            return

        event_detail = event['detail']
        site_that_changed = event_detail['alarmName'][:event_detail['alarmName'].find('-HealthCheckAlarm')]
        logger.info(f"site_that_changed={site_that_changed}")
        new_status = event_detail['state']['value']
        logger.info(f"new_status={new_status}")    

        if new_status != 'OK':
            icon = ":x:"
            text_status = "DOWN"
        else:
            icon = ":white_check_mark:"
            text_status = "UP"

        message_to_show = f"{icon} {site_that_changed} ( https://{site_that_changed} ) is now {text_status}"    
        slack_data = {
            "text": message_to_show
        }         

        if SLACK_WEBHOOK_URL:    
            logger.info(f"slack_data={slack_data}")
            slack_response = send_slack_message(slack_data, SLACK_WEBHOOK_URL)
            logger.info(f"slack_response={slack_response}")         

        sns_subject = f"{site_that_changed} is now {text_status}"
        sns_msg = f"https://{site_that_changed} is now {text_status}"
        sns_response = publish_to_sns(sns_subject, sns_msg, SNS_TOPIC_ARN)
        logger.info(f"sns_response={sns_response}") 

    except:
        traceback.print_exc()
        logger.info(f"traceback={traceback.format_exc()}")

Notifications on Route53 Health Check status

In the example i have put together you will get email notifications from SNS at the email address you defined in the SAM project as below (healthcheck_status@example.com is the code default). Of course you will need a real email that you have access to to accept the SNS topic confirmation. You can also setup SMS text messages and more with SNS.

You will also receive Slack messages using the Slack webhook URL you define (example below uses a FAKE webhook URL of https://hooks.slack.com/triggers/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). You will need to get a valid webhook URL or you can set this to '' to skip the slack update part and just use SNS.

Setting up a Slack application or bot is out of scope of this article but there is a good tutorial here: How to create a webhook URL for a Slack Channel?

The hostname to monitor (www.amazon.com is setup as the default) is defined in the parameters of the SAM template as below.

  Route53HealthCheckSNSTopic:
    Type: "AWS::SNS::Topic"
    Properties:
      DisplayName: "Route53 Health Check SNS Topic"
      Subscription:  
        - Endpoint: healthcheck_status@example.com
          Protocol: email          
      TopicName: "Route53HealthCheckSNSTopic"

Parameters:
  Hostname:
    Type: String
    Description: Hostname to monitor   
    Default: www.amazon.com
  SlackWebhookURL:
    Type: String
    Description: URL to publish slack messages to when health check changes state
    Default: https://hooks.slack.com/triggers/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

Here are examples of the messages you will see with this setup on state changes of the alarm.

Conclusion

I hope you learned about using various components in AWS to setup a serverless site monitoring solution. Please clone the Github repository and try it out for yourself. There are likely many improvements that could be made.

Please let me know if you have any questions or concerns.

Cleanup

If you did clone the repo and set this solution up for yourself please remember to clean up the resources to avoid any ongoing costs.

sam delete will be used to delete the underlying Cloudformation template and the resources provisioned in AWS.

For more articles from me please visit my blog at Darryl's World of Cloud or find me on X, LinkedIn, Medium, Dev.to, or the AWS Community.

For tons of great serverless content and discussions please join the Believe In Serverless community we have put together at this link: Believe In Serverless Community

Top comments (1)

Sean_Fang • Dec 24 '23

I'm a rookie dev using AWS serverless services to build a project currently. I remember it did make me scratch my head a bit when dealing with monitoring the errors caught in lambda functions, and I ended up with setting up cloudwatch custom alarms and custom metrics in a trycatch block in my lambda functions and using SNS to send notification just as mentioned in blog. It turned out it's great, especially we are able to set thresholds for the errors based on for example severity level.
However, by reading this blog, it makes me realize that we actually have a couple of options to choose from based on our needs when it comes to monitoring or health check.
Last but not least, thank you so much for sharing your experience! Really learned something today 😄 And I will definitely give this solution a try, particularly it's a good opportunity to practice SAM a bit 😄