DEV Community

shimo for AWS Community Builders

Posted on • Edited on

Suspend any Lambda functions using a single CloudWatch Alarm

(Note: This post does not include the update of CloudWatch alarm in 2022/Dec.)

Motivation

When using Lambda functions, there are some risks that too many executions are invoked unintentionally. For example, self-looping of the Lambda functions or DDoS attack on the Lambda function URLs. A user in Japan reported the incident with 2.8 billion invocations in 10 days (Japanese blog).

So it would be nice to set some thresholds to suspend the uncontrollable functions. CloudWatch can detect such behaviors and trigger alarms. Surely this works, but we usually need to set one alarm for one Lambda function. The alarm costs 0.1 USD per month per alarm. If we use many Lambda functions, it should be costly.

In this post, I share the idea that a single CloudWatch Alarm takes care of all of the Lambda functions in the region. (So, 0.1 USD per month!)

How it works

The image below shows the steps of this architecture.

Idea design

  1. Suppose Lambda(s) is running repeatedly too much. Here we don't care which function or how many functions are involved. Only the sum of invocations is used as the metric.

  2. When the sum of invocations reached first threshold, CloudWatch Alarm is triggered. This alarm tells that "SUM of the invocation reached threshold."

  3. Lambda-throttle is invoked by CloudWatch alarm. This function queries which functions and how many times were invoked at a high rate.

  4. When the invocations of the Lambda functions are beyond second threshold, Lambda-throttle sets the concurrency to 0 of them.

  5. Lambda-throttle sends a message to the user via Amazon SNS.

Code

(Find complete CDK code in my repository.)

First, this is the snapshot of the CloudWatch alarm setting, which is just a normal setting with AWS/Lambda namespace. The first threshold for the sum of invocations is set to 100 this time.

CloudWatch alarm setting

Next, let's see the Lambda code.

  • The second threshold is used for determining whether to suspend each function or not.
  • The most fun part of this post is MetricDataQueries. In this query, Lambda query with Metrics Insights sum of invocations for each lambda functions. (I write twice because it's important: Not sum of all functions, but the sum of each function.)
    • Query range is 5 minutes. Counts for every 1 minutes are obtained for this range.
    • Option: Adding "LIMIT 10", for example, in the Expression narrows the result.
  • After the metrics query, compare the sum of invocations for a function and the second threshold (threshold_lambda_stop).
  • When suspending the function, put_function_concurrency with ReservedConcurrentExecutions=0 works.
  • ("Failed to throttle." is just a verbose part in the case.)
import os
from datetime import datetime, timedelta

import boto3


def send_sns(message, subject):
    client = boto3.client("sns")
    topic_arn = os.environ["SNS_ARN"]
    client.publish(TopicArn=topic_arn, Message=message, Subject=subject)


def get_invocation_top_functions():
    """
    Check which functions are invoked many times
    """
    range_minutes = 5
    cloud_watch = boto3.client("cloudwatch")

    response = cloud_watch.get_metric_data(
        MetricDataQueries=[
            {
                "Id": "q1",
                "Expression": """
                    SELECT SUM(Invocations)
                    FROM SCHEMA(\"AWS/Lambda\", FunctionName)
                    GROUP BY FunctionName
                    ORDER BY SUM() DESC
                    """,
                "Period": 60,
                "Label": "Invocation top",
            },
        ],
        StartTime=datetime.now() - timedelta(minutes=range_minutes),
        EndTime=datetime.now(),
    )

    return response


def handler(event, context):

    threshold_lambda_stop = int(os.environ["THRESHOLD_LAMBDA_STOP"])

    response = get_invocation_top_functions()

    # Count invocation in range_minutes for each function
    # If the count is more than threshold, throttle the function
    for fn in response["MetricDataResults"]:
        count = sum(fn["Values"])
        fn_name = fn["Label"].split()[-1]

        if count >= threshold_lambda_stop:
            client = boto3.client("lambda")
            response = client.put_function_concurrency(
                FunctionName=fn_name, ReservedConcurrentExecutions=0
            )

            # Notify
            if response["ResponseMetadata"]["HTTPStatusCode"] == 200:
                message = f"Lambda: {fn_name} was throttled. Count in 5 minutes: {count}."
                subject = "Lambda throttled."
                send_sns(message, subject)
            else:  # Verbose
                message = f"Failed throttling Lambda: {fn_name}. Count in 5 minute: {count}."
                subject = "Failed to throttle."
                send_sns(message, subject)

Enter fullscreen mode Exit fullscreen mode

Try

For a test, I've set first threshold in the CloudWatch alarm to 10, and the second alarm to 5. Then manually run a Lambda function more than 10 times in a minute.

I received an SNS message like this.

SNS sample

Note

  • When testing, make sure not to suspend your critical Lambda functions.

  • As the resources are region specific, one alarm for a region is required.

Summary

I have shared how to detect and throttle all Lambda functions in a single CloudWatch Alarm.

Setting two thresholds properly is quite essential. One is for catching the sum of the invocations, and the other is for determining to suspend each function. They depend on your system.

Appendix

Find complete CDK code in my repository.

Top comments (0)