DEV Community

Cover image for Keep your CloudWatch bill under control when running AWS Lambda at scale
Thomas Laue
Thomas Laue

Posted on • Updated on

Keep your CloudWatch bill under control when running AWS Lambda at scale

In this post, I am showing a way how to keep the AWS CloudWatch costs caused by log messages coming from AWS Lambda under control without losing insights and debug information in case of errors. A logger with an included cache mechanism is presented. It manages the number of messages sent to AWS CloudWatch depending on the log level and function invocation result.

AWS Lambda and AWS CloudWatch

AWS Lambda, the serverless compute service offered by AWS, sends all log messages (platform as well as custom messages) to AWS CloudWatch. Log messages are sorted into log groups and streams which are associated with the Lambda function and its invocations from which the messages originated.

Depending on the AWS region CloudWatch charges for data ingestion (up to $0.90 per GB) and data storage (up to $0.0408 per GB and month). These fees sum up really quickly and it is not uncommon to spend a lot more on CloudWatch logs (sometimes up to 10 times more) than on Lambda itself in a production environment. In addition, log files are often sent from CloudWatch to 3rd party systems for analyzation adding even more spendings to the bill.

Logging

Nevertheless, log files are an important resource to debug problems and to get deeper insights into the behavior of a serverless system. Every logged detail might help to identify issues and to fix bugs and problems. Structured logging is important as log files can be analyzed much easier (e.g. with AWS CloudWatch Insights) which will save time and engineering costs. The dazn-lambda-powertools library provides a logger that supports structured logging for Node.js, the AWS Lambda Powertools offer the same for Python and Java.

Furthermore, it is highly recommended to reduce the retention time of Cloudwatch log groups to a suitable time period. By default, logs will be stored forever leading to increasing costs over time. The retention policy for every log group might be changed manually using the AWS Console or preferably by using an automated approach provided for instance by this AWS SAR app.

Finally, sampling debug logs might cut off the biggest part of the CloudWatch Logs bill especially when running AWS Lambda at scale without losing the complete insight into the system. Depending on the sampling rate (which has to be representable for a workload), a certain amount of debugging information is available for monitoring and diagnostics.

The following image shows a CloudWatch log stream belonging to a Lambda function for which a sampling rate of 10 % was used for demonstration purposes. A reasonable value for production will probably be much lower (e.g. 1%).

Alt CloudWatch log stream with debug output for about every 10th Lambda invocation

Problem with sampling debug logs

Nevertheless - as life goes - the sampling might not be in place when something goes wrong (e.g. a bug which only happens for edge cases) leaving a developer without detailed information to fix this issue. For instance, the invocation event or parameters for database or external API requests, are of interest in case of issues.

A logger that caches all messages which are not written to the output stream as their severity is below the defined log level could be used. The cached messages would only be sent to CloudWatch in case of a program error - in addition to the error information to get a full picture of the function invocation. This idea originated from the Production-Ready Serverless course by Yan Cui.

A reduced version of the logger which is based on the dazn-lambda-powertools-logger:

const log = require("@dazn/lambda-powertools-logger");

const LogLevels = {
  DEBUG: 20, INFO: 30, WARN: 40, ERROR: 50
};

class Logger {
  #logMessages = [];
  #level = "DEBUG";

  constructor() {
    this.#level = log.level;
  }

  handleMessage(levelName = "debug", message = "", params = {}, error = {}) {
    log[levelName](message, params, error);

    const level = LogLevels[levelName.toUpperCase()];

    if (level < LogLevels[this.#level]) {
      this.addToCache(levelName, message, params, error);
      return;
    }
  }

  addToCache(levelName, ...params) {
    this.#logMessages.push({ levelName, params });
  }

  writeAllMessages() {
    try {
      // The log level of the log has to be set do "debug" as
      // the current log level might prevent messages from
      // being logged.
      log.enableDebug();

      this.#logMessages.forEach((item) => {
        log[item.levelName.toLowerCase()](...item.params);
      });
    } finally {
      log.resetLevel();
    }
  }

  static debug(message, params) {
    globalLogger.handleMessage("debug", message, params);
  }

  static info(message, params) {
    globalLogger.handleMessage("info", message, params);
  }

  static warn(message, params, error) {
    globalLogger.handleMessage("warn", message, params, error);
  }

  static error(message, params, error) {
    globalLogger.handleMessage("error", message, params, error);
  }

  static writeAllMessages() {
    globalLogger.writeAllMessages();
  }

  ...
}

const globalLogger = new Logger();
module.exports = Logger;
Enter fullscreen mode Exit fullscreen mode

The logger provides methods for the most common log levels. A message is either written to the output stream or added to the internal cache depending on the current log level defined in the Lambda environment. If required all cached messages can be logged out as well using the "writeAllMessages" method.

How to use the logger within AWS Lambda

All required logic (including sample logging configuration) has been added to a wrapper that receives the Lambda handler function as an argument. This wrapper can be reused for any Lambda function and published for instance in a private NPM package.

const middy = require("middy");
const sampleLogging = require("@dazn/lambda-powertools-middleware-sample-logging");

const log = require("./logger");

module.exports = (lambdaHandler) => {
  const lambdaWrapper = async (event, context) => {
    log.debug(`Input event...`, { event });

    try {
      const response = await lambdaHandler(event, context, log);

      log.info(
        `Function [${context.functionName}] finished successfully with result: [${JSON.stringify(
          response
        )}] at [${new Date()}]`
      );

      return response;
    } catch (error) {
      log.writeAllMessages();
      throw error;
    } finally {
      log.clear();
    }
  };

  return middy(lambdaWrapper).use(
    sampleLogging({
      sampleRate: parseFloat(process.env.SAMPLE_DEBUG_LOG_RATE || "0.01"),
    })
  );
};
Enter fullscreen mode Exit fullscreen mode

An example of a simple Lambda handler in which some user information is retrieved from DynamoDB is given below. This function fails on a random basis to demonstrate logger behavior.

const { DynamoDB } = require("@aws-sdk/client-dynamodb");
const { marshall, unmarshall } = require("@aws-sdk/util-dynamodb");

const dynamoDBClient = new DynamoDB({ region: "eu-central-1" });

const handler = async (event, context, log) => {
  const userId = event.queryStringParameters.userId;
  const { name, age } = await getUserDetailsFromDB(userId);

  if (Math.random() > 0.5) {
   throw new Error("An error occurred");
  }

  let response = {
    statusCode: 200,
    body: JSON.stringify({
      name,
      age,
    }),
  };

  log.debug(`Response...`, { response });

  return response;
};

const getUserDetailsFromDB = async (userId) => {
  log.debug(`Get user information for user with id...`, { userId });

  const { Item } = await dynamoDBClient.getItem({
    TableName: process.env.TABLE_NAME,
    Key: marshall({
      userId: 1,
    }),
  });

  const userDetails = unmarshall(Item);
  log.debug("Retrieved user information...", { userDetails });

  return userDetails;
};

module.exports.handler = wrapper(handler);
Enter fullscreen mode Exit fullscreen mode

A small sample application (as shown by the lumigo platform) demonstrates the different logger behavior:

Alt Sample application

A successful invocation of the sample app with log level set to "INFO" does not write out any debug message (only in the rare case of a sampled invocation):

Alt Successful invocation of the sample application with resulting log stream

However, all debug information will be sent to CloudWatch Logs in case of an error as can been seen below:

Alt Failed invocation of the sample application with resulting log stream

Caveats

Platform errors like timeouts or out of memory issues will not trigger the logger logic as the function will not run to its end but will be terminated by the Lambda runtime.

Takeaways

Logging is one of the important tools to get some insights into the behavior of any system including AWS Lambda. CloudWatch Logs centralizes and manages all logs from most AWS services. It is not free but there are possibilities like to sample logs in production to reduce the bill. As this might result in NO logs in case of an error, a logger with an internal cache has been presented which outputs all logs but only in case of a problem. This logger can be combined with the sample logging strategy to keep the bill low but get all information when it is really required.

Let me know if you found this useful and what other approaches are used to keep the CloudWatch bill reasonable without losing all insights. Thank you for reading.

The full code including a small test application can be found in:

GitHub logo TLaue / logger-with-cache

An example of a logger for AWS Lambda which caches all messages

Top comments (0)