Daniel Kneipp for AWS Community Builders

Posted on Aug 1, 2021 • Edited on Aug 6, 2021

Instrumenting AMIs for GPU monitoring on CloudWatch

#aws #devops #machinelearning #cloud

If you have used provisioned instances on AWS before, you know that the default metrics monitored are kind of limited. You only have access to CPU utilization, network transfer rates and disk reads/writes. By default, you don't have the monitoring of some basic information, like RAM and filesystem usage (which can be a valuable information to prevent an instance malfunction due to lack of resources).

In case of GPU-accelerated applications (like Machine Learning apps), this problem goes even further, since you also don't have any access to GPU metrics, which is critical to guarantee the reliability of the system (e.g., the total GPU memory consumption can lead to the crash of any application running on the GPU).

I've created a project (available here) showing how we can create an AMI with CloudWatch agent for RAM and filesystem monitoring, and a custom service called gpumon to gather GPU metrics and send them to AWS CloudWatch.

Project structure

In the project we have two main directories like this:

.
├── packer  ==> AMI creation
└── tf      ==> AMI usage example

The first one contains all the necessary files to create the the AMI based on Amazon Linux 2 using a tool called packer. The second one has infrastructure as code in terraform to provision an instance using the new created AMI for testing purposes.

AMI creation

packer is a great tool to achieve Infrastructure as Code principles on AMI creation step. It has capabilities to provision an instance with the base AMI specified, run scripts through ssh, start the process of AMI creation, and clean everything up (e.g. instance, ebs volume, ssh key pair) afterwards.

The file packer/gpu.pkr.hcl contains the specification of the AMI. There we can find the base AMI, the instance used to create the AMI, the storage configuration, and the scripts used to configure the instance.

Base AMI

In order to make my life a bit easier, I tried to to look for AMIs that already have NVIDIA drivers installed, so that I don't have to install it myself. Looking through the AWS documentation about installing NVIDIA drivers, we can see that there are options already in the marketplace of AMIs with pre-shipped NVIDIA drivers. Among the options, we're going to use the Amazon Linux 2, because it already comes with the AWS Systems Manager agent, which we will use latter on.

A couple of notes:

You don't need to subscribe to the marketplace product in order to have access to the AMI currently selected. However, you will need to subscribe to have access to the AMI id of new releases.
You will need a GPU-based instance to build the AMI (as it's required by the marketplace product specifications). I've tested this project in a new AWS account and it seems that the default limits don't allow the provisioning of GPU-based instances (G family). packer will show an error if that's your case as well. If it is, you can request a limit increase here.

CloudWatch Agent

The first addon that we're going to make to the base AMI is to install and configure the AWS CloudWatch Agent.

The process of installation of the agent is well documented by AWS and you can see more details and methods of installation in other Linux distributions here.

The agent configuration is made by .json file that the agent reads in order to know what metrics to monitor and how to publish them on CloudWatch. You can also see more about it on the documentation page.

The process is automated by the script packer/scripts/install-cloudwatch-agent.sh. It installs the agent and configure it with some relevant metrics like filesystem, RAM, and swap usage.

Note that the agent is configured to publish metrics with a period of 60 seconds. This can incur costs since it's considered and Detailed metric (go to CloudWatch pricing page to know more).

Gathering the GPU metrics

AWS already have documentation talking about ways to monitor GPU usage. There is a brief description about a tool called gpumon and also a more extended blog post about it.

gpumon is a (kind of old) python script developed by AWS that makes use of a NVIDIA library called NVLM (NVIDIA Management Library) to gather metrics from the GPUs of the instance and publish them on CloudWatch. In this project the script was turned into a systemd unit. The script itself was also modified to make the error handling more readable and to capture memory usage correctly.

The gpumon service resides in packer/addons/gpumon and the install-cloudwatch-gpumon.sh automates the installation process. The service is configured to start the python script at boot and restart it stops working for some reason. Since systemd manages the service, its logs can be seen with journalctl --unit gpumon.

Note: the python script has only been tested on python2, which is deprecated. pip warns about that on the installation process while you create the AMI. You should keep that in mind if you intend to use this script for any production workload.

About the GPU memory usage metric gathering

The original script get the GPU memory usage from the nvmlDeviceGetUtilizationRates() function. I noticed through some tests that this metric was 0 even though I had data loaded into the GPU.

From the NVIDIA documentation this function actually returns the amount of memory that is being read/written, which isn't what I wanted. In order to get the amount of GPU memory allocated, nvmlDeviceGetMemoryInfo() should be used instead.

AMI Usage example

As an example on how to use this AMI, there is also a terraform project that contains the necessary resources to provision an instance and monitor it using the CloudWatch interface.

The tf/main.tf is the root file containing the reference to the module tf/module/monitored-gpu, which encapsulates the resources such as the instance and IAM permissions.

This example doesn't required SSH capabilities from the instance. We will use AWS Systems Manager - Session Manager to access of the instance (the base AMI already comes with the SSM agent preinstalled). This method is better because the access is registered into AWS, allowing security auditions on the instance access. Also, there is no credentials nor keys stored in any machine to be leaked.

The required AWS managed permissions are:

CloudWatchAgentServerPolicy: allow the instance to publish CloudWatch metrics;
AmazonSSMManagedInstanceCore instance access through Session Manager.

How to run it

All right, let's go to the fun part! To play with this project we first need to install some dependencies (packer and terraform).

A really handy tool that you can use to install and manage multiple versions of tools is asdf. It helps you keep track use different versions of a variety of tools. With it there is no need for you to uninstall the versions of the tools you may already have. With some simple commands it install the versions needed and make them context aware (the tolling version change automatically after entering in a directory that has a .tool-versions specified).

You can go to this link to install asdf. After that you can simply run the following to have the correct versions of packer and terraform:

asdf plugin-add terraform https://github.com/asdf-community/asdf-hashicorp.git
asdf plugin-add packer https://github.com/asdf-community/asdf-hashicorp.git

asdf install

After that, it's time to build the AMI:

cd packer
packer init
packer build .

This will start the process of building the AMI in the us-east-1 region. You can follow the terminal to see what is happening and the logs of the scripts. You can also see the snapshot being taken accessing the AWS console:

And get a progress bar in the "Snapshots" page like this:

The snapshot name tag will appear after the AMI has been created.

The AMI creation will be completed when you see something like this on your terminal:

...
==> amazon-ebs.gpu: Terminating the source AWS instance...
==> amazon-ebs.gpu: Cleaning up any extra volumes...
==> amazon-ebs.gpu: No volumes to clean up, skipping
==> amazon-ebs.gpu: Deleting temporary security group...
==> amazon-ebs.gpu: Deleting temporary keypair...
Build 'amazon-ebs.gpu' finished after 9 minutes 38 seconds.

==> Wait completed after 9 minutes 38 seconds

==> Builds finished. The artifacts of successful builds are:
--> amazon-ebs.gpu: AMIs were created:
us-east-1: ami-09a9fd45137e9129e

✅ At this point, you should have an AMI ready to be used!!

Now it's time to test it! Grab the AMI id (ami-09a9fd45137e9129e in this case) and paste it, replacing the text "<your-ami-id>" in the tf/main.tf file. After the modification, the section of the file that specifies the module should look like this:

module "gpu_vm" {
  source = "./modules/monitored-gpu"

  ami = "ami-09a9fd45137e9129e"
}

After that, just run:

cd tf
terraform init
terraform apply

terraform will ask you if you want to perform the actions specified. If, right before the prompt, it shows that it will create 6 resources, like it's being shown right below, you can type yes to start the resource provisioning.

...
Plan: 6 to add, 0 to change, 0 to destroy.
...

After a couple of minutes (roughly 5 minutes), go to the All metrics page on CloudWatch. You should be able to see two new custom namespaces already: CWAgent and GPU. This is the newly created instance publishing its metrics in idle.

You can see more details about RAM and swap, for example, using the CWAgent namespace, like the next figure shows. With that you can monitor the boot behavior of the AMI, assess its performance and verify if it's behaving as expected.

The swap usage is 0 because there is no swap configured in this AMI (you can follow this documentation in order to add it). The spike of RAM usage you see is a test that I was making 😅.

Now, let's use this hardware a bit to see the metrics moving. Go to the Instances tab on the EC2 page, like shown in the next figure. Right-click in the running instance and hit connect.

After that, go to the Session Manager tab and hit Connect.

You should now have a shell access through your browser. Running the commands below will clone and build a utility to stress-test the GPU for 5 minutes.

sudo -s
yum install -y git

cd ~
git clone https://github.com/wilicc/gpu-burn.git
make CUDAPATH=/opt/nvidia/cuda

./gpu_burn 600

You can look at CloudWatch to see the impact of the resource usage while gpu-burn does its thing, as shown in the figure below.

With these metrics, now it's easy to create alarms to alert you when an anomaly is detected on the resource usage or create autoscaling capabilities for a cluster using custom metrics.