On this post we discuss the necessary IaC (Infrastructure as Code) files to provision an EKS cluster capable of sharing a single GPU between multiple pods (code available here).
If you ever tried to use GPU-based instances with AWS ECS, or on EKS using the default Nvidia plugin, you would know that it's not possible to make a task/pod shared the same GPU on an instance. If you want to add more replicas to your service (for redundancy or load balancing), you would need one GPU for each replica.
And this doesn't seem to be going to change in the near future for ECS (see this feature request)
GPU-based instances are expensive, and despite the fact that some Machine Learning frameworks (e.g. Tensorflow) are pre-configured to use the entire GPU by default, that's not always the case. ML services can be configured to make independent inferences by request instead of batch processing, and this would require just a fraction of a 16GiB of VRAM that comes with some instances.
Currently, GPU-based instances only publish to ECS/EKS the amount of GPUs they have. This means that a task/pod can only request a GPU, but not the amount of resources of GPU (like it's possible with CPU and RAM memory). The solution is to make the instance publish the amount of GPU resources (processing cores, memory, etc.) so that a pod can request only a fraction of it.
This project (available here) uses the k8s device plugin described by this AWS blog post to make GPU-based nodes publish the amount of GPU resource they have available. Instead of the amount of VRAM available or some abstract metric, this plugin advertises the amount of pods/processes that can be connected to the GPU. This is controlled by what is called by NVIDIA as Multi-Process Service (MPS).
MPS manages workloads submitted by different processes to allow them to be scheduled and executed concurrently in a GPU. On Volta and newer architectures we can also limit the amount of threads a process can use from the GPU to limit the shareability of resources and ensure some Quality of Service (QoS) level.
Here we put it all together to deliver an infrastructure and deployment lifecycle which all can be managed by terraform. Integrally, here is the list of tools needed:
terraform: for infrastructure provisioning and service deployment (including the
DaemonSetfor the device plugin and the
packer: to create an instrumented AMI for GPU usage monitoring in CloudWatch
asdf: really handy tool used to install other tools in a version-controlled way
The rest will come along in the next steps ;)
At the end, you should have an infrastructure with the following features:
✔️ EKS cluster with encrypted volumes and secrets using KMS
✔️ All workers resides on private subnets and access the control plane only from within the VPC (no internet communication)
✔️ Ip whitelist configured for accessing the k8s api from the internet
✔️ Instrumented instances with GPU usage monitored in Cloudwatch
✔️ Nodes can be access with AWS SSM Session Manager (no
The first tool to be installed is
asdf. With it, all the other will come after easily.
asdf can be installed following this guide from its documentation page. After that, you should be able to run the following list of commands to install the rest of the tooling.
asdf plugin-add terraform https://github.com/asdf-community/asdf-hashicorp.git asdf plugin-add pre-commit firstname.lastname@example.org:jonathanmorley/asdf-pre-commit.git asdf plugin-add tflint https://github.com/skyzyx/asdf-tflint asdf plugin-add https://github.com/MetricMike/asdf-awscli.git asdf install pre-commit install tflint --init
This project also comes with
pre-commit configured to serve as a reference on how terraform-based projects can be configured to check of syntax and linting errors even before a commit is made (so that you don't have to wait for some CI pipeline).
For details about how the AMI is create and what comes with it, I highly suggest you to my other repo that explains in detail how the AMI works and what IAM permissions it requires.
From that repo, the only thing changed is the base AMI, which in this case an AMI tailored for accelerated hardware on EKS was used. The list of compatible AMIs for EKS can be obtained in this link updated regularly by AWS. Also, the AMI from AWS comes with SSM agent in it, so no need to change anything regarding that.
The following commands will create an AMI named
packer-gpu-ami-0-1, which should be picked automatically by the terraform code of the cluster. All
packer commands assume that you already have configured your AWS credentials properly.
cd ami/ packer build .
The cluster and network resources are defined together in the
cluster directory. Here is a small description of them:
main.tf: defines the versions and configuration of the main providers, as well as set values for variables that can be used on other files (e.g. name of the cluster);
vpc.tf: encompass the network configuration where the EKS cluster will be provisioned. It doesn't contain a subnet for the
us-east-1ebecause, at the time of this writing, there were no
g4dn.xlargeavailable at this availability zone;
eks.tf: contains the cluster definition using managed workers. Here is also where is defined the
k8s.amazonaws.com/accelerator, important to tell the device-plugin where it should be deployed;
kms.tf: here we have the definition of the Costumer Managed Keys (CMKs) alongside the policies necessary to make them work for the encryption of the volumes of the cluster nodes and k8s secrets;
iam.tf: has the permissions necessary in order to make the Session Manager access work and to allow the nodes to publish metrics on CloudWatch regarding CPU, RAM, swap, disk and GPU usage (go here to know more the permissions for Session Manager and here to learn more about permissions required by CloudWatch Agent);
aws-virtual-gpu--device-plugin.tf: Generated from the
yamlfile of the same name obtained from the AWS blog post. Some modifications needed to me made in order to make this
DaemonSetwork. Here they are:
- The image
nvidia/cuda:latestdoesn't exist anymore as the tag
latestis now deprecated (source). Because of that, the image
nvidia/cuda:11.4.2-base-ubuntu20.04is being used instead.
- The number of
vgpuconfigured for the container
aws-virtual-gpu-device-plugin-ctrwas modified from its default of
42because NVIDIA architectures after the Volta can handle up to
42connections to the MPS (source). This has been done to increase how much fractioned the GPU can get. Theoretically (not tested) 42 pods could share the same GPU (if they don't surpass the amount of VRAM available). At this point, limitations of instance networking are more restricting than GPU shareability.
- Because this
vgpuconfiguration can have different limits depending on the architecture of the GPU, the plugin also was configured to be deployed on
g4dn.xlargeinstances (see how here) where the architecture is now (Turing) and this demo was tested on.
- The image
Pro tip: If you want to convert k8s
yaml files to
.tf, you can use
k2tf (repo) that is able to convert the resource types of the
yaml top their appropriated counterparts of the k8s provider for terraform. To install it, just:
wget https://github.com/sl1pm4t/k2tf/releases/download/v0.6.3/k2tf_0.6.3_Linux_x86_64.tar.gz tar zxvf k2tf_0.6.3_Linux_x86_64.tar.gz k2tf sudo mv k2tf /usr/local/bin/ rm k2tf_0.6.3_Linux_x86_64.tar.gz
After that, you should be able to convert a
yaml manifest with a simple command like
cat file.yaml | k2tf > file.tf. This has been done for
To provision all of this, the following command should be sufficient:
cd cluster/ terraform init terraform apply
apply should show
Plan: 59 to add, 0 to change, 0 to destroy.. If that's the case, hit
yes and go grab a cup of of coffee as this can take dozens of minutes.
After the resources be provisioned, you might want to run
terraform apply -refresh-onlyto refresh your local state as the creation of some resource change the state of others within AWS. Also, state differences on
metadata.resource_versionof k8s resources almost always show up after an
apply. This seems to be related to this issue.
Now you should see an EKS cluster with the following workloads:
The app is a
Deployment also obtained from the AWS blog post that spawns 3 replicas of a resnet model in the cluster. This line defines "how much" GPU it needs. Because of this requirement, k8s will not schedule a pod of this deployment to a node that doesn't have a GPU.
This deployment is configured to use 20% of the GPU memory (using a tensorflow feature here). Based on this VRAM usage, we need to configure how many of the 48 process slots from MPS of an instance we wan't to reserve. Let's use
ceil to be conservative, so
ceil(48 * 0.2) = 10. With this we should be able to schedule even 4 replicas in the same instance.
Since we're using the same tool for infrastructure management and app deployment, now we leverage this by following the exact same procedure to deploy the app.
cd app/ terraform init terraform apply
And now you should be seeing the resnet workload deployed like this:
Also, we can see the on CloudWatch the amount of VRAM used in that instance to confirm that more than one replica
is actually allocating resources there. To know more about the new metrics available in ClodWatch published by instances using this custom AMI, please go here
Now, what about we scale the deployment to
4 replicas? Please, go to this line and change the amount of replicas from
4 and run another
tf apply. After some time (~3-5 minutes) you should be able to see the VRAM usage of that instance increasing a bit more, like this:
Leveraging again the fact we interact mostly with terraform, clean everything should be as simple as:
cd app/ tf destroy cd ../cluster/ tf destroy
Note: The order matters because you can't delete the EKS cluster before removing the resources allocated in it, otherwise you should get error messages from AWS API about resource still being used.
Also, don't forget follow the clean-up procedure of the AMI repo to delete the created AMI and avoid EBS costs.
- [ ] Implement/test autoscaling features making a load test to resnet
- [ ] Enable and use IRSA
- [ ] Add Infracost on pre-commit config
Here we've implemented a complete infrastructure for an EKS cluster with shared GPU-based instances.