DEV Community: Daniel Kneipp

Global Service on AWS

Daniel Kneipp — Tue, 30 Jan 2024 20:29:21 +0000

In a previous post I showed how you can have a multi-region service running while keeping response times low using an architectural pattern called Availability Zone Affinity

However, the previous design has a considerable issue: it doesn't perform a regional failover. In other words, if an entire region goes down, the service will become inoperable for the customers closer to that specific region.

To overcome this problem, this project shows how Global Accelerator can be used to provide a single point of entry to your service with static IPs available globally.

The code of this project is available here

Working with Global Accelerator

So, to improve the previous design, we will add on top it a new entrypoint using Global Accelerator (GA). This AWS service offers a global fixed endpoint with two static IPs.

When a web client uses this endpoint, its traffic is sent to the nearest point of presence of the AWS Edge network, and from there it goes through the AWS backbone instead of going through the Internet all the way to the intended resource (which can be a load balancer or an EC2 instance).

GA is used by several customers. Let's take Okta as an example. Okta follows a multi-tenant architecture and has subdomains for their customers, and you can see for LinkedIn the GA endpoint available as a CNAME record as shown in the image below

Feel free to test it on other Okta customers such as Zoom for example, to see a different GA endpoint.

AWS also provides a webpage that allows you to see the differences in response times from different regions when you use GA as opposed to going via the public Internet to reach an AWS endpoint: https://speedtest.globalaccelerator.aws/

Implementing the Design

GA alone can provide the same features Route53 offers with latency-based and failover records. So it could replace all of that. However, in the code, we will keep everything deployed previously to allow some comparisons between the approaches.

In summary, in the place of Route53, using GA in the design looks something like this:

To understand step by step how the design was made, please visit the aws-route53-global-dns to learn more.

A new file has been added at aws-route53-global-dns/terraform/ga.tf with all the relevant code there. This change was made in a separate branch so we can keep track of changes made and leave the previous project untouched.

GA follows a component hierarchy of listener -> endpoint group -> endpoint.

Listeners define the port and network protocol to listen to and which endpoint groups should receive the traffic.

Endpoint group describes a regional group of endpoints, which can be Application Load Balancers, EC2 instances, or in this case, Network Load Balancers (NLBs). For each endpoint you can set a weight which is used to define how the traffic to that endpoint group will be balanced.

In the code you can see endpoints defined as follows:

endpoint_configuration {
  client_ip_preservation_enabled = false
  endpoint_id                    = module.services_eu["eu1"].nlb_arn
  weight                         = 255
}
endpoint_configuration {
  client_ip_preservation_enabled = false
  endpoint_id                    = module.services_eu["eu2"].nlb_arn
  weight                         = 1
}

With the above configuration, we are defining that the primary endpoint in the EU (eu1) should receive 255/256 of the traffic. Meanwhile, the secondary endpoint used for failover received 1/256.

An endpoint with weight 0 doesn't receive traffic if another endpoint group has healthy endpoints. In other words, if we had the eu2 weight set to 0 and eu1 stopped working, the traffic would failover to the other region, and not the eu2

⚠️ Note: failover cluster receives a small portion of traffic (1/256 = 0,39%), which makes the design a active-active setup, in contrast to the active-passive configuration we had before. This has the benefit of ensuring that the failover cluster is always operational by receiving a portion of customer traffic at all times.

As another implementation detail: as of now, GA doesn't support client IP preservation when the traffic is being forwarded to a NLB with a TLS listener. That is why you see client_ip_preservation_enabled=false

Only this addition is enough to be able to test GA without impacting the existing infrastructure, which shows the benefit of having a progressive design that allows improvement by composition with minimal change to existing components.

Testing the Design

As mentioned before, a new branch global_accelerator has been created on the aws-route-53-global-dns project with the changes required to add the GA.

In order to deploy the whole thing, just do:

cd aws-route53-global-dns/terraform/
tf apply

The deployment can take several minutes. For more information regarding the deployment procedure, please refer to this more detailed description.

The previous domain names are still working so we can test them and compare the differences. To hit the GA you can use the domain name such as service.dkneipp.com. To reach the closest primary NLB you can use www.service.dkneipp.com, as shown below.

💡 From the domain name using the Global Accelerator, we can see the two AnyCast IPs. So, even in the event of a regional failover, the IPs of your service don't change, which allows customers to define network policies for your service only based on IPs if required.

Note: now service.dkneipp.com works. As opposed to the limitation of CNAME records, the A record of type alias used for the GA endpoint can overlap with the SOA record of the zone.

Now, let's do some testing.

Failover

To simulate an issue in one of the web servers, the instance is removed from the associated target group (as shown here)

After that, you can see that external web clients are seamlessly redirected to the secondary web server, as shown below:

This is the command used to perform the test shown above:

while true; do sleep 2 && curl -w 'Total: %{time_total}s\n' 'https://service.dkneipp.com' && date +%T && echo ""; done;

This was also performed in the previous project. The interesting addition is the cross-region failover. And after all the web servers in the region are taken down:

You can see the failover happening automatically, and the web server on the other region starts to respond to the traffic (now with much higher response times, but still with the service operational). However, in this case, the failover was not transparent, and the end user would have experienced issues for around 17 seconds.

And finally, once the primary web server is live again, the recovery is also automatic after a transition window, as seen below:

Response times

In order to get more interesting statistics (like average with standard deviation and percentiles) over response times, I've created a small utility in Go that can be used to get those from response times of GET requests to a specified URL.

The code of the utility is in http-latency-test/. Binaries for MacOS on ARM and x86 Linux have been built already, and you can use the Makefile to build a binary from the source if required.

The utility accepts the following arguments.

-count: Max number of requests. Pass 0 to keep it running forever;
-sleep: The amount of time in milliseconds to wait between requests (default is 500ms);
-url: The endpoint to make the request.

So, a simple test can be ./http-latency-test --url https://service.dkneipp.com --count 1000 --sleep 100.

With this tool, a test was performed to compare the response times of GA compared to hitting the closest NLB directly. This was done for both Europe and South America regions and the results are shown below.

Europe	South America

The interesting thing to point out is that the Global Accelerator has the same or better response times than hitting the NLB directly. For the European region, we can see an improvement of 22% in the response times on average! 🤩

However, this varies by region and also depends on several networking factors, such as the location of the web client and its Internet connection conditions.

As mentioned before, this improvement is due to the network path that is taken by the packets when the request is made. When using the NLB, the traffic goes through the internet before reaching the AWS resource. However, when using GA, the traffic goes through the AWS backbone as much as possible.

Cleanup

A simple terraform destroy should delete all 127 resources.

Conclusion

This project shows how you can use the AWS Global Accelerator to make a service globally available, resilient to failures in availability zones and entire regions, while also keeping response times low for users worldwide.

AWS Global Accelerator can be used for many other use-cases, such as Blue-Green deployments, custom routing to build sessions for online games, or even just to provide a static IP and endpoint to customers without having to rely on DNS.

I encourage you to have a look if you manage AWS environments and don't know about this service.

Global Endpoint For a Multi-region Service

Daniel Kneipp — Fri, 19 Jan 2024 18:45:23 +0000

To have a highly available web application requires several components working together. Automated health checks, failover mechanisms, backup infrastructure, you name it. And when it comes to keeping response times low globally, the complexity increases even further.

This project presents progressively how to go from a simple application available on the internet to a full-fledged setup with automated failover, together with user traffic segmented per region to keep response times low.

Progressive Architecture

Here we have three different infrastructure designs of the same service with different levels of complexity.

The first one is a simple web server running on an EC2 machine.

The second is a more reliable design, with another web server running in passive mode. This second web server would only respond to traffic in case the first one becomes inoperable.

Finally, the last one replicates the previous design in a different region, to provide the same service to a customer base located in a geographically distance place than the previous setup.

As simple as it gets

This design, as the section implies, is as simple as it gets. Essentially, we have just a two-tier application running. We have three main components:

A DNS record pointing to an AWS Network Load Balancer (NLB)
This NLB is deployed on the public subnet to receive traffic from the internet. This first layer/tier allows the web server to not be publicly exposed to the internet, as well as load balancing between multiple servers in case more were available
Finally, we have the web server itself receiving traffic only from the NLB on a private subnet

Everything is deployed on the same availability zone (AZ), which represents one or more AWS data centers geographically close to each other. This is to ensure fast communication between the load balancer and the web server. This also avoids costs related to data transfer between AZs on AWS.

The obvious problem with this design is that if the web server becomes out of service, or if the AZ becomes inoperable, the application stops responding to user requests. To overcome this, we will move to the next design.

Going active-passive

Here we replicate the same infrastructure in a different AZ. The key difference is that this secondary deployment should not receive user traffic unless the primary one stops working. This is also known as an active-passive deployment.

In this scenario we use AWS Route53 records of type failover. Those records work in pairs of primary and secondary records. As the name implies, the primary record should point to the server that responds to traffic under normal circumstances.

The secondary DNS record will only take precedence if the primary server is not working properly, or, in other words, it is unhealthy.

And how can a server be detected as unhealthy? Using health checks. Those can be configured in a way that the load balancer will perform regular checks to confirm if the web server is working properly.

If those checks fail above a certain threshold, the target (the ec2 instance is labeled as unhealthy). If a certain amount of targets are considered unhealthy, the endpoint of the load balancer itself is considered unhealthy.

This status is reported to Route53 (when using records of type alias), which uses this information to decide if the DNS requests should be resolved to the secondary load balancer instead, consequently, performing the failover.

ℹ️ We are also making use of making use of the Availability Zone Affinity architectural pattern to, as mentioned previously, improve response times and reduce costs. This way, traffic that reaches an AZ never leaves it.

So, now we have a more reliable design with an automated failover mechanism between distinct data centers. However, as a single-region deployment, users in a geographically distant region will suffer from slow response times. Although a CDN exists for this use case, it is used for static assets, not for dynamic APIs.

Multi-region setup

To achieve a multi-region architecture, while retaining the previous features, the strategy is still the same: replicate the previous design, this time in a different region.

However, in this case, the way to connect the two is via CNAME DNS record of type latency.

The records are assigned to different AWS regions and latency is measured between users and AWS regions. The record of the region with the lowest latency is used to resolve the user's DNS requests.

This design is highly available and provides fast response times to users in different regions of the world. Now, it's time to deploy it!

⚠️ Note: this design provides resilience against AZ failure, but not regional outages. If the entire region fails, users of that region won't have the service operational.

Implementation

Now that we have discussed the architectural design, it is time to implement it. To have access to the source code, please go the GitHub repo.

Infrastructure as code

Here we use terraform to allow us to define all the infrastructure via code. This allows us to perform the design replications mentioned previously with relative ease. We can define pieces of the infrastructure as modules and just reuse them as many times as we need.

Here is a brief description of the main directories and files:

.
├── 📁 regional_domain/ -> Per-region DNS config module
├── 📁 service/         -> NLB + EC2 + web server module
├── 📄 dns.tf           -> Hosted zone and DNS module usage
├── 📄 locals.tf        -> Variables used for replicating server resources
├── 📄 services.tf      -> Loop over the variables to deploy the servers
└── 📄 vpc.tf           -> Network config for public/private subnets

The specifications of the service deployments are defined in ./terraform/locals.tf and, for each region and AZ, a service is defined. As shown below, a public subnet is passed for the NLB, and a private one is passed for the server. The subnets define in which AZ the service will be deployed.

sa1 = {
  name           = "sa1-service"
  private_subnet = module.vpc_sa.private_subnets[0]
  public_subnet  = module.vpc_sa.public_subnets[0]
}

The name defines what you will see as a response if that server specifically responds to the request.

Those variables are looped over in services.tf by region via for_each = local.services.<REGION>. This is a nice example of how, when using some terraform features, we can easily replicate infrastructure with next to no code duplication.

dns.tf defines which service deployment is the primary and secondary. DNS records are deployed on pairs of primary/secondary via the regional_domain/ module, together with a record of type latency associated with the region the module is being deployed to.

elb_target_primary = {
  domain_name = module.services_sa["sa1"].nlb_domain_name
  zone_id     = module.services_sa["sa1"].nlb_zone_id
}

Those are the main components of the code. Feel free to dive into the modules to see how they are implemented. Now let's jump into how to deploy this.

Deploy

The very first thing needed for this demo is a domain name. If you already have one, remember to configure the Registrar of your domain name to use Route53 as the DNS server once the infrastructure has been deployed

In my case, I added the name servers assigned to the Hosted Zone as NS records for the locals.domain_name on Cloudflare.

And if you don't have one, remember you can also buy one from AWS itself as seen from the image below.

Just remember that this will create a Hosted Zone automatically for the domain bought. You will need to change it to make use of the new one that will be created by this project

In order to run this project you essentially only need terraform. However, I highly suggest installing it via asdf, as it allows you to automate the installation of several other tools and keep multiple versions of them installed at the same time.

Once asdf is installed, terraform can be installed in its correct version via

asdf plugin-add terraform https://github.com/asdf-community/asdf-hashicorp.git
asdf install

This project obviously also requires that you have properly configured access to your AWS account. If you have configured the credentials using a file with a profile name, the only thing needed is to change the profile name in the providers.tf file.

And lastly, change the domain name defined in locals.tf to your own.

With that, you can run the following command

cd terraform && terraform init && terraform apply

You should see the output:

Plan: 123 to add, 0 to change, 0 to destroy.

The deployment of all resources can take from 5 to around 10 minutes.

After all has been deployed, you can already try to reach the service via www.<DOMAIN-NAME>, in my case, www.service.dkneipp.com.

You should see sa1-service, or eu1-service depending on the region you are currently in 😉. From a dig command we can also identify which region is responding to the request

Web UI	DNS Resolution

Also, the IP returned by the DNS resolution should match of the NLB of the primary AZ of the region you are closest to.

💡 There is only one IP returned because the NLB has only one AZ associated. This ensures the traffic always goes to the designated AZ unless a failure happens in that AZ.

And by using another computer closest to the other region, we can see the response from that other region

Web UI	DNS Resolution

Test failover

In order to test if the failover from one AZ to the other is happening as expected, remove the instance from the target group associated with the primary NLB of the region you want to test. From the image below it can be seen how the instance can be removed

You can also identify the correct target group by checking the listener of the primary NLB. The listener will be forwarding the traffic to the target group that should be changed.

After this, the NLB has no more healthy targets and it is considered unhealthy. This status will be reported to Route53, which will automatically start resolving the DNS requests to the secondary NLB.

Wait for around 2 minutes, and you should be able to see the following while trying to access the same www.service.dkneipp.com

Web UI	DNS Resolution

Now, the secondary server is responding to the traffic, as we can identify from the Web UI response and the fact that the IP of the server changed.

💡 Note: since different regions have access to, essentially, different applications, this could be used as a way to promote service upgrades segmented by region

And now, if you add the instance back to the target group of the primary NLB, in a couple of minutes you should see the previous response back again.

Cleanup

A simple terraform destroy should delete all 123 resources.

Conclusion

This project shows a step-by-step design evolution from a basic web server available on the web, to a resilient design capable of handling failures in data centers and keeping response times low for users in different parts of the globe.

A lot more can still be done, of course, like the use of multiple servers on the same AZ to handle more load or start following a microservice approach with Kubernetes or AWS ECS to handle the web server code deployment.

However, the goal of this project is to show some interesting features we can use from AWS Route53 and NLBs to have a fast, reliable, and cost-effective web server spread in different parts of the world.

Your own Stable Diffusion endpoint with AWS SageMaker

Daniel Kneipp — Thu, 13 Oct 2022 20:49:09 +0000

Stable Diffusion is the name of a Deep Learning model created by stability.ai that allows you to generate images from their description. In short, you feed a textual scene description to the model and it returns an image that fits that description. What is cool about it is that it can generate very artistic and arguable beautiful images resembling pieces of art like the following

In the aws-sagemaker-stable-diffusion repo you will find everything needed in order to spin-up your own personal public endpoint with a Stable Diffusion model deployed using AWS SageMaker to show your friends 😎

Your own Stable Diffusion endpoint with AWS SageMaker
- TL;DR
- Setting things up
- Trying the model locally
- Going to the cloud
- Initial setup
- Deploy on AWS using SageMaker
- Lambda + API Gateway
- Clean-up
- Conclusion

TL;DR

Assuming you already have asdf and pyenv installed:

Define the bucket name (where the model will be sent) in ./terraform/variables.tf
Run:

# Here you will need to provide your huggingface credentials in order to confirm
# the model license has been accepted
INSTALL_TOOLING=true bash setup.sh

cd terraform/ && terraform init && terraform apply && cd ../
cd sagemaker/ && pip install -r requirements && bash zip-model.sh && cd ../
cd lambda/sd-public-endpoint/ && pip install -r requirements.txt && bash deploy.sh && cd ../../

The endpoint that will appear in the output can be used for the inferences. And here's how to use it

<endpoint>/ -> Generates a random description and feeds it into the model
<endpoint>/default -> Uses the default description "a photo of an astronaut riding a horse on mars"
<endpoint>/<description> -> Uses the <description> as input for the model. You can use spaces here.

Setting things up

For this repo, you will need these tools:

Tool	Version
`awscli`	2.8.2
`terraform`	1.3.2
`python`	3.9.13

For python, it's recommended to use pyenv, which allows you to install several versions of python at the same time with simple commands like this: pyenv install 3.9.13

For the rest, you can use a tool called asdf, which allows basically the same but for several other tools. If you have it installed, you can install the rest with just asdf install.

To be able to clone the repo with the ML model (stable-diffusion-v1-4/), we will also need git-lfs (that allows versioning of large files). The model is defined as a submodule of this repo and after attempting to clone the submodule, you will be requested to provide your huggingface account credentials in order to clone it. This it to confirm that you have accepted the license required to have access to the model weights.

And finally, pigz is also necessary as it allows parallel gzip compression. This makes the model packaging for SageMaker run much faster.

So, to do all of this, the setup.sh script can be used for that. It can be used on Mac OS X or Debian-based linux distros. If you already have asdf and pyenv installed and want to install the rest, just run

INSTALL_TOOLING=true bash setup.sh

However, if you already have the tooling, and just want git-lfs and pigz with the submodule cloned, please run:

bash setup.sh

Trying the model locally

To better understand how to use the model and the inference code we'll use later on, we first should try to run the model locally. Based on the official release page you should be able to access the model weights on their huggingface page. Here we will use version v1.4.

📄 In order to have access to the weights, you have to accept their terms of use

On the page we can see code samples showing how we should be able to run an inference on the model. There you will see examples on how to make a standard inference, however it's recommended to have a NVIDIA GPU with at least 10GB of VRAM. To allow my weaker hardware to make an inference, I choose to test the configuration using float16 precision instead, which is the code that you can find at local/code.py

ℹ️ There is also a local/code-mac.py available for those that want to try it out on a mac. However, bare in mind that one inference can take several minutes (~30 min on a M1 Pro), whereas in a GPU it might take 5 seconds using the low VRAM configuration on a RTX 3070

For my local computer, I used nvidia-docker to allow myself to run the code inside a container. This way I don't have to worry about installing the right version of CUDA (relative to pytorch) on my machine.

So, going to the fun part, to make an inference, you just need to run cd local && bash build-run.sh.

All script in this repo assume that they are being executed from their own directories

The whole execution can take several minutes on the first time in order to build the container image with the model inside. And after everything has finished, you should be able to see image generated at local/output/output.png.

You might see something like this on the default description of "a photo of an astronaut riding a horse on mars. VFX, octane renderer":

And you can change the description by changing the content of the prompt variable in local/code.py

Going to the cloud

Cool, so now that we have the inference code working, is time to put this in the cloud to, ultimately, make it available to others. Let's start by talking about the resource we'll need and the overall architecture.

Firstly, about how the ML model will be executed, although we could simply spin-up an EC2 instance and attach a web-server to it to receive the images as requests, we will use AWS SageMaker, which allows us to do exactly that and much more in a managed way (it means that several components, e.g. the web-server implementation, will be managed by AWS). SageMaker will manage the EC2 instance with a GPU and will give us a private endpoint to interact with the model. Since we are using huggingface, both have a nice integration and you can learn more about it from the huggingface docs about SageMaker or the AWS docs about huggingface 🤝.

However, one thing to notice is that SageMaker will provide a private endpoint only accessible if you have access to the AWS account associated with the resource. Since we want a public endpoint, we need to put on top of it a lambda that will forward requests from an API Gateway. This combination follows the recommended approach from an official AWS blog post.

⚠️ One import limitation of API Gateway to notice is the hard limit of 30 seconds timeout on the request. Because of this, the same low VRAM configuration (float16 precision) was needed in order to guarantee a lower response time.

Putting it all together, the architecture goes like this:

Showing all resources needed and from which directory each resource is managed. Here is a brief description:

Directory	Description
`lambda`	Has the lambda + API Gateway implementation using a framework called chalice
`sagemaker`	Contains the python code to manage the SageMaker Model, Endpoint, and the custom inference code. It also has the script used to pack and send the model to an S3 bucket
`terraform`	Manages the S3 bucket itself and the IAM roles required by the lambda code (to access the SageMaker endpoint) and for the Sagemaker endpoint (to access the model on the S3 bucket)

Initial setup

Before anything, we need to deploy the roles and the S3 bucket mentioned in order to set the stage to the rest of the resources. You can configure the bucket name used in the file terraform/variables.tf. After that, you can run (assuming that you already have the access to your AWS account properly configured):

cd terraform
terraform init
terraform apply

In total 11 resources should be created. In this process, a couple of .txt will be created. Those files will be used by other parts of this repo to get the value of AWS ARNs and the bucket name used

A brief explanation of the roles is as follows:

Role	Name on IAM	Description
SageMaker Endpoint Access	`lambda-sagemaker-access`	This role basically allows the lambda to execute properly (with the `AWSLambdaBasicExecutionRole` managed policy) and also allows it to invoke a SageMaker endpoint
SageMaker Full Access	`sagemaker-admin`	This one has the `AmazonSageMakerFullAccess` managed policy attached to it, which allows, among other things, for the SageMaker endpoint to access the S3 bucket to load the model, and also to publish it own logs

Deploy on AWS using SageMaker

With the IAM roles and the S3 bucket in place, now it the time to create the SageMaker endpoint itself. For that, we use the python package from AWS which allow us to use Transformers models with the huggingface SDK.

However, since we will use diffusion type of model, and not transformers, the default inference code will need to be overwritten to use the diffusers package (also from huggingface), just like we do in the local/code.py

In order to overwrite it, the package readme has some general information about it, and also there is an example in this jupyter notebook. We are doing what is necessary via the files inside sagemaker/code, which has the inference code following SageMaker requirements, and a requirements.txt, that has the necessary dependencies that will be installed when the endpoint gets created

With the inference code ready, it's time to ship the model with the code to the created S3 bucket (as the SageMaker endpoint will access the model via this bucket). For this, run:

cd sagemaker/
bash zip-model.sh

Keep in mind that this process will send around 4.2GB of data to S3. Just remember that cost you are willing to pay 😉

With that, we should have the what we need in the bucket. Now, let's create the SageMaker endpoint. To manage the endpoint, as stated before, we use the SageMaker toolkit for huggingface. As such, we have three python scripts that serve as functions to create, use and delete the endpoint.

So let's go ahead and create the endpoint with

pip install -r requirements  # To install the sagemaker pkg
python sagemaker-create-endpoint.py

After this the endpoint will be created, and also a file endpoint-name.txt so that the other python scripts can keep its reference. When the endpoint is ready, you will be able to see it in the AWS console like the following image:

Now we should be ready to run an inference. To do it, execute:

bash sagemaker-use-endpoint.py

And we should get a cool image like this one in sagemaker/output/image.jpg:

And it's nice that we just this amount of code, now we have a deployment of a ML model server with metrics, logs, health checks, etc. available, as we can see in the following images.

Metrics	Logs

Yes, to get that cool image from Darth Vader, lots of attempts were required 😅

So far so good, we have a python script capable of make inference on a the Stable Diffusion model, but this python script interacts directly to your AWS resources. To achieve public access, it's time to work on the Lambda + API Gateway

Lambda + API Gateway

For this part we had couple of options to choose from, like serverless and Zappa, both being serverless frameworks that would allow us to deploy a serverless app (fulfilling the lambda + API Gateway combo).

However, for this project we will go with chalice, which is a serverless framework for python designed from AWS itself that resembles the API design of Flask. Since I never heard about it before I'll give it a try.

The code for the serverless app it's all inside lambda/. To manage the chalice app, do

cd lambda
pip install -r requirements.txt
cd sd-public-endpoint

This way, you will have the chalice tool installed and will be able to deploy the app with

pip install -r requirements.txt
bash deploy.sh

The script will get the proper IAM role and deploy the app.

ℹ️ You need to install the dependencies of the serverless app because the dependencies are packaged locally and sent with the app code when publishing the lambda

⚠️ Just a reminder about that API Gateway has a non-configurable 30 seconds timeout that might impact your inferences depending on the configuration of the Model. However it seems there is a workaround for API Gateway

And now you should have a public endpoint to access the Model, you should be able to use the app like this:

Page	Description	Image
`/api/`	The home page, just for testing
`/api/inference/<text>`	You can type any text replacing `<text>` and that will become the input for the model, Go wild 🦄 (and you can use spaces)
`/api/inference/`	And that were the fun actually begins!! If you leave the `<text>` empty, a random sentence generator will be used to automatically generate the input for the Model. Let programs "talk" to each other and you will be able to use the sentence used below the image

Clean-up

To clean up everything (and avoid unwanted AWS costs 💸), you can do:

# Delete the serverless app
cd lambda/sd-public-endpoint/
chalice delete

# Delete the SageMaker resources
cd ../../sagemaker
python sagemaker-delete-endpoint.py
bash delete-model.sh

cd ../terraform/
terraform destroy

Now you will only need to delete the related CloudWatch Log groups via the console. Those will be /aws/lambda/sd-public-endpoint-dev and /aws/sagemaker/Endpoints/huggingface-pytorch-inference

Conclusion

So here create our own personal cloud deployment of the popular Stable Diffusion ML model and played a little with several cloud resources in the process.

The architecture works fine with play around, but an asynchronous implementation might become necessary to allow inferences that take more than 30 seconds, and also for batch processing.

App with self-contained infrastructure on AWS

Daniel Kneipp — Sun, 02 Oct 2022 20:39:50 +0000

Here is an example app with self-contained infrastructure-as-code. With less places to go, the developers has access to not only to the app, but also relevant parts of the infrastructure used to deploy it, allowing him/her to evaluate and change the deployment configuration.

The repo with the implementation is available at https://github.com/DanielKneipp/aws-self-infra-app

App with self-contained infrastructure on AWS
- Context
- Overall Architecture
- Platform Side
- Giving Github access to AWS
  - How to get thumbprint value
- Giving AWS App Runner Access to Github
- Wrap-up
- Development Side
- Code quality checks
- Sync infrastructure
- Trigger the app deployment
  - Why API-triggered deployment?
- Putting everything together
- Update procedure
- About how to refine the AWS role for Github
- Clean-up

Context

More often than not, we see in companies developers struggling understating how their apps are actually being deployed. Usually they're limited to their local deployments with tools like docker-compose and testing environments, which can diverge significantly form production environments (where they some times don't have access to).

At the same time, the platform/infrastructure team responsible to prepare the production environments, don't have specific knowledge about individual apps, forcing them to apply sensible defaults that might work, but won't be optimized configurations for each app.

This creates the infamous Dev and Ops silos, where those two teams struggle to communicate to achieve production-ready and optimized deployments.

DevOps practices should help avoid this exact situation. Through automation, here is shown how an application deployment could be managed by developers directly from their own app repository.

This brings the infrastructure and deployment configuration closer to where the know-how about the app is: the development team. This way they can take ownership and control not only the app itself, but how it should deployed, including which cloud resources it needs in an independent fashion.

Overall Architecture

The way we're going to achieve that is by:

Preparing a base infrastructure that allows automation pipelines to manage resources in the cloud
Creating a pipeline that allows the management of the necessary cloud resources

This allows us to have a clear separation of duties between teams (platform team manages cloud governance and development team manages the workloads), while given the development team arguably all control of their app (code and infrastructure).

Translating this strategy to the directories we have in this repo, this is what we have:

+---------------+    +------------------+
| Development   |    | Platform         |
|               |    |                  |
|   +---------+ |    |   +------------+ |
|   | aws     | |    |   | terraform  | |
|   +---------+ |    |   +------------+ |
|               |    |                  |
|   +---------+ |    +------------------+
|   | app     | |
|   +---------+ |
|               |
|   +---------+ |
|   | .github | |
|   +---------+ |
|               |
+---------------+

The breakdown is:

Directory	Description
`terraform/`	Contains the the necessary infrastructure to allow the Github Workflow (our automation tool for this example) to access and manage resources in AWS (our cloud provider). It usually would be in an independent repo managed by the platform team, but we will keep it here to for simplicity.
`aws/`	Has the Cloudformation stack definition (the cloud resources management tool used by the Development, which can, and in this case will, diverge from the tool used by the platform team). There we can see the cloud resources used by the app (an AWS App Runner), compute resources required, listening port, how it should be built and executed.
`app/`	The app itself, which is just a python Front-end app that has one static index page and one dynamic page.
`.github/`	Contains the Github workflow used to sync the code in `main` branch (app and its infrastructure) with the AWS account.

In a nutshell, the way that it works is:

Github workflow assumes an AWS role provided by the platform team
The workflow deploys and keeps the Cloudformation stack with the App Runner service up to date
And if something change in the app code, the workflow also trigger the service redeployment via AWS API.

If you are more of a visual learner, here is a diagram of how the whole integration is suppose to work, together with the separation of duties:

So let's jump into the implementation details!

Platform Side

As state before, the platform responsibility will be to provide a way for the Github workflow to mange the app's infrastructure. And also, one thing that hasn't being mentioned, App Runner needs to have access to the code repo in order to deploy the app. Both will be addressed by the platform.

Giving Github access to AWS

Following security best practices, we will use a AWS role assumed by the workflow to interact with the cloud resources. This will avoid the usage of static credentials that could be leaked and have the time-consuming manual task of having to be rotated regularly.

To give permissions to the workflow without the use of any kind of static credentials, we will set-up on AWS an identity provider that will allow the workflows of this repo (main branch specifically) to assume a role on AWS. This restriction is defined as a role condition with the verification of the attribute value of token.actions.githubusercontent.com:sub (terraform/iam-gh.tf file). Pattern can be used to give a more broad permission to all branches or to all repos of the owner. You can go to the Github docs to know more about the OIDC configurations (and here is also the AWS documentation about the subject)

With the role defined, it's only a matter of defining a policy that will allow the workflow to do its job and associate it with the role. In this example, we're giving very broad permissions (basically Cloudformation and App Runner administrator permissions) to keep the policy small and simple.

However, in practice, you would want to restrict this policy to specific actions (maybe deny Delete* actions) and to resources of a specific app (using patter matching to only allow the manipulation of a cloudformation stack that has a specific name)

To put everything into action, just run cd terraform/ && terraform apply. 6 resources will be created in total. The one value you will need is the output gh_role_arn, which is the ARN of the role and will be used by the workflow for authentication. You will be able to see the ARN after applying the code, or also you can run terraform output gh_role_arn after the fact. This value needs to be shared with the Development team so they can configure the workflow with the properly.

How to get thumbprint value

One configuration that worth a bit of explanation of is Thumbprint that needs to be specified for the OIDC identity provider. That is basically the fingerprint of the certificate of Github. Basically, AWS needs this in order to trust on Github as an identity provider (given the current certificate used by Github's domain).

AWS has a step-by-step guide on how to get this value. And here is the automate script for the process described there:

# The issuer can be obtained at https://docs.github.com/en/actions/deployment/security-hardening-your-deployments/about-security-hardening-with-openid-connect
issuer_url='https://token.actions.githubusercontent.com/.well-known/openid-configuration'
servername="$(curl -s "${issuer_url}" | jq -r '.jwks_uri' | sed -e 's|^[^/]*//||' -e 's|/.*$||')"
cert_info="$(echo -n | openssl s_client -servername "${servername}" -showcerts -connect "${servername}":443 2>/dev/null)"
last_cert="$(echo "${cert_info}" | awk '/-----BEGIN CERTIFICATE-----/{s=""} {s=s$0"\n"} /-----END CERTIFICATE-----/{cert=s} END {print cert}')"
echo "${last_cert}" | openssl x509 -fingerprint -noout | cut -d "=" -f2 | sed -e 's/://g' | tr '[:upper:]' '[:lower:]'

Giving AWS App Runner Access to Github

So far that we allowed Github to talk with AWS. Now it's time to allow AWS to talk with Github. This is something specific for App Runner, which is just one of many ways to deploy a production-ready web application on AWS. With automatic auto-scaling, logging management and ingress configuration with public domain, we will go with this one for today 😅

App Runner allows us to deploy services using container images or directly from the code. For the sake of simplicity, we will go with the latter in this example to simplify the resources required on AWS by the development team and the upgrade procedure for the app.

To allow App Runner to access the code on Github, "AWS Connector for Github" needs to be installed in your Github account. In order to do that, follow the below steps

Step number	Description	Image
1	Go to App Runner console page and click on "Create an App Runner service"
2	Select "Source code repository" and "Add new" to create a connection to a Github account or organization
3	A new windows will popup. There you can give the connection a name and select "Install another" to initiate the installation of the AWS app for Github
4	After logged in your Github account, review and authorize the connector to perform its duties
5	After authorizing the app, confirm its permissions and confirm the AWS app installation
6	The Github connector has been installed and now it can be selected. Select it and press "Next" to close the window
7	Now go the Github connections on the top-left of the previous page
8	There you can see the ARN of the connector, which will also be required by the development team

Note: this process could by partially automated by the use of a terraform resource called aws_apprunner_connection, however, the connector installation on your Github account would still be required to be performed manually (steps 4 and 5)

Wrap-up

After those two actions, three pieces of information need to be shared with the development team:

The AWS region name available
The ARN of the role that will be used by the Github workflow
The ARN of the Github connector created for App Runner

Development Side

With the base infrastructure ready to go, now it's time to talk about the automated pipeline with Github Actions. In summary, it has three main functionalities:

Run code quality checks with pre-commit
Sync infrastructure changes made on aws/app-template.yaml with the existing cloud resources
Trigger the app redeployment if changes happened inside app/

The workflow only runs on the main branch, but you can expand the logic steps to the process you and your team might be following.

Let's talk about each individual functionality

Code quality checks

pre-commit is a really good tool to avoid mistakes before committing (hence, the name). Locally the developer should run pre-commit install after cloning the repo and installing the pre-commit tool. After this, automatic checks will be execute right after running a git commit.

In this repo we have:

Some standard checks for empty new line as end of file and trailing white spaces. These will be executed on all files of the repo.
A python formatter called black, since our front-end app is made in python.
A linter for our AWSCloudformation stack called cfn-lint
Security checks for the Cloudformation stack using cfn-nag

In the future a cost estimation could be made using the aws cli. However, App Runner is not currently supported by the cost estimator. Potential command for future reference: aws cloudformation estimate-template-cost --template-body file://aws/app-template.yaml --parameters "ParameterKey=RepoUrl,ParameterValue='',UsePreviousValue=true" "ParameterKey=GithubConnArn,ParameterValue='',UsePreviousValue=true"

Normally those checks are executed in the local machine right before the commit, however, since in some cases the developer forgets to run pre-commit install, those checks are also executed in the workflow.

Sync infrastructure

This is the moment where the Cloudformation stack gets updated (in case of a change and the stack already exists) or created (in case there is no stack with the specified name).

In order to achieve this, AWS credentials need to be properly configured. Here we use a handy Github action called configure-aws-credential, from AWS itself. You can also read more about the many methods of authentication available. This step requires the AWS_REGION and AWS_ROLE_ARN secrets to be properly configured in the repo, both of which that should be shared by the platform team.

With the authentication in order, it's time to trigger the Cloudformation creation/update. aws-cloudformation-github-deploy action is used for that. Although the action was archived, it stills works just fine, so I'll keep it for now. This action will deploy our aws/app-template.yaml stack, which has the configuration to build and run the service, as well as its resource usage.

⭐️ Notice that this empowers the developers to configures many more things about the deployment (please, see AWS docs to see other App Runner configurations). Also, since this is Cloudformation template, with given enough permissions, the development team could specify many more cloud resources that the app might need, like dedicated message queues, S3 buckets, Databases, etc.

Trigger the app deployment

In case the python code has changed, AWS cli tool is installed to trigger the redeployment of the App Runner service, essentially, updating it.

The script first get the ARN of the App Runner service from one of the outputs of the Cloudformation Stack. With that it triggers the redeployment

Why API-triggered deployment?

You might be wondering why the App Runner service hasn't been configured with automatic deployments enabled? Because if we had enable it, every change on the infrastructure code would yeld a failed workflow execution.

With automatic deployments enabled, the App Runner service would enter in a state of "Operation in progress" (which can take several minutes) on every code change. If the workflow tries to change the App Runner service (part of the infrastructure code) while the service is in that state, Cloudformation receives an error like "Resource handler returned message: "Service cannot be updated in the current state: OPERATION_IN_PROGRESS. ...".

So, in order to avoid that, the app redeployment/update can only happen after an infrastructure change has concluded. This way the infrastructure can be changed without worrying about if the service is in a "Operation in progress" state.

Putting everything together

Now we have everything in place. Just pushing this repo to Github (with the proper name based on the AWS role), and the workflow should be able to create the Cloudformation stack. If everything goes according to the plan, you should be able to see the Cloudformation console similar to this:

And going to the "Outputs", you should be able to see the url of the app. If the app is running properly, you will see:

Page Description	Image
Index page
Dynamic page accessed via custom path

And you can also see general information about the app just deployed in the AWS App Runner console:

There you be able to see lots of useful information, like:

Logs of the deployment/update procedure, as well as the logs of the app itself
Metrics of app usage like request count (with status code), compute resource usage and request latency, to name a few
Activity log with the changes applied to the service, allowing later auditing

And since those logs and metrics are integrated with Cloudwatch, you can later filter query and post-process that telemetry data as you wish. Here is an example of the cpu usage of several different instances of the app deployed on Cloudwatch.

And to see the app logs, you can do the following:

Step	Description	Image
1	Go to the available Log Groups on CloudWatch. There you will spot log groups associated with the `my-app-service` app: `application` and `service`. `service` is related build process performed on deployment/update events. We want to access the `application` logs
2	There you can add a new column to sort by creation date the streams available. This way you will easily get the latest logs
3	Clicking on on of those streams you should be able to see the logs of the python web server

Tip: The log groups created by App Runner don't have a retention period. Remember to configure a retention period on your log groups to avoid unnecessary recurrent costs 😉

Because of those features and others that could be enabled (like tracing), App Runner shows itself to be a quick way to deploy reliable and maintainable web apps.

Update procedure

And what needs to be done in order to update the running app? Just change the python or html code and push it to the main branch! The workflow will do the rest. The developer only needs view access to the AWS console. Any write interaction is made either by the workflow or the platform team.

About how to refine the AWS role for Github

Here's a quick tip about how to improve the role created for the Github workflow. With a more permissive role, you can see on CloudTrail exactly what actions the role perform on normal operations. Filtering the events by the user name GithubActions, you can determine each individual event generated by the role, as shown below.

From there you can get the actions that need to be allowed and the resources associated with those actions. This will allow you to specify a well-refined role that follows the least privilege principle.

Clean-up

In order to clean-up this while demo, just:

Delete the Cloudformation stack created
Delete the CloudWatch log groups associated with the app
Run cd terraform &&terraform destroy

You can also delete the AWS connector on Github and the Github connection on the App Runner console if you don't intent to use that integration again.

Monitoring C++ Applications

Daniel Kneipp — Wed, 31 Aug 2022 00:45:45 +0000

In this document is described in general terms what is expected from a monitoring solution, together with suggestions of tools that could be used for C++ applications.

OpenTelemetry has a broad and generic terminology to define concepts around possible types of telemetry data. Here we will focus on traces, metrics, logs and something that is not covered there: crash reports.

Monitoring C++ Applications
- Overall Architecture
- Tools and services available
  - Logs
  - Metrics
  - Traces
  - Crash reports
  - Visualizing
- Conclusion

Overall Architecture

Generally speaking, any software application generate the following kinds of data for monitoring purposes:

Logs: text record with metadata and potentially semantic information about the action being performed by the software
Metrics: numeric measurement that can describe useful information (e.g. execution count or timer) about an action or event being triggered
Traces: metadata that can be correlated between applications that interact between themselves (used to create distributed profilers)
Crash reports: artifacts generated by the application when a unrecoverable failure happens (usually in form of memory dumps)

Each kind of data usually has a service (or an independent functionality of a service) that supports and stores it for future queries and analyzes. And depending on the tool used, the mechanism to obtain the data can vary. However, usually the services in the market follow a similar approach where the data is pushed, for the exception of logs, which has another software (called agent) that runs separately and is responsible to get the logs from a source (e.g. a file in the filesystem) to the destined service.

Note: this is a simple analyses of just an independent application being monitored. Here it's not covered the possibilities that the container orchestration systems like Kubernetes have

The general architecture of what was described can be shown by the following diagram:

                                                           +-----------------------------------------+
                                                           | Company Infrastructure                  |
                                                           |                                         |
+-----------------------------------+                      |     +-----------------+                 |
| Machine                           |                      |     |                 |                 |
|                                   |             +--------+----->   Logs Server   |                 |
|                 +-------------+   |             |        |     |                 |                 |
|                 |             |   | Push        |        |     +-----------------+                 |
|         Pull    |  Log agent  +---+-------------+        |                                         |
|        +-------->             |   |                      |     +------------------------------+    |
|        |        +-------------+   |                      |     |                              |    |
|        |                          |               +------+----->  Distributed Tracing System  |    |
|   +----v--+                       | Push traces   |      |     |                              |    |
|   |       +-----------------------+---------------+      |     +------------------------------+    |
|   |  App  |                       |                      |                                         |
|   |       +-----------------------+---------------+      |     +------------------+                |
|   +-----+-+                       | Push metrics  |      |     |                  |                |
|         |                         |               +------+----->  Metrics Server  |                |
|         |                         |                      |     |                  |                |
|         |                         |                      |     +------------------+                |
+---------+-------------------------+                      |                                         |
          |                                                |     +----------------------------+      |
          |                                                |     |                            |      |
          +------------------------------------------------+----->  Crash Reporting System    |      |
                                      Push minidumps       |     |                            |      |
                                                           |     +----------------------------+      |
                                                           |                                         |
                                                           |                                         |
                                                           +-----------------------------------------+

For traces, metrics and crash reports, changes in the code are needed to to send the data to the related services. And for logs, as said before, an agent is responsible for that. So, the application only needs to sends the logs to a file, for example, and the agent properly configured will do the rest.

With this approach, an application running in a customers environment can publish telemetry data without having to expose its internal network by allowing external incoming traffic. All traffic is outgoing.

For each component present diagram, there are several (and I do mean, several 😅) tools and services available in the market that can help accomplish this monitoring architecture. To keep the discussion brief, here I'll mention just a few for each kind of data we wan't to track.

Tools and services available

The focus will be given to the OpenTelemetry standard and for tools in the Grafana ecosystem. This will allow the application to be compatible with a variety of tools and services in the market, while allowing easy visualization and management of almost all telemetry data (with the exception of the crash reports, that will be discussed later).

Logs

It's recommended to use a logging formatter to add severity, timestamp, correlation ids, among other metadata to allow correlation the logs with other telemetry data . Also is good to format the logs in a specific format (e.g. json) to be able to quickly query the logs afterwards without too much preprocessing rules.

For that the OpenTelemetry SDK could be used.

As for the Logs Server, Grafana Loki could be used (also ELK stack and Datadog are well known options with not only logging support, but traces and metrics as well)

As the agent must be compatible with the Logs server, we need to follow Grafana's documentation There shows several options to choose from like Promtail or Fleunt Bit.

Metrics

Following Grafana's ecosystem, Prometheus is a widely used metrics server and OpenTelemetry has a nice example on how to use it already

Traces

And here again we can leverage a tool from the Grafana ecosystem and OpenTelemetry. With Tempo as the Distributed Tracing System, we can use OpenTelemetry (example) to send traces to it since Tempo is compatible to the OpenTelementry standard.

Crash reports

So, for crash reports it gets more interesting. There is nothing on OpenTelemetry or Grafana that is dedicated to handle this kind of telemetry data. We are still able to publish traces with error information for exceptions that can be handled, but to get information about crashes like segfaults, another tool must be used.

One interesting option is Sentry. It also has integration with Qt-based applications, which is a library widely used by C++ GUIs

Another onr is Raygun. Although it doesn't have an SDK itself, it shows how you can integrate your software with Google's breakpad and send the crash report via an http request.

Both options have their own GUIs, so you won't access them on the same place you would access the rest of the telemetry data

Visualizing

And as closing point, To visualize all this data (except crash reports), Grafana itself can be used to query, manage and create dashboards, alerts, etc. from this data. And with everything propely configured correlation ids could be used to allow the developer to grab all kinds of telemetry data related to a user interaction to get a better idea of how the application is being used and how performant it is.

Conclusion

For sure this is not the only way to achieve a good observability level of your application, but it's one with a good synergy between the tools and services chosen with a succinct tech stack

Share a GPU between pods on AWS EKS

Daniel Kneipp — Thu, 04 Nov 2021 22:29:51 +0000

On this post we discuss the necessary IaC (Infrastructure as Code) files to provision an EKS cluster capable of sharing a single GPU between multiple pods (code available here).

The problem

If you ever tried to use GPU-based instances with AWS ECS, or on EKS using the default Nvidia plugin, you would know that it's not possible to make a task/pod shared the same GPU on an instance. If you want to add more replicas to your service (for redundancy or load balancing), you would need one GPU for each replica.

And this doesn't seem to be going to change in the near future for ECS (see this feature request)

GPU-based instances are expensive, and despite the fact that some Machine Learning frameworks (e.g. Tensorflow) are pre-configured to use the entire GPU by default, that's not always the case. ML services can be configured to make independent inferences by request instead of batch processing, and this would require just a fraction of a 16GiB of VRAM that comes with some instances.

Currently, GPU-based instances only publish to ECS/EKS the amount of GPUs they have. This means that a task/pod can only request a GPU, but not the amount of resources of GPU (like it's possible with CPU and RAM memory). The solution is to make the instance publish the amount of GPU resources (processing cores, memory, etc.) so that a pod can request only a fraction of it.

Solution

This project (available here) uses the k8s device plugin described by this AWS blog post to make GPU-based nodes publish the amount of GPU resource they have available. Instead of the amount of VRAM available or some abstract metric, this plugin advertises the amount of pods/processes that can be connected to the GPU. This is controlled by what is called by NVIDIA as Multi-Process Service (MPS).

MPS manages workloads submitted by different processes to allow them to be scheduled and executed concurrently in a GPU. On Volta and newer architectures we can also limit the amount of threads a process can use from the GPU to limit the shareability of resources and ensure some Quality of Service (QoS) level.

How to use it

Here we put it all together to deliver an infrastructure and deployment lifecycle which all can be managed by terraform. Integrally, here is the list of tools needed:

terraform: for infrastructure provisioning and service deployment (including the DaemonSet for the device plugin and the Deployment for testing);
packer: to create an instrumented AMI for GPU usage monitoring in CloudWatch
asdf: really handy tool used to install other tools in a version-controlled way

The rest will come along in the next steps ;)

At the end, you should have an infrastructure with the following features:

✔️ EKS cluster with encrypted volumes and secrets using KMS
✔️ All workers resides on private subnets and access the control plane only from within the VPC (no internet communication)
✔️ Ip whitelist configured for accessing the k8s api from the internet
✔️ Instrumented instances with GPU usage monitored in Cloudwatch
✔️ Nodes can be access with AWS SSM Session Manager (no ssh required).

Installing the tooling

The first tool to be installed is asdf. With it, all the other will come after easily. asdf can be installed following this guide from its documentation page. After that, you should be able to run the following list of commands to install the rest of the tooling.

asdf plugin-add terraform https://github.com/asdf-community/asdf-hashicorp.git
asdf plugin-add pre-commit git@github.com:jonathanmorley/asdf-pre-commit.git
asdf plugin-add tflint https://github.com/skyzyx/asdf-tflint
asdf plugin-add https://github.com/MetricMike/asdf-awscli.git

asdf install
pre-commit install
tflint --init

This project also comes with pre-commit configured to serve as a reference on how terraform-based projects can be configured to check of syntax and linting errors even before a commit is made (so that you don't have to wait for some CI pipeline).

Creating the AMI

For details about how the AMI is create and what comes with it, I highly suggest you to my other repo that explains in detail how the AMI works and what IAM permissions it requires.

From that repo, the only thing changed is the base AMI, which in this case an AMI tailored for accelerated hardware on EKS was used. The list of compatible AMIs for EKS can be obtained in this link updated regularly by AWS. Also, the AMI from AWS comes with SSM agent in it, so no need to change anything regarding that.

The following commands will create an AMI named packer-gpu-ami-0-1, which should be picked automatically by the terraform code of the cluster. All terraform and packer commands assume that you already have configured your AWS credentials properly.

cd ami/
packer build .

About the infrastructure

The cluster and network resources are defined together in the cluster directory. Here is a small description of them:

main.tf: defines the versions and configuration of the main providers, as well as set values for variables that can be used on other files (e.g. name of the cluster);
vpc.tf: encompass the network configuration where the EKS cluster will be provisioned. It doesn't contain a subnet for the us-east-1e because, at the time of this writing, there were no g4dn.xlarge available at this availability zone;
eks.tf: contains the cluster definition using managed workers. Here is also where is defined the node-label k8s.amazonaws.com/accelerator, important to tell the device-plugin where it should be deployed;
kms.tf: here we have the definition of the Costumer Managed Keys (CMKs) alongside the policies necessary to make them work for the encryption of the volumes of the cluster nodes and k8s secrets;
iam.tf: has the permissions necessary in order to make the Session Manager access work and to allow the nodes to publish metrics on CloudWatch regarding CPU, RAM, swap, disk and GPU usage (go here to know more the permissions for Session Manager and here to learn more about permissions required by CloudWatch Agent);
aws-virtual-gpu--device-plugin.tf: Generated from the yamlfile of the same name obtained from the AWS blog post. Some modifications needed to me made in order to make this DaemonSet work. Here they are:
- The image nvidia/cuda:latest doesn't exist anymore as the tag latest is now deprecated (source). Because of that, the image nvidia/cuda:11.4.2-base-ubuntu20.04 is being used instead.
- The number of vgpu configured for the container aws-virtual-gpu-device-plugin-ctr was modified from its default of 16 to 42 because NVIDIA architectures after the Volta can handle up to 42 connections to the MPS (source). This has been done to increase how much fractioned the GPU can get. Theoretically (not tested) 42 pods could share the same GPU (if they don't surpass the amount of VRAM available). At this point, limitations of instance networking are more restricting than GPU shareability.
- Because this vgpu configuration can have different limits depending on the architecture of the GPU, the plugin also was configured to be deployed on g4dn.xlarge instances (see how here) where the architecture is now (Turing) and this demo was tested on.

Pro tip: If you want to convert k8s yaml files to .tf, you can use k2tf (repo) that is able to convert the resource types of the yaml top their appropriated counterparts of the k8s provider for terraform. To install it, just:

wget https://github.com/sl1pm4t/k2tf/releases/download/v0.6.3/k2tf_0.6.3_Linux_x86_64.tar.gz
tar zxvf k2tf_0.6.3_Linux_x86_64.tar.gz k2tf
sudo mv k2tf /usr/local/bin/
rm k2tf_0.6.3_Linux_x86_64.tar.gz

After that, you should be able to convert a yaml manifest with a simple command like cat file.yaml | k2tf > file.tf. This has been done for cluster/aws-virtual-gpu-device-plugin.yaml and app/app.yaml.

Provisioning the infrastructure

To provision all of this, the following command should be sufficient:

cd cluster/
terraform init
terraform apply

The apply should show Plan: 59 to add, 0 to change, 0 to destroy.. If that's the case, hit yes and go grab a cup of of coffee as this can take dozens of minutes.

After the resources be provisioned, you might want to run terraform apply -refresh-only to refresh your local state as the creation of some resource change the state of others within AWS. Also, state differences on metadata.resource_version of k8s resources almost always show up after an apply. This seems to be related to this issue.

Now you should see an EKS cluster with the following workloads:

About the app

The app is a Deployment also obtained from the AWS blog post that spawns 3 replicas of a resnet model in the cluster. This line defines "how much" GPU it needs. Because of this requirement, k8s will not schedule a pod of this deployment to a node that doesn't have a GPU.

This deployment is configured to use 20% of the GPU memory (using a tensorflow feature here). Based on this VRAM usage, we need to configure how many of the 48 process slots from MPS of an instance we wan't to reserve. Let's use ceil to be conservative, so ceil(48 * 0.2) = 10. With this we should be able to schedule even 4 replicas in the same instance.

Deploying the app

Since we're using the same tool for infrastructure management and app deployment, now we leverage this by following the exact same procedure to deploy the app.

cd app/
terraform init
terraform apply

And now you should be seeing the resnet workload deployed like this:

Also, we can see the on CloudWatch the amount of VRAM used in that instance to confirm that more than one replica
is actually allocating resources there. To know more about the new metrics available in ClodWatch published by instances using this custom AMI, please go here

Now, what about we scale the deployment to 4 replicas? Please, go to this line and change the amount of replicas from 3 to 4 and run another tf apply. After some time (~3-5 minutes) you should be able to see the VRAM usage of that instance increasing a bit more, like this:

Clean-up

Leveraging again the fact we interact mostly with terraform, clean everything should be as simple as:

cd app/
tf destroy

cd ../cluster/
tf destroy

Note: The order matters because you can't delete the EKS cluster before removing the resources allocated in it, otherwise you should get error messages from AWS API about resource still being used.

Also, don't forget follow the clean-up procedure of the AMI repo to delete the created AMI and avoid EBS costs.

Todo next

[ ] Implement/test autoscaling features making a load test to resnet
[ ] Enable and use IRSA
[ ] Add Infracost on pre-commit config

Wrap-up

Here we've implemented a complete infrastructure for an EKS cluster with shared GPU-based instances.

Please, fell free to reach out to me on my github or linkedin accounts for suggestions or questions. ✌️

Instrumenting AMIs for GPU monitoring on CloudWatch

Daniel Kneipp — Sun, 01 Aug 2021 21:12:30 +0000

If you have used provisioned instances on AWS before, you know that the default metrics monitored are kind of limited. You only have access to CPU utilization, network transfer rates and disk reads/writes. By default, you don't have the monitoring of some basic information, like RAM and filesystem usage (which can be a valuable information to prevent an instance malfunction due to lack of resources).

In case of GPU-accelerated applications (like Machine Learning apps), this problem goes even further, since you also don't have any access to GPU metrics, which is critical to guarantee the reliability of the system (e.g., the total GPU memory consumption can lead to the crash of any application running on the GPU).

I've created a project (available here) showing how we can create an AMI with CloudWatch agent for RAM and filesystem monitoring, and a custom service called gpumon to gather GPU metrics and send them to AWS CloudWatch.

Project structure

In the project we have two main directories like this:

.
├── packer  ==> AMI creation
└── tf      ==> AMI usage example

The first one contains all the necessary files to create the the AMI based on Amazon Linux 2 using a tool called packer. The second one has infrastructure as code in terraform to provision an instance using the new created AMI for testing purposes.

AMI creation

packer is a great tool to achieve Infrastructure as Code principles on AMI creation step. It has capabilities to provision an instance with the base AMI specified, run scripts through ssh, start the process of AMI creation, and clean everything up (e.g. instance, ebs volume, ssh key pair) afterwards.

The file packer/gpu.pkr.hcl contains the specification of the AMI. There we can find the base AMI, the instance used to create the AMI, the storage configuration, and the scripts used to configure the instance.

Base AMI

In order to make my life a bit easier, I tried to to look for AMIs that already have NVIDIA drivers installed, so that I don't have to install it myself. Looking through the AWS documentation about installing NVIDIA drivers, we can see that there are options already in the marketplace of AMIs with pre-shipped NVIDIA drivers. Among the options, we're going to use the Amazon Linux 2, because it already comes with the AWS Systems Manager agent, which we will use latter on.

A couple of notes:

You don't need to subscribe to the marketplace product in order to have access to the AMI currently selected. However, you will need to subscribe to have access to the AMI id of new releases.
You will need a GPU-based instance to build the AMI (as it's required by the marketplace product specifications). I've tested this project in a new AWS account and it seems that the default limits don't allow the provisioning of GPU-based instances (G family). packer will show an error if that's your case as well. If it is, you can request a limit increase here.

CloudWatch Agent

The first addon that we're going to make to the base AMI is to install and configure the AWS CloudWatch Agent.

The process of installation of the agent is well documented by AWS and you can see more details and methods of installation in other Linux distributions here.

The agent configuration is made by .json file that the agent reads in order to know what metrics to monitor and how to publish them on CloudWatch. You can also see more about it on the documentation page.

The process is automated by the script packer/scripts/install-cloudwatch-agent.sh. It installs the agent and configure it with some relevant metrics like filesystem, RAM, and swap usage.

Note that the agent is configured to publish metrics with a period of 60 seconds. This can incur costs since it's considered and Detailed metric (go to CloudWatch pricing page to know more).

Gathering the GPU metrics

AWS already have documentation talking about ways to monitor GPU usage. There is a brief description about a tool called gpumon and also a more extended blog post about it.

gpumon is a (kind of old) python script developed by AWS that makes use of a NVIDIA library called NVLM (NVIDIA Management Library) to gather metrics from the GPUs of the instance and publish them on CloudWatch. In this project the script was turned into a systemd unit. The script itself was also modified to make the error handling more readable and to capture memory usage correctly.

The gpumon service resides in packer/addons/gpumon and the install-cloudwatch-gpumon.sh automates the installation process. The service is configured to start the python script at boot and restart it stops working for some reason. Since systemd manages the service, its logs can be seen with journalctl --unit gpumon.

Note: the python script has only been tested on python2, which is deprecated. pip warns about that on the installation process while you create the AMI. You should keep that in mind if you intend to use this script for any production workload.

About the GPU memory usage metric gathering

The original script get the GPU memory usage from the nvmlDeviceGetUtilizationRates() function. I noticed through some tests that this metric was 0 even though I had data loaded into the GPU.

From the NVIDIA documentation this function actually returns the amount of memory that is being read/written, which isn't what I wanted. In order to get the amount of GPU memory allocated, nvmlDeviceGetMemoryInfo() should be used instead.

AMI Usage example

As an example on how to use this AMI, there is also a terraform project that contains the necessary resources to provision an instance and monitor it using the CloudWatch interface.

The tf/main.tf is the root file containing the reference to the module tf/module/monitored-gpu, which encapsulates the resources such as the instance and IAM permissions.

This example doesn't required SSH capabilities from the instance. We will use AWS Systems Manager - Session Manager to access of the instance (the base AMI already comes with the SSM agent preinstalled). This method is better because the access is registered into AWS, allowing security auditions on the instance access. Also, there is no credentials nor keys stored in any machine to be leaked.

The required AWS managed permissions are:

CloudWatchAgentServerPolicy: allow the instance to publish CloudWatch metrics;
AmazonSSMManagedInstanceCore instance access through Session Manager.

How to run it

All right, let's go to the fun part! To play with this project we first need to install some dependencies (packer and terraform).

A really handy tool that you can use to install and manage multiple versions of tools is asdf. It helps you keep track use different versions of a variety of tools. With it there is no need for you to uninstall the versions of the tools you may already have. With some simple commands it install the versions needed and make them context aware (the tolling version change automatically after entering in a directory that has a .tool-versions specified).

You can go to this link to install asdf. After that you can simply run the following to have the correct versions of packer and terraform:

asdf plugin-add terraform https://github.com/asdf-community/asdf-hashicorp.git
asdf plugin-add packer https://github.com/asdf-community/asdf-hashicorp.git

asdf install

After that, it's time to build the AMI:

cd packer
packer init
packer build .

This will start the process of building the AMI in the us-east-1 region. You can follow the terminal to see what is happening and the logs of the scripts. You can also see the snapshot being taken accessing the AWS console:

And get a progress bar in the "Snapshots" page like this:

The snapshot name tag will appear after the AMI has been created.

The AMI creation will be completed when you see something like this on your terminal:

...
==> amazon-ebs.gpu: Terminating the source AWS instance...
==> amazon-ebs.gpu: Cleaning up any extra volumes...
==> amazon-ebs.gpu: No volumes to clean up, skipping
==> amazon-ebs.gpu: Deleting temporary security group...
==> amazon-ebs.gpu: Deleting temporary keypair...
Build 'amazon-ebs.gpu' finished after 9 minutes 38 seconds.

==> Wait completed after 9 minutes 38 seconds

==> Builds finished. The artifacts of successful builds are:
--> amazon-ebs.gpu: AMIs were created:
us-east-1: ami-09a9fd45137e9129e

✅ At this point, you should have an AMI ready to be used!!

Now it's time to test it! Grab the AMI id (ami-09a9fd45137e9129e in this case) and paste it, replacing the text "<your-ami-id>" in the tf/main.tf file. After the modification, the section of the file that specifies the module should look like this:

module "gpu_vm" {
  source = "./modules/monitored-gpu"

  ami = "ami-09a9fd45137e9129e"
}

After that, just run:

cd tf
terraform init
terraform apply

terraform will ask you if you want to perform the actions specified. If, right before the prompt, it shows that it will create 6 resources, like it's being shown right below, you can type yes to start the resource provisioning.

...
Plan: 6 to add, 0 to change, 0 to destroy.
...

After a couple of minutes (roughly 5 minutes), go to the All metrics page on CloudWatch. You should be able to see two new custom namespaces already: CWAgent and GPU. This is the newly created instance publishing its metrics in idle.

You can see more details about RAM and swap, for example, using the CWAgent namespace, like the next figure shows. With that you can monitor the boot behavior of the AMI, assess its performance and verify if it's behaving as expected.

The swap usage is 0 because there is no swap configured in this AMI (you can follow this documentation in order to add it). The spike of RAM usage you see is a test that I was making 😅.

Now, let's use this hardware a bit to see the metrics moving. Go to the Instances tab on the EC2 page, like shown in the next figure. Right-click in the running instance and hit connect.

After that, go to the Session Manager tab and hit Connect.

You should now have a shell access through your browser. Running the commands below will clone and build a utility to stress-test the GPU for 5 minutes.

sudo -s
yum install -y git

cd ~
git clone https://github.com/wilicc/gpu-burn.git
make CUDAPATH=/opt/nvidia/cuda

./gpu_burn 600

You can look at CloudWatch to see the impact of the resource usage while gpu-burn does its thing, as shown in the figure below.

With these metrics, now it's easy to create alarms to alert you when an anomaly is detected on the resource usage or create autoscaling capabilities for a cluster using custom metrics.

Clean up

To finish the party and turn off the lights, just:

run terraform destroy while at the tf/ directory;
deregister ami;

and delete the EBS snapshot.

Thank you, guys! comments and feedback are much appreciated.

Feel free to reach out to me on LinkedIn or Github.