DEV Community

Akash Hardia for AWS Community Builders

Posted on • Originally published at Medium

4 1

Improve container boot time by lazy loading with SOCI

Managing a big sale event is not simple, and autoscaling is insufficient on its own because scaling takes a lot of time, mostly because it takes time for the hosts’ containers to boot up and process requests. It is just this problem that we are trying to solve today, and here is where the recent offering from AWS — “SOCI” can help us!

Originally published here

Booting Containers — before SOCI

To speed up the container boot time, first, we look at some relevant container orchestration bits done by ECS.

When a new ECS task is provisioned by AWS ECS Agent on the host either due to auto-scaling or scheduling… The snapshotter on the ecs task has to pull the container image onto the task first & once the image is downloaded & decompressed, then only the container is started.

ECS Agent provisioning a task which pulls image from ECR & starts container — traditional way

Factors affecting container boot-time

After gaining a better understanding, two key points are evident here:

  1. Image size directly impacts the time taken to boot the containers.

  2. Image has to be first downloaded & decompressed before the container can be started — unnecessary blocking.

Now, 1. Image size is influenced by application requirements and can be significantly reduced through optimization techniques like multi-stage build, etc. But, even with this reduced image size, it has to wait & download an image (could still be a large one) every time before a container can start serving the requests.

It gets especially crucial in a sale when there’s a sudden spike in traffic & ECS decides to double the containers (200 -> 400). This may cause a lot of requests queuing & that’s not an experience we want for our customers. What’s worse? — we overprovision our resources as a safety net for this.

Lazy Loading with SOCI - for faster boot-times

So our focus today is on point 2; instead of waiting for the whole image to download & decompress first, we only pull some files out of the image from the registry & start the container with the available files while the rest can be lazy loaded in the background. In this non-blocking approach, the container is started a lot earlier than it was supposed to be. Shown by figure:

load the container image asynchronously

Challenges for lazy loading

But wait…

we don’t know which file is present in which layer of the image

& how to pull.. only some files?… 🤔

While the latter is taken care of by ECR (since, it’s an OCI-compatible registry — an initiative to make different registries’ format, storage & distribution a bit generic across vendors). The former requires some additional steps in our docker build process.

A Docker Image

A docker image is composed of multiple layers gelled together by a manifest file. These layers are nothing but tarballs storing various files in different compressed sections/spans. Many a time, common layers are shared by different images.

representation of layers for docker image 1 & 2

Solution

To load the files asynchronously, we need a way to know which file is present in which layer & particularly which span/section of that layer tarball.

For this, we create a Table of Contents (TOC) index file for every layer containing information about every file present & its location(offset) in the tarball. We also create a SOCI index to map layers with their respective zTOCs for the image. This zTOC can be considered similar to a TOC found at the beginning of books before the chapters come in.

SOCI index & zTOCs

With this SOCI index file & TOC indexes, we know the exact location where a particular file can be found inside an image.

Integrate SOCI

Don’t worry.. we have a tool that will generate the indexes for us. This can be just one more step in your typical docker build workflow after the image is built & pushed to ECR. And we do two things here:

basic pipeline to deploy container

  1. Create Indices for the image by pulling it first.

  2. Push these generated SOCI artifacts back to the ECR.

The below script can help us do these. It downloads the SOCI cli, pulls the image from ecr & generates the artifacts for us. Finally, pushes back the artifacts to the original image registry. Do note that soci doesn’t work with images available in Docker Runtime that’s why we’re pulling it using ctr so that it’s available to containerd.

wget https://github.com/awslabs/soci-snapshotter/releases/download/v0.3.0/soci-snapshotter-0.3.0-linux-amd64.tar.gz
sudo tar -C . -xvf soci-snapshotter-0.3.0-linux-amd64.tar.gz soci

sudo ctr i pull --user xyz:password <image-uri>:latest
sudo ctr i ls
sudo ./soci create <image-uri>:latest
sudo ./soci index list
sudo ./soci push --user xyz:password <image-uri>:latest
Enter fullscreen mode Exit fullscreen mode

ECR image with SOCI artifacts (not visible to the user)

SOCI in action (figure below)

ECS Fargate has a special SOCI snapshotter for pulling the docker images which works with OCI-compatible registries like ECR. That means if it detects a SOCI index when pulling an image from ECR, it’ll pull all of the layers immediately whose zTOCs were skipped & automatically lazy load the remaining image.

& the best part — if it doesn’t detect the SOCI artifacts, it’ll go the traditional way of loading the image (completely download the image first & start the container after that.)

Faster booting containers with SOCI lazy loading

To verify if the container was started using the SOCI artifacts, we should see the snapshotter value set as “soci”:

wget "${ECS_CONTAINER_METADATA_URI_V4}/task"
Enter fullscreen mode Exit fullscreen mode

Improved Scaling

To measure the improvements, we can simply look at the task start-time & creation-time difference. A similar observation can be made at Alb’s level, checking the amount of time taken for a container to be considered healthy before & after SOCI.

In my observation, I saw an improvement of 30 % reduction in boot time which is really a good number & can result in quicker scaling during sales as well. This improvement can be different for different needs & people. This also depends on the number of layers you mark for zTOC creation.

Few Limitations

SOCI is a recent introduction by AWS & is only available to Fargate as of now… so the community support is limited. In fact, during my first iteration, I hit a roadblock with the problem later found at AWS’s end & support gave a month ETA for the fix. Apart from this, there are a few things to keep in mind:

  1. You need an image size >250mb to see some noticeable difference.

  2. The improvement you see is also dependent on the number of layers you create zTOCs for. Yes, that’s configurable.

  3. It doesn’t support ARM architecture for Fargate Spot provider as of now.

  4. It’s only for tasks running Linux 1.4.0 platform

  5. It’s not for those using zstd compression with docker.

  6. Service Connect support might be a hit or miss.

  7. Although there’s no direct cost involved with using SOCI, we’re pushing extra artifacts to ECR continuously & for that, we’ll have to pay for the ECR storage cost eventually.

Top comments (1)

Collapse
 
ankurk91 profile image
Ankur K

Well written