DEV Community: Vlad Ionescu

Scaling containers on AWS in 2022

Vlad Ionescu — Wed, 13 Apr 2022 00:00:00 +0000

This all started with a blog post back in 2020, from a tech curiosity: what's the fastest way to scale containers on AWS? Is ECS faster than EKS? What about Fargate? Is there a difference between ECS on Fargate and EKS on Fargate? I had to know this to build better architectures for my clients.

In 2021 , containers got even better, and I was lucky enough to get a preview and present just how fast they got at re:Invent!

What about 2022? What's next in the landscape of scaling containers? Did the previous trends continue? How will containers scale this year? What about Lambda? We now have the answer!

Hand-drawn-style graph showing how long it takes to scale from 0 to 3500 containers: Lambda instantly spikes to 3000 and then jumps to 3500, ECS on Fargate starts scaling after 30 seconds and reaches close to 3500 around the four and a half minute mark, EKS on Fargate starts scaling after about a minute and reaches close to 3500 around the eight and a half minute mark, EKS on EC2 starts scaling after two and a half minutes and reaches 3500 around the six and a half minute mark, and ECS on EC2 starts scaling after two and a half minutes and reaches 3500 around the ten minute mark

Tl;dr :

Fargate is now faster than EC2
ECS on Fargate improved so much and is the perfect example for why offloading engineering effort to AWS is a good idea
ECS on Fargate using Windows containers is surprisingly fast
App Runner is on the way to becoming a fantastic service
Up to a point, EKS on Fargate is faster than EKS on EC2
EKS on EC2 scales faster when using karpenter rather than cluster-autoscaler, even in the worst possible scenario
EKS on EC2 is a tiny bit faster when using IPv6
Lambda with increased limits scales ridiculously fast

Beware, this benchmark is extremely specific and meant to provide a FRAME OF REFERENCE, not completely accurate results — the focus here is making informed architectural decisions, not on squeezing out the most performance!

That's it! If you want to get more insights or if you want details about how I tested all this, continue reading on https://www.vladionescu.me/posts/scaling-containers-on-aws-in-2022/

Warning

Friendly warning: the estimated reading time for this blog post is about 45 minutes!

Might I suggest getting comfortable? A cup of tea or water goes great with container scaling.

Preparation

Before any testing can be done, we have to set up.

Limit increases

First up, we will reuse the same dedicated AWS Account we used in 2020 and 2021 — my "Container scaling" account.

To be able to create non-trivial amounts of resources, we have to increase a couple of AWS quotas.
I do not want to showcase exotic levels of performance that only the top 1% of the top 1% of AWS customers can achieve. At the same time, we can't look at out-of-the-box performance since AWS accounts have safeguards in place. The goal of this testing is to see what "ordinary" performance levels all of us can get, and for that we need some quota increases:

by default, one can run a maximum of 1 000 Fargate Concurrent Tasks. We're scaling to more than that, so the limit was increased to 10 000
by default, one can run at most 5 EC2 Spot Instances. That is not enough for our testing, and, after chatting with AWS Support, the EC2 Spot Instances limit was raised to 4 500 vCPUs which is about 280 EC2 Spot instances
by default, EKS does a fantastic job of scaling the Kubernetes Control Plane components (really — I tested this extensively with my customers). That said, our test clusters will be spending a lot of time idle, with zero containers running. We are not benchmarking EKS Control Plane scaling, and I'd rather eliminate this variable. AWS is happy to pre-scale clusters depending on the workload, and they did precisely that after some discussions and validation of my workload: the Kubernetes Control Plane was pre-scaled for all our EKS clusters.

That's it — not performance quota increases, but capacity quota increases! To scale a bunch, we need to create a bunch of instances. These are quotas that everybody should be able to get without too much work. Unless explicitly stated, these are all the quota increases I got.

Testing setup

Based on the previous tests in 2020 and 2021, we will scale up to 3 500 containers, as fast as possible: we'll force scaling, by manually changing the desired number of containers from 1 to 3 500.
I think this keeps a good balance: we're not scaling to a small number, but we're not wasting resources scaling to millions of containers. By forcing the scaling, we're eliminating a lot of complexity: there is a ridiculous number of ways to scale containers! We're avoiding going down some complex rabbit holes: no talking about optimizing CloudWatch or AWS Autoscaling intervals and reaction times, no stressing about the granularity at which the application exposes metrics, no optimizing the app → Prometheus Push Gateway → Prometheus → Metrics Server flow, we're blissfully ignoring many complexities. For an application to actually scale up, we first must detect that scaling is required (that can happen with a multitude of tools, from CloudWatch Alarms to KEDA to custom application events), we must then decide how much we have to scale (which can again be complex logic — is it a defined step or is it dynamic?), we then have to actually scale up (how is new capacity added and how does it join the cluster?), and finally we have to gracefully utilize that capacity (how are new instances connected to load balancers, how do new instances impact the scaling metrics?). This whole process can be very complex and it is often application-specific and company-specific. We want results that are relevant to everybody, so we will ignore all this and focus on how quickly we can get new capacity from AWS.

We will run all the tests in AWS' North Virginia region (us-east-1), as we've seen in previous years that different AWS regions have the same performance levels. No need to also run the tests in South America, Europe, Middle East, Africa, and Asia Pacific too.

In terms of networking, each test that requires a networking setup will use a dedicated VPC spanning four availability zones (use1-az1, use1-az4, use1-az5, and use1-az6), and all containers will be created across four private /16 subnets. For increased performance and lower cost, each VPC will use three VPC Endpoints: S3 Gateway, ECR API, and ECR Docker API.

In terms of servers, for the tests that require servers, we will use the latest and the greatest: AWS' c6g.4xlarge EC2 instances.
In the previous years, we used c5.4xlarge instances, but the landscape has evolved since then. Ideally, we'd like to keep the same server size to accurately compare results between 2020 and 2021 and 2022.
This year we have 2 options: c6i.4xlarge which are Intel-based servers with 32 GBs of memory and 16 vCPUs or c6g.4xlarge which are ARM-based servers using AWS' Graviton 2 processors with 32 GBs of memory and 16 vCPUs. AWS also announced the next-generation Graviton 3 processors and c7g servers, but those are only available in limited preview. Seeing how 16 Graviton vCPUs are both faster and cheaper than 16 Intel vCPUs, c6g.4xlarge sounds like the best option, so that's what we will use for our EC2 instances. To further optimize our costs, we will use EC2 Spot Instances which are up to 90% cheaper than On-demand EC2 instances. More things have to happen on the AWS side when a Spot Instance is requested, but the scaling impact should not be significant and the cost savings are a big draw.

The operating system landscape has evolved too. In previous years we used the default Amazon Linux 2 operating system, but in late 2020, Amazon launched an open-source operating system focused on containers: Bottlerocket OS. Since then, Bottlerocket matured and grew into an awesome operating system! Seeing how Bottlerocket OS is optimized for containers, we'll run Bottlerocket as the operating system on all our EC2 servers.

How should we measure how fast containers scale?
In the past years, we used CloudWatch Container Insights, but that won't really work this year: the best metrics we can get are minute-level metrics. Last year we had services scale from 1 to 3 500 in a couple minutes and with minute-level data we won't get proper insights, will we?
To get the most relevant results, I decided we should move the measurement directly in the container: the application running in the container should record the time it started at! That will give us the best metric: we will know exactly when the container has started.
In the past years, we used the poc-hello-world/namer-service application: a small web application that returns hello world. For this year, I extended the code a bit based on our idea: as soon as the application starts, it records the time it started at! It then does the normal web stuff — configuring a couple web routes using the Flask micro-framework.
Besides the timestamp, the application also records details about the container (name, unique id, and whatever else was easily available) and sends all this data synchronously to Honeycomb and asynchronously to CloudWatch Logs — by using two providers we are protected in case of failure or errors. We'll use Honeycomb as a live-UI with proper observability that allows us to explore, and CloudWatch Logs as the definitive source of truth.

Now that we have the application defined, we need to put it in a container! I built the app using GitHub Actions and Docker's build-and-push Action and a 380 MB multi-arch (Intel and ARM) container image resulted. The image was then pushed to AWS' Elastic Container registry (ECR) which is where all our tests will download it from.

Keeping in line with the "default setup" we are doing, the container image is not at all optimized!
There are many ways to optimize image sizes, there are many ways to optimize image pulling, and there are many ways to optimize image use — we would never get to testing if we keep optimizing. Again, this testing is meant to provide a frame of reference and not showcase the highest levels of performance.
For latency-sensitive workloads, people are optimizing image sizes, reusing layers, using custom schedulers, using base server images (AMIs) that have large layers or even full images already cached, and much more. It's not only that, but container runtimes matter too and performance can differ between say Docker and containerd. And it's not only image sizes and runtimes, it's also container setup: a container without any storage differs from a container with a 2 TB RAM disk, which differs from a container with an EBS volume attached, which differs from a container with an EFS volume attached.
We are testing to get an idea of how fast containers scale, and a lightweight but not minimal container is good enough. Performance in real life will vary.

That's all in terms of common setup used by all the tests in our benchmark! With the base defined, let's get into the setup required for each container service and the results we got!

Originally posted at https://www.vladionescu.me/posts/scaling-containers-on-aws-in-2022/ and the dev.to version may contain errors and less-than-ideal presentation

Kubernetes scaling

Kubernetes — the famous container orchestrator.
Kubernetes has its components divided in two sections: the Control Plane and the Worker Plane. The Control Plane is like a brain: it decides where containers run, what happens to them, and it talks to us. The Worker Plane is the servers on which our containers actually run.

We could run our own Kubernetes Control Plane, with tools like kops, kubeadm, or the newer cluster-api, and that would allow us to optimize each component to the extreme. Or, we could let AWS handle that for us through a managed service: Amazon Elastic Kubernetes Service, or EKS for short. By using EKS, we can let AWS stress about scaling and optimizations and we can focus on our applications. Few companies manage their own Kubernetes Control Planes now, and AWS offers enough customization for our use-case, so we'll use EKS!

For the Worker Plane, we have a lot more options:

self-managed workers. These are EC2 instances: we manage them, we configure them, we update them, we do everything
AWS-managed workers. If we want to stress a bit less, we can take advantage of EKS Managed Node Groups where we still have EC2 instances in our AWS account, but AWS handles the lifecycle management of those EC2s
serverless workers. If we want the least amount of stress, like not even caring about configuration or the operating system of servers and patches, we can use serverless workers through EKS on Fargate: we give AWS a container and we tell them to run it using EKS on Fargate — AWS will run it on a server they manage, update, and control
fancy 5G workers through AWS Wavelength. These are also EC2 instances, but hosted by 5G networking providers for super-low latency
extended-region workers through AWS Local Zones. These are still EC2 instances, but hosted in popular cities for lower latency
your own workers running on a big AWS-managed server that you buy from AWS and install in your own datacenter though AWS Outposts

For our testing, we'll ignore the less common options, and we'll use both EC2 workers managed through EKS Managed Node Groups and AWS-managed serverless workers through EKS on Fargate.

To get visibility into what is happening on the clusters, we will install two helper tools on the cluster: CloudWatch Container Insights for metrics and logs, and kube-eventer for event insights.

Our application container will use its own AWS IAM Role through IAM Roles for Service Accounts. When using EC2 workers, we will use the underlying node's Security Group and not a dedicated security group for the pod due to the impact that has on scaling — a lot more networking setup has to happen and that slows things down. When using Fargate workers there is no performance impact, so we will configure a per-pod security group in that case.
Containers are not run by themselves — what happens if the container has an error or restarts, or during an update — but as part of larger concepts. In the ECS world we have Services and in the Kubernetes world we have Deployments, but both of them do the same work: they manage containers. For example, a Deployment or a Service of 30 containers will constantly try to make sure 30 containers are running. If a container has an error, it is restarted. If a container dies, it is replaced by a new container. If the application has to be updated, the Deployment will handle the complex logic around replacing each of the 30 running containers with new and updated containers, and so on. In 2021, we saw that multiple Deployments or multiple Fargate Profiles have no impact on the scaling speed, so our application's containers will be part of a single Deployment, and using a single Fargate Profile. The Kubernetes Pod will be using 1 vCPU and 2 GBs and will have a single container: our test application.

There are multiple other configuration options, and if you want to see the full configuration used, you can check out the Terraform infrastructure code in the eks-* folders in vlaaaaaaad/blog-scaling-containers-on-aws-in-2022 on GitHub.

Now, with the setup covered, let's get to testing! I ran all the tests between December 2021 and February 2022, using all the latest versions available at the time of testing.

In terms of yearly evolution, EKS on EC2 remained pretty constant — in the below graph there are actually 3 different lines for 3 different years, all overlapping:

Hand-drawn-style graph showing yearly evolution for EKS: the 2020, 2021, and 2022 lines are all overlapping each other. Scaling starts after the two minute mark and reaches 3500 around the seven minute mark

But that is not the whole story! For 2022 we get a couple new options when using EC2 workers: an alternative scaler and IPv6 support.

Until late 2021, the default way of scaling EC2 nodes was cluster-autoscaler — there were other scalers too, but nothing as popular or as adopted. Cluster Autoscaler is an official Kubernetes project, with support for scaling worker nodes on a bunch of providers — from the classic big 3 clouds all the way to Hetzner or Equinix Metal. Cluster Autoscaler works on homogenous node groups: we have an AutoScaling Group that has instances of the exact same size and type, and Cluster Autoscaler sets the number of desired instances. If new instances are needed, the number is increased. If instances are unused, servers are cleanly drained and the number is decreased. For our benchmark, we're using a single AutoScaling Group with c6g.4xlarge instances.

For folks running Kubernetes on AWS, Cluster Autoscaler is great, but not perfect.
Cluster Autoscaler does its own node management, lifecycle management, and scaling which means a lot of EC2 AutoScaling Group (ASG) features are disabled. For example, ASGs have support for launching different instance sizes as part of the same group: if one instance with 16 vCPUs is not available, AGSs can figure out that two instances with 8 vCPUs are available, and launch that to satisfy the requested number. That feature and many others (predictive scaling, node refreshes, rebalancing) have to be disabled.
Cluster Autoscaler also knows little about AWS and what happens with EC2 nodes — it configures the desired number of instances, and then waits to see if desired_number == running_number. In cases of EC2 Spot exhaustion, it takes a while for Cluster Autoscaler to figure out what happened, for example.
By design, and by having to support a multitude of infrastructure providers, Cluster Autoscaler works close to the lowest common denominator: barebones node groups. This approach also forces architectural decisions: by only supporting homogenous node groups, a bunch of node groups are defined and applications are allocated to one of those node groups. Does your team have an application with custom needs? Too bad, it needs to fit in a pre-existing node group, or it has to justify the creation of a new node group.

To offer an alternative, Karpenter was built and it was released in late 2021. Karpenter, instead of working with AutoScaling Groups that must have servers of the same size (same RAM and CPU), directly calls the EC2 APIs to launch or remove nodes. Karpenter does not use AutoScaling Groups at all — it manages EC2 instances directly! This is a fundamental shift: there is no longer a need to think about server groups! Each application's needs can be considered individually.
Karpenter will look at what containers have to run, at what EC2 instances AWS offers, and make the best decision on what server to launch. Karpenter will then manage the server for its full lifecycle, with full knowledge of what AWS does and what happens with each server. With all this detailed information, Karpenter has an advantage in resource-constrained or complex environments — like say not enough EC2 Spot capacity or complex hardware requirements like GPUs or specific types of processors — Karpenter knows why an EC2 instance could not be launched and does not have to wait for a while to confirm that indeed desired_number == running_number. That sounds great and it sounds like it will impact scaling speeds!

That said, how should we compare Cluster Autoscaler with Karpenter? Given free rein to scale from 1 container to 3 500 containers, Karpenter will choose the best option: launching the biggest instances possible. While that may be a fair way to compare them, the results will be at odds.
I decided to compare Cluster Autoscaler (with 1 ASG of c6g.4xlarge) with Karpenter also limited to launching just c6g.4xlarge instances. This is the worst possible case for Karpenter and the absolute best case for Cluster Autoscaler, but it should give us enough information about how they compare.

Surprisingly, even in this worst case possible, EKS on EC2 using Karpenter is faster than EKS on EC2 using Cluster Autoscaler:

Hand-drawn-style graph showing the scaling  performance from 0 to 3500 containers. EKS on EC2 with cluster-autoscaler starts around the two and a half minute mark, reaches 3000 containers around the six and a half minute mark, and 3500 containers around the seven minute mark. EKS on EC2 with Karpenter is faster: it starts around the same two and a half minute mark, but reaches 3000 containers a minute earlier, around the five and a half minute mark, and reaches 3500 containers around the same seven minute mark

Another enhancement we got this year is the support for IPv6, released in early 2022. There is no way to migrate an existing cluster, which means that for our testing we have to create a new IPv6 EKS cluster in an IPv6 VPC. You can see the full code used in the eks-on-ec2-ipv6 folder in the vlaaaaaaad/blog-scaling-containers-on-aws-in-2022 repository on Github.

As AWS pointed out in their announcement blog post, IPv6 reduces the work that the EKS' network plugin (amazon-vpc-cni-k8s) has to do, giving us a nice bump in scaling speed, for both Cluster Autoscaler and Karpenter:

Hand-drawn-style graph showing the scaling  performance from 0 to 3500 containers. There are four lines, in two clusters. The first cluster is EKS on EC2 using Karpenter where we can see IPv6 is around 30 seconds faster than IPv4. The second cluster is EKS on EC2 using cluster-autoscaler where IPv6 is again faster than IPv4

I envision people will start migrating towards IPv6 and Karpenter, but that will be a slow migration: both of them are fundamental changes! Migrating from IPv4 to IPv6 requires a complete networking revamp, with multiple components and integrations being affected. Migrating from Cluster AutoScaler to Karpenter is easier as it can be done gradually and in-place (workloads that are a fit for Karpenter-only clusters are rare), but taking full advantage of Karpenter requires deeply understanding what resources applications need — no more putting applications in groups.

Now, let's move to serverless Kubernetes workers!
As mentioned above, Fargate is an alternative to EC2s: no more stressing about servers, operating systems, patches, and updates. We only have to care about the container image!

This year, EKS on Fargate is neck-and-neck with EKS on EC2 in terms of scaling:

Hand-drawn-style graph showing the scaling  performance from 0 to 3500 containers. EKS on Fargate starts around the one-minute mark and reaches close to 3500 containers around the eight minute mark. EKS on EC2 with Karpenter and EKS on EC2 with cluster-autoscaler both start around the two minute mark, and reach 3500 containers around the seven minute mark, but EKS on EC2 with Karpenter scales faster initially

Looking at the above graph does not paint the full picture, though. If we look at the yearly evolution of EKS on Fargate, we see how much AWS improved this, without any effort required from the users:

Hand-drawn-style graph showing yearly evolution for EKS on Fargate: in 2020 it took about 55 minutes to reach 3500 containers. In 2021, it takes around 20 minutes, and in 2022 it takes a little over 8 minutes

Massive improvements! If we built and ran an application using EKS on Fargate in 2020, we would have scaled to 3 500 containers in about an hour. Without any effort or changes, in 2021 the scaling would be done in 20 minutes. Without having to invest any effort or change any lines of code, in 2022 the same application would finish scaling in less than 10 minutes! IN LESS THAN 2 YEARS, WITHOUT ANY EFFORT, WE WENT FROM 1 HOUR TO 10 MINUTES!

I do see people migrating from EKS on EC2 to EKS on Fargate, but small amounts. EKS on Fargate has no support for Spot pricing, which makes Fargate an expensive proposition when compared with EC2 Spot Instances. EC2 Spot instances (and ECS on Fargate Spot) are cheaper by up to 90%, but they may be interrupted with a 2-minute warning. For a lot of containerized workloads, that is not a problem: a container will be interrupted, and quickly replaced by a new container. No harm done, and way lower bills.

The most common setup I see for Kubernetes on AWS is using a combination of workers. EKS on Fargate is ideal for long-running and critical components like AWS Load Balancer Controller, Karpenter, or Cluster Autoscaler. EKS on EC2 is ideal for interruption-sensitive workloads like stateful applications. What I am seeing most often with my customers, is the largest part of the worker plane using EKS on EC2 Spot which is good enough for most applications.

Keep in mind that these are DEFAULT PERFORMANCE RESULTS, with an extreme test case, and with manual scaling! Performance levels will differ depending on your applications and what setup you're running.

One would think the upper limit on performance is EKS on EC2 with the servers all ready to run containers — the servers won't have to be requested, created, and started, containers would just need to start:

Hand-drawn-style graph showing the scaling  performance from 0 to 3500 containers. All the previous graphs are merges into this graph, in a mess of lines

That gets close, but even that could be optimized by having the container images already cached on the servers and by tuning the networking stack!
There are always optimizations to be done, and more performance can always be squeezed out. For example, I spent 3 months with a customer tuning, experimenting, and optimizing Cluster Autoscaler for their applications and their specific workload. After all the work was done, my customer's costs decreased by 20% while their end-users saw a 15% speed-up! For them and their scale, it was worth it. For other customers, spending this much time would represent a lot of wasted effort.

ALWAYS MAKE SURE TO CONSIDER ALL THE TRADE-OFFS WHEN DESIGNING! Pre-scaled, pre-warmed, and finely tuned EKS on EC2 servers may be fast, but the associated costs will be huge — both pure AWS costs, but also development time costs, and missed opportunity costs 💸

Originally posted at https://www.vladionescu.me/posts/scaling-containers-on-aws-in-2022/ and the dev.to version may contain errors and less-than-ideal presentation

Elastic Container Service scaling

Amazon's Elastic Container Service — the pole-position orchestrator that needed competition to become awesome.

Just like Kubernetes, Amazon Elastic Container Service or ECS for short, has its components divided in two sections: the Control Plane and the Worker Plane. The Control Plane is like the brain: it decides where containers run, what happens to them, and it talks to us. The Worker Plane is the servers on which our containers actually run.

For ECS, the Control Plane is proprietary: AWS built it, AWS runs it, AWS manages it, and AWS develops it further. As users, we can create a Control Plane by creating an ECS Cluster. That's it!

For the Worker Plane we have more options, with the first five options being the same as the options we had when using Kubernetes as the orchestrator:

self-managed workers. These are EC2 instances: we manage them, we configure them, we update them, we do everything
serverless workers. For the least amount of stress — not having to care about servers, operating systems, patches, and all that — we can use serverless workers through ECS on Fargate: we give AWS a container and we tell them to run it — AWS will run it on a server they manage, update, and control
fancy 5G workers through AWS Wavelength. These are also EC2 instances, but hosted by 5G networking providers for super-low latency
extended-region workers through AWS Local Zones. These are still EC2 instances, but hosted in popular cities for lower latency
your own workers running on a big AWS-managed server that you buy from AWS and install in your own datacenter though AWS Outposts
an extra option, exclusively for ECS: your own workers on your own hardware that is connected to AWS through ECS Anywhere

For our testing, we'll focus on the most common options, and we'll use both EC2 workers that we manage ourselves and AWS-managed serverless workers though Fargate.

To get visibility into what is happening on the clusters, we don't have to install anything, we just have to enable CloudWatch Container Insights on the ECS cluster.

Our application container will use its own dedicated AWS IAM Role and its own dedicated Security Group. Containers are not run by themselves — what happens if the container has an error or restarts, or during an update — but as part of larger concepts. In the Kubernetes world we have Deployments and in the ECS world we have Services, but both of them do the same work: they manage containers. For example, a Deployment or a Service of 30 containers will always try to make sure 30 containers are running. If a container has an error, it is restarted. If a container dies, it is replaced by a new container. If 30 containers have to be updated, the Service will handle the complex logic around replacing each of the 30 containers with new and updated containers. Each of the 30 containers are gradually replaced, always making sure at least 30 containers are running at all times. In 2021, we saw that ECS scales faster when multiple ECS Services are used, so our application's containers will be part of multiple Deployments, all in the same ECS Cluster. The ECS Task will be using 1 vCPU and 2 GBs and will have a single container: our test application.

There are multiple other configuration options, and if you want to see the full configuration used, you can check out the Terraform infrastructure code in the ecs-* folders in vlaaaaaaad/blog-scaling-containers-on-aws-in-2022 on GitHub.

Now, with the setup covered, let's get to testing! I ran all the tests between December 2021 and April 2022, using all the latest versions available at the time of testing.

In 2020, I did not test ECS on EC2 at all. In 2021, I tested ECS on EC2, but the performance was not great. This year, the announcement for improved capacity provider auto-scaling gave me hope, and I thought we should re-test ECS on EC2.

Scaling ECS clusters with EC2 workers as part of an AutoScaling Group managed by Capacity Providers is very similar to cluster-autoscaler from the Kubernetes word: based on demand, EC2 instances are added or removed from the cluster. Not enough space to run all the containers? New EC2 instances are added. Too many EC2 instances that are underutilized? Instances are cleanly removed from the cluster.

In 2021, we saw that ECS can scale a lot faster when using multiple Services — there's a bit of extra configuration that has to be done, but it's worth it. We first have to figure out what is the number of ECS Services that will scale the fastest:

Hand-drawn-style graph showing the scaling  performance from 0 to 3500 containers, with different number of services. They all start scaling around the two and a half minute mark, with ECS on EC2 with 1 Service reaching about 1600 containers after ten minutes, with 5 Services reaching 3500 containers after ten minutes, with 7 Services also reaching 3500 containers after ten minutes, and with 10 Services reaching 2500 containers after ten minutes

Now that we know the ideal number of Services, we can focus on tuning Capacity Providers. A very important setting for the Capacity Provider is the target capacity, which can be anything between 1% and 100%.
If we set a low target utilization of 30%, that means we are keeping 70% of the EC2 capacity free — ready for hosting new containers. That's awesome from a scaling performance perspective, but it's terrible from a cost perspective: we are overpaying by 70%! Using a larger target, of say 95% offers better cost efficiency (only 5% unused space), but it means scaling would be slower since we have to wait for new EC2 instances to be started. How does this impact scaling performance? How much slower would scaling be? To figure it out, let's test:

Hand-drawn-style graph showing the scaling  performance from 0 to 3500 containers. ECS on EC2 with 5 Services and 30% Target Capacity starts scaling after about thirty seconds, and reaches 3500 containers just before the six minute mark. ECS on EC2 with 5 Services and 80% Target Capacity starts scaling around the two and a half minute mark and reaches 3500 containers after about ten minutes. ECS on EC2 with 5 Services and 95% Target Capacity starts scaling around the two and a half minute mark and reaches about 1600 containers after ten minutes of scaling

In the real world, I mostly see ECS on EC2 used when ECS on Fargate is not enough — for things like GPU support, high-bandwidth networking, and so on.

Let's move to ECS on Fargate — serverless containers!

As mentioned above, Fargate is an alternative to EC2s: no more stressing about servers, operating systems, patches, and updates. We only have to care about the container image!

FARGATE DIFFERS MASSIVELY BETWEEN ECS AND EKS! ECS is AWS-native and serverless by design, which means ECS on Fargate can move faster and it can fully utilize the power of Fargate. Besides the "default" Fargate — On-Demand Intel-based Fargate by its full name — ECS on Fargate also has support for Spot (up to 70% discount, but containers may be interrupted), ARM support (faster and cheaper than the default), Windows support, and additional storage options.

In 2021, we saw that ECS on Fargate can scale a lot faster when using multiple Services — there's a bit of extra configuration that has to be done, but it's worth it. We first have to figure out what is the number of ECS Services that will scale the fastest:

Hand-drawn-style graph showing the scaling  performance from 0 to 3500 containers, with different numbers of services. ECS on Fargate with 2 Services takes about eight minutes, with 3 Services it takes about six minutes, with 5 Services it takes around five minutes, and with 7 Services it takes the same around five minutes

For our test application, the ideal number seems to be 5 Services: our application needs to be split in 5 Services, each ECS Service launching and taking care of 700 containers. If we use less than 5 Services, performance is lower. If we use more than 5 Services, performance does not improve.

Amazing performance from ECS on Fargate! If we compare this year's best result with the results from previous years, we get a fabulous graph showing how much ECS on Fargate has evolved:

Hand-drawn-style graph showing the scaling  performance from 0 to 3500 containers. In 2020 ECS on Fargate took about 55 minutes to reach 3500 containers. In 2021, it takes around 12 minutes, and in 2022 it takes a little over 5 minutes

If we built and ran an application using ECS on Fargate in 2020, we would have scaled to 3 500 containers in about 60 minutes. Without any effort or changes, in 2021 the scaling would be done in a little over 10 minutes. Without having to invest any effort or change any lines of code, in 2022 the same application would finish scaling in a little bit over 5 minutes! IN LESS THAN 2 YEARS, WITHOUT ANY EFFORT, WE WENT FROM 1 HOUR TO 5 MINUTES!

To further optimize our costs, we can run ECS on Fargate Spot which is discounted by up to 70%, but AWS can interrupt our containers with a 2-minute warning. For our testing, and for a lot of real-life workloads, we don't care if our containers get interrupted and then replaced by another container. The up 70% discount is… appealing, but AWS mentions that Spot might be slower due to additional work that has to happen on their end. Let's test and see if there's any impact:

Hand-drawn-style graph showing the scaling  performance from 0 to 3500 containers. The lines for ECS on Fargate On-Demand using 5 Services and ECS on Fargate Spot using 5 Services are really close, with the Spot line having a small bump

What a surprise! As per AWS, "customers can expect a greater variability in performance when using ECS on Fargate Spot" and we are seeing exactly that: for the test I ran ECS on Fargate Spot was just a smidge faster than ECS on Fargate On-Demand. SPOT PERFORMANCE AND AVAILABILITY VARIES, so make sure to account for that when architecting!

That said, how sustained is this ECS on Fargate scaling performance? We can see that scaling happens super-slow as we get close to our target of 3 500 containers. Is that because there are only a few remaining containers to be started, or is it because we are hitting a performance limitation? Let's test what happens when we try to scale to 10 000 containers!

Hand-drawn-style graph showing the scaling  performance from 0 to 3500 containers. There are lines for ECS on Fargate with 2, 3, 5, and 7 Services, and they all scale super-fast to 3400-ish containers and then slow down. There is a tall line for ECS on Fargate to 10000 containers with 5 Services taking a close to ten minutes to scale, also slowing down when reaching the top

Ok, ECS on Fargate has awesome and sustained performance!

That is not all though! This year, we have even more options: ECS on Fargate ARM and ECS on Fargate Windows.

In late 2021, AWS announced ECS on Fargate ARM, which is taking advantage of AWS' Graviton 2 processors. ECS on Fargate ARM is both faster and cheaper than the "default" ECS on Fargate On-Demand which uses Intel processors. There is no option for ECS on Fargate ARM Spot right now, so ECS on Fargate Spot remains the most cost-effective option.

To run a container on ARM processors — be they AWS' Graviton processors or Apple's Silicon in the latest Macs — we have to build a container image for ARM architectures. In our case, this was easy: we add a single line to the Docker's build-and-push Action to build a multi-architecture image for both Intel and ARM processors: platforms: linux/amd64, linux/arm64. It's that easy! The image will then be pushed to ECR, which supports multi-architecture images since 2020.

With the image built and pushed, we can test how ECS on Fargate ARM scales:

Hand-drawn-style graph showing the scaling  performance from 0 to 3500 containers. ECS on Fargate is a bit slower than ECS on Fargate ARM, with about 20 seconds of difference between the two

Interesting, ECS on Fargate ARM is faster! AWS' Graviton2 processors are faster, and I expected that to be visible in the application processing time, but I did not expect that would have an impact on scaling too. Thinking about it, it makes sense though — a faster processor would help with the container image extraction and application startup. Even better!

In the same late 2021, AWS announced ECS on Fargate Windows, which can run Windows Server containers. Since Windows has licensing fees, the pricing is a bit different: there is an extra OS licensing fee, and, while billing is still done per-second, there is a minimum of 15 minutes.

Some folks would dismiss ECS on Fargate Windows, but it is a major announcement! People that had to run specific Windows-only dependencies can now easily adopt containerized applications or, for the first time, run Windows containers serverlessly on AWS. Windows support is great news for folks doing complex .NET applications that cannot be moved to Linux: they can now move at way higher velocity!

To build Windows container images, we have to make a couple of changes.
First, we have to use a Windows base image for our container. For our Python web app it's easy: the official Python base images have Windows support since 2016. Unfortunately, Docker's build-and-push Action has no support for building Windows containers. To get the image built, we'll have to run the docker build commands manually. Since GitHub Actions has native support for Windows runners, this is straightforward.
Like all our images, the Windows container image is pushed to ECR which does support Windows images since 2017.

I tried running our application and it failed: the web server could not start. As of right now, our gunicorn web server does not support Windows. No worries, we can use a drop-in alternative: waitress! This will lead to a small difference in the test application code between Windows and Linux, but no fundamental changes.

Because I used Honeycomb for proper observability, I was able to discover there is one more thing we have to do for the best scaling results: push non-distributable artifacts! You can read the whole story on a guest post I wrote on the Honeycomb blog, but the short version is that for right now, Windows Container images are special and an extra configuration option has to be enabled.
Windows has complex licensing and the "base" container image is a non-distributable artifact: we are not allowed to distribute it! That means that when we build our container image in GitHub Action and then run docker push to upload our image to ECR, only some parts of the image will be pushed to ECR — the parts that we are allowed to share, which in our case are our application code and the dependencies for our app. If we open the AWS Console and look at our image, we will see that only 76 MB were pushed to ECR.
When ECS on Fargate wants to start a Windows container with our application, it has to first download the container image: our Python application and its dependencies totaling about 76MBs will be downloaded from ECR, but the base Windows Server image of about 2.7 GBs will be downloaded from Microsoft. Unless everything is perfect in the universe and on the Internet between AWS and Microsoft, download performance can vary wildly!
As per Steve Lasker, PM Architect Microsoft Azure, Microsoft recognizes this licensing constraint has caused frustration and is working to remove this constraint and default configuration. Until then, for the highest consistent performance, both AWS and Microsoft recommend setting a Docker daemon flag for private-use images: --allow-nondistributable-artifacts. By setting this flag, the full image totaling 2.8 GBs will be pushed to ECR — both the base image and our application code. When ECS on Fargate will have to download the container image, it will download the whole thing from close-by ECR.

With this extra flag set, and with the full imaged pushed to ECR, we can test ECS on Fargate Windows and get astonishing performance:

Hand-drawn-style graph showing the scaling  performance from 0 to 3500 containers. ECS on Fargate and ECS on Fargate ARM both start around the 30 second mark and reach close to 3500 containers around the four to five minute mark. ECS on Fargate Windows starts just before the six minute mark, reaching 3500 containers in about eleven minutes

ECS on Fargate Windows is slower to start — about 5 minutes of delay compared to about 30 seconds of delay when using Linux containers — but that was expected: Fargate has to do licensing stuff and Windows containers are, for good reasons, bigger. After that initial delay, ECS on Fargate Windows is scaling just as fast, which is awesome!

I am seeing a massive migration to ECS on Fargate — it's so much easier!
Since ECS on Fargate launched in 2017, for the best-case scenario, approximative pricing per vCPU-hour got a whooping 76% reduction from $ 0.05 to $ 0.01 and pricing per GB-hour got a shocking 89% reduction from $ 0.010 to $ 0.001. Since I started testing in 2020, ECS on Fargate got 12 times faster!

While initially a slow and expensive service, ECS on Fargate grew into an outstanding service. I started advising my clients to start moving smaller applications to ECS on Fargate in late 2019, and that recommendation became stronger each year. I think ECS on Fargate should be the default choice for any new container deployments on AWS. As a bonus, running in other data-centers became a thing instantly, without any effort, through ECS Anywhere! That enables some amazingly easy cross-cloud SaaS scenarios and instant edge computing use-cases.

That said, ECS on Fargate is not an ideal fit for everything: there is limited support for CPU and memory size, limited storage support, no GPU support yet, no dedicated network bandwith, and so on. Using EC2 workers with ECS on EC2 offers a lot more flexibility and power — say servers with multiple TBs of memory, hundreds of CPUs, and dedicated network bandwidth in the range of 100s of Gigabits. It's always tradeoffs!

Again, keep in mind that these are DEFAULT AND FORCED PERFORMANCE RESULTS, with an extreme test case, and with manual scaling! Performance levels will differ depending on your applications and what setup you're running! This post on the AWS Containers Blog gets into more details, if you're curious.

Hand-drawn-style graph showing the scaling  performance from 0 to 3500 containers, using ECS. There are a lot of lines and it's messy

Originally posted at https://www.vladionescu.me/posts/scaling-containers-on-aws-in-2022/ and the dev.to version may contain errors and less-than-ideal presentation

App Runner

App Runner is a higher-level service released in May 2021.

For App Runner, we don't see any complex Control Plane and Worker Plane separation: we tell App Runner to run our services, and it does that for us!
App Runner is an easier way of running containers, further building up on the experience offered by ECS on Fargate. If you're really curious how this works under the covers, an awesome deep-dive can be read here and AWS has a splendid networking deep-dive here.

For our use case, there is no need to create a VPC since App Runner does not require one (but App Runner can connect to resources in a VPC). To test App Runner, we create an App Runner service configured to run our 1 vCPU 2 GB container using our container image from ECR. That's it!

To force scaling, we can edit the "Minimum number of instances" to equal the "Maximum number of instances", and we quickly get the result:

Hand-drawn-style graph showing the scaling performance from 0 to 30 containers instead of 3500 containers. There are 2 lines for ECS on Fargate using 2 and 5 services: they both start around the 30 seconds mark and go straight up. App Runner has a shorter line that starts around the one minute mark, goes straight up, and stops abruptly at 25 containers

App Runner starts scaling a bit slower than ECS on Fargate, but then scales just as fast. Scaling finishes quickly, as APP RUNNER SUPPORTS A MAXIMUM OF 25 CONTAINERS per service. There is no way to run more than 25 containers per service, but multiple services could be used.

Don't be fooled by the seemingly small number! Each App Runner container can use a maximum of 2 vCPUs and 4 GB of RAM, for a grand total of 50 vCPUs and 100 GBs possible in a single service. For many applications, this is more than enough, and the advantage of AWS managing things is not to be underestimated!

In the future, I expect App Runner will continue to mature, and I think it might become the default way of running containers sometime in 2023 — but those are my hopes and dreams. With AWS managing capacity for App Runner, there are a lot of optimizations that AWS could implement. We'll see what the future brings!

Keep in mind that these are DEFAULT PERFORMANCE RESULTS, with manual, forced scaling! Performance levels will differ depending on your applications and what setup you're running! App Runner also requires way less effort to setup and has some awesome scaling features 😉

Originally posted at https://www.vladionescu.me/posts/scaling-containers-on-aws-in-2022/ and the dev.to version may contain errors and less-than-ideal presentation

Lambda

Lambda is different: it's Function-as-a-Service, not containers.

In previous years I saw no need to test Lambda as AWS publishes the exact speed at which Lambda scales. In our Northern Virginia (us-east-1) region, that is an initial burst of 3 000 instances in the first minute, and then 500 instances every minute after. No need to test when we know the exact results we are going to get: Lambda will scale to 3 500 instances in 2 minutes!
This year, based on countless requests, I thought we should include Lambda, even if only to confirm AWS' claims of performance. Lambda instances are not directly comparable with containers, but we'll get to that in a few.

Lambda is a fully managed event-driven Function-as-a-Service product. In plain language, when events happen (HTTP requests, messages put in a queue, a file gets created) Lambda will run code for us. We send Lambda the code, say "run it when X happens", and Lambda will do everything for us, without any Control Plane and Worker Plane separation — we don't even see the workers directly!

The code that Lambda will run can be packaged in two ways:

in a .zip archive, of at most 50 MB. This is the "classic" way of sending code to Lambda
in a container image (which, at its lowest level, is a collection of archives too), which can be as large as 10 GB. This was launched in late 2020 and is an alternative way of packaging the code

To adapt our application for Lambda, we have first to figure out what event our Lambda function will react to.
We could configure Lambda to have a bunch of containers waiting ready to accept traffic, but that is not the same thing as Lambda creating workers for us when traffic spikes. For accurate results, I think we need to have at least a semi-realistic scenario in place.
Looking at the many integrations Lambda has, I think the easiest one to use for our testing is the Amazon API Gateway integration: API Gateway will receive HTTP requests, run our code for each request, and return the response.

Our test application code can be adapted now that we know what will call our Lambda. I decided to re-write our test application. While Lambda can run whatever code we send it (including big frameworks with awesome performance), I see no need for a big framework for our use-case. I re-wrote the code, still in Python but without the Flask micro-framework, I packaged it into a .zip file, and configured it as the source for my Lambda function.

To scale the Lambda function, since there is no option of manually editing a number, we will have to create a lot of events: flood the API Gateway with a lot of HTTP requests. I did some experiments, and the easiest option seems to be Apache Benchmark running on multiple EC2 instances with a lot of sustained network bandwidth. We will run Apache Benchmark on each EC2 instance, and send a gigantic flood of requests to API Gateway, which will in turn send that to a lot of Lambdas.
Since Lambda publishes its scaling performance targets), we know that Lambda will scale to our target of 3 500 containers in just 2 minutes. That's not enough time — we have to scale even higher! When we wanted to confirm that ECS on Fargate has sustained scaling performance, we scaled to 10 000 containers in about 10 minutes. That seems like a good target, right? Let's scale Lambda!

There are multiple other configuration options, and if you want to see the full configuration used, you can check out the Terraform infrastructure code in the lambda folder in vlaaaaaaad/blog-scaling-containers-on-aws-in-2022 on GitHub.

Now, with the setup covered, let's get to testing! I ran all the tests between January and February 2022, using all the latest versions available at the time of testing.

Hand-drawn-style graph showing the scaling performance from 0 to 10000 containers. ECS on Fargate starts around the 30 seconds mark and grows smoothly until 10000 around the ten minute mark. The Lambda line spikes instantly to 3000 containers, and then spikes again to 3500 containers. After that, the Lambda line follows a stair pattern, every minutes spiking an additional 500 containers

Lambda followed the advertised performance to the letter: an initial burst of 3  000 containers in the first minute, and 500 containers each minute after. I am surprised by Lambda scaling in steps — I expected the scaling to be spread over the minutes, not to have those spikes when each minute starts.

Funnily enough, at about 7 minutes after the scaling command, both Lambda and ECS on Fargate were running almost the same number of containers: 6 500, give or take. That does not tell us much though — these are pure containers launched. How would this work in the real world? How would this impact our applications? How would this scaling impact response times and customer happiness?

LAMBDA AND ECS ON FARGATE WORK IN DIFFERENT WAYS and direct comparisons between the two cannot be easily done.
Lambda has certain container sizes and ECS on Fargate has other container sizes with even more sizes in the works. Lambda has one pricing for x86 and one pricing for ARM and ECS on Fargate has two pricing options for x86 and one pricing option for ARM. Lambda is more managed and tightly integrated with AWS which means, for example, that there is no need to write code to receive a web request, or to get a message from an SQS Queue. But ECS on Fargate can more easily use mature and validated frameworks. Lambda can only do 1 HTTP request per container, while ECS on Fargate can do as many as the application can handle, but that means more complex configuration. And so on and so forth.
I won't even attempt to compare them here.

And that is not all — it gets worse/better!
We all know that ECS on Fargate and EKS on Fargate limits can be increased, and we even saw that in the 2021 tests. In the middle of testing Lambda, I got some surprising news: Lambda's default scaling limits of 3 000 burst and 500 sustained can be increased!
Unlike the previous limit increases (which were actually capacity limit increases) this is a performance limit increase and is not a straightforward limit increase request. It's not something that only the top 1% of the top 1% of AWS customers can achieve, but it's not something that is easily available either. With a LEGITIMATE WORKFLOW and some conversations with the AWS teams and engineers through AWS Support, I was able to get my Lambda limits increased by a shocking amount: from the default initial burst of 3 000 and a sustained rate of 500, my limits were increased to an initial burst of 15 000 instances, and then a sustained rate of 3 000 instances each minute 🤯

To see these increased limits in action, we have to scale even higher. A test to 10 000 containers is useless:

Hand-drawn-style graph showing the scaling performance from 0 to 10000 containers. The same graph as before, with an additional line going straight up from 0 to 10000, instantly

Did you notice the vertical line? Yeah, scaling to 10 000 is not a great benchmark when Lambda has increased limits to burst to 15 000.

To properly test Lambda with increased limits, we have to go even higher! If we want to scale for the same 10 minute duration, we have to scale up to 50 000 containers!
At this scale, things start getting complicated. To support this many requests, we also have to increase the API Gateway traffic limits. I talked to AWS Support and after validating my workflow, we got the Throttle quota per account, per Region across HTTP APIs, REST APIs, WebSocket APIs, and WebSocket callback APIs limit increased from the default 10 000 to our required 50 000.

With the extra limits increased for our setup, we can run our test and see the results. Prepare to scroll for a while:

Hand-drawn-style graph showing the scaling performance form 0 to 50000 containers. It is a comically tall graph, requiring a lot of scrolling. ECS on Fargate starts around the 30 seconds mark and grows smoothly until 10000 around the same ten minute mark. There is an additional line for Lambda with increased limits which goes straight up to 12000-ish containers and then spikes again to 18000-ish. After that, the line follows the same stair pattern, every minute spiking an additional 3000 containers

Yeah… that is a lot.

We are about large numbers here: 3 500, 10 000, and now 50 000 containers. We are getting desensitized, and I think it would help to put these numbers in perspective. The biggest Lambda size is 6 vCPUs with 10 GB of memory.
With the default limits, Lambda scales to 3 000 containers in a couple of seconds. That means that with default limits, we get 30 TB OF MEMORY AND 18 000 VCPUS IN A COUPLE OF SECONDS.
With a legitimate workload and increased limits, as we just saw, we are now living in a world where we can INSTANTLY GET 150 TB OF RAM AND 90 000 VCPUS for our apps 🤯

Acknowledgments

First, massive thanks go to Farrah and Massimo, and everybody at AWS that helped with this! You are all awesome and the time and care you took to answer all my annoying emails is much appreciated!
I also want to thank Steve Lasker and the nice folks at Microsoft that answered my questions around Windows containers!

Special thanks go to all my friends who helped with this — from friends being interviewed about what they would like to see and their scaling use-cases, all the way to friends reading rough drafts and giving feedback. Thank you all so much!

Because I value transparency, and because "oh, just write a blog post" fills me with anger, here are a couple of stats:

this whole thing took almost 6 months:
- November 2021: early discovery and calls to figure out how to test, showcase, and contextualize Lambda and App Runner
- December 2021: continued discovery, coding, testing
- January 2022: continued testing, data visualization explorations
- February 2022: final testing, early drafts, and data visualization explorations
- March 2022: a lot of writing and reviews
- April 2022: final reviews, re-testing
final results from the tests were exported to over 80 MB of spreadsheets, from about 5 GB of raw data. That's 80 MB of CSVs and 5 GBs of compressed JSONs, not counting the discarded data from experiments or failed tests
over 200 emails were sent between me and more than 30 engineers at AWS. Again, thank you all so much, and I apologize for the spam!
the total AWS bill was about 7.000 $. Thank you Farrah and AWS for the Credits!
more than 300 000 containers were launched
one major bug was discovered
carbon emissions are TBD

Originally posted at https://www.vladionescu.me/posts/scaling-containers-on-aws-in-2022/ and the dev.to version may contain errors and less-than-ideal presentation

AWS User Group Dubai 2021 Container Series Meetups

Vlad Ionescu — Tue, 14 Sep 2021 20:37:50 +0000

Details

This series of talks and hands-on workshops around the concept of "AWS cloud-native modern applications" introduces AWS cloud platform in that light. Starting with the core building blocks of modernizing traditional applications, and how to capitalize on AWS services and capabilities to build a better resilient, reliable application with cloud-native design in mind.

It's 2021. There is a need to find ways to speed up the process of deploying, scaling, and automating our applications, also enable developers and operations teams "DevOps" to collaborate effectively, work efficiently, save resources, and solve the matrix from Hell problems. The magic word here is Containers. This is a pragmatic hands-on series of workshops to introduce members to AWS container workloads focusing on the fundamentals and where to start, AWS Containers 101 workshop.

Conception & leadership: AWS Container Hero Walid Shaari
Organizer: AWS Community Hero Anas Khattar
Outreach: AWS Community Hero Ahmed Samir
Speakers:
- AWS Container Hero Walid Shaari
- AWS Container Hero Vlad Ionescu

Follow AWS UG Dubai on Twitter: @awsdubai and @AWSomeMENA!

Intro to Containers 101

In the first talk of this new series, we will introduce the format for this multi-talk series. After that, we’ll discuss containers in an abstract way. Why do we use containers? What are some different ways of running containers?

Agenda:

Intro to this series: speakers, format, sponsors
What are some ideal use cases for containers?
How can we run containers in production?
Open discussions, Q&A

Keywords:

Containers, Docker, OCI
AWS ECS, Amazon EKS, Kubernetes, Lambda Containers, Serverless
Immutable Infrastructure, Virtual Machines, VMs
Gitpod, GitHub Codespaces

📆 Date: Wednesday, Sept. 15, 2021, at 8:30 PM UTC +4

🌍 Venue: online, see all the details at AWS UG Dubai Meetup #8 (virtual): Part 1 - Intro to Containers 101

🎥 Video recording of the meetup: here on YouTube

Building Containers 102

In this session we will discuss everything about building containers. What’s a container image? how do we build a container image? and the definition of OCI will be covered. We will end this talk with some best practices, and we will build up some excitement for the workshop!

Agenda:

Container workflows
Docker
Dockerfiles
Building an image
Tagging an image
Running a container from an image
Best practices and helpful tools

Keywords:

Dockerfiles
Docker, OCI, Open Container Initiative
GitHub Actions, CircleCI
AWS ECR, Github Container Registry, Dockerhub, Quay
Hadolint, Snyk, Dependabot, Dive

📆 Date: Tuesday, Sept. 21, 2021, at 8:30 PM UTC +4

🌍 Venue: online, see all the details at AWS UG Dubai Meetup #9 (virtual): Part 2 - Building Containers 102

🎥 Video recording of the meetup: here on YouTube

Building Containers 102 Lab

In this hands-on lab you will learn how to build containers for different applications. After a short introduction, each student will open a pre-configured Gitpod workspace and build container images for 3 applications: one Go app, one Python app, and a React app! Once an image is built, we will push the image to Amazon Elastic Container Registry. For extra credit, we will also inspect the image we built. No previous experience with Docker, Go, Python, React, or Javascript is required!

This will be a live and guided workshop, leveraging Zoom, Slack, and Gitpod.

Requirements:

GitHub account
Gitpod account (can be created instantly)
An active AWS account (if you don’t have an AWS account, you can do the first half of the workshop)

Keywords:

AWS, ECR, Amazon Elastic Container Registry, Github Container Registry, Dockerhub, Quay
Workshop, Interactive, Live, Guided
Docker, OCI, Containerd, Buildkit, GitHub Actions

📆 Date: Wednesday, Sept. 22, 2021, at 8:30 PM UTC +4

🌍 Venue: online, see all the details at AWS UG Dubai Meetup #10 (virtual): Part 3 - Building Containers Lab

🎥 Video recording of the meetup: here on YouTube

Links:

Application Modernization with Amazon EKS

In this session, we will explore the popular workload manager and scheduler Kubernetes. Amazon managed Kubernetes service, Elastic Container Service for Kubernetes (Amazon EKS) takes care of the heavy-lifting and lets one focus on managing the containerized workloads. EKS, however, still gives you the flexibility and choice where to run, and how to efficiently run your data-plane that hosts your workloads. In this session, we cover what you need to know to get your application up and running with Kubernetes on AWS. We show how Amazon EKS makes deploying Kubernetes on AWS simple and scalable.

Agenda:

Review the general Kubernetes architecture and relate to EKS
How to set up and provision your Kubernetes cluster using console and eksctl.
Discuss the important abstractions that developers use to map their traditional application into any kubernetes platform.
How to deploy software efficiently, while sustaining reliable and scalable applications.
Deploy your first microservices on EKS
EKS possible development deployment workflow

Keywords:

EKS-Distro, EKS-Anywhere
Fargate, data plane
YAML, Helm, Gitops, Operators
Pod, Deployment, Service, ConfigMap, Secret

📆 Date: Tuesday, Sept. 28, 2021, at 8:30 PM UTC +4

🌍 Venue: online, see all the details at AWS UG Dubai Meetup #11: Part 4 - Application Modernization with Amazon EKS

🎥 Video recording of the meetup: to be posted here on YouTube

Accidentally a hacker

Vlad Ionescu — Fri, 26 Jun 2020 10:55:21 +0000

Around the beginning of February 2020, GitHub Security updated their public security disclosures and hacker leaderboard, and this happened:

Previously, on October 7, 2019, something popped up on HackerOne:

This is the story of how it all came to be. Unlikely to ever have a sequel.

Vlad likes pretty

Something most people on the internet don't know about me is that I like pretty stuff. Be it a whiskey pipette, a lovely painting, many dog thoughts, a ferocious lover — it does not matter. I like pretty!

Although I generally have excellent impulse control[1], pretty is pretty and I want pretty. I want to enjoy pretty. Savor pretty. Surprisingly often, this puts me in weird situations — like accidentally hacking GitHub.

CircleCI is not pretty

CircleCI is a hosted build service. It can build or compile code to ensure there are no build errors, it can run tests to ensure the correct thing happens, it can do whatever you want it to do!

But CircleCI is not pretty. For some cruel reason, CircleCI forces people to log in to their UI to get to the test output. Oh, a test failed? Spend 3 minutes following redirects and logging into things. Only then you get to see why the test failed.

The typical workflow looks like this:

do some work
send work to GitHub
CircleCI is notified and starts building and testing the code
notification is received from GitHub and/or CircleCI
go to GitHub, see that the "Test X" step failed
click "Details Test X"
click "See more in CircleCI"
new tab opens
click "Log In"
go through the login process
finally, see how the test failed

WHY?! I don't understand why CircleCI even has a GitHub integration. Why not have the output right there on GitHub? Why does CircleCI insist on adding another step to the whole process?

It's not even a limitation on the GitHub side: Brigade can, and does, precisely that!

I really do not like the whole user experience around the CircleCI integration with GitHub. It was, and it still is, a constant source of annoyance for me. I mean look at it:

GitHub Actions are pretty

In August 2019, it happened: I finally got access to the GitHub Actions Beta! A day I was very much looking forward to.

I am working on a continuous challenge to be less cranky, and CircleCI was making me cranky. This was a chance for me to be happier!

GitHub Actions are similar to CircleCI: they can build code, they can test code, they can do whatever you want them to do!

Even more, GitHub Actions can answer to more than just new code: they can run when a comment is posted, or when a change is approved! This flexibility empowers people to build even more amazing workflows.

I immediately started playing around with GitHub Actions. I loved everything about them — they were so flexible. More importantly, the output is right there. On the same page! Look how gorgeous they are:

Everything was nice in the world. For a couple days.

GitHub Actions user experience is not pretty

GitHub Actions were unacceptably slow due to a lack of caching, and don't even get me started on the whole slew of unexpected and random limitations.

You think GitHub Actions can answer to say a comment being posted on a Pull Request, but then surprise! They can indeed run on a comment post, but you don't actually have access to the code in that Pull Request.

I started working with GitHub Actions more and more, and I started even creating some actions. It was not pretty, but it was prettier than the CircleCI alternative.

I was doing a lot of Terraform at that time, and I really wanted warnings from tflint on Pull Requests. Output on the separate Checks tab was not enough anymore — I wanted it in the Conversations tab. That is where I spend my time, so that is where everything relevant should be. I don't want to even have to press one button.

GitHub Actions are pretty with reviewdog

I ended up discovering the lovely reviewdog: it takes the output from any tool and sends it to GitHub!

Want the output as a Check annotation? It can do that. Want the output as a Review comment? It can do that.

Look how pretty it is:

It was love at first sight! I started working more and more with reviewdog, and I even ended up writing a reviewdog tflint GitHub Action!

Life was good again!

GitHub Actions are not pretty if it has unexpected limitations

Now I get notified that things are wrong, and in the right place. But wouldn't auto-fixes be even prettier? Not fully automatic, but something like "click a button and it's fixed".

Since I want pretty, I started working on it! I imagined something along these lines:

GitHub Actions sees that code formatting is wrong
reviewdog posts a comment letting me know
when the time is right, I comment pls fix
automated fixes just appear

GitHub Actions at that time were very... particular about how they downloaded the code and what version of the code was downloaded. For the life of me, I could not get an auto-fix workflow to do what I wanted.

GitHub Actions are pretty with auto-fixes

After about a week of perusing the ~~incomplete~~ very beta documentation, I figured it out!

If I post a comment on the review screen and have the auto-fixer GitHub Action respond to the pull_request_review event, it all works! WOOOO! The right code is downloaded, the fixes are pushed back, everything works!

Life was so pretty! It was an exemplary workflow, and it made life so much easier.

GitHub Actions are not pretty if it leaks secrets

I was so proud of the thing I built. I was even telling people about it and insisting they use it because it is a better user experience.

In one of the chats, it was pointed out to me that I was kind of breaking the security of GitHub Actions.

In my mind it was just "beware, you need to set this thing for this to work", not "please create a secret so I can steal it".

I have access to a surprising amount of GitHub Organizations and GitHub Teams, so I went forth and tested if I was actually breaking the Secret protection.

In 10 minutes I had confirmed that I was indeed able to steal any secret for any public GitHub repository, without any user involvement.

Oops.

Submitting the report

Allow me to set the scene.

It is late in the evening on September 29. I am on my couch with my laptop and a glass of Lupi, a lovely blend of red wines. Rather inebriated, but happy and content. Doing stuff on GitHub.

I panicked a bit. I allegedly found a security issue. I am also drunk, so I might be wrong. But I could be right. But I am a dum-dum and GitHub people are smarter than me, so there is no way. But what if?

Fuck. I panicked some more.

I had to report it — even if there was a small chance of it being real, there was a risk. I never had any issues or reservations about looking dumb: that's how you learn!

I imagined I would send an email to something like security@github.com. Looking into it, I saw that GitHub has an open process on how to report security issues: they have a HackerOne account. HackerOne handles the process and GitHub responds. Nice!

I quickly created an account with HackerOne, hoping that a 2-minute-old-account will be allowed to send a report. Surprisingly, they allow that. Nice!

Drunkenly, I wrote a report and submitted it. For your pleasure, here it is in all its glory:

Description:

Unsure how much of a security issue this is but better safe than sorry. This is my first report ever so I have no idea what I am doing. Apologies if I am wrong.

GitHub Actions for the pull_request_review event run in the base repo not in the fork. But a fork can change the code for the action to leak secrets. No input is needed.

This was discussed a bit on https://github.com/reviewdog/action-tflint/issues/2 where I was urged to report it.

Steps To Reproduce:

See https://github.com/org-name-redacted/repo-for-fork-testing/pull/9 with the https://github.com/org-name-redacted/repo-for-fork-testing/commit/2dc3d01c3f342d722ca0a0b0543901a15de7fe16/checks check, the Test step. I base64 encoded the secrets to get around the secret hiding thing.

Have a public repository with GitHub Actions access and some Secrets set for the repo

Fork the repo to an account

Add an action that runs for pull_request_review that tries to get the value of a secret

Create PR to the public repository

Leave a review comment on the created PR

Go to the Actions tab where the action ran and secrets values can be found

This does require knowing the secrets name, but I guess a brute force attack searching for secrets names is not improbable.

Again, apologies if Actions is out of scope/ this issue is known.

Impact

Leaking of secrets( whose names are known) for any public repository with GitHub Actions enabled.

I woke up the next day. A ridiculously sunny Monday. I thought about it some more, and I updated my initial report.

I remember it vividly: I was in an Uber on my way to a client onsite, and I was sitting in the back trying to write a "serious" and "adult" report:

Hi,

I don't see a way to edit this but I'd like to raise this to Critical. I am still new and confused, but being able to leak any secret from GitHub Actions seems huge to me.

While Actions are still in preview, multiple projects use them already. This exploit has a huge impact for GitHub Actions and for any org using Actions:

Homebrew can be totally taken over

Ruby's AWS Keys are easily retrievable

Pretty much every secret on GitHub can be retrieved in less than 5 minutes, with no action required from the person being attacked. A very easy way to find names is to search for secrets in .github/worrkflows across all GiHub. Once a name is found the repo is forked, a PR is created back with the following code, a comment is left on the PR, and in less than 1 minute the secrets are out:
name: Leak

on: [pull_request_review]

jobs:
  leak:
    name: Leak GitHub Seecrets
    runs-on: ubuntu-latest
    steps:
      - name: Leak the secre
        run: |
          echo $VALUE | base64
        env:
          VALUE: ${{ secrets.SECRET_NAME_WHICH_IS_EASY_TO_GET }}
Please treat this with the highest priority.

Thank you,

Vlad Ionescu

It had a typo and all that, but it was a better report.

GitHub Bounties are pretty

GitHub, to their credit, responded super-fast to the report. Keeping in mind that times are local to Bucharest and the US is 10-ish hours behind, the timeline looks like this:

Dead-of-the-night Sunday, September 29: initial report
Early Monday, September 30: updated report
Late Monday, September 30: report triaged and GitHub started working on it
October 7: 10.000 $ bounty awarded for the bug report[2] [3]
October 18: lifetime GitHub Pro subscription awarded, formal invitation into @GitHubBounty organization

GitHub Bounties spam is weird

As soon as GitHub awarded the bounty, their HackerOne page was updated and showed that "Vlad Ionescu reported something secret and got 10 grand from GitHub". That brought more attention to me than I would've liked to. It was a bit overwhelming.

I started getting a bunch of messages asking me for details about the vulnerability. I started getting a lot of messages asking me if I want to pair on researching bugs.

On the one hand, that is very spammy. On the other hand, awww, people hack together and collaborate and that seems nice? I have doubts about how nice it actually is.

Conclusion

The fact that a random human with zero experience can just create an account on HackerOne and report a security issue to a major company is fantastic. And they take these reports seriously! And give out rewards!

Major props to HackerOne and the GitHub Security Team for all this! They managed to create an outstanding process, and they stick to it.

Also, I can now put Security Researcher on my resume! Ahem, it's a resume; I have to sell myself. "Established myself as part of the top 0.3 percent of hackers".

[1]: Shhhhhhhh. I really do!
[2]: I, of course, tried negotiating for a higher bounty. Well, "asking" cannot really be called negotiating.
[3]: The reward was split with the person that suggested reporting the issue to GitHub.

Scaling containers in AWS

Vlad Ionescu — Fri, 19 Jun 2020 15:30:41 +0000

THIS IS OUTDATED!
For the latest information, check out my re:Invent 2020 talk on YouTube: COM401 - Scaling Containers on AWS!

This all started with a tech curiosity: what's the fastest way to scale containers on AWS? Is ECS faster than EKS? What about Fargate? Is there a difference between Fargate on ECS and Fargate on EKS?

For the fiscally responsible person in me, there is another interesting side to this. Does everybody have to bear the complexities of EKS? Where is the line? Say for a job that must finish in less than 1 hour and that on average uses 20.000 vCPUs and 50.000 GBs, what is the cost-efficient option considering the ramp-up time?

THIS IS OUTDATED!
For the latest information, check out my re:Invent 2020 talk on YouTube: COM401 - Scaling Containers on AWS!

Tl;dr:

Fargate on ECS scales up 1 single Service at a surprisingly consistent 23 containers per minute
Fargate on ECS scales up multiple Services at 60 containers per minute as per the default AWS Limit
Fargate on EKS scales up at 60 containers per minute, regardless of the number of Deployments
Fargate scales up the same way, no matter if it's running on ECS or EKS
Fargate on EKS scales down significantly faster
Fargate limits can be increased for relevant workloads, significantly improving performance
Fargate starts scaling with a burst of 10 containers in the first second
EKS does have a delay of 1-2 minutes before it starts scaling up. The kube-scheduler has to do some magic, cluster-autoscaler has to decide what nodes to add, and so on. Fargate starts scaling up immediately
EKS scales up suuuper fast

Beware, this benchmark is utterly useless for web workloads — the focus here is on background work or batch processing. For a scaling benchmark relevant to web services, you can check out Clare Liguori's awesome demo at re:Invent 2019.

That's it! If you want to check out details about how I tested this, read on.

THIS IS OUTDATED!
For the latest information, check out my re:Invent 2020 talk on YouTube: COM401 - Scaling Containers on AWS!

Preparation

Before any test can be done, we have to prepare.

Setup

I created a completely separate AWS Account for this — my brand new "Container scaling" account. Any performance tests I plan to do in the future will happen here.

In this account, I submitted requests to increase several limits.

By default, Fargate allows a maximum of 250 tasks to run. The Fargate Spot Concurrent Tasks limit was raised to 10.000.

By default, Fargate allows scaling at 1 task per second( after an initial burst of 10 tasks the first second). After discussions with AWS and validating my workflow, this limit was raised to 3 tasks per second.

By default, EKS does a great job of scaling the Kubernetes Control Plane components( really — I tested this extensively for my customers). As there will be lots of time with 0 work, and as we are not benchmarking EKS scaling, I wanted to take that variable out. AWS is happy to pre-scale clusters depending on the workload, and they did precisely that after some discussions and validation.

By default, EC2 Spot allows for a maximum of 20 Spot Instances. After many talks, the EC2 Spot Instances limit was raised to 250 c5.4xlarge Spot Instances.

Test plan

Numbers

After an initial desired value of 30.000, and changing my mind multiple times, it was finally decided: we will test what is the fastest option to scale to about 3.000 containers!

Multiple reasons went into this:

the joys of limit increases for EC2 Spot Instances on a new AWS Organization with a history of sustained usage at around $ 5/ month
I had to run these tests in my own AWS Organization — I really couldn't do this in one of my customers' account
I really really really did not want to wait 6 hours for a single test

Regions

We don't want to test just in us-east-1 due to its... size and particularities, so we should also run the tests in eu-west-1, which is the largest European region.

As any European AWS user, I've had us-east-1 issues that only happened during my night-time — actual US day-time. We must run the tests during the US day-time, too, for full relevance.

Measuring

To measure the container creation, we'll use CloudWatch Container Insights. It has the RunningTaskCount metric for Fargate and namespace_number_of_running_pods metric for Kubernetes that give us exactly what we need.

To scale up, we'll just edit Desired Tasks or Replicas to 3.500.

All the tests will start from a base of 1 task/ pod already running.

For EKS testing, the start point will be just 1 node running; cluster-autoscaler will have to add all the relevant nodes.

Container

The container image used was poc-hello-world/namer-service — a straightforward app I plan to use in upcoming workshops and posts.

The container size was decided to be 1 vCPU and 2 GBs of memory. Not too small, but not too large either.

Both tasks and pods can run multiple containers, but for simplicity, we will use just 1 container.

THIS IS OUTDATED!
For the latest information, check out my re:Invent 2020 talk on YouTube: COM401 - Scaling Containers on AWS!

Expectations

Before starting, it is crucial to set expectations. We don't want to be surprised with a 5-digit bill, for example.

Expectations for Fargate on ECS

AWS Fargate on ECS has to respect the default 1 task per second launch limit, and so time to scale from 1 to 3.500 tasks should be around 3.500 seconds, which is about 1 hour. Reasonable.

As I want to focus on realistic results for everybody, we will mostly test Fargate with the default rate limits. AWS does increase the rate limits for relevant workloads, but that is the exception and not the rule.

A scaling test using the default limits would reach 10.000 Tasks in about 2:47 hours, which is indeed not that relevant. The first hour should be more than enough.

For Fargate, the pricing can be kind-of-estimated by multiplying the number of tasks with the hourly cost.

As per AWS Fargate pricing page we get the following values:

On-Demand: 3.500 * ($0.040480 vCPU / hour * 1 vCPU + $0.0044450 GB / hour * 2 GB * 1 hour) = ~$180 per test
Spot: 3.500 * ($0.012144 vCPU / hour * 1 vCPU + $0.0013335 GB / hour * 2 GB * 1 hour) = ~$60 per test

Yes, Fargate pricing is the same in Northern Virginia and Ireland. Nice!

THESE COSTS DO NOT INCLUDE ECR, NAT GATEWAY, AND MANY MANY OTHER COSTS!

To be safe, I'll double the 60$ cost.

Final expectations for a Fargate test: 150$ and about 1 hour.

Expectations for EKS

AWS AutoScaling Group is responsible for adding EC2 instances, and cluster-autoscaler is in charge of deciding the actual number of EC2s needed.

It is a bit unclear how fast AWS would give us an instance. Let's go with 60 seconds.

From my testing a while ago, an EC2 instance took about 21 seconds from actually starting to becoming a Ready node in Kubernetes. Add say 30 seconds for image pulling and starting the container( an overestimation, but let's go wild).

Now, all of this can be modeled mathematically, and we’d get a time estimate. Unfortunately, I am not that smart, so we’ll skip getting a time estimate. It would take me less time to run a test than to do the math.

A c5.4xlarge EC2 instance has 16 vCPUs and 32 GBs of memory. Some of it is reserved for Kubernetes components, monitoring, NodeLocal DNS, and so on. Let's say a total of 2 vCPUs and 4 GBs for cluster-level pods on each node.

We are left with 14 vCPUs and 28 GBs, which at our task size is 14 pods per node. At 250 nodes, that is precisely 3.500 pods. Purrfect!

As per AWS EC2 pricing page, we get the following values:

On-Demand in Northern Virginia: $0.068 per hour of c5.4xlarge usage * 250 nodes = $170 per hour
Spot in Northern Virginia: $0.260 per hour of c5.4xlarge usage * 250 nodes = $65 per hour
On-Demand in Ireland: $0.768 per hour of c5.4xlarge usage * 250 nodes = $195 per hour
Spot in Ireland: $0.291 per hour of c5.4xlarge usage * 250 nodes = $75 per hour

THESE COSTS DO NOT INCLUDE ECR, NAT GATEWAY, AND MANY MANY OTHER COSTS!

So to be safe, I'll double the 75$ cost.

Final expectations for an EKS test: 200$ per hour and unknown running time.

Expectations for Fargate on EKS

Pricing is the same as Fargate on ECS, so about 150$ per test.

Time is the same as EKS? Is the time closer to Fargate on ECS than to EKS?

I have no idea, and I really look forward to seeing what happens.

THIS IS OUTDATED!
For the latest information, check out my re:Invent 2020 talk on YouTube: COM401 - Scaling Containers on AWS!

Running the tests

I ran a total of about 10 Fargate on EKS and ECS tests — for some of them, I was still figuring out some stuff. I ran a total of 4 EKS tests.

I ran the tests both during the weekend and during the week. I made sure to run the tests during day-time and night-time too.

The data used for the graph on top of the page can be downloaded as a CSV at this link( CSV, less than 1MB file).

Results

Let's recap the results from the top of the page!

The Terraform code used for all the tests can be found on GitHub at Vlaaaaaaad/blog-scaling-containers-in-aws.

It is not pretty or correct code at all. On the other hand, it is code that works! I firmly believe in the reproducibility of any results, and I believe I have a moral duty to share everything I used to reach the conclusions presented here.

Unfortunately, the costs were not fully tracked.
Due to AWS pre-scaling the EKS Control Plane for my clusters, I had to keep them running for the whole duration. I did multiple tests a day in multiple regions, so there was no easy way to get the cost of a single test run. The estimates did help a lot, but they were there to ensure I did not end up spending tens of thousands of dollars.

The total bill for the account was a little under 2.000$.

Fargate results

Fargate on ECS and Fargate on EKS scaled surprisingly similar.

I expected more variance in the results, but they were almost identical scaling up: an initial burst and then 60 containers per minute.
Maybe for a minute, it was 58, and maybe for another minute, it was 62, but variance was minimal.

When running the tests, I did discover a previously-unknown Fargate on ECS limitation: 1 Service can have at most 1.000 Tasks running. To reach the 3.500 running tasks, I had to create 4 separate services. That led to the Fargate on ECS tests starting with 4 containers instead of 1.

Something that I did not thoroughly test was downscaling. I did notice Fargate on EKS scaled down significantly faster. Fargate on ECS scaled down slower, from what I saw similar to the speed of scaling up.

It turns out my costs math was terrible but correct: I totally forgot to account for downscaling!
Scaling down takes about the same time as scaling up — I calculated the costs just for the scale up. On the other hand, due to me doubling the costs to account for unexpected things, I was in the right ballpark!

EKS results

EKS results were also very similar between different runs. I tested both at night and during the day, I tested in multiple regions, and the results were almost identical.

EKS being so much faster than Fargate was a bit of a surprise. While Fargate would scale in about 50 minutes, EKS was consistently done in less than 10 minutes.

Since I am using EKS extensively in my work, there were no other surprises.

From a cost perspective, I made the same mistake: I did not account for scaling down.

THIS IS OUTDATED!
For the latest information, check out my re:Invent 2020 talk on YouTube: COM401 - Scaling Containers on AWS!

Disclaimers

First of all, I would like to reiterate that this is utterly useless for web workloads — the focus here is on background work or batch processing. For a relevant web scaling benchmark, you can check out Clare Liguori's awesome demo at re:Invent 2019.

As an aside, Lambda would scale up to 3.500 containers in 1.5 to 7 seconds, depending on the region. It's an entirely different beast, but I thought it's worth mentioning.

Clearly, this was not a very strict or statistically relevant testing process.

I did just a few tests because I saw little variance — I did not see the point in running more.

I only tested in us-east-1 and eu-west-1, which are large regions. Numbers may or may not differ for smaller regions or during weird times — say Black Friday, Cyber Monday, EOY Report time.

The container image I used was the image for poc-hello-world/namer-service, which is small and maybe not that well suited. I did not want to go into the whole "let's optimize image pulling speed".
A study of all images on Dockerhub can be read here.

I only ran pods and tasks with 1 single container. Both pods and tasks can have multiple containers running.

I did not optimize the tests at all. I wanted to showcase the average experience, not a super-custom solution that would not help anybody. The values in here can likely be improved — multiple ECS clusters, multiple ASGs for EKS, and so on.

ECS with EC2 was completely ignored. I cannot say there was a reason, I did just not think of that. Fargate has the cost advantage, the simplicity, and the top-notch AWS integration. EKS has the complexity, all the knobs, and all the buzzwords. Between the two, I did not consider ECS with EC2, and that is on me.

Overall I am happy. After all this, we now have a ballpark figure we can use when designing systems 🙂

Acknowledgments

First of all, thank you so much to Ignacio and Michael for helping me escalate all my Support tickets and for connecting me to the right people in AWS!

Special thanks go out to Mats and Massimo for all their Fargate help and reviews! Your feedback was priceless!

Thanks to everybody else that helped review this, gave feedback, or supported me!

Generating the graph

This part took forever to do, so I decided to add it to the post.

The results are the most exciting thing about this whole post. I desperately wanted to have a pretty graph image showcasing the results.

Since the testing was not statistically correct, I thought a hand-drawn graph would be perfect. It would nicely showcase the data, while at the same time hinting that experiences may vary.

Numbers and Excel could not easily do this. LiveGap Charts had a bunch of hand-drawn options and was outstanding, but not exactly what I wanted.

Matplotlib to the rescue! After a bunch of research and about a day of playing around with it, I got a working script.

It is not pretty or optimized, nor is it correct. But it works!

# Install the required font
#  BEWARE: this only works on macOS
#          Linuxbrew does not install fonts
brew install homebrew/cask-fonts/font-humor-sans

# Install Python dependencies
pip3 install matplotlib numpy

# Run and generate the image
python3 draw.py

# File: draw.py
# The actual image-generation script

import matplotlib.pyplot as plt
from matplotlib.lines import Line2D

import numpy as np

# Load data from CSV
data = np.genfromtxt(
    'graph-results.csv',
    delimiter=',',
    names=True,
)

# Start an XKCD graph
plt.xkcd()

# Make the image pretty
figure = plt.figure()
figure.set_dpi(300)
figure.set_size_inches(8, 7.5)

figure.suptitle(
    'Scaling containers on AWS\n@iamvlaaaaaaad',
    fontsize=16,
)
plt.xlabel('Minutes')
plt.ylabel('Containers')

plt.xticks(np.arange(0, 70, step=10))

# Colors from https://jfly.uni-koeln.de/color/
plt.plot(
    data['EKS'],
    label="EKS with EC2",
    color=(0.8, 0.4, 0.7),
)
plt.plot(
    data['TunedFargate'],
    label="Tuned Fargate",
    color=(0.9, 0.6, 0.0),
)
plt.plot(
    data['FargateOnECS'],
    label="Fargate on ECS",
    color=(0.0, 0.45, 0.70),
)
plt.plot(
    data['FargateOnEKS'],
    label="Fargate on EKS",
    color=(0.35, 0.70, 0.90),
)

# Add a legend to the graph
#  using default labels
plt.legend(
    loc='lower right',
    borderaxespad=1
)

# Export the image
figure.savefig('containers.svg')
figure.savefig('containers.png')

THIS IS OUTDATED!
For the latest information, check out my re:Invent 2020 talk on YouTube: COM401 - Scaling Containers on AWS!