<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Vlad Ionescu</title>
    <description>The latest articles on DEV Community by Vlad Ionescu (@vlaaaaaaad).</description>
    <link>https://dev.to/vlaaaaaaad</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F411056%2F73d233d5-133f-416d-8c5f-732f7be0b523.png</url>
      <title>DEV Community: Vlad Ionescu</title>
      <link>https://dev.to/vlaaaaaaad</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/vlaaaaaaad"/>
    <language>en</language>
    <item>
      <title>Scaling containers on AWS in 2022</title>
      <dc:creator>Vlad Ionescu</dc:creator>
      <pubDate>Wed, 13 Apr 2022 00:00:00 +0000</pubDate>
      <link>https://dev.to/aws-heroes/scaling-containers-on-aws-in-2022-3lml</link>
      <guid>https://dev.to/aws-heroes/scaling-containers-on-aws-in-2022-3lml</guid>
      <description>&lt;p&gt;This all started with &lt;a href="https://www.vladionescu.me/posts/scaling-containers-in-aws/" rel="noopener noreferrer"&gt;a blog post back in 2020&lt;/a&gt;, from a tech curiosity: what's the fastest way to scale containers on AWS? Is ECS faster than EKS? What about Fargate? Is there a difference between ECS on Fargate and EKS on Fargate? I &lt;strong&gt;had to know&lt;/strong&gt; this to build better architectures for my clients.&lt;/p&gt;

&lt;p&gt;In &lt;strong&gt;2021&lt;/strong&gt; , containers got even better, and &lt;a href="https://www.youtube.com/watch?v=UhRiLCxYNbo" rel="noopener noreferrer"&gt;I was lucky enough to get a preview and present just how fast they got at re:Invent&lt;/a&gt;!&lt;/p&gt;

&lt;p&gt;What about &lt;strong&gt;2022&lt;/strong&gt;? What's next in the landscape of scaling containers? Did the previous trends continue? How will containers scale this year? What about Lambda? We now have the answer!&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fwww.vladionescu.me%2Fposts%2Fscaling-containers-on-aws-in-2022%2Foverview.min.svg" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fwww.vladionescu.me%2Fposts%2Fscaling-containers-on-aws-in-2022%2Foverview.min.svg" alt="Hand-drawn-style graph showing how long it takes to scale from 0 to 3500 containers: Lambda instantly spikes to 3000 and then jumps to 3500, ECS on Fargate starts scaling after 30 seconds and reaches close to 3500 around the four and a half minute mark, EKS on Fargate starts scaling after about a minute and reaches close to 3500 around the eight and a half minute mark, EKS on EC2 starts scaling after two and a half minutes and reaches 3500 around the six and a half minute mark, and ECS on EC2 starts scaling after two and a half minutes and reaches 3500 around the ten minute mark"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Hand-drawn-style graph showing how long it takes to scale from 0 to 3500 containers: Lambda instantly spikes to 3000 and then jumps to 3500, ECS on Fargate starts scaling after 30 seconds and reaches close to 3500 around the four and a half minute mark, EKS on Fargate starts scaling after about a minute and reaches close to 3500 around the eight and a half minute mark, EKS on EC2 starts scaling after two and a half minutes and reaches 3500 around the six and a half minute mark, and ECS on EC2 starts scaling after two and a half minutes and reaches 3500 around the ten minute mark&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tl;dr&lt;/strong&gt; :&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Fargate is now faster than EC2&lt;/li&gt;
&lt;li&gt;ECS on Fargate improved so much and is the perfect example for why offloading engineering effort to AWS is a good idea&lt;/li&gt;
&lt;li&gt;ECS on Fargate using Windows containers is surprisingly fast&lt;/li&gt;
&lt;li&gt;App Runner is on the way to becoming a fantastic service&lt;/li&gt;
&lt;li&gt;Up to a point, EKS on Fargate is faster than EKS on EC2&lt;/li&gt;
&lt;li&gt;EKS on EC2 scales faster when using &lt;a href="https://karpenter.sh" rel="noopener noreferrer"&gt;karpenter&lt;/a&gt; rather than &lt;a href="https://github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler" rel="noopener noreferrer"&gt;cluster-autoscaler&lt;/a&gt;, even in the worst possible scenario&lt;/li&gt;
&lt;li&gt;EKS on EC2 is a tiny bit faster when using IPv6&lt;/li&gt;
&lt;li&gt;Lambda with increased limits scales &lt;em&gt;ridiculously&lt;/em&gt; fast&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Beware, &lt;strong&gt;this benchmark is extremely specific and meant to provide a &lt;strong&gt;FRAME OF REFERENCE&lt;/strong&gt;, not completely accurate results&lt;/strong&gt; — the focus here is making informed architectural decisions, not on squeezing out the most performance!&lt;/p&gt;

&lt;p&gt;That's it! If you want to get more insights or if you want details about how I tested all this, continue reading on &lt;a href="https://www.vladionescu.me/posts/scaling-containers-on-aws-in-2022/" rel="noopener noreferrer"&gt;https://www.vladionescu.me/posts/scaling-containers-on-aws-in-2022/&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Warning&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Friendly warning: the &lt;strong&gt;estimated reading time for this blog post is about 45 minutes&lt;/strong&gt;!&lt;/p&gt;

&lt;p&gt;Might I suggest getting comfortable? A cup of tea or water goes great with container scaling.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Preparation
&lt;/h2&gt;

&lt;p&gt;Before any testing can be done, we have to set up.&lt;/p&gt;

&lt;h3&gt;
  
  
  Limit increases
&lt;/h3&gt;

&lt;p&gt;First up, we will reuse the same dedicated &lt;strong&gt;AWS Account&lt;/strong&gt; we used in 2020 and 2021 — my "&lt;em&gt;Container scaling&lt;/em&gt;" account.&lt;/p&gt;

&lt;p&gt;To be able to create non-trivial amounts of resources, we have to &lt;a href="https://docs.aws.amazon.com/general/latest/gr/aws_service_limits.html" rel="noopener noreferrer"&gt;increase a couple of &lt;strong&gt;AWS quotas&lt;/strong&gt;&lt;/a&gt;.&lt;br&gt;
I do not want to showcase exotic levels of performance that only the top 1% of the top 1% of AWS customers can achieve. At the same time, we can't look at out-of-the-box performance since AWS accounts have safeguards in place. The goal of this testing is to see what "ordinary" performance levels all of us can get, and for that we need some quota increases:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  by default, one can run a maximum of 1 000 &lt;strong&gt;&lt;em&gt;Fargate Concurrent Tasks&lt;/em&gt;&lt;/strong&gt;. We're scaling to more than that, so the limit was increased to 10 000&lt;/li&gt;
&lt;li&gt;  by default, one can run at most 5 EC2 Spot Instances. That is not enough for our testing, and, after chatting with AWS Support, the &lt;strong&gt;&lt;em&gt;EC2 Spot Instances&lt;/em&gt;&lt;/strong&gt; limit was raised to 4 500 vCPUs which is about 280 EC2 Spot instances&lt;/li&gt;
&lt;li&gt;  by default, EKS does a fantastic job of scaling the Kubernetes Control Plane components (really — I tested this extensively with my customers). That said, our test clusters will be spending a lot of time idle, with zero containers running. We are not benchmarking EKS Control Plane scaling, and I'd rather eliminate this variable. AWS is happy to pre-scale clusters depending on the workload, and they did precisely that after some discussions and validation of my workload: the &lt;strong&gt;Kubernetes Control Plane was pre-scaled&lt;/strong&gt; for all our EKS clusters.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That's it — &lt;strong&gt;not performance quota increases, but capacity quota increases&lt;/strong&gt;! To scale a bunch, we need to create a bunch of instances. These are quotas that everybody should be able to get without too much work. Unless explicitly stated, these are all the quota increases I got.&lt;/p&gt;

&lt;h3&gt;
  
  
  Testing setup
&lt;/h3&gt;

&lt;p&gt;Based on the previous tests in 2020 and 2021, we will &lt;strong&gt;scale&lt;/strong&gt; up to 3 500 containers, as fast as possible: we'll force scaling, by manually changing the desired number of containers from 1 to 3 500.&lt;br&gt;
I think this keeps a good balance: we're not scaling to a small number, but we're not wasting resources scaling to millions of containers. By forcing the scaling, we're eliminating a lot of complexity: there is a ridiculous number of ways to scale containers! We're avoiding going down some complex rabbit holes: no talking about optimizing CloudWatch or AWS Autoscaling intervals and reaction times, no stressing about the granularity at which the application exposes metrics, no optimizing the app → Prometheus Push Gateway → Prometheus → Metrics Server flow, we're blissfully ignoring many complexities. For an application to actually scale up, we first must detect that scaling is required (that can happen with a multitude of tools, from CloudWatch Alarms to KEDA to custom application events), we must then decide how much we have to scale (which can again be complex logic — is it a defined step or is it dynamic?), we then have to actually scale up (how is new capacity added and how does it join the cluster?), and finally we have to gracefully utilize that capacity (how are new instances connected to load balancers, how do new instances impact the scaling metrics?). This whole process can be very complex and it is often application-specific and company-specific. We want results that are relevant to everybody, so we will ignore all this and focus on how quickly we can get new capacity from AWS.&lt;/p&gt;

&lt;p&gt;We will run all the tests in AWS' North Virginia &lt;strong&gt;region&lt;/strong&gt; (&lt;code&gt;us-east-1&lt;/code&gt;), as we've seen in previous years that different AWS regions have the same performance levels. No need to also run the tests in South America, Europe, Middle East, Africa, and Asia Pacific too.&lt;/p&gt;

&lt;p&gt;In terms of &lt;strong&gt;networking&lt;/strong&gt;, each test that requires a networking setup will use a dedicated VPC spanning four availability zones (&lt;code&gt;use1-az1&lt;/code&gt;, &lt;code&gt;use1-az4&lt;/code&gt;, &lt;code&gt;use1-az5&lt;/code&gt;, and &lt;code&gt;use1-az6&lt;/code&gt;), and all containers will be created across four private &lt;code&gt;/16&lt;/code&gt; subnets. For increased performance and lower cost, each VPC will use three &lt;a href="https://docs.aws.amazon.com/vpc/latest/privatelink/vpc-endpoints.html" rel="noopener noreferrer"&gt;VPC Endpoints&lt;/a&gt;: &lt;a href="https://docs.aws.amazon.com/vpc/latest/privatelink/vpc-endpoints-s3.html" rel="noopener noreferrer"&gt;S3 Gateway&lt;/a&gt;, &lt;a href="https://docs.aws.amazon.com/AmazonECR/latest/userguide/vpc-endpoints.html" rel="noopener noreferrer"&gt;ECR API, and ECR Docker API&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;In terms of &lt;strong&gt;servers&lt;/strong&gt;, for the tests that require servers, we will use the latest and the greatest: AWS' &lt;code&gt;c6g.4xlarge&lt;/code&gt; EC2 instances.&lt;br&gt;
In the previous years, we used &lt;code&gt;c5.4xlarge&lt;/code&gt; instances, but the landscape has evolved since then. Ideally, we'd like to keep the same server size to accurately compare results between 2020 and 2021 and 2022.&lt;br&gt;
This year we have 2 options: &lt;code&gt;c6i.4xlarge&lt;/code&gt; which are Intel-based servers with 32 GBs of memory and 16 vCPUs or &lt;code&gt;c6g.4xlarge&lt;/code&gt; which are ARM-based servers using AWS' Graviton 2 processors with 32 GBs of memory and 16 vCPUs. AWS also announced the next-generation Graviton 3 processors and &lt;code&gt;c7g&lt;/code&gt; servers, but those are only available in limited preview. Seeing how &lt;em&gt;16 Graviton vCPUs&lt;/em&gt; are both faster and cheaper than &lt;em&gt;16 Intel vCPUs&lt;/em&gt;, &lt;code&gt;c6g.4xlarge&lt;/code&gt; sounds like the best option, so that's what we will use for our EC2 instances. To further optimize our costs, we will use &lt;a href="https://aws.amazon.com/ec2/spot/" rel="noopener noreferrer"&gt;EC2 Spot Instances&lt;/a&gt; which are up to 90% cheaper than On-demand EC2 instances. More things have to happen on the AWS side when a Spot Instance is requested, but the scaling impact should not be significant and the cost savings are a big draw.&lt;/p&gt;

&lt;p&gt;The &lt;strong&gt;operating system&lt;/strong&gt; landscape has evolved too. In previous years we used the default &lt;em&gt;Amazon Linux 2&lt;/em&gt; operating system, but in &lt;a href="https://aws.amazon.com/about-aws/whats-new/2020/08/announcing-general-availability-of-bottlerocket/" rel="noopener noreferrer"&gt;late 2020, Amazon launched an open-source operating system focused on containers: Bottlerocket OS&lt;/a&gt;. Since then, Bottlerocket matured and grew into an awesome operating system! Seeing how Bottlerocket OS is optimized for containers, we'll run &lt;a href="https://github.com/bottlerocket-os/bottlerocket" rel="noopener noreferrer"&gt;Bottlerocket&lt;/a&gt; as the operating system on all our EC2 servers.&lt;/p&gt;

&lt;p&gt;How should we &lt;strong&gt;measure&lt;/strong&gt; how fast containers scale?&lt;br&gt;
In the past years, we used &lt;a href="https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/ContainerInsights.html" rel="noopener noreferrer"&gt;CloudWatch Container Insights&lt;/a&gt;, but that won't really work this year: the best metrics we can get are minute-level metrics. Last year we had services scale from 1 to 3 500 in a couple minutes and with minute-level data we won't get proper insights, will we?&lt;br&gt;
To get the most relevant results, I decided we should move the measurement directly in the container: the application running in the container should record the time it started at! That will give us the best metric: we will know exactly when the container has started.&lt;br&gt;
In the past years, we used &lt;a href="https://github.com/poc-hello-world/namer-service" rel="noopener noreferrer"&gt;the poc-hello-world/namer-service application&lt;/a&gt;: a &lt;strong&gt;small web application&lt;/strong&gt; that returns &lt;code&gt;hello world&lt;/code&gt;. For this year, I extended the code a bit based on our idea: as soon as the application starts, it records the time it started at! It then does the normal web stuff — configuring a couple web routes using the &lt;a href="https://flask.palletsprojects.com/" rel="noopener noreferrer"&gt;Flask micro-framework&lt;/a&gt;.&lt;br&gt;
Besides the timestamp, the application also records details about the container (name, unique id, and whatever else was easily available) and sends all this data synchronously to &lt;a href="https://www.honeycomb.io" rel="noopener noreferrer"&gt;Honeycomb&lt;/a&gt; and asynchronously to &lt;a href="https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/WhatIsCloudWatchLogs.html" rel="noopener noreferrer"&gt;CloudWatch Logs&lt;/a&gt; — by using two providers we are protected in case of failure or errors. We'll use Honeycomb as a live-UI with proper observability that allows us to explore, and CloudWatch Logs as the definitive source of truth.&lt;/p&gt;

&lt;p&gt;Now that we have the application defined, we need to put it in a &lt;strong&gt;container&lt;/strong&gt;! I built the app using GitHub Actions and &lt;a href="https://github.com/docker/build-push-action" rel="noopener noreferrer"&gt;Docker's &lt;code&gt;build-and-push&lt;/code&gt; Action&lt;/a&gt; and a 380 MB multi-arch (Intel and ARM) container image resulted. The image was then pushed to &lt;a href="https://aws.amazon.com/ecr/" rel="noopener noreferrer"&gt;AWS' Elastic Container registry (ECR)&lt;/a&gt; which is where all our tests will download it from.&lt;/p&gt;

&lt;p&gt;Keeping in line with the "&lt;em&gt;default setup&lt;/em&gt;" we are doing, the &lt;strong&gt;container image&lt;/strong&gt; is not at all optimized!&lt;br&gt;
There are many ways to optimize image sizes, there are many ways to optimize image pulling, and there are many ways to optimize image use — we would never get to testing if we keep optimizing. Again, this testing is meant to provide a frame of reference and not showcase the highest levels of performance.&lt;br&gt;
For latency-sensitive workloads, people are optimizing image sizes, reusing layers, using custom schedulers, using base server images (AMIs) that have large layers or even full images already cached, and much more. It's not only that, but container runtimes &lt;a href="https://events19.linuxfoundation.org/wp-content/uploads/2017/11/How-Container-Runtime-Matters-in-Kubernetes_-OSS-Kunal-Kushwaha.pdf" rel="noopener noreferrer"&gt;matter too&lt;/a&gt; and &lt;a href="https://www.scitepress.org/Papers/2020/93404/93404.pdf" rel="noopener noreferrer"&gt;performance can differ between say Docker and containerd&lt;/a&gt;. And it's not only image sizes and runtimes, it's also container setup: a container without any storage differs from a container with a 2 TB RAM disk, which differs from a container with an EBS volume attached, which differs from a container with an EFS volume attached.&lt;br&gt;
We are testing to get an idea of how fast containers scale, and a lightweight but not minimal container is good enough. Performance in real life will vary.&lt;/p&gt;

&lt;p&gt;That's all in terms of common setup used by all the tests in our benchmark! With the base defined, let's get into the setup required for each container service and the results we got!&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Originally posted at &lt;a href="https://www.vladionescu.me/posts/scaling-containers-on-aws-in-2022/" rel="noopener noreferrer"&gt;https://www.vladionescu.me/posts/scaling-containers-on-aws-in-2022/&lt;/a&gt; and the dev.to version may contain errors and less-than-ideal presentation&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Kubernetes scaling
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Kubernetes&lt;/strong&gt; — the famous container orchestrator.&lt;br&gt;
Kubernetes has its components divided in two sections: the Control Plane and the Worker Plane. The Control Plane is like a brain: it decides where containers run, what happens to them, and it talks to us. The Worker Plane is the servers on which our containers actually run.&lt;/p&gt;

&lt;p&gt;We could run our own &lt;strong&gt;Kubernetes Control Plane&lt;/strong&gt;, with tools like &lt;a href="https://github.com/kubernetes/kops/" rel="noopener noreferrer"&gt;kops&lt;/a&gt;, &lt;a href="https://github.com/kubernetes/kubeadm" rel="noopener noreferrer"&gt;kubeadm&lt;/a&gt;, or the newer &lt;a href="https://github.com/kubernetes-sigs/cluster-api" rel="noopener noreferrer"&gt;cluster-api&lt;/a&gt;, and that would allow us to optimize each component to the extreme. Or, we could let AWS handle that for us through a managed service: &lt;a href="https://aws.amazon.com/eks/" rel="noopener noreferrer"&gt;Amazon Elastic Kubernetes Service, or &lt;strong&gt;EKS&lt;/strong&gt; for short&lt;/a&gt;. By using EKS, we can let AWS stress about scaling and optimizations and we can focus on our applications. Few companies manage their own Kubernetes Control Planes now, and AWS offers enough customization for our use-case, so we'll use EKS!&lt;/p&gt;

&lt;p&gt;For the &lt;strong&gt;Worker Plane&lt;/strong&gt;, we have a lot more options:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  self-managed workers. These are EC2 instances: we manage them, we configure them, we update them, we do everything&lt;/li&gt;
&lt;li&gt;  AWS-managed workers. If we want to stress a bit less, we can take advantage of &lt;a href="https://docs.aws.amazon.com/eks/latest/userguide/managed-node-groups.html" rel="noopener noreferrer"&gt;EKS Managed Node Groups&lt;/a&gt; where we still have EC2 instances in our AWS account, but AWS handles the lifecycle management of those EC2s&lt;/li&gt;
&lt;li&gt;  serverless workers. If we want the least amount of stress, like not even caring about configuration or the operating system of servers and patches, we can use serverless workers through &lt;a href="https://docs.aws.amazon.com/eks/latest/userguide/fargate.html" rel="noopener noreferrer"&gt;EKS on Fargate&lt;/a&gt;: we give AWS a container and we tell them to run it using EKS on Fargate — AWS will run it on a server they manage, update, and control&lt;/li&gt;
&lt;li&gt;  &lt;em&gt;fancy&lt;/em&gt; 5G workers through &lt;a href="https://aws.amazon.com/wavelength/" rel="noopener noreferrer"&gt;AWS Wavelength&lt;/a&gt;. These are also EC2 instances, but hosted by 5G networking providers for super-low latency&lt;/li&gt;
&lt;li&gt;  &lt;em&gt;extended-region&lt;/em&gt; workers through &lt;a href="https://docs.aws.amazon.com/eks/latest/userguide/local-zones.html" rel="noopener noreferrer"&gt;AWS Local Zones&lt;/a&gt;. These are still EC2 instances, but hosted in popular cities for lower latency&lt;/li&gt;
&lt;li&gt;  &lt;em&gt;your own&lt;/em&gt; workers running on a big AWS-managed server that you buy from AWS and install in your own datacenter though &lt;a href="https://aws.amazon.com/outposts/" rel="noopener noreferrer"&gt;AWS Outposts&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For our testing, we'll ignore the &lt;em&gt;less common&lt;/em&gt; options, and we'll use both EC2 workers managed through EKS Managed Node Groups and AWS-managed serverless workers through EKS on Fargate.&lt;/p&gt;

&lt;p&gt;To get &lt;strong&gt;visibility&lt;/strong&gt; into what is happening on the clusters, we will install two helper tools on the cluster: &lt;a href="https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/ContainerInsights.html" rel="noopener noreferrer"&gt;CloudWatch Container Insights&lt;/a&gt; for metrics and logs, and &lt;a href="https://github.com/AliyunContainerService/kube-eventer" rel="noopener noreferrer"&gt;kube-eventer&lt;/a&gt; for event insights.&lt;/p&gt;

&lt;p&gt;Our &lt;strong&gt;application container&lt;/strong&gt; will use its own AWS IAM Role through &lt;a href="https://docs.aws.amazon.com/eks/latest/userguide/iam-roles-for-service-accounts.html" rel="noopener noreferrer"&gt;IAM Roles for Service Accounts&lt;/a&gt;. When using EC2 workers, we will use the underlying node's Security Group and not &lt;a href="https://docs.aws.amazon.com/eks/latest/userguide/security-groups-for-pods.html" rel="noopener noreferrer"&gt;a dedicated security group for the pod&lt;/a&gt; due to the impact that has on scaling — a lot more networking setup has to happen and that slows things down. When using Fargate workers there is no performance impact, so we will configure a &lt;a href="https://docs.aws.amazon.com/eks/latest/userguide/security-groups-for-pods.html" rel="noopener noreferrer"&gt;per-pod security group&lt;/a&gt; in that case.&lt;br&gt;
Containers are not run by themselves — what happens if the container has an error or restarts, or during an update — but as part of larger concepts. In the ECS world we have &lt;code&gt;Services&lt;/code&gt; and in the Kubernetes world we have &lt;code&gt;Deployments&lt;/code&gt;, but both of them do the same work: they manage containers. For example, a &lt;code&gt;Deployment&lt;/code&gt; or a &lt;code&gt;Service&lt;/code&gt; of 30 containers will constantly try to make sure 30 containers are running. If a container has an error, it is restarted. If a container dies, it is replaced by a new container. If the application has to be updated, the &lt;code&gt;Deployment&lt;/code&gt; will handle the complex logic around replacing each of the 30 running containers with new and updated containers, and so on. In 2021, we saw that multiple &lt;code&gt;Deployments&lt;/code&gt; or multiple &lt;a href="https://docs.aws.amazon.com/eks/latest/userguide/fargate-profile.html" rel="noopener noreferrer"&gt;Fargate Profiles&lt;/a&gt; have no impact on the scaling speed, so our application's containers will be part of a single &lt;code&gt;Deployment&lt;/code&gt;, and using a single Fargate Profile. The Kubernetes &lt;code&gt;Pod&lt;/code&gt; will be using 1 vCPU and 2 GBs and will have a single container: our test application.&lt;/p&gt;

&lt;p&gt;There are multiple other configuration options, and if you want to see the full configuration used, you can check out the Terraform infrastructure code in the &lt;a href="https://github.com/vlaaaaaaad/blog-scaling-containers-on-aws-in-2022" rel="noopener noreferrer"&gt;&lt;code&gt;eks-*&lt;/code&gt; folders in &lt;code&gt;vlaaaaaaad/blog-scaling-containers-on-aws-in-2022&lt;/code&gt; on GitHub&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Now, with the setup covered, let's get to testing! I ran all the tests between December 2021 and February 2022, using all the latest versions available at the time of testing.&lt;/p&gt;

&lt;p&gt;In terms of yearly evolution, &lt;strong&gt;EKS on EC2&lt;/strong&gt; remained pretty constant — in the below graph there are actually 3 different lines for 3 different years, all overlapping:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fwww.vladionescu.me%2Fposts%2Fscaling-containers-on-aws-in-2022%2Feks-on-ec2-yearly.min.svg" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fwww.vladionescu.me%2Fposts%2Fscaling-containers-on-aws-in-2022%2Feks-on-ec2-yearly.min.svg" alt="Hand-drawn-style graph showing yearly evolution for EKS: the 2020, 2021, and 2022 lines are all overlapping each other. Scaling starts after the two minute mark and reaches 3500 around the seven minute mark"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Hand-drawn-style graph showing yearly evolution for EKS: the 2020, 2021, and 2022 lines are all overlapping each other. Scaling starts after the two minute mark and reaches 3500 around the seven minute mark&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;But that is not the whole story! For 2022 we get a couple new options when using EC2 workers: an &lt;strong&gt;alternative scaler&lt;/strong&gt; and &lt;strong&gt;IPv6 support&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Until late 2021, the default way of scaling EC2 nodes was &lt;a href="https://github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler" rel="noopener noreferrer"&gt;cluster-autoscaler&lt;/a&gt; — there were other scalers too, but nothing as popular or as adopted. &lt;strong&gt;Cluster Autoscaler&lt;/strong&gt; is an official Kubernetes project, with support for scaling worker nodes on a bunch of providers — from the classic big 3 clouds all the way to Hetzner or Equinix Metal. Cluster Autoscaler works on homogenous node groups: we have an AutoScaling Group that has instances of the exact same size and type, and Cluster Autoscaler sets the number of desired instances. If new instances are needed, the number is increased. If instances are unused, servers are cleanly drained and the number is decreased. For our benchmark, we're using a single AutoScaling Group with &lt;code&gt;c6g.4xlarge&lt;/code&gt; instances.&lt;/p&gt;

&lt;p&gt;For folks running Kubernetes on AWS, Cluster Autoscaler is great, but not perfect.&lt;br&gt;
Cluster Autoscaler does its own node management, lifecycle management, and scaling which means a lot of EC2 AutoScaling Group (ASG) features are disabled. For example, ASGs have support for launching different instance sizes as part of the same group: if one instance with 16 vCPUs is not available, AGSs can figure out that two instances with 8 vCPUs are available, and launch that to satisfy the requested number. That feature and many others (predictive scaling, node refreshes, rebalancing) have to be disabled.&lt;br&gt;
Cluster Autoscaler also knows little about AWS and what happens with EC2 nodes — it configures the desired number of instances, and then waits to see if &lt;code&gt;desired_number == running_number&lt;/code&gt;. In cases of EC2 Spot exhaustion, it takes a while for Cluster Autoscaler to figure out what happened, for example.&lt;br&gt;
By design, and by having to support a multitude of infrastructure providers, Cluster Autoscaler works close to the lowest common denominator: barebones node groups. This approach also forces architectural decisions: by only supporting homogenous node groups, a bunch of node groups are defined and applications are allocated to one of those node groups. Does your team have an application with custom needs? Too bad, it needs to fit in a pre-existing node group, or it has to justify the creation of a new node group.&lt;/p&gt;

&lt;p&gt;To offer an alternative, &lt;a href="https://karpenter.sh" rel="noopener noreferrer"&gt;&lt;strong&gt;Karpenter&lt;/strong&gt;&lt;/a&gt; was built and it &lt;a href="https://aws.amazon.com/blogs/aws/introducing-karpenter-an-open-source-high-performance-kubernetes-cluster-autoscaler/" rel="noopener noreferrer"&gt;was released in late 2021&lt;/a&gt;. Karpenter, instead of working with AutoScaling Groups that must have servers of the same size (same RAM and CPU), directly calls the EC2 APIs to launch or remove nodes. Karpenter does not use AutoScaling Groups at all — it manages EC2 instances directly! This is a fundamental shift: there is no longer a need to think about server groups! Each application's needs can be considered individually.&lt;br&gt;
Karpenter will look at what containers have to run, at what EC2 instances AWS offers, and make the best decision on what server to launch. Karpenter will then manage the server for its full lifecycle, with full knowledge of what AWS does and what happens with each server. With all this detailed information, Karpenter has an advantage in resource-constrained or complex environments — like say not enough EC2 Spot capacity or complex hardware requirements like GPUs or specific types of processors — Karpenter knows why an EC2 instance could not be launched and does not have to wait for a while to confirm that indeed &lt;code&gt;desired_number == running_number&lt;/code&gt;. That sounds great and it sounds like it will impact scaling speeds!&lt;/p&gt;

&lt;p&gt;That said, how should we compare Cluster Autoscaler with Karpenter? Given free rein to scale from 1 container to 3 500 containers, Karpenter will choose the best option: launching the biggest instances possible. While that may be a fair way to compare them, the results will be at odds.&lt;br&gt;
I decided to compare Cluster Autoscaler (with 1 ASG of &lt;code&gt;c6g.4xlarge&lt;/code&gt;) with Karpenter also limited to launching just &lt;code&gt;c6g.4xlarge&lt;/code&gt; instances. This is the worst possible case for Karpenter and the absolute best case for Cluster Autoscaler, but it should give us enough information about how they compare.&lt;/p&gt;

&lt;p&gt;Surprisingly, even in this worst case possible, EKS on EC2 using Karpenter is faster than EKS on EC2 using Cluster Autoscaler:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fwww.vladionescu.me%2Fposts%2Fscaling-containers-on-aws-in-2022%2Feks-on-ec2-ipv4.min.svg" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fwww.vladionescu.me%2Fposts%2Fscaling-containers-on-aws-in-2022%2Feks-on-ec2-ipv4.min.svg" alt="Hand-drawn-style graph showing the scaling  performance from 0 to 3500 containers. EKS on EC2 with cluster-autoscaler starts around the two and a half minute mark, reaches 3000 containers around the six and a half minute mark, and 3500 containers around the seven minute mark. EKS on EC2 with Karpenter is faster: it starts around the same two and a half minute mark, but reaches 3000 containers a minute earlier, around the five and a half minute mark, and reaches 3500 containers around the same seven minute mark"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Hand-drawn-style graph showing the scaling  performance from 0 to 3500 containers. EKS on EC2 with cluster-autoscaler starts around the two and a half minute mark, reaches 3000 containers around the six and a half minute mark, and 3500 containers around the seven minute mark. EKS on EC2 with Karpenter is faster: it starts around the same two and a half minute mark, but reaches 3000 containers a minute earlier, around the five and a half minute mark, and reaches 3500 containers around the same seven minute mark&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Another enhancement we got this year is the &lt;a href="https://docs.aws.amazon.com/eks/latest/userguide/cni-ipv6.html" rel="noopener noreferrer"&gt;&lt;strong&gt;support for IPv6&lt;/strong&gt;, released in early 2022&lt;/a&gt;. There is no way to migrate an existing cluster, which means that for our testing we have to create a new IPv6 EKS cluster in an IPv6 VPC. You can see the full code used in &lt;a href="https://github.com/vlaaaaaaad/blog-scaling-containers-on-aws-in-2022" rel="noopener noreferrer"&gt;the &lt;code&gt;eks-on-ec2-ipv6&lt;/code&gt; folder in the &lt;code&gt;vlaaaaaaad/blog-scaling-containers-on-aws-in-2022&lt;/code&gt; repository on Github&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;As AWS pointed out in their &lt;a href="https://aws.amazon.com/blogs/containers/amazon-eks-launches-ipv6-support/" rel="noopener noreferrer"&gt;announcement blog post&lt;/a&gt;, IPv6 reduces the work that the &lt;a href="https://github.com/aws/amazon-vpc-cni-k8s" rel="noopener noreferrer"&gt;EKS' network plugin (amazon-vpc-cni-k8s&lt;/a&gt;) has to do, giving us a nice bump in scaling speed, for both Cluster Autoscaler and Karpenter:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fwww.vladionescu.me%2Fposts%2Fscaling-containers-on-aws-in-2022%2Feks-on-ec2-ipv6.min.svg" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fwww.vladionescu.me%2Fposts%2Fscaling-containers-on-aws-in-2022%2Feks-on-ec2-ipv6.min.svg" alt="Hand-drawn-style graph showing the scaling  performance from 0 to 3500 containers. There are four lines, in two clusters. The first cluster is EKS on EC2 using Karpenter where we can see IPv6 is around 30 seconds faster than IPv4. The second cluster is EKS on EC2 using cluster-autoscaler where IPv6 is again faster than IPv4"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Hand-drawn-style graph showing the scaling  performance from 0 to 3500 containers. There are four lines, in two clusters. The first cluster is EKS on EC2 using Karpenter where we can see IPv6 is around 30 seconds faster than IPv4. The second cluster is EKS on EC2 using cluster-autoscaler where IPv6 is again faster than IPv4&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;I envision people will start migrating towards IPv6 and Karpenter, but that will be a slow migration: both of them are fundamental changes! Migrating from IPv4 to IPv6 requires a complete networking revamp, with multiple components and integrations being affected. Migrating from Cluster AutoScaler to Karpenter is easier as it can be done gradually and in-place (workloads that are a fit for Karpenter-only clusters are rare), but taking full advantage of Karpenter requires deeply understanding what resources applications need — no more putting applications in groups.&lt;/p&gt;

&lt;p&gt;Now, let's move to &lt;strong&gt;serverless Kubernetes&lt;/strong&gt; workers!&lt;br&gt;
As mentioned above, Fargate is an alternative to EC2s: no more stressing about servers, operating systems, patches, and updates. We only have to care about the container image!&lt;/p&gt;

&lt;p&gt;This year, &lt;strong&gt;EKS on Fargate&lt;/strong&gt; is neck-and-neck with EKS on EC2 in terms of scaling:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fwww.vladionescu.me%2Fposts%2Fscaling-containers-on-aws-in-2022%2Feks-on-fargate.min.svg" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fwww.vladionescu.me%2Fposts%2Fscaling-containers-on-aws-in-2022%2Feks-on-fargate.min.svg" alt="Hand-drawn-style graph showing the scaling  performance from 0 to 3500 containers. EKS on Fargate starts around the one-minute mark and reaches close to 3500 containers around the eight minute mark. EKS on EC2 with Karpenter and EKS on EC2 with cluster-autoscaler both start around the two minute mark, and reach 3500 containers around the seven minute mark, but EKS on EC2 with Karpenter scales faster initially"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Hand-drawn-style graph showing the scaling  performance from 0 to 3500 containers. EKS on Fargate starts around the one-minute mark and reaches close to 3500 containers around the eight minute mark. EKS on EC2 with Karpenter and EKS on EC2 with cluster-autoscaler both start around the two minute mark, and reach 3500 containers around the seven minute mark, but EKS on EC2 with Karpenter scales faster initially&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Looking at the above graph does not paint the full picture, though. If we look at the yearly evolution of EKS on Fargate, we see how much AWS improved this, without any effort required from the users:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fwww.vladionescu.me%2Fposts%2Fscaling-containers-on-aws-in-2022%2Feks-on-fargate-yearly.min.svg" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fwww.vladionescu.me%2Fposts%2Fscaling-containers-on-aws-in-2022%2Feks-on-fargate-yearly.min.svg" alt="Hand-drawn-style graph showing yearly evolution for EKS on Fargate: in 2020 it took about 55 minutes to reach 3500 containers. In 2021, it takes around 20 minutes, and in 2022 it takes a little over 8 minutes"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Hand-drawn-style graph showing yearly evolution for EKS on Fargate: in 2020 it took about 55 minutes to reach 3500 containers. In 2021, it takes around 20 minutes, and in 2022 it takes a little over 8 minutes&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Massive improvements! If we built and ran an application using EKS on Fargate in 2020, we would have scaled to 3 500 containers in about an hour. Without any effort or changes, in 2021 the scaling would be done in 20 minutes. Without having to invest any effort or change any lines of code, in 2022 the same application would finish scaling in less than 10 minutes! &lt;strong&gt;IN LESS THAN 2 YEARS, WITHOUT ANY EFFORT, WE WENT FROM 1 HOUR TO 10 MINUTES!&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;I do see people migrating from EKS on EC2 to EKS on Fargate, but small amounts. EKS on Fargate has &lt;a href="https://github.com/aws/containers-roadmap/issues/622" rel="noopener noreferrer"&gt;no support for Spot pricing&lt;/a&gt;, which makes Fargate an expensive proposition when compared with &lt;a href="https://aws.amazon.com/ec2/spot/" rel="noopener noreferrer"&gt;EC2 Spot Instances&lt;/a&gt;. EC2 Spot instances (and ECS on Fargate Spot) are cheaper by up to 90%, but they may be interrupted with a 2-minute warning. For a lot of containerized workloads, that is not a problem: a container will be interrupted, and quickly replaced by a new container. No harm done, and way lower bills.&lt;/p&gt;

&lt;p&gt;The most common setup I see for Kubernetes on AWS is using a combination of workers. &lt;strong&gt;EKS on Fargate is ideal for long-running and critical components&lt;/strong&gt; like &lt;a href="https://github.com/kubernetes-sigs/aws-load-balancer-controller" rel="noopener noreferrer"&gt;AWS Load Balancer Controller&lt;/a&gt;, &lt;a href="https://karpenter.sh" rel="noopener noreferrer"&gt;Karpenter&lt;/a&gt;, or &lt;a href="https://github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler" rel="noopener noreferrer"&gt;Cluster Autoscaler&lt;/a&gt;. &lt;strong&gt;EKS on EC2 is ideal for interruption-sensitive workloads&lt;/strong&gt; like stateful applications. What I am seeing most often with my customers, is the largest part of the worker plane using &lt;strong&gt;EKS on EC2 Spot which is good enough for most applications&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Keep in mind that these are &lt;strong&gt;DEFAULT PERFORMANCE RESULTS&lt;/strong&gt;, with an extreme test case, and with manual scaling! Performance levels will differ depending on your applications and what setup you're running.&lt;/p&gt;

&lt;p&gt;One would think the upper limit on performance is EKS on EC2 with the servers all ready to run containers — the servers won't have to be requested, created, and started, containers would just need to start:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fwww.vladionescu.me%2Fposts%2Fscaling-containers-on-aws-in-2022%2Feks-overview.min.svg" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fwww.vladionescu.me%2Fposts%2Fscaling-containers-on-aws-in-2022%2Feks-overview.min.svg" alt="Hand-drawn-style graph showing the scaling  performance from 0 to 3500 containers. All the previous graphs are merges into this graph, in a mess of lines"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Hand-drawn-style graph showing the scaling  performance from 0 to 3500 containers. All the previous graphs are merges into this graph, in a mess of lines&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;That gets close, but even that could be optimized by having the container images already cached on the servers and by tuning the networking stack!&lt;br&gt;
There are always optimizations to be done, and &lt;strong&gt;more performance can always be squeezed out&lt;/strong&gt;. For example, I spent 3 months with a customer tuning, experimenting, and optimizing Cluster Autoscaler for their applications and their specific workload. After all the work was done, my customer's costs decreased by 20% while their end-users saw a 15% speed-up! For them and their scale, it was worth it. For other customers, spending this much time would represent a lot of wasted effort.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;ALWAYS MAKE SURE TO CONSIDER ALL THE TRADE-OFFS WHEN DESIGNING!&lt;/strong&gt; Pre-scaled, pre-warmed, and finely tuned EKS on EC2 servers may be fast, but the associated costs will be huge — both pure AWS costs, but also development time costs, and &lt;em&gt;missed opportunity costs&lt;/em&gt; 💸&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Originally posted at &lt;a href="https://www.vladionescu.me/posts/scaling-containers-on-aws-in-2022/" rel="noopener noreferrer"&gt;https://www.vladionescu.me/posts/scaling-containers-on-aws-in-2022/&lt;/a&gt; and the dev.to version may contain errors and less-than-ideal presentation&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Elastic Container Service scaling
&lt;/h2&gt;

&lt;p&gt;Amazon's &lt;strong&gt;Elastic Container Service&lt;/strong&gt; — the pole-position orchestrator that needed competition to become awesome.&lt;/p&gt;

&lt;p&gt;Just like Kubernetes, &lt;a href="https://aws.amazon.com/ecs/" rel="noopener noreferrer"&gt;Amazon Elastic Container Service&lt;/a&gt; or ECS for short, has its components divided in two sections: the Control Plane and the Worker Plane. The Control Plane is like the brain: it decides where containers run, what happens to them, and it talks to us. The Worker Plane is the servers on which our containers actually run.&lt;/p&gt;

&lt;p&gt;For ECS, the &lt;strong&gt;Control Plane&lt;/strong&gt; is proprietary: AWS built it, AWS runs it, AWS manages it, and AWS develops it further. As users, we can create a Control Plane by creating an ECS Cluster. That's it!&lt;/p&gt;

&lt;p&gt;For the &lt;strong&gt;Worker Plane&lt;/strong&gt; we have more options, with the first five options being the same as the options we had when using Kubernetes as the orchestrator:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  self-managed workers. These are EC2 instances: we manage them, we configure them, we update them, we do everything&lt;/li&gt;
&lt;li&gt;  serverless workers. For the least amount of stress — not having to care about servers, operating systems, patches, and all that — we can use serverless workers through &lt;a href="https://docs.aws.amazon.com/AmazonECS/latest/developerguide/AWS_Fargate.html" rel="noopener noreferrer"&gt;ECS on Fargate&lt;/a&gt;: we give AWS a container and we tell them to run it — AWS will run it on a server they manage, update, and control&lt;/li&gt;
&lt;li&gt;  &lt;em&gt;fancy&lt;/em&gt; 5G workers through &lt;a href="https://aws.amazon.com/wavelength/" rel="noopener noreferrer"&gt;AWS Wavelength&lt;/a&gt;. These are also EC2 instances, but hosted by 5G networking providers for super-low latency&lt;/li&gt;
&lt;li&gt;  &lt;em&gt;extended-region&lt;/em&gt; workers through &lt;a href="https://docs.aws.amazon.com/AmazonECS/latest/developerguide/cluster-regions-zones.html" rel="noopener noreferrer"&gt;AWS Local Zones&lt;/a&gt;. These are still EC2 instances, but hosted in popular cities for lower latency&lt;/li&gt;
&lt;li&gt;  &lt;em&gt;your own&lt;/em&gt; workers running on a big AWS-managed server that you buy from AWS and install in your own datacenter though &lt;a href="https://aws.amazon.com/outposts/" rel="noopener noreferrer"&gt;AWS Outposts&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;  an &lt;em&gt;extra&lt;/em&gt; option, exclusively for ECS: &lt;em&gt;your own workers on your own hardware&lt;/em&gt; that is connected to AWS through &lt;a href="https://aws.amazon.com/ecs/anywhere/" rel="noopener noreferrer"&gt;ECS Anywhere&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For our testing, we'll focus on the most common options, and we'll use both EC2 workers that we manage ourselves and AWS-managed serverless workers though Fargate.&lt;/p&gt;

&lt;p&gt;To get &lt;strong&gt;visibility&lt;/strong&gt; into what is happening on the clusters, we don't have to install anything, we just have to enable &lt;a href="https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/deploy-container-insights-ECS-cluster.html" rel="noopener noreferrer"&gt;CloudWatch Container Insights&lt;/a&gt; on the ECS cluster.&lt;/p&gt;

&lt;p&gt;Our &lt;strong&gt;application container&lt;/strong&gt; will use its own dedicated AWS IAM Role and its own dedicated Security Group. Containers are not run by themselves — what happens if the container has an error or restarts, or during an update — but as part of larger concepts. In the Kubernetes world we have &lt;code&gt;Deployments&lt;/code&gt; and in the ECS world we have &lt;code&gt;Services&lt;/code&gt;, but both of them do the same work: they manage containers. For example, a &lt;code&gt;Deployment&lt;/code&gt; or a &lt;code&gt;Service&lt;/code&gt; of 30 containers will always try to make sure 30 containers are running. If a container has an error, it is restarted. If a container dies, it is replaced by a new container. If 30 containers have to be updated, the &lt;code&gt;Service&lt;/code&gt; will handle the complex logic around replacing each of the 30 containers with new and updated containers. Each of the 30 containers are gradually replaced, always making sure at least 30 containers are running at all times. In 2021, we saw that ECS scales faster when multiple &lt;a href="https://docs.aws.amazon.com/AmazonECS/latest/developerguide/ecs_services.html" rel="noopener noreferrer"&gt;ECS Services&lt;/a&gt; are used, so our application's containers will be part of multiple &lt;code&gt;Deployments&lt;/code&gt;, all in the same ECS Cluster. The ECS &lt;code&gt;Task&lt;/code&gt; will be using 1 vCPU and 2 GBs and will have a single container: our test application.&lt;/p&gt;

&lt;p&gt;There are multiple other configuration options, and if you want to see the full configuration used, you can check out the Terraform infrastructure code in the &lt;a href="https://github.com/vlaaaaaaad/blog-scaling-containers-on-aws-in-2022" rel="noopener noreferrer"&gt;&lt;code&gt;ecs-*&lt;/code&gt; folders in &lt;code&gt;vlaaaaaaad/blog-scaling-containers-on-aws-in-2022&lt;/code&gt; on GitHub&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Now, with the setup covered, let's get to testing! I ran all the tests between December 2021 and April 2022, using all the latest versions available at the time of testing.&lt;/p&gt;

&lt;p&gt;In 2020, I did not test &lt;strong&gt;ECS on EC2&lt;/strong&gt; at all. In 2021, I tested ECS on EC2, but the performance was not great. This year, the &lt;a href="https://aws.amazon.com/about-aws/whats-new/2021/11/amazon-ecs-improved-capacity-providers-cluster-auto-scaling/" rel="noopener noreferrer"&gt;announcement for improved &lt;strong&gt;capacity provider auto-scaling&lt;/strong&gt;&lt;/a&gt; gave me hope, and I thought we should re-test ECS on EC2.&lt;/p&gt;

&lt;p&gt;Scaling ECS clusters with EC2 workers as part of an AutoScaling Group managed by Capacity Providers is very similar to cluster-autoscaler from the Kubernetes word: based on demand, EC2 instances are added or removed from the cluster. Not enough space to run all the containers? New EC2 instances are added. Too many EC2 instances that are underutilized? Instances are cleanly removed from the cluster.&lt;/p&gt;

&lt;p&gt;In 2021, we saw that ECS can scale a lot faster when using multiple Services — there's a bit of extra configuration that has to be done, but it's worth it. We first have to figure out what is the number of ECS Services that will scale the fastest:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fwww.vladionescu.me%2Fposts%2Fscaling-containers-on-aws-in-2022%2Fecs-on-ec2-services.min.svg" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fwww.vladionescu.me%2Fposts%2Fscaling-containers-on-aws-in-2022%2Fecs-on-ec2-services.min.svg" alt="Hand-drawn-style graph showing the scaling  performance from 0 to 3500 containers, with different number of services. They all start scaling around the two and a half minute mark, with ECS on EC2 with 1 Service reaching about 1600 containers after ten minutes, with 5 Services reaching 3500 containers after ten minutes, with 7 Services also reaching 3500 containers after ten minutes, and with 10 Services reaching 2500 containers after ten minutes"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Hand-drawn-style graph showing the scaling  performance from 0 to 3500 containers, with different number of services. They all start scaling around the two and a half minute mark, with ECS on EC2 with 1 Service reaching about 1600 containers after ten minutes, with 5 Services reaching 3500 containers after ten minutes, with 7 Services also reaching 3500 containers after ten minutes, and with 10 Services reaching 2500 containers after ten minutes&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Now that we know the ideal number of Services, we can focus on tuning Capacity Providers. A very important setting for the Capacity Provider is the &lt;strong&gt;target capacity&lt;/strong&gt;, which can be anything between 1% and 100%.&lt;br&gt;
If we set a low target utilization of 30%, that means we are keeping 70% of the EC2 capacity free — ready for hosting new containers. That's awesome from a scaling performance perspective, but it's terrible from a cost perspective: we are overpaying by 70%! Using a larger target, of say 95% offers better cost efficiency (only 5% unused space), but it means scaling would be slower since we have to wait for new EC2 instances to be started. How does this impact scaling performance? How much slower would scaling be? To figure it out, let's test:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fwww.vladionescu.me%2Fposts%2Fscaling-containers-on-aws-in-2022%2Fecs-on-ec2-target.min.svg" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fwww.vladionescu.me%2Fposts%2Fscaling-containers-on-aws-in-2022%2Fecs-on-ec2-target.min.svg" alt="Hand-drawn-style graph showing the scaling  performance from 0 to 3500 containers. ECS on EC2 with 5 Services and 30% Target Capacity starts scaling after about thirty seconds, and reaches 3500 containers just before the six minute mark. ECS on EC2 with 5 Services and 80% Target Capacity starts scaling around the two and a half minute mark and reaches 3500 containers after about ten minutes. ECS on EC2 with 5 Services and 95% Target Capacity starts scaling around the two and a half minute mark and reaches about 1600 containers after ten minutes of scaling"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Hand-drawn-style graph showing the scaling  performance from 0 to 3500 containers. ECS on EC2 with 5 Services and 30% Target Capacity starts scaling after about thirty seconds, and reaches 3500 containers just before the six minute mark. ECS on EC2 with 5 Services and 80% Target Capacity starts scaling around the two and a half minute mark and reaches 3500 containers after about ten minutes. ECS on EC2 with 5 Services and 95% Target Capacity starts scaling around the two and a half minute mark and reaches about 1600 containers after ten minutes of scaling&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;In the real world, I mostly see ECS on EC2 used when ECS on Fargate is not enough — for things like GPU support, high-bandwidth networking, and so on.&lt;/p&gt;

&lt;p&gt;Let's move to &lt;strong&gt;ECS on Fargate&lt;/strong&gt; — serverless containers!&lt;/p&gt;

&lt;p&gt;As mentioned above, Fargate is an alternative to EC2s: no more stressing about servers, operating systems, patches, and updates. We only have to care about the container image!&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;FARGATE DIFFERS MASSIVELY BETWEEN ECS AND EKS&lt;/strong&gt;! ECS is AWS-native and serverless by design, which means ECS on Fargate can move faster and it can fully utilize the power of Fargate. Besides the "default" Fargate — &lt;em&gt;On-Demand Intel-based Fargate&lt;/em&gt; by its full name — ECS on Fargate also has support for &lt;strong&gt;Spot&lt;/strong&gt; (up to 70% discount, but containers may be interrupted), &lt;strong&gt;ARM&lt;/strong&gt; support (faster and cheaper than the default), &lt;strong&gt;Windows&lt;/strong&gt; support, and additional &lt;strong&gt;storage&lt;/strong&gt; options.&lt;/p&gt;

&lt;p&gt;In 2021, we saw that ECS on Fargate can scale a lot faster when using multiple &lt;code&gt;Services&lt;/code&gt; — there's a bit of extra configuration that has to be done, but it's worth it. We first have to figure out what is the number of ECS Services that will scale the fastest:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fwww.vladionescu.me%2Fposts%2Fscaling-containers-on-aws-in-2022%2Fecs-on-fargate-services.min.svg" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fwww.vladionescu.me%2Fposts%2Fscaling-containers-on-aws-in-2022%2Fecs-on-fargate-services.min.svg" alt="Hand-drawn-style graph showing the scaling  performance from 0 to 3500 containers, with different numbers of services. ECS on Fargate with 2 Services takes about eight minutes, with 3 Services it takes about six minutes, with 5 Services it takes around five minutes, and with 7 Services it takes the same around five minutes"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Hand-drawn-style graph showing the scaling  performance from 0 to 3500 containers, with different numbers of services. ECS on Fargate with 2 Services takes about eight minutes, with 3 Services it takes about six minutes, with 5 Services it takes around five minutes, and with 7 Services it takes the same around five minutes&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;For our test application, the ideal number seems to be 5 Services: our application needs to be split in 5 Services, each ECS Service launching and taking care of 700 containers. If we use less than 5 Services, performance is lower. If we use more than 5 Services, performance does not improve.&lt;/p&gt;

&lt;p&gt;Amazing performance from ECS on Fargate! If we compare this year's best result with the results from previous years, we get a fabulous graph showing how much ECS on Fargate has evolved:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fwww.vladionescu.me%2Fposts%2Fscaling-containers-on-aws-in-2022%2Fecs-on-fargate-yearly.min.svg" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fwww.vladionescu.me%2Fposts%2Fscaling-containers-on-aws-in-2022%2Fecs-on-fargate-yearly.min.svg" alt="Hand-drawn-style graph showing the scaling  performance from 0 to 3500 containers. In 2020 ECS on Fargate took about 55 minutes to reach 3500 containers. In 2021, it takes around 12 minutes, and in 2022 it takes a little over 5 minutes"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Hand-drawn-style graph showing the scaling  performance from 0 to 3500 containers. In 2020 ECS on Fargate took about 55 minutes to reach 3500 containers. In 2021, it takes around 12 minutes, and in 2022 it takes a little over 5 minutes&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;If we built and ran an application using ECS on Fargate in 2020, we would have scaled to 3 500 containers in about 60 minutes. Without any effort or changes, in 2021 the scaling would be done in a little over 10 minutes. Without having to invest any effort or change any lines of code, in 2022 the same application would finish scaling in a little bit over 5 minutes! &lt;strong&gt;IN LESS THAN 2 YEARS, WITHOUT ANY EFFORT, WE WENT FROM 1 HOUR TO 5 MINUTES!&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;To further optimize our costs, we can run &lt;a href="https://aws.amazon.com/fargate/pricing/" rel="noopener noreferrer"&gt;ECS on Fargate Spot which is discounted by up to 70%&lt;/a&gt;, but AWS can interrupt our containers with a 2-minute warning. For our testing, and for a lot of real-life workloads, we don't care if our containers get interrupted and then replaced by another container. The up 70% discount is… appealing, but AWS mentions that Spot might be slower due to additional work that has to happen on their end. Let's test and see if there's any impact:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fwww.vladionescu.me%2Fposts%2Fscaling-containers-on-aws-in-2022%2Fecs-on-fargate-spot.min.svg" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fwww.vladionescu.me%2Fposts%2Fscaling-containers-on-aws-in-2022%2Fecs-on-fargate-spot.min.svg" alt="Hand-drawn-style graph showing the scaling  performance from 0 to 3500 containers. The lines for ECS on Fargate On-Demand using 5 Services and ECS on Fargate Spot using 5 Services are really close, with the Spot line having a small bump"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Hand-drawn-style graph showing the scaling  performance from 0 to 3500 containers. The lines for ECS on Fargate On-Demand using 5 Services and ECS on Fargate Spot using 5 Services are really close, with the Spot line having a small bump&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;What a surprise! As per AWS, "&lt;em&gt;customers can expect a greater variability in performance when using ECS on Fargate Spot&lt;/em&gt;" and we are seeing exactly that: for the test I ran ECS on Fargate Spot was just a smidge faster than ECS on Fargate On-Demand. &lt;em&gt;SPOT PERFORMANCE AND AVAILABILITY VARIES&lt;/em&gt;, so make sure to account for that when architecting!&lt;/p&gt;

&lt;p&gt;That said, how sustained is this ECS on Fargate scaling performance? We can see that scaling happens super-slow as we get close to our target of 3 500 containers. Is that because there are only a few remaining containers to be started, or is it because we are hitting a performance limitation? Let's test what happens when we try to scale to 10 000 containers!&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fwww.vladionescu.me%2Fposts%2Fscaling-containers-on-aws-in-2022%2Fecs-on-fargate-to-10k.min.svg" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fwww.vladionescu.me%2Fposts%2Fscaling-containers-on-aws-in-2022%2Fecs-on-fargate-to-10k.min.svg" alt="Hand-drawn-style graph showing the scaling  performance from 0 to 3500 containers. There are lines for ECS on Fargate with 2, 3, 5, and 7 Services, and they all scale super-fast to 3400-ish containers and then slow down. There is a tall line for ECS on Fargate to 10000 containers with 5 Services taking a close to ten minutes to scale, also slowing down when reaching the top"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Hand-drawn-style graph showing the scaling  performance from 0 to 3500 containers. There are lines for ECS on Fargate with 2, 3, 5, and 7 Services, and they all scale super-fast to 3400-ish containers and then slow down. There is a tall line for ECS on Fargate to 10000 containers with 5 Services taking a close to ten minutes to scale, also slowing down when reaching the top&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Ok, ECS on Fargate has awesome &lt;strong&gt;and&lt;/strong&gt; sustained performance!&lt;/p&gt;

&lt;p&gt;That is not all though! This year, we have even more options: &lt;strong&gt;ECS on Fargate ARM&lt;/strong&gt; and &lt;strong&gt;ECS on Fargate Windows&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;In late 2021, &lt;a href="https://aws.amazon.com/about-aws/whats-new/2021/11/aws-fargate-amazon-ecs-aws-graviton2-processors/" rel="noopener noreferrer"&gt;AWS announced &lt;strong&gt;ECS on Fargate ARM&lt;/strong&gt;&lt;/a&gt;, which is taking advantage of &lt;a href="https://aws.amazon.com/ec2/graviton/" rel="noopener noreferrer"&gt;AWS' Graviton 2 processors&lt;/a&gt;. ECS on Fargate ARM is both faster and cheaper than the "default" ECS on Fargate On-Demand which uses Intel processors. There is no option for &lt;a href="https://github.com/aws/containers-roadmap/issues/1594" rel="noopener noreferrer"&gt;ECS on Fargate ARM Spot right now&lt;/a&gt;, so ECS on Fargate Spot remains the most cost-effective option.&lt;/p&gt;

&lt;p&gt;To run a container on ARM processors — be they AWS' Graviton processors or Apple's Silicon in the latest Macs — we have to build a container image for ARM architectures. In our case, this was easy: we add a single line to the Docker's &lt;a href="https://github.com/docker/build-push-action" rel="noopener noreferrer"&gt;&lt;code&gt;build-and-push&lt;/code&gt; Action&lt;/a&gt; to build a multi-architecture image for both Intel and ARM processors: &lt;code&gt;platforms: linux/amd64, linux/arm64&lt;/code&gt;. It's that easy! The image will then be pushed to ECR, which &lt;a href="https://aws.amazon.com/blogs/containers/introducing-multi-architecture-container-images-for-amazon-ecr/" rel="noopener noreferrer"&gt;supports multi-architecture images since 2020&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;With the image built and pushed, we can test how ECS on Fargate ARM scales:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fwww.vladionescu.me%2Fposts%2Fscaling-containers-on-aws-in-2022%2Fecs-on-fargate-arm.min.svg" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fwww.vladionescu.me%2Fposts%2Fscaling-containers-on-aws-in-2022%2Fecs-on-fargate-arm.min.svg" alt="Hand-drawn-style graph showing the scaling  performance from 0 to 3500 containers. ECS on Fargate is a bit slower than ECS on Fargate ARM, with about 20 seconds of difference between the two"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Hand-drawn-style graph showing the scaling  performance from 0 to 3500 containers. ECS on Fargate is a bit slower than ECS on Fargate ARM, with about 20 seconds of difference between the two&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Interesting, ECS on Fargate ARM is faster! AWS' Graviton2 processors are faster, and I expected that to be visible in the application processing time, but I did not expect that would have an impact on scaling too. Thinking about it, it makes sense though — a faster processor would help with the container image extraction and application startup. Even better!&lt;/p&gt;

&lt;p&gt;In the same late 2021, &lt;a href="https://aws.amazon.com/about-aws/whats-new/2021/10/aws-fargate-amazon-ecs-windows-containers/" rel="noopener noreferrer"&gt;AWS announced &lt;strong&gt;ECS on Fargate Windows&lt;/strong&gt;&lt;/a&gt;, which can run Windows Server containers. Since Windows has licensing fees, the pricing is a bit different: there is an extra OS licensing fee, and, while billing is still done per-second, there is a minimum of 15 minutes.&lt;/p&gt;

&lt;p&gt;Some folks would dismiss ECS on Fargate Windows, but it is a major announcement! People that had to run specific Windows-only dependencies can now easily adopt containerized applications or, for the first time, run Windows containers serverlessly on AWS. Windows support is great news for folks doing complex .NET applications that cannot be moved to Linux: they can now move at way higher velocity!&lt;/p&gt;

&lt;p&gt;To build Windows container images, we have to make &lt;strong&gt;a couple of changes&lt;/strong&gt;.&lt;br&gt;
First, we have to use a Windows base image for our container. For our Python web app it's easy: the official Python base images &lt;a href="https://github.com/docker-library/python/pull/142" rel="noopener noreferrer"&gt;have Windows support since 2016&lt;/a&gt;. Unfortunately, Docker's &lt;code&gt;build-and-push&lt;/code&gt; Action &lt;a href="https://github.com/docker/build-push-action/issues/18" rel="noopener noreferrer"&gt;has no support for building Windows containers&lt;/a&gt;. To get the image built, we'll have to run the &lt;code&gt;docker build&lt;/code&gt; commands manually. Since GitHub Actions has native support for Windows runners, this is straightforward.&lt;br&gt;
Like all our images, the Windows container image is pushed to ECR which &lt;a href="https://aws.amazon.com/about-aws/whats-new/2017/01/amazon-ecr-supports-docker-image-manifest-v2-schema-2/" rel="noopener noreferrer"&gt;does support Windows images since 2017&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;I tried running our application and it failed: the &lt;strong&gt;web server&lt;/strong&gt; could not start. As of right now, our &lt;a href="https://github.com/benoitc/gunicorn/issues/524" rel="noopener noreferrer"&gt;&lt;code&gt;gunicorn&lt;/code&gt; web server does not support Windows&lt;/a&gt;. No worries, we can use a drop-in alternative: &lt;a href="https://docs.pylonsproject.org/projects/waitress/en/latest/" rel="noopener noreferrer"&gt;&lt;code&gt;waitress&lt;/code&gt;&lt;/a&gt;! This will lead to a small difference in the test application code between Windows and Linux, but no fundamental changes.&lt;/p&gt;

&lt;p&gt;Because I used Honeycomb for &lt;strong&gt;proper&lt;/strong&gt; observability, I was able to discover there is one more thing we have to do for the best scaling results: push non-distributable artifacts! You can &lt;a href="https://www.honeycomb.io/blog/observability-power-of-asking-questions" rel="noopener noreferrer"&gt;read the whole story on a guest post I wrote on the Honeycomb blog&lt;/a&gt;, but the short version is that for right now, Windows Container images are special and an &lt;strong&gt;extra configuration&lt;/strong&gt; option has to be enabled.&lt;br&gt;
Windows has &lt;strong&gt;complex licensing&lt;/strong&gt; and the "base" container image is a non-distributable artifact: we are not allowed to distribute it! That means that when we build our container image in GitHub Action and then run &lt;code&gt;docker push&lt;/code&gt; to upload our image to ECR, only some parts of the image will be pushed to ECR — the parts that we are allowed to share, which in our case are our application code and the dependencies for our app. If we open the AWS Console and look at our image, we will see that only 76 MB were pushed to ECR.&lt;br&gt;
When ECS on Fargate wants to start a Windows container with our application, it has to first download the container image: our Python application and its dependencies totaling about 76MBs will be downloaded from ECR, but the base Windows Server image of about 2.7 GBs will be downloaded from Microsoft. Unless everything is perfect in the universe and on the Internet between AWS and Microsoft, download performance can vary wildly!&lt;br&gt;
As per Steve Lasker, PM Architect Microsoft Azure, Microsoft recognizes this licensing constraint has caused frustration and is working to remove this constraint and default configuration. Until then, for the highest consistent performance, &lt;a href="https://docs.aws.amazon.com/AmazonECR/latest/userguide/vpc-endpoints.html" rel="noopener noreferrer"&gt;both AWS&lt;/a&gt; and &lt;a href="https://docs.microsoft.com/en-us/azure/container-registry/container-registry-faq#how-do-i-push-non-distributable-layers-to-a-registry-" rel="noopener noreferrer"&gt;Microsoft&lt;/a&gt; recommend setting a &lt;strong&gt;Docker daemon flag&lt;/strong&gt; for private-use images: &lt;code&gt;--allow-nondistributable-artifacts&lt;/code&gt;. By setting this flag, the full image totaling 2.8 GBs will be pushed to ECR — both the base image and our application code. When ECS on Fargate will have to download the container image, it will download the whole thing from close-by ECR.&lt;/p&gt;

&lt;p&gt;With this extra flag set, and with the full imaged pushed to ECR, we can test ECS on Fargate Windows and get astonishing performance:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fwww.vladionescu.me%2Fposts%2Fscaling-containers-on-aws-in-2022%2Fecs-on-fargate-windows.min.svg" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fwww.vladionescu.me%2Fposts%2Fscaling-containers-on-aws-in-2022%2Fecs-on-fargate-windows.min.svg" alt="Hand-drawn-style graph showing the scaling  performance from 0 to 3500 containers. ECS on Fargate and ECS on Fargate ARM both start around the 30 second mark and reach close to 3500 containers around the four to five minute mark. ECS on Fargate Windows starts just before the six minute mark, reaching 3500 containers in about eleven minutes"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Hand-drawn-style graph showing the scaling  performance from 0 to 3500 containers. ECS on Fargate and ECS on Fargate ARM both start around the 30 second mark and reach close to 3500 containers around the four to five minute mark. ECS on Fargate Windows starts just before the six minute mark, reaching 3500 containers in about eleven minutes&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;ECS on Fargate Windows is slower to start — about 5 minutes of delay compared to about 30 seconds of delay when using Linux containers — but that was expected: Fargate has to do licensing stuff and Windows containers are, for good reasons, bigger. After that initial delay, ECS on Fargate Windows is scaling just as fast, which is awesome!&lt;/p&gt;

&lt;p&gt;I am seeing a massive migration to ECS on Fargate — it's so much easier!&lt;br&gt;
Since ECS on Fargate launched in 2017, for the best-case scenario, approximative pricing per vCPU-hour got a whooping 76% reduction from &lt;a href="https://aws.amazon.com/blogs/aws/aws-fargate/" rel="noopener noreferrer"&gt;$ 0.05&lt;/a&gt; to &lt;a href="https://aws.amazon.com/fargate/pricing/" rel="noopener noreferrer"&gt;$ 0.01&lt;/a&gt; and pricing per GB-hour got a shocking 89% reduction from &lt;a href="(https://aws.amazon.com/blogs/aws/aws-fargate/)"&gt;$ 0.010&lt;/a&gt; to &lt;a href="https://aws.amazon.com/fargate/pricing/" rel="noopener noreferrer"&gt;$ 0.001&lt;/a&gt;. Since I started testing in 2020, ECS on Fargate got 12 times faster!&lt;/p&gt;

&lt;p&gt;While initially a slow and expensive service, ECS on Fargate grew into an outstanding service. I started advising my clients to start moving smaller applications to ECS on Fargate in late 2019, and that recommendation became stronger each year. I think &lt;a href="https://www.vladionescu.me/posts/flowchart-how-should-i-run-containers-on-aws-2021/" rel="noopener noreferrer"&gt;ECS on Fargate should be the default choice for any new container deployments on AWS&lt;/a&gt;. As a bonus, &lt;a href="https://aws.amazon.com/ecs/anywhere/" rel="noopener noreferrer"&gt;running in other data-centers became a thing instantly, without any effort, through ECS Anywhere&lt;/a&gt;! That enables some amazingly easy cross-cloud SaaS scenarios and instant edge computing use-cases.&lt;/p&gt;

&lt;p&gt;That said, ECS on Fargate is not an ideal fit for everything: there is &lt;a href="https://github.com/aws/containers-roadmap/issues/164" rel="noopener noreferrer"&gt;limited support for CPU and memory size&lt;/a&gt;, &lt;a href="https://github.com/aws/containers-roadmap/issues/384" rel="noopener noreferrer"&gt;limited storage support&lt;/a&gt;, &lt;a href="https://github.com/aws/containers-roadmap/issues/88" rel="noopener noreferrer"&gt;no GPU support yet&lt;/a&gt;, &lt;a href="https://github.com/aws/containers-roadmap/issues/715" rel="noopener noreferrer"&gt;no dedicated network bandwith&lt;/a&gt;, and so on. Using EC2 workers with ECS on EC2 offers a lot more flexibility and power — say servers with multiple TBs of memory, hundreds of CPUs, and dedicated network bandwidth in the range of 100s of Gigabits. It's always tradeoffs!&lt;/p&gt;

&lt;p&gt;Again, keep in mind that these are &lt;strong&gt;DEFAULT AND FORCED PERFORMANCE RESULTS&lt;/strong&gt;, with an extreme test case, and with manual scaling! Performance levels will differ depending on your applications and what setup you're running! &lt;a href="https://aws.amazon.com/blogs/containers/under-the-hood-amazon-elastic-container-service-and-aws-fargate-increase-task-launch-rates/" rel="noopener noreferrer"&gt;This post on the AWS Containers Blog&lt;/a&gt; gets into more details, if you're curious.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fwww.vladionescu.me%2Fposts%2Fscaling-containers-on-aws-in-2022%2Fecs-overview.min.svg" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fwww.vladionescu.me%2Fposts%2Fscaling-containers-on-aws-in-2022%2Fecs-overview.min.svg" alt="Hand-drawn-style graph showing the scaling  performance from 0 to 3500 containers, using ECS. There are a lot of lines and it's messy"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Hand-drawn-style graph showing the scaling  performance from 0 to 3500 containers, using ECS. There are a lot of lines and it's messy&lt;/em&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Originally posted at &lt;a href="https://www.vladionescu.me/posts/scaling-containers-on-aws-in-2022/" rel="noopener noreferrer"&gt;https://www.vladionescu.me/posts/scaling-containers-on-aws-in-2022/&lt;/a&gt; and the dev.to version may contain errors and less-than-ideal presentation&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  App Runner
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;App Runner&lt;/strong&gt; is a higher-level service &lt;a href="https://aws.amazon.com/blogs/containers/introducing-aws-app-runner/" rel="noopener noreferrer"&gt;released in May 2021&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;For App Runner, we don't see any complex Control Plane and Worker Plane separation: we tell App Runner to run our services, and it does that for us!&lt;br&gt;
App Runner is an easier way of running containers, further building up on the experience offered by ECS on Fargate. If you're really curious how this works under the covers, &lt;a href="https://tty.neveragain.de/2021/06/18/app-runner-deep-dive.html" rel="noopener noreferrer"&gt;an awesome deep-dive can be read here&lt;/a&gt; and &lt;a href="https://aws.amazon.com/blogs/containers/deep-dive-on-aws-app-runner-vpc-networking/" rel="noopener noreferrer"&gt;AWS has a splendid networking deep-dive here&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;For our use case, there is no need to create a VPC since App Runner does not require one (but App Runner &lt;a href="https://aws.amazon.com/blogs/aws/new-for-app-runner-vpc-support/" rel="noopener noreferrer"&gt;can connect to resources in a VPC&lt;/a&gt;). To test App Runner, we create an &lt;strong&gt;App Runner service&lt;/strong&gt; configured to run our 1 vCPU 2 GB container using our container image from ECR. That's it!&lt;/p&gt;

&lt;p&gt;To force &lt;strong&gt;scaling&lt;/strong&gt;, we can edit the "&lt;em&gt;Minimum number of instances&lt;/em&gt;" to equal the "&lt;em&gt;Maximum number of instances&lt;/em&gt;", and we quickly get the result:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fwww.vladionescu.me%2Fposts%2Fscaling-containers-on-aws-in-2022%2Fapprunner.min.svg" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fwww.vladionescu.me%2Fposts%2Fscaling-containers-on-aws-in-2022%2Fapprunner.min.svg" alt="Hand-drawn-style graph showing the scaling performance from 0 to 30 containers instead of 3500 containers. There are 2 lines for ECS on Fargate using 2 and 5 services: they both start around the 30 seconds mark and go straight up. App Runner has a shorter line that starts around the one minute mark, goes straight up, and stops abruptly at 25 containers"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Hand-drawn-style graph showing the scaling performance from 0 to 30 containers instead of 3500 containers. There are 2 lines for ECS on Fargate using 2 and 5 services: they both start around the 30 seconds mark and go straight up. App Runner has a shorter line that starts around the one minute mark, goes straight up, and stops abruptly at 25 containers&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;App Runner starts scaling a bit slower than ECS on Fargate, but then scales just as fast. Scaling finishes quickly, as &lt;strong&gt;APP RUNNER SUPPORTS A MAXIMUM OF 25 CONTAINERS&lt;/strong&gt; per service. There is no way to run more than 25 containers per service, but multiple services could be used.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Don't be fooled&lt;/strong&gt; by the seemingly small number! Each App Runner container can use a maximum of 2 vCPUs and 4 GB of RAM, for a grand total of 50 vCPUs and 100 GBs possible in a single service. For many applications, this is more than enough, and the advantage of AWS managing things is not to be underestimated!&lt;/p&gt;

&lt;p&gt;In the future, I expect App Runner will continue to mature, and I think it might become the default way of running containers sometime in 2023 — but those are my hopes and dreams. With AWS managing capacity for App Runner, there are a lot of optimizations that AWS could implement. We'll see what the future brings!&lt;/p&gt;

&lt;p&gt;Keep in mind that these are &lt;strong&gt;DEFAULT PERFORMANCE RESULTS&lt;/strong&gt;, with manual, forced scaling! Performance levels will differ depending on your applications and what setup you're running! App Runner also requires way less effort to setup and has some awesome scaling features 😉&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Originally posted at &lt;a href="https://www.vladionescu.me/posts/scaling-containers-on-aws-in-2022/" rel="noopener noreferrer"&gt;https://www.vladionescu.me/posts/scaling-containers-on-aws-in-2022/&lt;/a&gt; and the dev.to version may contain errors and less-than-ideal presentation&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Lambda
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Lambda&lt;/strong&gt; is different: it's Function-as-a-Service, not containers.&lt;/p&gt;

&lt;p&gt;In &lt;strong&gt;previous years&lt;/strong&gt; I saw no need to test Lambda as &lt;a href="https://docs.aws.amazon.com/lambda/latest/dg/invocation-scaling.html" rel="noopener noreferrer"&gt;AWS publishes the exact speed at which Lambda scales&lt;/a&gt;. In our Northern Virginia (&lt;code&gt;us-east-1&lt;/code&gt;) region, that is an initial burst of 3 000 instances in the first minute, and then 500 instances every minute after. No need to test when &lt;strong&gt;we know the exact results we are going to get&lt;/strong&gt;: Lambda will scale to 3 500 instances in 2 minutes!&lt;br&gt;
This year, based on countless requests, I thought we should include Lambda, even if only to confirm AWS' claims of performance. &lt;strong&gt;Lambda instances are not directly comparable with containers&lt;/strong&gt;, but we'll get to that in a few.&lt;/p&gt;

&lt;p&gt;Lambda is a fully managed &lt;strong&gt;event-driven Function-as-a-Service&lt;/strong&gt; product. In plain language, when events happen (HTTP requests, messages put in a queue, a file gets created) Lambda will run code for us. We send Lambda the code, say "&lt;em&gt;run it when X happens&lt;/em&gt;", and Lambda will do everything for us, without any Control Plane and Worker Plane separation — we don't even see the workers directly!&lt;/p&gt;

&lt;p&gt;The code that Lambda will run can be &lt;strong&gt;packaged&lt;/strong&gt; in two ways:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  in a &lt;code&gt;.zip&lt;/code&gt; archive, of at most 50 MB. This is the "classic" way of sending code to Lambda&lt;/li&gt;
&lt;li&gt;  in a container image (which, at its lowest level, is a collection of archives too), which can be as large as 10 GB. This was launched in &lt;a href="https://aws.amazon.com/blogs/aws/new-for-aws-lambda-container-image-support/" rel="noopener noreferrer"&gt;late 2020&lt;/a&gt; and is an alternative way of packaging the code&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;To &lt;strong&gt;adapt our application&lt;/strong&gt; for Lambda, we have first to figure out what &lt;em&gt;event&lt;/em&gt; our Lambda function will react to.&lt;br&gt;
We could configure Lambda to &lt;a href="https://docs.aws.amazon.com/lambda/latest/dg/configuration-concurrency.html" rel="noopener noreferrer"&gt;have a bunch of containers waiting ready to accept traffic&lt;/a&gt;, but that is not the same thing as Lambda creating workers for us when traffic spikes. For accurate results, I think we need to have at least a semi-realistic scenario in place.&lt;br&gt;
Looking at the &lt;a href="https://docs.aws.amazon.com/lambda/latest/dg/lambda-services.html" rel="noopener noreferrer"&gt;many integrations Lambda has&lt;/a&gt;, I think the easiest one to use for our testing is the &lt;a href="https://docs.aws.amazon.com/lambda/latest/dg/services-apigateway.html" rel="noopener noreferrer"&gt;Amazon API Gateway integration&lt;/a&gt;: API Gateway will receive HTTP requests, run our code for each request, and return the response.&lt;/p&gt;

&lt;p&gt;Our &lt;strong&gt;test application code&lt;/strong&gt; can be adapted now that we know what will call our Lambda. I decided to re-write our test application. While Lambda can run whatever code we send it (including &lt;a href="https://lamby.custominktech.com" rel="noopener noreferrer"&gt;big frameworks&lt;/a&gt; with &lt;a href="https://twitter.com/metaskills/status/1377219340936826881" rel="noopener noreferrer"&gt;awesome performance&lt;/a&gt;), I see no need for a big framework for our use-case. I re-wrote the code, still in Python but without the Flask micro-framework, I packaged it into a &lt;code&gt;.zip&lt;/code&gt; file, and configured it as the source for my Lambda function.&lt;/p&gt;

&lt;p&gt;To &lt;strong&gt;scale&lt;/strong&gt; the Lambda function, since there is no option of manually editing a number, we will have to create a lot of events: flood the API Gateway with a lot of HTTP requests. I did some experiments, and the easiest option seems to be &lt;a href="https://httpd.apache.org/docs/2.4/programs/ab.html" rel="noopener noreferrer"&gt;Apache Benchmark&lt;/a&gt; running on multiple EC2 instances with a lot of &lt;a href="https://www.ec2throughput.info" rel="noopener noreferrer"&gt;sustained network bandwidth&lt;/a&gt;. We will run Apache Benchmark on each EC2 instance, and send a gigantic flood of requests to API Gateway, which will in turn send that to a lot of Lambdas.&lt;br&gt;
Since &lt;a href="https://docs.aws.amazon.com/lambda/latest/dg/invocation-scaling.html" rel="noopener noreferrer"&gt;Lambda publishes its scaling performance targets&lt;/a&gt;), we know that Lambda will scale to our target of 3 500 containers in just 2 minutes. That's not enough time — we have to scale even higher! When we wanted to confirm that ECS on Fargate has sustained scaling performance, we scaled to 10 000 containers in about 10 minutes. That seems like a good target, right? Let's scale Lambda!&lt;/p&gt;

&lt;p&gt;There are multiple other configuration options, and if you want to see the full configuration used, you can check out the Terraform infrastructure code in the &lt;a href="https://github.com/vlaaaaaaad/blog-scaling-containers-on-aws-in-2022" rel="noopener noreferrer"&gt;&lt;code&gt;lambda&lt;/code&gt; folder in &lt;code&gt;vlaaaaaaad/blog-scaling-containers-on-aws-in-2022&lt;/code&gt; on GitHub&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Now, with the setup covered, let's get to testing! I ran all the tests between January and February 2022, using all the latest versions available at the time of testing.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fwww.vladionescu.me%2Fposts%2Fscaling-containers-on-aws-in-2022%2Flambda-vs-ecs-on-fargate.min.svg" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fwww.vladionescu.me%2Fposts%2Fscaling-containers-on-aws-in-2022%2Flambda-vs-ecs-on-fargate.min.svg" alt="Hand-drawn-style graph showing the scaling performance from 0 to 10000 containers. ECS on Fargate starts around the 30 seconds mark and grows smoothly until 10000 around the ten minute mark. The Lambda line spikes instantly to 3000 containers, and then spikes again to 3500 containers. After that, the Lambda line follows a stair pattern, every minutes spiking an additional 500 containers"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Hand-drawn-style graph showing the scaling performance from 0 to 10000 containers. ECS on Fargate starts around the 30 seconds mark and grows smoothly until 10000 around the ten minute mark. The Lambda line spikes instantly to 3000 containers, and then spikes again to 3500 containers. After that, the Lambda line follows a stair pattern, every minutes spiking an additional 500 containers&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Lambda followed the &lt;a href="https://docs.aws.amazon.com/lambda/latest/dg/invocation-scaling.html" rel="noopener noreferrer"&gt;advertised performance&lt;/a&gt; to the letter: an initial burst of 3  000 containers in the first minute, and 500 containers each minute after. I am surprised by Lambda scaling in steps — I expected the scaling to be spread over the minutes, not to have those spikes when each minute starts.&lt;/p&gt;

&lt;p&gt;Funnily enough, at about 7 minutes after the scaling command, both Lambda and ECS on Fargate were running almost the same number of containers: 6 500, give or take. That does not tell us much though — these are pure containers launched. How would this work in the real world? How would this impact our applications? How would this scaling impact response times and customer happiness?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;LAMBDA AND ECS ON FARGATE WORK IN DIFFERENT WAYS&lt;/strong&gt; and direct comparisons between the two cannot be easily done.&lt;br&gt;
Lambda has &lt;a href="https://www.sentiatechblog.com/aws-re-invent-2020-day-3-optimizing-lambda-cost-with-multi-threading" rel="noopener noreferrer"&gt;certain container sizes&lt;/a&gt; and ECS on Fargate has &lt;a href="https://docs.aws.amazon.com/AmazonECS/latest/developerguide/AWS_Fargate.html" rel="noopener noreferrer"&gt;other container sizes&lt;/a&gt; with &lt;a href="https://github.com/aws/containers-roadmap/issues/164" rel="noopener noreferrer"&gt;even more sizes in the works&lt;/a&gt;. Lambda has &lt;a href="https://aws.amazon.com/lambda/pricing/" rel="noopener noreferrer"&gt;one pricing for x86 and one pricing for ARM&lt;/a&gt; and ECS on Fargate has &lt;a href="https://aws.amazon.com/fargate/pricing/" rel="noopener noreferrer"&gt;two pricing options for x86 and one pricing option for ARM&lt;/a&gt;. Lambda is more managed and tightly integrated with AWS which means, for example, that there is no need to write code to receive a web request, or to get a message from an SQS Queue. But ECS on Fargate can more easily use mature and validated frameworks. Lambda can only do 1 HTTP request per container, while ECS on Fargate can do as many as the application can handle, but that means more complex configuration. And so on and so forth.&lt;br&gt;
I won't even attempt to compare them here.&lt;/p&gt;

&lt;p&gt;And that is not all — it gets worse/better!&lt;br&gt;
We all know that ECS on Fargate and EKS on Fargate limits can be increased, and we even saw that in the 2021 tests. In the middle of testing Lambda, I got some &lt;strong&gt;surprising news&lt;/strong&gt;: Lambda's &lt;a href="https://docs.aws.amazon.com/lambda/latest/dg/gettingstarted-limits.html!" rel="noopener noreferrer"&gt;default scaling limits of 3 000 burst and 500 sustained&lt;/a&gt; can be increased!&lt;br&gt;
Unlike the previous limit increases (which were actually capacity limit increases) this is a performance limit increase and is not a straightforward limit increase request. It's not something that only the top 1% of the top 1% of AWS customers can achieve, but it's not something that is easily available either. With a &lt;strong&gt;LEGITIMATE WORKFLOW&lt;/strong&gt; and some conversations with the AWS teams and engineers through AWS Support, I was able to get my Lambda limits increased by a shocking amount: from the default initial burst of 3 000 and a sustained rate of 500, my limits were increased to an initial burst of 15 000 instances, and then a sustained rate of 3 000 instances each minute 🤯&lt;/p&gt;

&lt;p&gt;To see these increased limits in action, we have to scale even higher. A test to 10 000 containers is useless:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fwww.vladionescu.me%2Fposts%2Fscaling-containers-on-aws-in-2022%2Flambda-increased-limits.min.svg" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fwww.vladionescu.me%2Fposts%2Fscaling-containers-on-aws-in-2022%2Flambda-increased-limits.min.svg" alt="Hand-drawn-style graph showing the scaling performance from 0 to 10000 containers. The same graph as before, with an additional line going straight up from 0 to 10000, instantly"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Hand-drawn-style graph showing the scaling performance from 0 to 10000 containers. The same graph as before, with an additional line going straight up from 0 to 10000, instantly&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Did you notice the vertical line? Yeah, scaling to 10 000 is not a great benchmark when Lambda has increased limits to burst to 15 000.&lt;/p&gt;

&lt;p&gt;To properly test Lambda with increased limits, we have to go even higher! If we want to scale for the same 10 minute duration, we have to scale up to 50 000 containers!&lt;br&gt;
At this scale, things start getting complicated. To support this many requests, we also have to increase the API Gateway traffic limits. I talked to AWS Support and after validating my workflow, we got the &lt;em&gt;Throttle quota per account, per Region across HTTP APIs, REST APIs, WebSocket APIs, and WebSocket callback APIs&lt;/em&gt; limit increased from the default 10 000 to our required 50 000.&lt;/p&gt;

&lt;p&gt;With the extra limits increased for our setup, we can run our test and see the results. Prepare to scroll for a while:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fwww.vladionescu.me%2Fposts%2Fscaling-containers-on-aws-in-2022%2Flambda-to-50k.min.svg" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fwww.vladionescu.me%2Fposts%2Fscaling-containers-on-aws-in-2022%2Flambda-to-50k.min.svg" alt="Hand-drawn-style graph showing the scaling performance form 0 to 50000 containers. It is a comically tall graph, requiring a lot of scrolling. ECS on Fargate starts around the 30 seconds mark and grows smoothly until 10000 around the same ten minute mark. There is an additional line for Lambda with increased limits which goes straight up to 12000-ish containers and then spikes again to 18000-ish. After that, the line follows the same stair pattern, every minute spiking an additional 3000 containers"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Hand-drawn-style graph showing the scaling performance form 0 to 50000 containers. It is a comically tall graph, requiring a lot of scrolling. ECS on Fargate starts around the 30 seconds mark and grows smoothly until 10000 around the same ten minute mark. There is an additional line for Lambda with increased limits which goes straight up to 12000-ish containers and then spikes again to 18000-ish. After that, the line follows the same stair pattern, every minute spiking an additional 3000 containers&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Yeah… that is &lt;strong&gt;&lt;em&gt;a lot&lt;/em&gt;&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;We are about large numbers here: 3 500, 10 000, and now 50 000 containers. We are getting desensitized, and I think it would help to put these numbers in &lt;strong&gt;perspective&lt;/strong&gt;. The biggest Lambda size is 6 vCPUs with 10 GB of memory.&lt;br&gt;
With the default limits, Lambda scales to 3 000 containers in a couple of seconds. That means that with default limits, we get &lt;strong&gt;30 TB OF MEMORY AND 18 000 VCPUS IN A COUPLE OF SECONDS&lt;/strong&gt;.&lt;br&gt;
With a &lt;strong&gt;legitimate workload&lt;/strong&gt; and increased limits, as we just saw, we are now living in a world where we can &lt;strong&gt;INSTANTLY GET 150 TB OF RAM AND 90 000 VCPUS&lt;/strong&gt; for our apps 🤯&lt;/p&gt;

&lt;h2&gt;
  
  
  Acknowledgments
&lt;/h2&gt;

&lt;p&gt;First, massive thanks go to &lt;a href="https://twitter.com/FarrahC32" rel="noopener noreferrer"&gt;Farrah&lt;/a&gt; and &lt;a href="https://twitter.com/mreferre" rel="noopener noreferrer"&gt;Massimo&lt;/a&gt;, and everybody at AWS that helped with this! You are all awesome and the time and care you took to answer all my annoying emails is much appreciated!&lt;br&gt;
I also want to thank &lt;a href="https://twitter.com/SteveLasker" rel="noopener noreferrer"&gt;Steve Lasker&lt;/a&gt; and the nice folks at Microsoft that answered my questions around Windows containers!&lt;/p&gt;

&lt;p&gt;Special thanks go to all my friends who helped with this — from friends being interviewed about what they would like to see and their scaling use-cases, all the way to friends reading rough drafts and giving feedback. Thank you all so much!&lt;/p&gt;

&lt;p&gt;Because I value transparency, and because "&lt;em&gt;oh, just write a blog post&lt;/em&gt;" fills me with anger, here are a couple of stats:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  this whole thing took &lt;strong&gt;almost 6 months&lt;/strong&gt;:

&lt;ul&gt;
&lt;li&gt;  November 2021: early discovery and calls to figure out how to test, showcase, and contextualize Lambda and App Runner&lt;/li&gt;
&lt;li&gt;  December 2021: continued discovery, coding, testing&lt;/li&gt;
&lt;li&gt;  January 2022: continued testing, data visualization explorations&lt;/li&gt;
&lt;li&gt;  February 2022: final testing, early drafts, and data visualization explorations&lt;/li&gt;
&lt;li&gt;  March 2022: a lot of &lt;a href="https://twitter.com/iamvlaaaaaaad/status/1499049324864577543" rel="noopener noreferrer"&gt;writing and reviews&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;  April 2022: final reviews, re-testing&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;  final results from the tests were exported to &lt;strong&gt;over 80 MB of spreadsheets&lt;/strong&gt;, from about &lt;strong&gt;5 GB of raw data&lt;/strong&gt;. That's 80 MB of CSVs and 5 GBs of compressed JSONs, not counting the discarded data from experiments or failed tests&lt;/li&gt;

&lt;li&gt;  &lt;strong&gt;over 200 emails&lt;/strong&gt; were sent between me and &lt;strong&gt;more than 30 engineers at AWS&lt;/strong&gt;. Again, thank you all so much, and I apologize for the spam!&lt;/li&gt;

&lt;li&gt;  the total &lt;strong&gt;AWS bill was about 7.000 $&lt;/strong&gt;. Thank you &lt;a href="https://twitter.com/FarrahC32" rel="noopener noreferrer"&gt;Farrah&lt;/a&gt; and AWS for the Credits!&lt;/li&gt;

&lt;li&gt;  &lt;strong&gt;more than 300 000 containers&lt;/strong&gt; were launched&lt;/li&gt;

&lt;li&gt;  one major bug was discovered&lt;/li&gt;

&lt;li&gt;  carbon emissions are TBD&lt;/li&gt;

&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;Originally posted at &lt;a href="https://www.vladionescu.me/posts/scaling-containers-on-aws-in-2022/" rel="noopener noreferrer"&gt;https://www.vladionescu.me/posts/scaling-containers-on-aws-in-2022/&lt;/a&gt; and the dev.to version may contain errors and less-than-ideal presentation&lt;/p&gt;
&lt;/blockquote&gt;

</description>
      <category>aws</category>
      <category>devops</category>
      <category>serverless</category>
      <category>cloud</category>
    </item>
    <item>
      <title>AWS User Group Dubai 2021 Container Series Meetups</title>
      <dc:creator>Vlad Ionescu</dc:creator>
      <pubDate>Tue, 14 Sep 2021 20:37:50 +0000</pubDate>
      <link>https://dev.to/aws-heroes/aws-user-group-dubai-2021-container-series-meetups-49fi</link>
      <guid>https://dev.to/aws-heroes/aws-user-group-dubai-2021-container-series-meetups-49fi</guid>
      <description>&lt;h2&gt;
  
  
  Details
&lt;/h2&gt;

&lt;p&gt;This series of talks and hands-on workshops around the concept of "&lt;em&gt;AWS cloud-native modern applications&lt;/em&gt;" introduces AWS cloud platform in that light. Starting with the core building blocks of modernizing traditional applications, and how to capitalize on AWS services and capabilities to build a better resilient, reliable application with cloud-native design in mind.&lt;/p&gt;

&lt;p&gt;It's 2021. There is a need to find ways to speed up the process of deploying, scaling, and automating our applications, also enable developers and operations teams "&lt;em&gt;DevOps&lt;/em&gt;" to collaborate effectively, work efficiently, save resources, and solve the matrix from Hell problems. The magic word here is Containers. This is a pragmatic hands-on series of workshops to introduce members to AWS container workloads focusing on the fundamentals and where to start, AWS Containers 101 workshop.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Conception &amp;amp; leadership: AWS Container Hero &lt;strong&gt;Walid Shaari&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Organizer: AWS Community Hero &lt;strong&gt;Anas Khattar&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Outreach: AWS Community Hero &lt;strong&gt;Ahmed Samir&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Speakers:

&lt;ul&gt;
&lt;li&gt;AWS Container Hero &lt;strong&gt;Walid Shaari&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;AWS Container Hero &lt;strong&gt;Vlad Ionescu&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Follow AWS UG Dubai on Twitter: &lt;a href="https://twitter.com/awsdubai"&gt;@awsdubai&lt;/a&gt; and &lt;a href="https://twitter.com/AWSomeMENA"&gt;@AWSomeMENA&lt;/a&gt;!&lt;/p&gt;

&lt;h2&gt;
  
  
  Intro to Containers 101
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--wlsme5iQ--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/07rqf2tjsg5yefrb5up6.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--wlsme5iQ--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/07rqf2tjsg5yefrb5up6.png" alt="Cover image for the first talk"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In the first talk of this new series, we will introduce the format for this multi-talk series. After that, we’ll discuss containers in an abstract way. Why do we use containers? What are some different ways of running containers?&lt;/p&gt;

&lt;p&gt;Agenda:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Intro to this series: speakers, format, sponsors&lt;/li&gt;
&lt;li&gt;What are some ideal use cases for containers?&lt;/li&gt;
&lt;li&gt;How can we run containers in production?&lt;/li&gt;
&lt;li&gt;Open discussions, Q&amp;amp;A&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Keywords:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Containers, Docker, OCI&lt;/li&gt;
&lt;li&gt;AWS ECS, Amazon EKS, Kubernetes, Lambda Containers, Serverless&lt;/li&gt;
&lt;li&gt;Immutable Infrastructure, Virtual Machines, VMs&lt;/li&gt;
&lt;li&gt;Gitpod, GitHub Codespaces&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;📆 Date: Wednesday, Sept. 15, 2021, at 8:30 PM UTC +4&lt;/p&gt;

&lt;p&gt;🌍 Venue: online, see all the details at &lt;a href="https://www.meetup.com/AWS-Dubai/events/280711543/"&gt;AWS UG Dubai Meetup #8 (virtual): Part 1 - Intro to Containers 101&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;🎥 Video recording of the meetup: &lt;a href="https://www.youtube.com/watch?v=PwbPlk-KE7s"&gt;here on YouTube&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Building Containers 102
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--ivBjqAOe--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/oo0xm708vdyg3q206zdj.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--ivBjqAOe--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/oo0xm708vdyg3q206zdj.png" alt="Cover image for the second talk"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In this session we will discuss everything about building containers. What’s a container image? how do we build a container image? and the definition of OCI will be covered. We will end this talk with some best practices, and we will build up some excitement for the workshop!&lt;/p&gt;

&lt;p&gt;Agenda:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Container workflows&lt;/li&gt;
&lt;li&gt;Docker&lt;/li&gt;
&lt;li&gt;Dockerfiles&lt;/li&gt;
&lt;li&gt;Building an image&lt;/li&gt;
&lt;li&gt;Tagging an image&lt;/li&gt;
&lt;li&gt;Running a container from an image&lt;/li&gt;
&lt;li&gt;Best practices and helpful tools&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Keywords:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Dockerfiles&lt;/li&gt;
&lt;li&gt;Docker, OCI, Open Container Initiative&lt;/li&gt;
&lt;li&gt;GitHub Actions, CircleCI&lt;/li&gt;
&lt;li&gt;AWS ECR, Github Container Registry, Dockerhub, Quay&lt;/li&gt;
&lt;li&gt;Hadolint, Snyk, Dependabot, Dive&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;📆 Date: Tuesday, Sept. 21, 2021, at 8:30 PM UTC +4&lt;/p&gt;

&lt;p&gt;🌍 Venue: online, see all the details at &lt;a href="https://www.meetup.com/AWS-Dubai/events/280722294/"&gt;AWS UG Dubai Meetup #9 (virtual): Part 2 - Building Containers 102&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;🎥 Video recording of the meetup: &lt;a href="https://www.youtube.com/watch?v=mTzAD0s3kd4"&gt;here on YouTube&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Building Containers 102 Lab
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--Jy2_2HDE--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/79jq2bko622tqt8rwpuv.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--Jy2_2HDE--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/79jq2bko622tqt8rwpuv.png" alt="Cover image for the workshop"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In this hands-on lab you will learn how to build containers for different applications. After a short introduction, each student will open a pre-configured Gitpod workspace and build container images for 3 applications: one Go app, one Python app, and a React app! Once an image is built, we will push the image to Amazon Elastic Container Registry. For extra credit, we will also inspect the image we built. No previous experience with Docker, Go, Python, React, or Javascript is required!&lt;/p&gt;

&lt;p&gt;This will be a live and guided workshop, leveraging Zoom, Slack, and Gitpod.&lt;/p&gt;

&lt;p&gt;Requirements:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;GitHub account&lt;/li&gt;
&lt;li&gt;Gitpod account (can be created instantly)&lt;/li&gt;
&lt;li&gt;An active AWS account (if you don’t have an AWS account, you can do the first half of the workshop)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Keywords:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;AWS, ECR, Amazon Elastic Container Registry, Github Container Registry, Dockerhub, Quay&lt;/li&gt;
&lt;li&gt;Workshop, Interactive, Live, Guided&lt;/li&gt;
&lt;li&gt;Docker, OCI, Containerd, Buildkit, GitHub Actions&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;📆 Date: Wednesday, Sept. 22, 2021, at 8:30 PM UTC +4&lt;/p&gt;

&lt;p&gt;🌍 Venue: online, see all the details at &lt;a href="https://www.meetup.com/AWS-Dubai/events/280722765/"&gt;AWS UG Dubai Meetup #10 (virtual): Part 3 - Building Containers Lab&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;🎥 Video recording of the meetup: &lt;a href="https://www.youtube.com/watch?v=87KSsZtH1uw"&gt;here on YouTube&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Links:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://github.com/Vlaaaaaaad/aws-ug-dbx-2021-building-containers-workshop-react-app"&gt;The React exercise&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/Vlaaaaaaad/aws-ug-dbx-2021-building-containers-workshop-golang-app"&gt;The Go exercise&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/Vlaaaaaaad/aws-ug-dbx-2021-building-containers-workshop-python-app"&gt;The Python exercise&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/Vlaaaaaaad/aws-ug-dbx-2021-building-containers-workshop-react-app"&gt;The React exercise&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/poc-hello-world"&gt;The source for all the apps: poc-hello-world&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/docker/build-push-action"&gt;Docker's build-and-push GitHub Action&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://gitpod.io"&gt;Gitpod, the development environment we used&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Application Modernization with Amazon EKS
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--Tl6s65IN--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/sq0897hj7ivyuf0hstgt.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--Tl6s65IN--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/sq0897hj7ivyuf0hstgt.png" alt="Cover image for the fourth talk"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In this session, we will explore the popular workload manager and scheduler Kubernetes. Amazon managed Kubernetes service, Elastic Container Service for Kubernetes (Amazon EKS)  takes care of the heavy-lifting and lets one focus on managing the containerized workloads. EKS, however, still gives you the flexibility and choice where to run, and how to efficiently run your data-plane that hosts your workloads. In this session, we cover what you need to know to get your application up and running with Kubernetes on AWS. We show how Amazon EKS makes deploying Kubernetes on AWS simple and scalable.&lt;/p&gt;

&lt;p&gt;Agenda:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Review the general Kubernetes architecture and relate to EKS&lt;/li&gt;
&lt;li&gt;How to set up and provision your Kubernetes cluster using console and eksctl.&lt;/li&gt;
&lt;li&gt;Discuss the important abstractions that developers use to map their traditional application into any kubernetes platform.&lt;/li&gt;
&lt;li&gt;How to deploy software efficiently, while sustaining reliable and scalable applications. &lt;/li&gt;
&lt;li&gt;Deploy your first microservices on EKS&lt;/li&gt;
&lt;li&gt;EKS possible development deployment workflow&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Keywords:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;EKS-Distro, EKS-Anywhere&lt;/li&gt;
&lt;li&gt;Fargate, data plane&lt;/li&gt;
&lt;li&gt;YAML, Helm, Gitops, Operators&lt;/li&gt;
&lt;li&gt;Pod, Deployment, Service, ConfigMap, Secret&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;📆 Date: Tuesday, Sept. 28, 2021, at 8:30 PM UTC +4&lt;/p&gt;

&lt;p&gt;🌍 Venue: online, see all the details at &lt;a href="https://www.meetup.com/AWS-Dubai/events/280723703/"&gt;AWS UG Dubai Meetup #11: Part 4 - Application Modernization with Amazon EKS&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;🎥 Video recording of the meetup: &lt;a href="https://www.youtube.com/c/AWSomeMENA/videos"&gt;to be posted here on YouTube&lt;/a&gt;&lt;/p&gt;

</description>
      <category>aws</category>
      <category>cloud</category>
      <category>devops</category>
    </item>
    <item>
      <title>Accidentally a hacker</title>
      <dc:creator>Vlad Ionescu</dc:creator>
      <pubDate>Fri, 26 Jun 2020 10:55:21 +0000</pubDate>
      <link>https://dev.to/vlaaaaaaad/accidentally-a-hacker-1p62</link>
      <guid>https://dev.to/vlaaaaaaad/accidentally-a-hacker-1p62</guid>
      <description>&lt;p&gt;Around the beginning of February 2020, GitHub Security updated their &lt;a href="https://bounty.github.com/bounty-hunters.html" rel="noopener noreferrer"&gt;public security disclosures&lt;/a&gt; and &lt;a href="https://bounty.github.com/index.html" rel="noopener noreferrer"&gt;hacker leaderboard&lt;/a&gt;, and this happened:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Fjb20rltsjc4xml0aha9k.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Fjb20rltsjc4xml0aha9k.png" alt="GitHub Security Leaderboards with Vlad Ionescu on the ninth position out of ten people"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Previously, on October 7, 2019, something popped up on &lt;a href="https://www.hackerone.com" rel="noopener noreferrer"&gt;HackerOne&lt;/a&gt;:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Fmw6vg0hyqnwideov6ofj.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Fmw6vg0hyqnwideov6ofj.png" alt="A 10000 $ bounty awarded to Vlad Ionescu on HackerOne for a GitHub security report"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This is the story of how it all came to be. Unlikely to ever have a sequel.&lt;/p&gt;




&lt;h2&gt;
  
  
  Vlad likes pretty
&lt;/h2&gt;

&lt;p&gt;Something most people on the internet don't know about me is that I like pretty stuff. Be it a &lt;a href="https://www.eischcrystalglass.com/shop/gentleman-whiskey-pipette-gold/" rel="noopener noreferrer"&gt;whiskey pipette&lt;/a&gt;, a &lt;a href="https://afremov.com/passion.html" rel="noopener noreferrer"&gt;lovely painting&lt;/a&gt;, many &lt;a href="https://twitter.com/dog_feelings" rel="noopener noreferrer"&gt;dog thoughts&lt;/a&gt;, a &lt;a href="https://www.youtube.com/watch?v=dQw4w9WgXcQ" rel="noopener noreferrer"&gt;ferocious lover&lt;/a&gt; — it does not matter. I like pretty!&lt;/p&gt;

&lt;p&gt;Although I generally have excellent impulse control[1], pretty is pretty and I want pretty. I want to enjoy pretty. Savor pretty. Surprisingly often, this puts me in weird situations — like accidentally hacking GitHub.&lt;/p&gt;

&lt;h2&gt;
  
  
  CircleCI is not pretty
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://circleci.com" rel="noopener noreferrer"&gt;CircleCI&lt;/a&gt; is a hosted build service. It can build or compile code to ensure there are no build errors, it can run tests to ensure the correct thing happens, it can do whatever you want it to do!&lt;/p&gt;

&lt;p&gt;But CircleCI is not pretty. For some cruel reason, CircleCI forces people to log in to their UI to get to the test output. Oh, a test failed? Spend 3 minutes following redirects and logging into things. Only then you get to see why the test failed.&lt;/p&gt;

&lt;p&gt;The typical workflow looks like this:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;do some work&lt;/li&gt;
&lt;li&gt;send work to GitHub&lt;/li&gt;
&lt;li&gt;CircleCI is notified and starts building and testing the code&lt;/li&gt;
&lt;li&gt;notification is received from GitHub and/or CircleCI&lt;/li&gt;
&lt;li&gt;go to GitHub, see that the "&lt;em&gt;Test X&lt;/em&gt;" step failed&lt;/li&gt;
&lt;li&gt;click "&lt;em&gt;Details Test X&lt;/em&gt;"&lt;/li&gt;
&lt;li&gt;click "&lt;em&gt;See more in CircleCI&lt;/em&gt;"&lt;/li&gt;
&lt;li&gt;new tab opens&lt;/li&gt;
&lt;li&gt;click "&lt;em&gt;Log In&lt;/em&gt;"&lt;/li&gt;
&lt;li&gt;go through the login process&lt;/li&gt;
&lt;li&gt;finally, see how the test failed&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;WHY?! I don't understand why CircleCI even has &lt;a href="https://github.com/marketplace/circleci" rel="noopener noreferrer"&gt;a GitHub integration&lt;/a&gt;. Why not have the output &lt;strong&gt;right there&lt;/strong&gt; on GitHub? Why does CircleCI insist on adding another step to the whole process?&lt;/p&gt;

&lt;p&gt;It's not even a limitation on the GitHub side: &lt;a href="https://brigade.sh" rel="noopener noreferrer"&gt;Brigade&lt;/a&gt; can, and does, &lt;a href="https://github.com/brigadecore/brigade/pull/914/checks?check_run_id=130703731" rel="noopener noreferrer"&gt;precisely that&lt;/a&gt;!&lt;/p&gt;

&lt;p&gt;I really do not like the whole user experience around the CircleCI integration with GitHub. It was, and it still is, a constant source of annoyance for me. I mean look at it:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Fxew07t0fi3pqdrumb4tk.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Fxew07t0fi3pqdrumb4tk.png" alt="An image comparing CircleCI in GitHub( three builds, 1 failed, and a link below with 'See more details on CircleCI' versus a Brigade test run with the output right there"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  GitHub Actions are pretty
&lt;/h2&gt;

&lt;p&gt;In August 2019, it happened: I finally got access to the &lt;a href="https://techcrunch.com/2018/10/16/github-launches-actions-its-workflow-automation-tool/" rel="noopener noreferrer"&gt;GitHub Actions Beta&lt;/a&gt;! A day I was very much looking forward to.&lt;/p&gt;

&lt;p&gt;I am working on a continuous challenge to be less cranky, and CircleCI was making me cranky. This was a chance for me to be happier!&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/features/actions" rel="noopener noreferrer"&gt;GitHub Actions&lt;/a&gt; are similar to CircleCI: they can build code, they can test code, they can do whatever you want them to do!&lt;/p&gt;

&lt;p&gt;Even more, GitHub Actions can answer to more than just new code: they can run when a comment is posted, or when a change is approved! This flexibility empowers people to build even more amazing workflows.&lt;/p&gt;

&lt;p&gt;I &lt;strong&gt;immediately&lt;/strong&gt; started playing around with GitHub Actions. I &lt;strong&gt;loved&lt;/strong&gt; everything about them — they were so flexible. More importantly, the output is right there. On the same page! Look how gorgeous they are:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Fmd6c1eqrpsf7nln87bj8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Fmd6c1eqrpsf7nln87bj8.png" alt="An image showing very clear and pretty logs from GitHub Actions"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Everything was nice in the world. For a couple days.&lt;/p&gt;

&lt;h2&gt;
  
  
  GitHub Actions user experience is not pretty
&lt;/h2&gt;

&lt;p&gt;GitHub Actions were unacceptably slow due to a lack of caching, and don't even get me started on the whole slew of unexpected and random limitations.&lt;/p&gt;

&lt;p&gt;You think GitHub Actions can answer to say a comment being posted on a Pull Request, but then surprise! They can indeed run on a comment post, but you don't actually have access to the code in that Pull Request.&lt;/p&gt;

&lt;p&gt;I started working with GitHub Actions more and more, and I started even creating some actions. It was not pretty, but it was prettier than the CircleCI alternative.&lt;/p&gt;

&lt;p&gt;I was doing a lot of &lt;a href="https://www.terraform.io" rel="noopener noreferrer"&gt;Terraform&lt;/a&gt; at that time, and I really wanted warnings from &lt;a href="https://github.com/terraform-linters/tflint/" rel="noopener noreferrer"&gt;tflint&lt;/a&gt; on Pull Requests. Output on the separate &lt;em&gt;Checks&lt;/em&gt; tab was not enough anymore — I wanted it in the &lt;em&gt;Conversations&lt;/em&gt; tab. That is where I spend my time, so that is where everything relevant should be. I don't want to even have to press one button.&lt;/p&gt;

&lt;h2&gt;
  
  
  GitHub Actions are pretty with reviewdog
&lt;/h2&gt;

&lt;p&gt;I ended up discovering the lovely &lt;a href="https://github.com/reviewdog/reviewdog" rel="noopener noreferrer"&gt;reviewdog&lt;/a&gt;: it takes the output from any tool and sends it to GitHub!&lt;/p&gt;

&lt;p&gt;Want the output as a &lt;em&gt;Check annotation&lt;/em&gt;? It can do that. Want the output as a &lt;em&gt;Review comment&lt;/em&gt;? It can do that.&lt;/p&gt;

&lt;p&gt;Look how pretty it is:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2F39lheymw3l5is0r87dqm.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2F39lheymw3l5is0r87dqm.png" alt="Reviewdog posting a comment on a Pull Request with a golint error"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;It was love at first sight! I started working more and more with reviewdog, and I even ended up writing &lt;a href="https://github.com/reviewdog/action-tflint" rel="noopener noreferrer"&gt;a reviewdog tflint GitHub Action&lt;/a&gt;!&lt;/p&gt;

&lt;p&gt;Life was good again!&lt;/p&gt;

&lt;h2&gt;
  
  
  GitHub Actions are not pretty if it has unexpected limitations
&lt;/h2&gt;

&lt;p&gt;Now I get notified that things are wrong, and in the right place. But wouldn't auto-fixes be even prettier? Not fully automatic, but something like "&lt;em&gt;click a button and it's fixed&lt;/em&gt;".&lt;/p&gt;

&lt;p&gt;Since I want pretty, I started working on it! I imagined something along these lines:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;GitHub Actions sees that code formatting is wrong&lt;/li&gt;
&lt;li&gt;reviewdog posts a comment letting me know&lt;/li&gt;
&lt;li&gt;when the time is right, I comment &lt;code&gt;pls fix&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;automated fixes just appear&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;GitHub Actions at that time were very... particular about how they downloaded the code and what version of the code was downloaded. For the life of me, I could not get an auto-fix workflow to do what I wanted.&lt;/p&gt;

&lt;h2&gt;
  
  
  GitHub Actions are pretty with auto-fixes
&lt;/h2&gt;

&lt;p&gt;After about a week of perusing the &lt;del&gt;incomplete&lt;/del&gt; very beta documentation, I figured it out!&lt;/p&gt;

&lt;p&gt;If I post a comment on the review screen and have the auto-fixer GitHub Action respond to the &lt;code&gt;pull_request_review&lt;/code&gt; event, it all works! WOOOO! The right code is downloaded, the fixes are pushed back, everything works!&lt;/p&gt;

&lt;p&gt;Life was so pretty! It was an exemplary workflow, and it made life so much easier.&lt;/p&gt;

&lt;h2&gt;
  
  
  GitHub Actions are not pretty if it leaks secrets
&lt;/h2&gt;

&lt;p&gt;I was so proud of the thing I built. I was even &lt;a href="https://github.com/terraform-aws-modules/terraform-aws-eks/pull/541" rel="noopener noreferrer"&gt;telling people about it&lt;/a&gt; and &lt;a href="https://github.com/reviewdog/action-tflint/issues/2#issuecomment-536319167" rel="noopener noreferrer"&gt;insisting they use it&lt;/a&gt; because it is a better user experience.&lt;/p&gt;

&lt;p&gt;In one of the chats, &lt;a href="https://github.com/reviewdog/action-tflint/issues/2#issuecomment-536320327" rel="noopener noreferrer"&gt;it was pointed out to me&lt;/a&gt; that I was kind of breaking the security of GitHub Actions.&lt;/p&gt;

&lt;p&gt;In my mind it was just "&lt;em&gt;beware, you need to set this thing for this to work&lt;/em&gt;", not "&lt;em&gt;please create a secret so I can steal it&lt;/em&gt;".&lt;/p&gt;

&lt;p&gt;I have access to a surprising amount of GitHub Organizations and GitHub Teams, so I went forth and tested if I was actually breaking the Secret protection.&lt;/p&gt;

&lt;p&gt;In 10 minutes I had confirmed that I was indeed able to steal any secret for any public GitHub repository, without any user involvement.&lt;/p&gt;

&lt;p&gt;Oops.&lt;/p&gt;

&lt;h2&gt;
  
  
  Submitting the report
&lt;/h2&gt;

&lt;p&gt;Allow me to set the scene.&lt;/p&gt;

&lt;p&gt;It is late in the evening on September 29. I am on my couch with my laptop and a glass of &lt;a href="https://www.gitana.md/en/shop/reserve/lupi-2/" rel="noopener noreferrer"&gt;Lupi&lt;/a&gt;, a lovely blend of red wines. Rather inebriated, but happy and content. Doing stuff on GitHub.&lt;/p&gt;

&lt;p&gt;I panicked a bit. I allegedly found a security issue. I am also drunk, so I might be wrong. But I could be right. But I am a dum-dum and GitHub people are smarter than me, so there is no way. But what if?&lt;/p&gt;

&lt;p&gt;Fuck. I panicked some more.&lt;/p&gt;

&lt;p&gt;I had to report it — even if there was a small chance of it being real, there was a risk. I never had any issues or reservations about looking dumb: that's how you learn!&lt;/p&gt;

&lt;p&gt;I imagined I would send an email to something like &lt;code&gt;security@github.com&lt;/code&gt;. Looking into it, I saw that GitHub has an &lt;a href="https://help.github.com/en/github/site-policy/responsible-disclosure-of-security-vulnerabilities" rel="noopener noreferrer"&gt;open process on how to report security issues&lt;/a&gt;: they have a &lt;a href="https://www.hackerone.com" rel="noopener noreferrer"&gt;HackerOne&lt;/a&gt; account. HackerOne handles the process and GitHub responds. Nice!&lt;/p&gt;

&lt;p&gt;I quickly created an account with HackerOne, hoping that a 2-minute-old-account will be allowed to send a report. Surprisingly, they allow that. Nice!&lt;/p&gt;

&lt;p&gt;Drunkenly, I wrote a report and submitted it. For your pleasure, here it is in all its glory:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Description&lt;/strong&gt;:&lt;/p&gt;

&lt;p&gt;Unsure how much of a security issue this is but better safe than sorry. This is my first report ever so I have no idea what I am doing. Apologies if I am wrong.&lt;/p&gt;

&lt;p&gt;GitHub Actions for the pull_request_review event run in the base repo not in the fork. But a fork can change the code for the action to leak secrets. No input is needed.&lt;/p&gt;

&lt;p&gt;This was discussed a bit on &lt;a href="https://github.com/reviewdog/action-tflint/issues/2" rel="noopener noreferrer"&gt;https://github.com/reviewdog/action-tflint/issues/2&lt;/a&gt; where I was urged to report it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Steps To Reproduce&lt;/strong&gt;:&lt;/p&gt;

&lt;p&gt;See &lt;a href="https://github.com/org-name-redacted/repo-for-fork-testing/pull/9" rel="noopener noreferrer"&gt;https://github.com/org-name-redacted/repo-for-fork-testing/pull/9&lt;/a&gt; with the &lt;a href="https://github.com/org-name-redacted/repo-for-fork-testing/commit/2dc3d01c3f342d722ca0a0b0543901a15de7fe16/checks" rel="noopener noreferrer"&gt;https://github.com/org-name-redacted/repo-for-fork-testing/commit/2dc3d01c3f342d722ca0a0b0543901a15de7fe16/checks&lt;/a&gt; check, the Test step. I base64 encoded the secrets to get around the secret hiding thing.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Have a public repository with GitHub Actions access and some Secrets set for the repo&lt;/li&gt;
&lt;li&gt;Fork the repo to an account&lt;/li&gt;
&lt;li&gt;Add an action that runs for pull_request_review that tries to get the value of a secret&lt;/li&gt;
&lt;li&gt;Create PR to the public repository&lt;/li&gt;
&lt;li&gt;Leave a review comment on the created PR&lt;/li&gt;
&lt;li&gt;Go to the Actions tab where the action ran and secrets values can be found&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This does require knowing the secrets name, but I guess a brute force attack searching for secrets names is not improbable.&lt;/p&gt;

&lt;p&gt;Again, apologies if Actions is out of scope/ this issue is known.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Impact&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Leaking of secrets( whose names are known) for any public repository with GitHub Actions enabled.&lt;/p&gt;


&lt;/blockquote&gt;

&lt;p&gt;I woke up the next day. A ridiculously sunny Monday. I thought about it some more, and I updated my initial report.&lt;/p&gt;

&lt;p&gt;I remember it vividly: I was in an Uber on my way to a client onsite, and I was sitting in the back trying to write a "&lt;em&gt;serious&lt;/em&gt;" and "&lt;em&gt;adult&lt;/em&gt;" report:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Hi,&lt;/p&gt;

&lt;p&gt;I don't see a way to edit this but I'd like to raise this to Critical. I am still new and confused, but being able to leak any secret from GitHub Actions seems huge to me.&lt;/p&gt;

&lt;p&gt;While Actions are still in preview, multiple projects use them already. This exploit has a huge impact for GitHub Actions and for any org using Actions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://github.com/Homebrew/homebrew-core/blob/e9b3867c54009a63d396e6642ddeaf0063152205/.github/workflows/generate_formulae.brew.sh_data.yml#L17" rel="noopener noreferrer"&gt;Homebrew can be totally taken over&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://github.com/ruby/ruby/blob/8ba48c1b8509bc52c2fc1f020990c8a3a8eca2c9/.github/workflows/draft-release.yml#L80" rel="noopener noreferrer"&gt;Ruby's AWS Keys are easily retrievable&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Pretty much every secret on GitHub can be retrieved in less than 5 minutes, with no action required from the person being attacked. A very easy way to find names &lt;a href="https://github.com/search?q=secrets+path%3A.github%2Fworkflows" rel="noopener noreferrer"&gt;is to search for secrets in .github/worrkflows across all GiHub&lt;/a&gt;. Once a name is found the repo is forked, a PR is created back with the following code, a comment is left on the PR, and in less than 1 minute the secrets are out:&lt;/p&gt;


&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Leak&lt;/span&gt;

&lt;span class="na"&gt;on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;pull_request_review&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;

&lt;span class="na"&gt;jobs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;leak&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Leak GitHub Seecrets&lt;/span&gt;
    &lt;span class="na"&gt;runs-on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ubuntu-latest&lt;/span&gt;
    &lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Leak the secre&lt;/span&gt;
        &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
          &lt;span class="s"&gt;echo $VALUE | base64&lt;/span&gt;
        &lt;span class="na"&gt;env&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;VALUE&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ secrets.SECRET_NAME_WHICH_IS_EASY_TO_GET }}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;


&lt;p&gt;Please treat this with the highest priority.&lt;/p&gt;

&lt;p&gt;Thank you,&lt;/p&gt;

&lt;p&gt;Vlad Ionescu&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;It had a typo and all that, but it was a better report.&lt;/p&gt;

&lt;h2&gt;
  
  
  GitHub Bounties are pretty
&lt;/h2&gt;

&lt;p&gt;GitHub, to their credit, responded super-fast to the report. Keeping in mind that times are local to Bucharest and the US is 10-ish hours behind, the timeline looks like this:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Dead-of-the-night Sunday, September 29: initial report&lt;/li&gt;
&lt;li&gt;Early Monday, September 30: updated report&lt;/li&gt;
&lt;li&gt;Late Monday, September 30: report triaged and GitHub started working on it&lt;/li&gt;
&lt;li&gt;October 7: 10.000 $ bounty awarded for the bug report[2] [3]&lt;/li&gt;
&lt;li&gt;October 18: lifetime GitHub Pro subscription awarded, formal invitation into &lt;a href="https://github.com/GitHubBounty" rel="noopener noreferrer"&gt;@GitHubBounty organization&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  GitHub Bounties spam is weird
&lt;/h2&gt;

&lt;p&gt;As soon as GitHub awarded the bounty, their HackerOne page was updated and showed that &lt;em&gt;"Vlad Ionescu reported something secret and got 10 grand from GitHub"&lt;/em&gt;. That brought more attention to me than I would've liked to. It was a bit overwhelming.&lt;/p&gt;

&lt;p&gt;I started getting a bunch of messages asking me for details about the vulnerability. I started getting a lot of messages asking me if I want to pair on researching bugs.&lt;/p&gt;

&lt;p&gt;On the one hand, that is very spammy. On the other hand, awww, people hack together and collaborate and that seems nice? I have doubts about how nice it actually is.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;The fact that a random human with zero experience can just create an account on HackerOne and report a security issue to a major company is fantastic. And they take these reports seriously! And give out rewards!&lt;/p&gt;

&lt;p&gt;Major props to &lt;a href="https://hackerone.com" rel="noopener noreferrer"&gt;HackerOne&lt;/a&gt; and the &lt;a href="https://github.com/security/team" rel="noopener noreferrer"&gt;GitHub Security Team&lt;/a&gt; for all this! They managed to create an outstanding process, and they stick to it.&lt;/p&gt;

&lt;p&gt;Also, I can now put &lt;em&gt;Security Researcher&lt;/em&gt; on my resume! Ahem, it's a resume; I have to sell myself. "&lt;em&gt;Established myself as part of the &lt;a href="https://duo.com/decipher/taking-hype-out-of-bug-bounty-programs" rel="noopener noreferrer"&gt;top 0.3 percent of hackers&lt;/a&gt;&lt;/em&gt;".&lt;/p&gt;




&lt;p&gt;[1]: Shhhhhhhh. I really do!&lt;br&gt;
[2]: I, of course, tried negotiating for a higher bounty. Well, "&lt;em&gt;asking&lt;/em&gt;" cannot really be called negotiating.&lt;br&gt;
[3]: The reward was split with the person that suggested reporting the issue to GitHub.&lt;/p&gt;

</description>
      <category>beginners</category>
      <category>security</category>
      <category>github</category>
    </item>
    <item>
      <title>Scaling containers in AWS</title>
      <dc:creator>Vlad Ionescu</dc:creator>
      <pubDate>Fri, 19 Jun 2020 15:30:41 +0000</pubDate>
      <link>https://dev.to/aws-heroes/scaling-containers-in-aws-3e82</link>
      <guid>https://dev.to/aws-heroes/scaling-containers-in-aws-3e82</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;THIS IS OUTDATED!&lt;br&gt;
For the latest information, check out &lt;a href="https://www.youtube.com/watch?v=UhRiLCxYNbo" rel="noopener noreferrer"&gt;my re:Invent 2020 talk on YouTube: COM401 - Scaling Containers on AWS&lt;/a&gt;!&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This all started with a tech curiosity: what's the fastest way to scale containers on AWS? Is ECS faster than EKS? What about Fargate? Is there a difference between Fargate on ECS and Fargate on EKS?&lt;/p&gt;

&lt;p&gt;For the fiscally responsible person in me, there is another interesting side to this. Does everybody have to bear the complexities of EKS? Where is the line? Say for a job that &lt;strong&gt;must&lt;/strong&gt; finish in less than 1 hour and that on average uses 20.000 vCPUs and 50.000 GBs, what is the cost-efficient option considering the ramp-up time?&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Fnip0lq1nvih2xooqbhae.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Fnip0lq1nvih2xooqbhae.png" alt="Hand-drawn graph showing EKS scaling from 0 to 3500 containers in about 5 minutes. Fargate on ECS and Fargate on EKS scale up to the same 3500 containers in about an hour. Tuned Fargate scales faster than Fargate, but slower than EKS. There is minimal variance in the results." width="800" height="750"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;THIS IS OUTDATED!&lt;br&gt;
For the latest information, check out &lt;a href="https://www.youtube.com/watch?v=UhRiLCxYNbo" rel="noopener noreferrer"&gt;my re:Invent 2020 talk on YouTube: COM401 - Scaling Containers on AWS&lt;/a&gt;!&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Tl;dr&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Fargate on ECS scales up 1 single &lt;code&gt;Service&lt;/code&gt; at a surprisingly consistent 23 containers per minute&lt;/li&gt;
&lt;li&gt;Fargate on ECS scales up multiple &lt;code&gt;Services&lt;/code&gt; at 60 containers per minute &lt;a href="https://forums.aws.amazon.com/thread.jspa?threadID=276814" rel="noopener noreferrer"&gt;as per the default AWS Limit&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Fargate on EKS scales up at 60 containers per minute, regardless of the number of &lt;code&gt;Deployments&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Fargate scales up the same way, no matter if it's running on ECS or EKS&lt;/li&gt;
&lt;li&gt;Fargate on EKS scales down &lt;a href="https://twitter.com/iamvlaaaaaaad/status/1231980069008166912" rel="noopener noreferrer"&gt;significantly faster&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Fargate limits can be increased for relevant workloads, significantly improving performance&lt;/li&gt;
&lt;li&gt;Fargate starts scaling with a burst of 10 containers in the first second&lt;/li&gt;
&lt;li&gt;EKS does have a delay of 1-2 minutes before it starts scaling up. The &lt;a href="https://kubernetes.io/docs/concepts/scheduling/kube-scheduler/" rel="noopener noreferrer"&gt;&lt;code&gt;kube-scheduler&lt;/code&gt;&lt;/a&gt; has to do some magic, &lt;a href="https://github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler" rel="noopener noreferrer"&gt;&lt;code&gt;cluster-autoscaler&lt;/code&gt;&lt;/a&gt; has to decide what nodes to add, and so on. Fargate starts scaling up immediately&lt;/li&gt;
&lt;li&gt;EKS scales up suuuper fast&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Beware, &lt;strong&gt;this benchmark is utterly useless for web workloads&lt;/strong&gt; — the focus here is on background work or batch processing. For a scaling benchmark relevant to web services, you can check out &lt;a href="https://www.youtube.com/watch?v=OdzaTbaQwTg&amp;amp;t=1366" rel="noopener noreferrer"&gt;Clare Liguori's awesome demo at re:Invent 2019&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt; &lt;br&gt;
 &lt;/p&gt;

&lt;p&gt;That's it! If you want to check out details about how I tested this, read on.&lt;/p&gt;





&lt;blockquote&gt;
&lt;p&gt;THIS IS OUTDATED!&lt;br&gt;
For the latest information, check out &lt;a href="https://www.youtube.com/watch?v=UhRiLCxYNbo" rel="noopener noreferrer"&gt;my re:Invent 2020 talk on YouTube: COM401 - Scaling Containers on AWS&lt;/a&gt;!&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h2&gt;
  
  
  Preparation
&lt;/h2&gt;

&lt;p&gt;Before any test can be done, we have to prepare.&lt;/p&gt;
&lt;h3&gt;
  
  
  Setup
&lt;/h3&gt;

&lt;p&gt;I created a completely separate AWS Account for this — my brand new "&lt;em&gt;Container scaling&lt;/em&gt;" account. Any performance tests I plan to do in the future will happen here.&lt;/p&gt;

&lt;p&gt;In this account, I submitted requests to increase several limits.&lt;/p&gt;

&lt;p&gt;By default, Fargate allows a maximum of 250 tasks to run. The &lt;em&gt;Fargate Spot Concurrent Tasks&lt;/em&gt; limit was raised to 10.000.&lt;/p&gt;

&lt;p&gt;By default, Fargate allows scaling at 1 task per second( after an initial burst of 10 tasks the first second). After discussions with AWS and validating my workflow, this limit was raised to 3 tasks per second.&lt;/p&gt;

&lt;p&gt;By default, EKS does a great job of scaling the Kubernetes Control Plane components( really — I tested this extensively for my customers). As there will be lots of time with 0 work, and as we are not benchmarking EKS scaling, I wanted to take that variable out. AWS is happy to pre-scale clusters depending on the workload, and they did precisely that after some discussions and validation.&lt;/p&gt;

&lt;p&gt;By default, EC2 Spot allows for a maximum of 20 Spot Instances. After many talks, the &lt;em&gt;EC2 Spot Instances&lt;/em&gt; limit was raised to 250 &lt;code&gt;c5.4xlarge&lt;/code&gt; Spot Instances.&lt;/p&gt;
&lt;h3&gt;
  
  
  Test plan
&lt;/h3&gt;
&lt;h4&gt;
  
  
  Numbers
&lt;/h4&gt;

&lt;p&gt;After an initial desired value of 30.000, and changing my mind multiple times, it was finally decided: we will test what is the fastest option to scale to about 3.000 containers!&lt;/p&gt;

&lt;p&gt;Multiple reasons went into this:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;the joys of limit increases for EC2 Spot Instances on a new AWS Organization with a history of sustained usage at around $ 5/ month&lt;/li&gt;
&lt;li&gt;I had to run these tests in my own AWS Organization — I really couldn't do this in one of my customers' account&lt;/li&gt;
&lt;li&gt;I really &lt;em&gt;really&lt;/em&gt; &lt;strong&gt;really&lt;/strong&gt; did not want to wait 6 hours for a single test&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;
  
  
  Regions
&lt;/h4&gt;

&lt;p&gt;We don't want to test just in &lt;code&gt;us-east-1&lt;/code&gt; due to its... size and &lt;em&gt;particularities&lt;/em&gt;, so we should also run the tests in &lt;code&gt;eu-west-1&lt;/code&gt;, which is the largest European region.&lt;/p&gt;

&lt;p&gt;As any European AWS user, I've had &lt;code&gt;us-east-1&lt;/code&gt; issues that &lt;strong&gt;only&lt;/strong&gt; happened during my night-time — actual US day-time. We must run the tests during the US day-time, too, for full relevance.&lt;/p&gt;
&lt;h4&gt;
  
  
  Measuring
&lt;/h4&gt;

&lt;p&gt;To measure the container creation, we'll use &lt;a href="https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/ContainerInsights.html" rel="noopener noreferrer"&gt;CloudWatch Container Insights&lt;/a&gt;. It has the &lt;code&gt;RunningTaskCount&lt;/code&gt; metric for Fargate and &lt;code&gt;namespace_number_of_running_pods&lt;/code&gt; metric for Kubernetes that give us exactly what we need.&lt;/p&gt;

&lt;p&gt;To scale up, we'll just edit &lt;code&gt;Desired Tasks&lt;/code&gt; or &lt;code&gt;Replicas&lt;/code&gt; to &lt;code&gt;3.500&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;All the tests will start from a base of 1 task/ pod already running.&lt;/p&gt;

&lt;p&gt;For EKS testing, the start point will be just 1 node running; cluster-autoscaler will have to add all the relevant nodes.&lt;/p&gt;
&lt;h4&gt;
  
  
  Container
&lt;/h4&gt;

&lt;p&gt;The container image used was &lt;a href="https://github.com/poc-hello-world/namer-service" rel="noopener noreferrer"&gt;poc-hello-world/namer-service&lt;/a&gt; — a straightforward app I plan to use in upcoming workshops and posts.&lt;/p&gt;

&lt;p&gt;The container size was decided to be 1 vCPU and 2 GBs of memory. Not too small, but not too large either.&lt;/p&gt;

&lt;p&gt;Both tasks and pods can run multiple containers, but for simplicity, we will use just 1 container.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;THIS IS OUTDATED!&lt;br&gt;
For the latest information, check out &lt;a href="https://www.youtube.com/watch?v=UhRiLCxYNbo" rel="noopener noreferrer"&gt;my re:Invent 2020 talk on YouTube: COM401 - Scaling Containers on AWS&lt;/a&gt;!&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h2&gt;
  
  
  Expectations
&lt;/h2&gt;

&lt;p&gt;Before starting, it is crucial to set expectations. We don't want to be surprised with a 5-digit bill, for example.&lt;/p&gt;
&lt;h3&gt;
  
  
  Expectations for Fargate on ECS
&lt;/h3&gt;

&lt;p&gt;AWS Fargate on ECS has to respect the &lt;a href="https://forums.aws.amazon.com/thread.jspa?threadID=276814" rel="noopener noreferrer"&gt;default 1 task per second launch limit&lt;/a&gt;, and so time to scale from 1 to 3.500 tasks should be around 3.500 seconds, which is about 1 hour. Reasonable.&lt;/p&gt;

&lt;p&gt;As I want to focus on realistic results for everybody, we will mostly test Fargate with the default rate limits. AWS does increase the rate limits for relevant workloads, but that is the exception and not the rule.&lt;/p&gt;

&lt;p&gt;A scaling test using the default limits would reach 10.000 Tasks in about 2:47 hours, which is indeed not that relevant. The first hour should be more than enough.&lt;/p&gt;

&lt;p&gt;For Fargate, the pricing can be &lt;em&gt;kind-of-estimated&lt;/em&gt; by multiplying the number of tasks with the hourly cost.&lt;/p&gt;

&lt;p&gt;As per &lt;a href="https://aws.amazon.com/fargate/pricing/" rel="noopener noreferrer"&gt;AWS Fargate pricing page&lt;/a&gt; we get the following values:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;On-Demand: 3.500 * ($0.040480 vCPU / hour * 1 vCPU + $0.0044450 GB / hour * 2 GB * 1 hour) = ~$180 per test&lt;/li&gt;
&lt;li&gt;Spot: 3.500 * ($0.012144 vCPU / hour * 1 vCPU + $0.0013335 GB / hour * 2 GB * 1 hour) = ~$60 per test&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Yes, Fargate pricing is the same in Northern Virginia and Ireland. Nice!&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;THESE COSTS DO NOT INCLUDE ECR, NAT GATEWAY, AND MANY MANY OTHER COSTS!&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;To be safe, I'll double the 60$ cost.&lt;/p&gt;

&lt;p&gt;Final expectations for a Fargate test: 150$ and about 1 hour.&lt;/p&gt;
&lt;h3&gt;
  
  
  Expectations for EKS
&lt;/h3&gt;

&lt;p&gt;AWS AutoScaling Group is responsible for adding EC2 instances, and &lt;a href="https://github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler" rel="noopener noreferrer"&gt;cluster-autoscaler&lt;/a&gt; is in charge of deciding the actual number of EC2s needed.&lt;/p&gt;

&lt;p&gt;It is a bit unclear how fast AWS would give us an instance. Let's go with 60 seconds.&lt;/p&gt;

&lt;p&gt;From my testing a while ago, an EC2 instance took about 21 seconds from actually starting to becoming a &lt;code&gt;Ready&lt;/code&gt; node in Kubernetes. Add say 30 seconds for image pulling and starting the container( an overestimation, but let's go wild).&lt;/p&gt;

&lt;p&gt;Now, all of this can be modeled mathematically, and we’d get a time estimate. Unfortunately, I am not that smart, so we’ll skip getting a time estimate. It would take me less time to run a test than to do the math.&lt;/p&gt;

&lt;p&gt; &lt;br&gt;
 &lt;/p&gt;

&lt;p&gt;A &lt;code&gt;c5.4xlarge&lt;/code&gt; EC2 instance has 16 vCPUs and 32 GBs of memory. Some of it is reserved for Kubernetes components, monitoring, NodeLocal DNS, and so on. Let's say a total of 2 vCPUs and 4 GBs for cluster-level pods on each node.&lt;/p&gt;

&lt;p&gt;We are left with 14 vCPUs and 28 GBs, which at our task size is 14 pods per node. At 250 nodes, that is precisely 3.500 pods. Purrfect!&lt;/p&gt;

&lt;p&gt;As per &lt;a href="https://aws.amazon.com/ec2/pricing/" rel="noopener noreferrer"&gt;AWS EC2 pricing page&lt;/a&gt;, we get the following values:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;On-Demand in Northern Virginia: $0.068 per hour of &lt;code&gt;c5.4xlarge&lt;/code&gt; usage * 250 nodes = $170 per hour&lt;/li&gt;
&lt;li&gt;Spot in Northern Virginia: $0.260 per hour of &lt;code&gt;c5.4xlarge&lt;/code&gt; usage * 250 nodes = $65 per hour&lt;/li&gt;
&lt;li&gt;On-Demand in Ireland: $0.768 per hour of &lt;code&gt;c5.4xlarge&lt;/code&gt; usage * 250 nodes = $195 per hour&lt;/li&gt;
&lt;li&gt;Spot in Ireland: $0.291 per hour of &lt;code&gt;c5.4xlarge&lt;/code&gt; usage * 250 nodes = $75 per hour&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;THESE COSTS DO NOT INCLUDE ECR, NAT GATEWAY, AND MANY MANY OTHER COSTS!&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;So to be safe, I'll double the 75$ cost.&lt;/p&gt;

&lt;p&gt;Final expectations for an EKS test: 200$ per hour and unknown running time.&lt;/p&gt;
&lt;h3&gt;
  
  
  Expectations for Fargate on EKS
&lt;/h3&gt;

&lt;p&gt;Pricing is the same as Fargate on ECS, so about 150$ per test.&lt;/p&gt;

&lt;p&gt;Time is the same as EKS? Is the time closer to Fargate on ECS than to EKS?&lt;/p&gt;

&lt;p&gt;I have no idea, and I really look forward to seeing what happens.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;THIS IS OUTDATED!&lt;br&gt;
For the latest information, check out &lt;a href="https://www.youtube.com/watch?v=UhRiLCxYNbo" rel="noopener noreferrer"&gt;my re:Invent 2020 talk on YouTube: COM401 - Scaling Containers on AWS&lt;/a&gt;!&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h2&gt;
  
  
  Running the tests
&lt;/h2&gt;

&lt;p&gt;I ran a total of about 10 Fargate on EKS and ECS tests — for some of them, I was still figuring out some stuff. I ran a total of 4 EKS tests.&lt;/p&gt;

&lt;p&gt;I ran the tests both during the weekend and during the week. I made sure to run the tests during day-time and night-time too.&lt;/p&gt;

&lt;p&gt;The data used for the graph on top of the page can be downloaded as a CSV &lt;a href="///posts/scaling-containers-in-aws/graph-results.csv"&gt;at this link&lt;/a&gt;( CSV, less than 1MB file).&lt;/p&gt;
&lt;h2&gt;
  
  
  Results
&lt;/h2&gt;

&lt;p&gt;Let's recap the results from the top of the page!&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Fnip0lq1nvih2xooqbhae.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Fnip0lq1nvih2xooqbhae.png" alt="Hand-drawn graph showing EKS scaling from 0 to 3500 containers in about 5 minutes. Fargate on ECS and Fargate on EKS scale up to the same 3500 containers in about an hour. Tuned Fargate scales faster than Fargate, but slower than EKS. There is minimal variance in the results." width="800" height="750"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The Terraform code used for all the tests can be found on GitHub at &lt;a href="https://github.com/vlaaaaaaad/blog-scaling-containers-in-aws" rel="noopener noreferrer"&gt;Vlaaaaaaad/blog-scaling-containers-in-aws&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;It is not pretty or correct code at all. On the other hand, it is code that works! I firmly believe in the &lt;a href="https://plato.stanford.edu/entries/scientific-reproducibility/" rel="noopener noreferrer"&gt;reproducibility of any results&lt;/a&gt;, and I believe I have a moral duty to share everything I used to reach the conclusions presented here.&lt;/p&gt;

&lt;p&gt;Unfortunately, the costs were not fully tracked.&lt;br&gt;
Due to AWS pre-scaling the EKS Control Plane for my clusters, I had to keep them running for the whole duration. I did multiple tests a day in multiple regions, so there was no easy way to get the cost of a single test run. The estimates did help a lot, but they were there to ensure I did not end up spending tens of thousands of dollars.&lt;/p&gt;

&lt;p&gt;The total bill for the account was a little under 2.000$.&lt;/p&gt;
&lt;h3&gt;
  
  
  Fargate results
&lt;/h3&gt;

&lt;p&gt;Fargate on ECS and Fargate on EKS scaled surprisingly similar.&lt;/p&gt;

&lt;p&gt;I expected more variance in the results, but they were almost identical scaling up: an initial burst and then 60 containers per minute.&lt;br&gt;
Maybe for a minute, it was 58, and maybe for another minute, it was 62, but variance was minimal.&lt;/p&gt;

&lt;p&gt;When running the tests, I did discover a &lt;a href="https://github.com/aws/containers-roadmap/issues/768" rel="noopener noreferrer"&gt;previously-unknown Fargate on ECS limitation&lt;/a&gt;: 1 &lt;code&gt;Service&lt;/code&gt; can have at most &lt;code&gt;1.000&lt;/code&gt; Tasks running. To reach the 3.500 running tasks, I had to create 4 separate services. That led to the Fargate on ECS tests starting with 4 containers instead of 1.&lt;/p&gt;

&lt;p&gt;Something that I did not thoroughly test was downscaling. I did notice Fargate on EKS scaled down &lt;a href="https://twitter.com/iamvlaaaaaaad/status/1231980069008166912" rel="noopener noreferrer"&gt;significantly faster&lt;/a&gt;. Fargate on ECS scaled down slower, from what I saw similar to the speed of scaling up.&lt;/p&gt;

&lt;p&gt;It turns out my costs math was terrible but correct: I totally forgot to account for downscaling!&lt;br&gt;
Scaling down takes about the same time as scaling up — I calculated the costs just for the scale up. On the other hand, due to me doubling the costs to account for unexpected things, I was in the right ballpark!&lt;/p&gt;
&lt;h3&gt;
  
  
  EKS results
&lt;/h3&gt;

&lt;p&gt;EKS results were also very similar between different runs. I tested both at night and during the day, I tested in multiple regions, and the results were almost identical.&lt;/p&gt;

&lt;p&gt;EKS being so much faster than Fargate was a bit of a surprise. While Fargate would scale in about 50 minutes, EKS was consistently done in less than 10 minutes.&lt;/p&gt;

&lt;p&gt;Since I am using EKS extensively in my work, there were no other surprises.&lt;/p&gt;

&lt;p&gt;From a cost perspective, I made the same mistake: I did not account for scaling down.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;THIS IS OUTDATED!&lt;br&gt;
For the latest information, check out &lt;a href="https://www.youtube.com/watch?v=UhRiLCxYNbo" rel="noopener noreferrer"&gt;my re:Invent 2020 talk on YouTube: COM401 - Scaling Containers on AWS&lt;/a&gt;!&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h2&gt;
  
  
  Disclaimers
&lt;/h2&gt;

&lt;p&gt;First of all, I would like to reiterate that this is utterly useless for web workloads — the focus here is on background work or batch processing. For a relevant web scaling benchmark, you can check out &lt;a href="https://www.youtube.com/watch?v=OdzaTbaQwTg&amp;amp;t=1366" rel="noopener noreferrer"&gt;Clare Liguori's awesome demo at re:Invent 2019&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;As an aside, Lambda would scale up to 3.500 containers &lt;a href="https://docs.aws.amazon.com/lambda/latest/dg/invocation-scaling.html" rel="noopener noreferrer"&gt;in 1.5 to 7 seconds, depending on the region&lt;/a&gt;. It's an entirely different beast, but I thought it's worth mentioning.&lt;/p&gt;

&lt;p&gt;Clearly, this was &lt;strong&gt;not a very strict or statistically relevant testing process&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;I did just a few tests because I saw little variance — I did not see the point in running more.&lt;/p&gt;

&lt;p&gt;I only tested in &lt;code&gt;us-east-1&lt;/code&gt; and &lt;code&gt;eu-west-1&lt;/code&gt;, which are large regions. Numbers may or may not differ for smaller regions or during weird times — say Black Friday, Cyber Monday, EOY Report time.&lt;/p&gt;

&lt;p&gt;The container image I used was the image for &lt;a href="https://github.com/poc-hello-world/namer-service" rel="noopener noreferrer"&gt;poc-hello-world/namer-service&lt;/a&gt;, which is small and maybe not that well suited. I did not want to go into the whole "&lt;em&gt;let's optimize image pulling speed&lt;/em&gt;".&lt;br&gt;
A study of all images on Dockerhub can be &lt;a href="https://ieeexplore.ieee.org/abstract/document/8891000" rel="noopener noreferrer"&gt;read here&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;I only ran pods and tasks with 1 single container. Both pods and tasks can have multiple containers running.&lt;/p&gt;

&lt;p&gt;I did not optimize the tests at all. I wanted to showcase the average experience, not a super-custom solution that would not help anybody. The values in here can likely be improved — multiple ECS clusters, multiple ASGs for EKS, and so on.&lt;/p&gt;

&lt;p&gt;ECS with EC2 was completely ignored. I cannot say there was a reason, I did just not think of that. Fargate has the cost advantage, the simplicity, and the top-notch AWS integration. EKS has the complexity, all the knobs, and all the buzzwords. Between the two, I did not consider ECS with EC2, and that is on me.&lt;/p&gt;

&lt;p&gt;Overall I am happy. After all this, we now have a ballpark figure we can use when designing systems 🙂&lt;/p&gt;
&lt;h2&gt;
  
  
  Acknowledgments
&lt;/h2&gt;

&lt;p&gt;First of all, thank you so much to Ignacio and Michael for helping me escalate all my Support tickets and for connecting me to the right people in AWS!&lt;/p&gt;

&lt;p&gt;Special thanks go out to Mats and Massimo for all their Fargate help and reviews! Your feedback was priceless!&lt;/p&gt;

&lt;p&gt;Thanks to everybody else that helped review this, gave feedback, or supported me!&lt;/p&gt;
&lt;h2&gt;
  
  
  Generating the graph
&lt;/h2&gt;

&lt;p&gt;This part took &lt;strong&gt;forever&lt;/strong&gt; to do, so I decided to add it to the post.&lt;/p&gt;

&lt;p&gt;The results are the most exciting thing about this whole post. I desperately wanted to have a pretty graph image showcasing the results.&lt;/p&gt;

&lt;p&gt;Since the testing was not statistically correct, I thought a hand-drawn graph would be perfect. It would nicely showcase the data, while at the same time hinting that experiences may vary.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.apple.com/numbers/" rel="noopener noreferrer"&gt;Numbers&lt;/a&gt; and &lt;a href="https://products.office.com/en/excel" rel="noopener noreferrer"&gt;Excel&lt;/a&gt; could not easily do this. &lt;a href="https://livegap.com/charts/" rel="noopener noreferrer"&gt;LiveGap Charts&lt;/a&gt; had a bunch of hand-drawn options and was outstanding, but not exactly what I wanted.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://matplotlib.org" rel="noopener noreferrer"&gt;Matplotlib&lt;/a&gt; to the rescue! After a bunch of research and about a day of playing around with it, I got a working script.&lt;/p&gt;

&lt;p&gt;It is not pretty or optimized, nor is it correct. But it works!&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Install the required font&lt;/span&gt;
&lt;span class="c"&gt;#  BEWARE: this only works on macOS&lt;/span&gt;
&lt;span class="c"&gt;#          Linuxbrew does not install fonts&lt;/span&gt;
brew &lt;span class="nb"&gt;install &lt;/span&gt;homebrew/cask-fonts/font-humor-sans

&lt;span class="c"&gt;# Install Python dependencies&lt;/span&gt;
pip3 &lt;span class="nb"&gt;install &lt;/span&gt;matplotlib numpy

&lt;span class="c"&gt;# Run and generate the image&lt;/span&gt;
python3 draw.py
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt; &lt;br&gt;
 &lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# File: draw.py
# The actual image-generation script
&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;matplotlib.pyplot&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;plt&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;matplotlib.lines&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Line2D&lt;/span&gt;

&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;numpy&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;

&lt;span class="c1"&gt;# Load data from CSV
&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;genfromtxt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;graph-results.csv&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;delimiter&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;,&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;names&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Start an XKCD graph
&lt;/span&gt;&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;xkcd&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;# Make the image pretty
&lt;/span&gt;&lt;span class="n"&gt;figure&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;figure&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;figure&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_dpi&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;300&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;figure&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_size_inches&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;7.5&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;figure&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;suptitle&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Scaling containers on AWS&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;@iamvlaaaaaaad&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;fontsize&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;16&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;xlabel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Minutes&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;ylabel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Containers&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;xticks&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;arange&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;70&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;step&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

&lt;span class="c1"&gt;# Colors from https://jfly.uni-koeln.de/color/
&lt;/span&gt;&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;plot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;EKS&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;label&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;EKS with EC2&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;color&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;0.8&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.7&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;plot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;TunedFargate&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;label&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Tuned Fargate&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;color&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;0.9&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.6&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.0&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;plot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;FargateOnECS&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;label&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Fargate on ECS&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;color&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;0.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.45&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.70&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;plot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;FargateOnEKS&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;label&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Fargate on EKS&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;color&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;0.35&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.70&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.90&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Add a legend to the graph
#  using default labels
&lt;/span&gt;&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;legend&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;loc&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;lower right&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;borderaxespad&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Export the image
&lt;/span&gt;&lt;span class="n"&gt;figure&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;savefig&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;containers.svg&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;figure&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;savefig&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;containers.png&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;THIS IS OUTDATED!&lt;br&gt;
For the latest information, check out &lt;a href="https://www.youtube.com/watch?v=UhRiLCxYNbo" rel="noopener noreferrer"&gt;my re:Invent 2020 talk on YouTube: COM401 - Scaling Containers on AWS&lt;/a&gt;!&lt;/p&gt;
&lt;/blockquote&gt;

</description>
      <category>aws</category>
      <category>kubernetes</category>
      <category>devops</category>
      <category>docker</category>
    </item>
  </channel>
</rss>
