<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: oo00oo00oo00</title>
    <description>The latest articles on DEV Community by oo00oo00oo00 (@oo00oo00oo00).</description>
    <link>https://dev.to/oo00oo00oo00</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1263368%2F56f7b7f6-2b8c-4e0f-9439-1214bfd8b0d4.jpeg</url>
      <title>DEV Community: oo00oo00oo00</title>
      <link>https://dev.to/oo00oo00oo00</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/oo00oo00oo00"/>
    <language>en</language>
    <item>
      <title>Choosing an orchestrator for multi-tenant code execution system</title>
      <dc:creator>oo00oo00oo00</dc:creator>
      <pubDate>Mon, 12 Feb 2024 18:09:11 +0000</pubDate>
      <link>https://dev.to/oo00oo00oo00/choosing-an-orchestrator-for-multi-tenant-code-execution-system-pmm</link>
      <guid>https://dev.to/oo00oo00oo00/choosing-an-orchestrator-for-multi-tenant-code-execution-system-pmm</guid>
      <description>&lt;h2&gt;
  
  
  Intro
&lt;/h2&gt;

&lt;p&gt;Here, at my current company - &lt;a href="https://tripleten.com/" rel="noopener noreferrer"&gt;Tripleten&lt;/a&gt;, we have multiple code execution systems deployed for our students to use as interactive trainers. One of which is designed to deploy student web servers (e.g. Django, Express.js, etc.) on our cloud. Recently we decided, that this system needs to be refactored and redesigned (you’ll see later why). In this article I will guide you through the process of choosing underlying &lt;strong&gt;&lt;em&gt;Orchestrator&lt;/em&gt;&lt;/strong&gt; for our new refactored system.&lt;/p&gt;




&lt;h2&gt;
  
  
  Problem statement
&lt;/h2&gt;

&lt;p&gt;First, let’s take a quick overview of a problem we’re solving. In a nutshell - we need our system to create a new isolated environment on HTTP request. Containers or VMs - does not matter to us, as long as user can’t escape it, so let’s just call such an environment - a &lt;code&gt;container&lt;/code&gt;. In this container student &lt;strong&gt;web application&lt;/strong&gt; will reside, so we’ll need to put student code inside container and execute it. User also should be able to interact with his container for at least an hour - that’s realistic maximum comfortable time needed for user to interact with his application before making any changes.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Note: There are other approaches to solving our problem, e.g. having multiple pre-deployed “workers” with limited capacity for user’s web servers, but our chosen tactics of creating a new container each time allows us to comply with our security requirements easier.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Design overview
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8hb9u9lx9s2a9aamm5k4.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8hb9u9lx9s2a9aamm5k4.png" alt="Simple design" width="800" height="143"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Here, &lt;code&gt;Hub&lt;/code&gt; is a component of our system, which acts as a main gateway or facade, as you might call it, &lt;code&gt;Orchestrator&lt;/code&gt; is an interface of &lt;em&gt;this thing, that can create a container.&lt;/em&gt; Newly created container has a generated name here - &lt;code&gt;cnt-ff6dg&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;We also want  to somehow manipulate a container by having an ability to add files to it, execute commands, start background processes, e.t.c.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fey2mmngkev2jdigbkcgv.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fey2mmngkev2jdigbkcgv.png" alt="Requests flow" width="800" height="298"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Here you can see two incoming requests to Hub - first request puts a file into container &lt;code&gt;cnt-ff6dg&lt;/code&gt; and second one executes &lt;code&gt;cat&lt;/code&gt; command on this file inside container. Orchestrator is not responsible for performing operations on container, but rather creating and getting containers.&lt;/p&gt;

&lt;p&gt;There are also quite a few things we want to do with those containers, but for the sake of simplicity, I won’t go into further details, let’s concentrate on requirements we have so far.&lt;/p&gt;

&lt;p&gt;At this point you probably think, that orchestrator just looks like a very limited web interface for Docker, and you have a point! So far we only want to create and get containers with it, so let’s think, what implementation options are available for our orchestrator.&lt;/p&gt;




&lt;h2&gt;
  
  
  Orchestrator options
&lt;/h2&gt;

&lt;p&gt;As you noticed, it could be just a Docker engine installed on a server. Or, perhaps, we could use Kubernetes? Here is a list of other alternatives, that we came up with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Docker&lt;/li&gt;
&lt;li&gt;Docker swarm&lt;/li&gt;
&lt;li&gt;AWS lambda&lt;/li&gt;
&lt;li&gt;AWS ECS&lt;/li&gt;
&lt;li&gt;Kubernetes&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Some of the options are pretty easy to exclude right away, but let’s review them all. &lt;/p&gt;

&lt;h3&gt;
  
  
  Docker
&lt;/h3&gt;

&lt;p&gt;Conceptually Docker should work fine, you can create a container with it, and get information about a running container. Spin up a &lt;code&gt;Hub&lt;/code&gt; with Docker on the same server and you’re good to go, it can’t get any simpler. But suppose our business does well, and resources of a single server are no longer enough? We could probably assign more CPU and memory to our server, so we could support more containers, But what if this server goes down? We’ll lose all the containers at once, but more importantly, we won’t be able to create new ones for some time, which means that our students won’t be able to learn properly. &lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7qp388uk54qrvytjvnfu.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7qp388uk54qrvytjvnfu.png" alt="Docker" width="800" height="236"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This leads us to a distributed design (unfortunately), so we can scale our resources horizontally and keep our &lt;code&gt;Hub&lt;/code&gt; replicas and &lt;code&gt;containers&lt;/code&gt; on different servers.&lt;/p&gt;

&lt;h3&gt;
  
  
  Docker swarm
&lt;/h3&gt;

&lt;p&gt;For those, who are not familiar with Docker swarm, here’s a note from &lt;a href="https://docs.docker.com/engine/swarm/" rel="noopener noreferrer"&gt;docker docs&lt;/a&gt;:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Swarm mode is an advanced feature for managing a cluster of Docker daemons.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Read a little more for better understanding, but you can think of it as of distributed Docker, that supports multiple &lt;em&gt;nodes&lt;/em&gt; (servers)&lt;em&gt;,&lt;/em&gt; which means, that our &lt;code&gt;Hub&lt;/code&gt; replicas can be placed separately from &lt;code&gt;containers&lt;/code&gt; and we can also add more nodes to our cluster in case we’re out of resources.&lt;/p&gt;

&lt;p&gt;This is perfect and fits all the requirements we have so far - we can create and get containers, while having that precious scalability and fault tolerance. &lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzluedc3hh4jd1xrynjn6.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzluedc3hh4jd1xrynjn6.png" alt="Docker swarm" width="800" height="374"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Remember, how I said in the Intro, that we’re actually &lt;strong&gt;refactoring&lt;/strong&gt; the system and not implementing it from scratch? Yes, our current solution uses Docker swarm. And we probably would’ve kept it, but we actually faced some problems, that are specific to Docker swarm. Under sufficient load it proved to be unstable in our hands. After reaching some amount of actively running containers on multiple nodes, docker swarm just stopped working properly. We tried adding new nodes, restarting the service, looked for memory leaks, but nothing worked except removing all existing containers and rebooting the whole thing, which we would like to do &lt;em&gt;never&lt;/em&gt;. Summarizing downsides:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;History of instability, when dealing with sufficient load&lt;/li&gt;
&lt;li&gt;There is &lt;a href="https://dockerswarm.rocks/swarm-or-kubernetes/" rel="noopener noreferrer"&gt;no strong support&lt;/a&gt; for Docker swarm currently&lt;/li&gt;
&lt;li&gt;A lot of maintenance work needed from infrastructure engineers

&lt;ul&gt;
&lt;li&gt;Lack of “managed” solutions - hard to do staging environment&lt;/li&gt;
&lt;li&gt;Node scaling is done by hand&lt;/li&gt;
&lt;li&gt;Logs collection and monitoring differs from how it’s done in our other infrastructure&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;p&gt;Overall, it’s a good option, but we won’t consider it further.&lt;/p&gt;

&lt;h3&gt;
  
  
  AWS lambda
&lt;/h3&gt;

&lt;p&gt;I can imagine some hacky way to make it work and it would not be trivial, but all the engineering won’t be worth it in our case, because maximum lambda invocation timeout is &lt;a href="https://docs.aws.amazon.com/lambda/latest/dg/configuration-function-common.html#configuration-timeout-console" rel="noopener noreferrer"&gt;15 min&lt;/a&gt;. We want our students to interact with their containers for a bit longer, at least an hour. Also there is a &lt;a href="https://repost.aws/knowledge-center/lambda-concurrency-limit-increase" rel="noopener noreferrer"&gt;concurrency limit&lt;/a&gt; for single account in AWS &lt;strong&gt;region&lt;/strong&gt;, which can be increased, but still, I feel like this shows us, that this is not the right instrument for our task.&lt;/p&gt;

&lt;p&gt;On the other hand, we could probably create a new lambda function for each “create &lt;code&gt;container&lt;/code&gt;" request, but I can’t think of a valid reason to do so. Those lambda functions would contain code, that our students submitted, yet they would lay among our production lambdas. Something feels not right about it, but I might be wrong. Anyway, skipping this option.&lt;/p&gt;

&lt;h3&gt;
  
  
  AWS ECS
&lt;/h3&gt;

&lt;p&gt;From AWS &lt;a href="https://aws.amazon.com/ecs/getting-started/?pg=ln&amp;amp;cp=bn" rel="noopener noreferrer"&gt;landing page&lt;/a&gt;:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Amazon Elastic Container Service (ECS) is a fully managed container orchestration service that simplifies your deployment, management, and scaling of containerized applications.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Sounds great, even has a word &lt;code&gt;orchestration&lt;/code&gt; in it! On a more serious note, using ECS can be a good choice for at least a couple of reasons: &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Supports the ability to create/get containers (”services” in ECS terminology) across multiple servers, which is what we need.&lt;/li&gt;
&lt;li&gt;Support of &lt;a href="https://docs.aws.amazon.com/AmazonECS/latest/developerguide/AWS_Fargate.html" rel="noopener noreferrer"&gt;Fargate&lt;/a&gt;. Fargate allows you to automagically create containers without thinking about scaling and managing your servers. You just specify the needed amount of resources for your service and Fargate deploys it.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Option to use Fargate really caught our attention, since both setting the right amount of nodes or configuring autoscaler - are places where things can go wrong, so why not eliminate them (or delegate them to Amazon)?&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3n9oks8m6fmd16oiie36.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3n9oks8m6fmd16oiie36.png" alt="AWS ECS" width="800" height="273"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Downside is clear - vendor lock. Regardless, ECS seems like a valid option, we’ll take a closer look at it later.&lt;/p&gt;

&lt;h3&gt;
  
  
  Kubernetes (AWS EKS)
&lt;/h3&gt;

&lt;p&gt;Using kubernetes looked like the best option on our list since we started digging into the problem, and I’ll explain why. First and foremost, it supports all of our requirements - you can create containers (”pods” in k8s terminology, which contains actual container), perform any type of querying on them, label them, and so on. Also, much like in Docker swarm, you can have multiple nodes (servers) and groups of nodes, which solves fault tolerance, and kubernetes also provides you with &lt;code&gt;cluster autoscaler&lt;/code&gt;, which can automatically scale nodes up and down based on configuration - no more tossing around tokens, trying to register a new swarm node. Cherry on top - kubernetes is open source and there are many cloud providers, that have managed kubernetes on their product list - all of them share k8s core API, so no vendor lock.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbedhhercz2kveif71o1b.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbedhhercz2kveif71o1b.png" alt="EKS + EC2" width="800" height="382"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Digging a bit deeper into EKS documentation and we found, that EKS &lt;a href="https://docs.aws.amazon.com/eks/latest/userguide/fargate.html" rel="noopener noreferrer"&gt;support Fargate node groups&lt;/a&gt;. That’s great news, we can potentially have best of both worlds - open source API from k8s and serverless nature of Fargate! Although, just in case, EKS also supports “normal” node groups with “normal” virtual servers (a.k.a. EC2), as shown in diagram above - “worker node group” consists of “data node 1” and “data node 2”, which are EC2 instances.&lt;/p&gt;

&lt;p&gt;In the diagram below we can see how our architecture would look like with Fargate profile, instead of node group with EC2 instances as nodes - no cluster autoscaler needed.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fg9et2bxzqkulsrfck7vw.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fg9et2bxzqkulsrfck7vw.png" alt="EKS + Fargate" width="800" height="384"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Finalists
&lt;/h3&gt;

&lt;p&gt;We reviewed a handful of orchestrator options, some were more suitable for our requirements, than others, so here they are, ones worth competing in MVP:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;ECS + Fargate&lt;/li&gt;
&lt;li&gt;EKS + Fargate&lt;/li&gt;
&lt;li&gt;EKS + EC2&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Metrics
&lt;/h2&gt;

&lt;p&gt;So how do we know which one to pick? Besides factors like familiarity of our engineers with certain technology, how well-maintained it is or simply what’s more appealing, we came up with a set of performance oriented metrics. It is important for us, that our containers can be created swiftly, so our students don’t get frustrated waiting for their first nodejs server to be deployed. Also, we would like to measure latency of commands, that we run inside container, given variable number of currently active containers. Here are our metrics:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fetou4t4w329rz39ezqnm.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fetou4t4w329rz39ezqnm.png" alt="Metrics" width="800" height="52"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;“&lt;strong&gt;CCT&lt;/strong&gt;” here is &lt;strong&gt;container creation time&lt;/strong&gt; in seconds, “burst each 100ms” means, that new container will be created every 100 ms.&lt;/li&gt;
&lt;li&gt;“CCT gradual each 1s” - same thing, but creating container each second.&lt;/li&gt;
&lt;li&gt;Avg run (ms) is the &lt;strong&gt;average time to run&lt;/strong&gt; a simple &lt;code&gt;ls /&lt;/code&gt; command of 10 consecutive runs for each container.&lt;/li&gt;
&lt;li&gt;Each metric will be calculated given a different number of containers “&lt;strong&gt;# containers&lt;/strong&gt;”. E.g. we calculate CCT, where containers are created each 100ms and the total number of containers created is 1000.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Calculating those metrics should give us a rough understanding of how core features of our service would behave under different loads.&lt;/p&gt;




&lt;h2&gt;
  
  
  Measuring
&lt;/h2&gt;

&lt;p&gt;In order to measure our metrics, we have to implement our &lt;code&gt;Hub&lt;/code&gt;, so we can send HTTP requests for container creation, also create sample infrastructure for each orchestrator option, that we chose. After all of that done, we’ll need a tool for sending batches of requests for container creation and command runs.&lt;/p&gt;

&lt;h3&gt;
  
  
  Implementing Hub
&lt;/h3&gt;

&lt;p&gt;Let’s start with our &lt;code&gt;Hub&lt;/code&gt;. It should be a simple web app, we chose Go for our experiment.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="n"&gt;main&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;orchestrator&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;kubeclient&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;KubeClient&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;Namespace&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="n"&gt;namespace&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="n"&gt;c&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;Controller&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;Orchestrator&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="n"&gt;orchestrator&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;HandleFunc&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"/get"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;GetContainer&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;HandleFunc&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"/create"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;CreateContainer&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;HandleFunc&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"/delete"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;DeleteContainer&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;HandleFunc&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"/files"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;PutFiles&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;HandleFunc&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"/run"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;RunCommand&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;fmt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Printf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Server is running on %v&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;serverPort&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;log&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Fatal&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ListenAndServe&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;serverHost&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="s"&gt;":"&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="n"&gt;serverPort&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Naming of our URIs is pretty self explanatory - those are operations, that we want to do with our containers. The interesting part is &lt;code&gt;KubeClient&lt;/code&gt; - this is one of our implementations of &lt;code&gt;Orchestrator&lt;/code&gt; interface, here’s how it looks like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;type&lt;/span&gt; &lt;span class="n"&gt;Orchestrator&lt;/span&gt; &lt;span class="k"&gt;interface&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;GetContainer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;containerID&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;Container&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;error&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;CreateContainer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;containerSpec&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;ContainerSpec&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;Container&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;error&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;DeleteContainer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;containerID&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="kt"&gt;error&lt;/span&gt;
    &lt;span class="n"&gt;ListContainers&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;([]&lt;/span&gt;&lt;span class="n"&gt;Container&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;error&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We never noted, that we’ll need &lt;code&gt;DeleteContainer&lt;/code&gt; and &lt;code&gt;ListContainers&lt;/code&gt;, but I’m sure everyone saw it coming. Those methods just helped us throughout our research.&lt;/p&gt;

&lt;h3&gt;
  
  
  Implementing orchestrators
&lt;/h3&gt;

&lt;p&gt;We’ll only need two implementations of our &lt;code&gt;Orchestrator&lt;/code&gt; interface, despite having three candidates. That’s because &lt;em&gt;EKS + Fargate&lt;/em&gt; and &lt;em&gt;EKS + EC2&lt;/em&gt; do not have major differences, except node group selectors, which can be parameterized. So we’ll only have implementation for EKS and ECS.&lt;/p&gt;

&lt;p&gt;Both implementations won’t easily fit into a code snippet, but I’ll just show you important details. Basically, container creation for our kubernetes orchestrator can be boiled down to creating a Pod specification and passing it to k8s API, like so:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;k&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;KubeClient&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;CreateContainer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;containerSpec&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;types&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ContainerSpec&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;types&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Container&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;error&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="c"&gt;// ...&lt;/span&gt;

    &lt;span class="c"&gt;// Making pod specification&lt;/span&gt;
    &lt;span class="n"&gt;pod&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;apiv1&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Pod&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;ObjectMeta&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="n"&gt;metav1&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ObjectMeta&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="n"&gt;Name&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;   &lt;span class="n"&gt;podName&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;Labels&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="n"&gt;labels&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="n"&gt;Spec&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="n"&gt;apiv1&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;PodSpec&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="c"&gt;// ...&lt;/span&gt;
            &lt;span class="n"&gt;NodeSelector&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;  &lt;span class="n"&gt;nodeSelector&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="c"&gt;// Can be for Fargate or EC2&lt;/span&gt;
            &lt;span class="n"&gt;Containers&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;&lt;span class="n"&gt;apiv1&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Container&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="p"&gt;{&lt;/span&gt;
                    &lt;span class="n"&gt;Name&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;  &lt;span class="s"&gt;"sandbox"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                    &lt;span class="n"&gt;Image&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="n"&gt;containerSpec&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Image&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                    &lt;span class="c"&gt;// ...&lt;/span&gt;
                &lt;span class="p"&gt;},&lt;/span&gt;
                &lt;span class="c"&gt;// ...&lt;/span&gt;
            &lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="c"&gt;// Creating pod via API&lt;/span&gt;
    &lt;span class="n"&gt;pod&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;clientset&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;
        &lt;span class="n"&gt;CoreV1&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;
        &lt;span class="n"&gt;Pods&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Namespace&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;
        &lt;span class="n"&gt;Create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;TODO&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;pod&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;metav1&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;CreateOptions&lt;/span&gt;&lt;span class="p"&gt;{})&lt;/span&gt;

    &lt;span class="c"&gt;// ...&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;types&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Container&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;ID&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;        &lt;span class="n"&gt;pod&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;IP&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;        &lt;span class="n"&gt;pod&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Status&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;PodIP&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;UpdatedAt&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="n"&gt;updatedAt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;ECS implementation looks very similar, except the fact that we need to create a template for our container (Task Definition) separately:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;c&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;ECSClient&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;CreateContainer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;containerSpec&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;types&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ContainerSpec&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;types&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Container&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;error&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="c"&gt;// ... &lt;/span&gt;

    &lt;span class="c"&gt;// Creating task definition&lt;/span&gt;
    &lt;span class="n"&gt;taskDefInput&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;ecs&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;RegisterTaskDefinitionInput&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="c"&gt;// ...&lt;/span&gt;
        &lt;span class="n"&gt;RequiresCompatibilities&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;LaunchType&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt; &lt;span class="c"&gt;// Fargate&lt;/span&gt;
        &lt;span class="n"&gt;ContainerDefinitions&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;ecs&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ContainerDefinition&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="c"&gt;// ...&lt;/span&gt;
                &lt;span class="n"&gt;Name&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;  &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;podName&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="n"&gt;Image&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="n"&gt;aws&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;String&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;containerSpec&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Image&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="n"&gt;Cpu&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;    &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;Cpu&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;Memory&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;Memory&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="n"&gt;taskDefResp&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;ecsSvc&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;RegisterTaskDefinition&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;taskDefInput&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c"&gt;// ...&lt;/span&gt;

    &lt;span class="c"&gt;// Running the task&lt;/span&gt;
    &lt;span class="n"&gt;runTaskInput&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;ecs&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;RunTaskInput&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;Cluster&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;        &lt;span class="n"&gt;aws&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;String&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ClusterName&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="n"&gt;TaskDefinition&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="n"&gt;taskDefResp&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;TaskDefinition&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;TaskDefinitionArn&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;LaunchType&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;     &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;LaunchType&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="c"&gt;// ...&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="n"&gt;res&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;ecsSvc&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;RunTask&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;runTaskInput&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c"&gt;// ...&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;types&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Container&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;ID&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;        &lt;span class="n"&gt;cnt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ID&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;IP&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;        &lt;span class="n"&gt;cnt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Ip&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;UpdatedAt&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Now&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;Note: creating task definition for each container might not be the optimal strategy here, but we conducted our experiment both with predefined task ARN and newly created one - it did not affect the speed of container (task) creation in any significant way, so we’re showing the simplest version.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  Quick note on infrastructure
&lt;/h3&gt;

&lt;p&gt;Having &lt;code&gt;Hub&lt;/code&gt; and &lt;code&gt;Orchestrators&lt;/code&gt; implemented, we need to prepare our infrastructure. Unfortunately, this is out of current article’s scope, so I’ll just list most significant requirements (that I can remember):&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;EKS:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;EKS cluster&lt;/li&gt;
&lt;li&gt;Manager node group (specs don’t really matter), where &lt;code&gt;Hub&lt;/code&gt; will be deployed&lt;/li&gt;
&lt;li&gt;Worker node group with 13 x m6i.2xlarge EC2 machines&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://docs.aws.amazon.com/eks/latest/userguide/fargate-profile.html" rel="noopener noreferrer"&gt;Fargate profile&lt;/a&gt; (~ special node group)&lt;/li&gt;
&lt;li&gt;Service account, bound to our &lt;code&gt;Hub&lt;/code&gt; deployment, so it can create/get/remove pods&lt;/li&gt;
&lt;li&gt;Correctly applied taints to our node groups, so that only &lt;code&gt;containers&lt;/code&gt; can be scheduled on our Worker node group&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;ECS:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;ECS resource&lt;/li&gt;
&lt;li&gt;Manager service for our &lt;code&gt;Hub&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;VPS configuration&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Implementing testing tool
&lt;/h3&gt;

&lt;p&gt;With our &lt;code&gt;Hub&lt;/code&gt; deployed and infrastructure ready we can finally make some requests and create some containers. We wrote a CLI tool, that allowed us to make batches of requests and measure our metrics.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ ./sh-dos spawn -h
NAME:
   sh-dos spawn - spawn containers

USAGE:
   sh-dos spawn [command options] [arguments...]

OPTIONS:
   -n value                    Number of containers to spawn (default: "1")
   --interval value, -i value  Sleep interval between spawns in Millisecond (default: "100")
   --cpu value                 CPU requests for each container (default: "0.1")
   --mem value                 Memory requests for each container in Mi (default: "128")
   --fargate                   Use fargate profile in node selector (default: false)
   --image value               Image for user container
   --files value               Files to put inside container
   --scenario value            Scenario to run (default: "basic")
   --help, -h                  show help
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;As a result it showed us values distribution for each metric we collected, here we can see container start time and average command execution time:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;--------------------------------------------------
Container creation time
--------------------------------------------------
(2)    1.0s   |##
(24)   2.0s   |########################
(24)   3.0s   |########################
(0)    4.0s   |
bins: 4; total: 50; step: 1.00
--------------------------------------------------
--------------------------------------------------
Average time to exec command in container
--------------------------------------------------
(0)    0.1s   |
(50)   0.2s   |##################################################
(0)    0.3s   |
bins: 3; total: 50; step: 0.10
--------------------------------------------------
✅ Success!
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Results
&lt;/h2&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;EKS + Fargate&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;I can feel everyone’s excited for the results, so here they are! First contestant - &lt;strong&gt;EKS + Fargate&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Let’s see, how our most promising candidate did…&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fn4yl2iix28l1wusmh90j.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fn4yl2iix28l1wusmh90j.png" alt="EKS + Fargate results" width="800" height="163"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;(Sad trombone)&lt;/p&gt;

&lt;p&gt;As you can see, table is not fully filled and it does not even have all the metrics. That’s because even creation of a single container turned our to be pretty slow for our needs. We were hoping to see something under 10 seconds at least. Here (row 3), a single container with a realistic 1.1 GB image and 0.25 allocated virtual CPU took &lt;strong&gt;160&lt;/strong&gt; seconds to start up. When adding more CPU to our container (row 4), time dropped to &lt;strong&gt;80&lt;/strong&gt; seconds. Just for the sake of experiment, we then created 20, a 100 and a 1000 containers - mean creation time was still around 80 seconds. Also note the “Node scaling (sec)” column - it’s 0 in all rows, that’s because we don’t need to add or remove any resources to our cluster, Fargate just does the job.&lt;/p&gt;

&lt;h3&gt;
  
  
  ECS + Fargate
&lt;/h3&gt;

&lt;p&gt;Trying our luck with Fargate again, next up is &lt;strong&gt;ECS + Fargate&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxrg5nud2wlcf4fusbq6z.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxrg5nud2wlcf4fusbq6z.png" alt="ECS + Fargate results" width="800" height="163"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Somehow, here we can see a better picture. I guess AWS proprietary infrastructure has some sort of synergy. Node scaling is still 0, since we use Fargate, which is great, but realistic 1.1 GB image still takes quite some time to create container from - &lt;strong&gt;40-50&lt;/strong&gt; seconds, depending on allocated CPU. On the other hand, containers from small 7.5 MB image were created faster, around &lt;strong&gt;20&lt;/strong&gt; seconds. Interesting observation is that the difference in creation time of same sized images with different CPU allocations is not as significant, as it was with EKS + Fargate (observations in rows 1 and 2, and rows 3 and 4). I’m sure, that there’s a reason for this, but even the best time for the small image is still slow for our requirements, so we won’t dig much deeper into this option.&lt;/p&gt;

&lt;h3&gt;
  
  
  EKS + EC2
&lt;/h3&gt;

&lt;p&gt;Last candidate, that we’ll observe is &lt;strong&gt;EKS + EC2&lt;/strong&gt;. It is not using Fargate, instead we’re creating our own node groups from EC2 instances, as I mentioned earlier.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5bicygdno9g1s1wcxdr4.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5bicygdno9g1s1wcxdr4.png" alt="EKS + EC2 results" width="800" height="73"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Now that’s better. We can see a significant improvement in our “CCT” metric, and other metrics also look promising. Even containers from the large 1.1 GB image are created with desirable speed. Results show us, that regardless of number of containers, image size and creation frequency - containers are created &lt;strong&gt;under 10 seconds&lt;/strong&gt;. &lt;em&gt;Or are they?&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;If you take a closer look, you’ll see “Node scaling” is not zero anymore, it’s actually 90 seconds roughly. What this means is that if we’re out of resources on our nodes, cluster autoscaler will add a new node to our node group and this will take around 90 seconds, meaning that we can’t instantly adapt to spikes in our load. Cluster autoscaler in it’s default configuration is triggered by a single pod being pending, after that it will start adding a node. This means, that an unlucky user will try to create a container via our system and will just wait for at least 90 seconds. We can reduce this number or even completely get rid of it by using “hacks” like cluster &lt;a href="https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/FAQ.md#how-can-i-configure-overprovisioning-with-cluster-autoscaler" rel="noopener noreferrer"&gt;overprovisioning&lt;/a&gt;, but depending on pod creation frequency and overprovisioning configuration wait time will differ, and there still will always be a scenario in which a user will have to wait for node scaling. Sounds like a problem for future me.&lt;/p&gt;

&lt;p&gt;Also an important note is that you need to pull the image from the registry on each node first, and only then you’ll be able to create a container from it. Results in this table does not show that, because images were already present on our nodes which is a realistic circumstance for our use-case.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Having conducted our research, we chose EKS + EC2 as an orchestrator for our new service. This was an easy decision for us, since container creation time was just not fast enough with other options. EKS + EC2 also has other advantages, that are most likely subjective to our company, one of which is that our infrastructure was already set up for easy monitoring, logging and deployment of newly added kubernetes services.&lt;/p&gt;

&lt;p&gt;As always, problems in software engineering and system design require compromise and risk. I thank you for reading this article and hope, that I explained my though process well enough and provided a reasonable amount of data and instructions, so you can use it in search of this compromise and risk mitigation, while building a similar system.&lt;/p&gt;




&lt;p&gt;P.S.: It would be unfair not to mention my colleagues, talented engineers, with whom as a team we've implemented this research - &lt;strong&gt;&lt;a href="https://www.linkedin.com/in/timur-kushukov-b99552206/" rel="noopener noreferrer"&gt;Timur Kushukov&lt;/a&gt;&lt;/strong&gt; and &lt;strong&gt;&lt;a href="https://www.linkedin.com/in/anatfulin/" rel="noopener noreferrer"&gt;Adel Natfulin&lt;/a&gt;&lt;/strong&gt;.&lt;/p&gt;

</description>
      <category>kubernetes</category>
      <category>docker</category>
      <category>aws</category>
      <category>architecture</category>
    </item>
  </channel>
</rss>
