Choosing an orchestrator for multi-tenant code execution system

oo00oo00oo00 — Mon, 12 Feb 2024 18:09:11 +0000

Intro

Here, at my current company - Tripleten, we have multiple code execution systems deployed for our students to use as interactive trainers. One of which is designed to deploy student web servers (e.g. Django, Express.js, etc.) on our cloud. Recently we decided, that this system needs to be refactored and redesigned (you’ll see later why). In this article I will guide you through the process of choosing underlying Orchestrator for our new refactored system.

Problem statement

First, let’s take a quick overview of a problem we’re solving. In a nutshell - we need our system to create a new isolated environment on HTTP request. Containers or VMs - does not matter to us, as long as user can’t escape it, so let’s just call such an environment - a container. In this container student web application will reside, so we’ll need to put student code inside container and execute it. User also should be able to interact with his container for at least an hour - that’s realistic maximum comfortable time needed for user to interact with his application before making any changes.

Note: There are other approaches to solving our problem, e.g. having multiple pre-deployed “workers” with limited capacity for user’s web servers, but our chosen tactics of creating a new container each time allows us to comply with our security requirements easier.

Design overview

Here, Hub is a component of our system, which acts as a main gateway or facade, as you might call it, Orchestrator is an interface of this thing, that can create a container. Newly created container has a generated name here - cnt-ff6dg.

We also want to somehow manipulate a container by having an ability to add files to it, execute commands, start background processes, e.t.c.

Here you can see two incoming requests to Hub - first request puts a file into container cnt-ff6dg and second one executes cat command on this file inside container. Orchestrator is not responsible for performing operations on container, but rather creating and getting containers.

There are also quite a few things we want to do with those containers, but for the sake of simplicity, I won’t go into further details, let’s concentrate on requirements we have so far.

At this point you probably think, that orchestrator just looks like a very limited web interface for Docker, and you have a point! So far we only want to create and get containers with it, so let’s think, what implementation options are available for our orchestrator.

Orchestrator options

As you noticed, it could be just a Docker engine installed on a server. Or, perhaps, we could use Kubernetes? Here is a list of other alternatives, that we came up with:

Docker
Docker swarm
AWS lambda
AWS ECS
Kubernetes

Some of the options are pretty easy to exclude right away, but let’s review them all.

Docker

Conceptually Docker should work fine, you can create a container with it, and get information about a running container. Spin up a Hub with Docker on the same server and you’re good to go, it can’t get any simpler. But suppose our business does well, and resources of a single server are no longer enough? We could probably assign more CPU and memory to our server, so we could support more containers, But what if this server goes down? We’ll lose all the containers at once, but more importantly, we won’t be able to create new ones for some time, which means that our students won’t be able to learn properly.

This leads us to a distributed design (unfortunately), so we can scale our resources horizontally and keep our Hub replicas and containers on different servers.

Docker swarm

For those, who are not familiar with Docker swarm, here’s a note from docker docs:

Swarm mode is an advanced feature for managing a cluster of Docker daemons.

Read a little more for better understanding, but you can think of it as of distributed Docker, that supports multiple nodes (servers), which means, that our Hub replicas can be placed separately from containers and we can also add more nodes to our cluster in case we’re out of resources.

This is perfect and fits all the requirements we have so far - we can create and get containers, while having that precious scalability and fault tolerance.

Remember, how I said in the Intro, that we’re actually refactoring the system and not implementing it from scratch? Yes, our current solution uses Docker swarm. And we probably would’ve kept it, but we actually faced some problems, that are specific to Docker swarm. Under sufficient load it proved to be unstable in our hands. After reaching some amount of actively running containers on multiple nodes, docker swarm just stopped working properly. We tried adding new nodes, restarting the service, looked for memory leaks, but nothing worked except removing all existing containers and rebooting the whole thing, which we would like to do never. Summarizing downsides:

History of instability, when dealing with sufficient load
There is no strong support for Docker swarm currently
A lot of maintenance work needed from infrastructure engineers
- Lack of “managed” solutions - hard to do staging environment
- Node scaling is done by hand
- Logs collection and monitoring differs from how it’s done in our other infrastructure

Overall, it’s a good option, but we won’t consider it further.

AWS lambda

I can imagine some hacky way to make it work and it would not be trivial, but all the engineering won’t be worth it in our case, because maximum lambda invocation timeout is 15 min. We want our students to interact with their containers for a bit longer, at least an hour. Also there is a concurrency limit for single account in AWS region, which can be increased, but still, I feel like this shows us, that this is not the right instrument for our task.

On the other hand, we could probably create a new lambda function for each “create container" request, but I can’t think of a valid reason to do so. Those lambda functions would contain code, that our students submitted, yet they would lay among our production lambdas. Something feels not right about it, but I might be wrong. Anyway, skipping this option.

AWS ECS

From AWS landing page:

Amazon Elastic Container Service (ECS) is a fully managed container orchestration service that simplifies your deployment, management, and scaling of containerized applications.

Sounds great, even has a word orchestration in it! On a more serious note, using ECS can be a good choice for at least a couple of reasons:

Supports the ability to create/get containers (”services” in ECS terminology) across multiple servers, which is what we need.
Support of Fargate. Fargate allows you to automagically create containers without thinking about scaling and managing your servers. You just specify the needed amount of resources for your service and Fargate deploys it.

Option to use Fargate really caught our attention, since both setting the right amount of nodes or configuring autoscaler - are places where things can go wrong, so why not eliminate them (or delegate them to Amazon)?

Downside is clear - vendor lock. Regardless, ECS seems like a valid option, we’ll take a closer look at it later.

Kubernetes (AWS EKS)

Using kubernetes looked like the best option on our list since we started digging into the problem, and I’ll explain why. First and foremost, it supports all of our requirements - you can create containers (”pods” in k8s terminology, which contains actual container), perform any type of querying on them, label them, and so on. Also, much like in Docker swarm, you can have multiple nodes (servers) and groups of nodes, which solves fault tolerance, and kubernetes also provides you with cluster autoscaler, which can automatically scale nodes up and down based on configuration - no more tossing around tokens, trying to register a new swarm node. Cherry on top - kubernetes is open source and there are many cloud providers, that have managed kubernetes on their product list - all of them share k8s core API, so no vendor lock.

Digging a bit deeper into EKS documentation and we found, that EKS support Fargate node groups. That’s great news, we can potentially have best of both worlds - open source API from k8s and serverless nature of Fargate! Although, just in case, EKS also supports “normal” node groups with “normal” virtual servers (a.k.a. EC2), as shown in diagram above - “worker node group” consists of “data node 1” and “data node 2”, which are EC2 instances.

In the diagram below we can see how our architecture would look like with Fargate profile, instead of node group with EC2 instances as nodes - no cluster autoscaler needed.

Finalists

We reviewed a handful of orchestrator options, some were more suitable for our requirements, than others, so here they are, ones worth competing in MVP:

ECS + Fargate
EKS + Fargate
EKS + EC2

Metrics

So how do we know which one to pick? Besides factors like familiarity of our engineers with certain technology, how well-maintained it is or simply what’s more appealing, we came up with a set of performance oriented metrics. It is important for us, that our containers can be created swiftly, so our students don’t get frustrated waiting for their first nodejs server to be deployed. Also, we would like to measure latency of commands, that we run inside container, given variable number of currently active containers. Here are our metrics:

“CCT” here is container creation time in seconds, “burst each 100ms” means, that new container will be created every 100 ms.
“CCT gradual each 1s” - same thing, but creating container each second.
Avg run (ms) is the average time to run a simple ls / command of 10 consecutive runs for each container.
Each metric will be calculated given a different number of containers “# containers”. E.g. we calculate CCT, where containers are created each 100ms and the total number of containers created is 1000.

Calculating those metrics should give us a rough understanding of how core features of our service would behave under different loads.

Measuring

In order to measure our metrics, we have to implement our Hub, so we can send HTTP requests for container creation, also create sample infrastructure for each orchestrator option, that we chose. After all of that done, we’ll need a tool for sending batches of requests for container creation and command runs.

Implementing Hub

Let’s start with our Hub. It should be a simple web app, we chose Go for our experiment.

func main() {
    orchestrator := &kubeclient.KubeClient{Namespace: namespace}
    c := Controller{Orchestrator: orchestrator}

    http.HandleFunc("/get", c.GetContainer)
    http.HandleFunc("/create", c.CreateContainer)
    http.HandleFunc("/delete", c.DeleteContainer)
    http.HandleFunc("/files", c.PutFiles)
    http.HandleFunc("/run", c.RunCommand)

    fmt.Printf("Server is running on %v\n", serverPort)
    log.Fatal(http.ListenAndServe(serverHost+":"+serverPort, nil))
}

Naming of our URIs is pretty self explanatory - those are operations, that we want to do with our containers. The interesting part is KubeClient - this is one of our implementations of Orchestrator interface, here’s how it looks like:

type Orchestrator interface {
    GetContainer(containerID string) (*Container, error)
    CreateContainer(containerSpec *ContainerSpec) (*Container, error)
    DeleteContainer(containerID string) error
    ListContainers() ([]Container, error)
}

We never noted, that we’ll need DeleteContainer and ListContainers, but I’m sure everyone saw it coming. Those methods just helped us throughout our research.

Implementing orchestrators

We’ll only need two implementations of our Orchestrator interface, despite having three candidates. That’s because EKS + Fargate and EKS + EC2 do not have major differences, except node group selectors, which can be parameterized. So we’ll only have implementation for EKS and ECS.

Both implementations won’t easily fit into a code snippet, but I’ll just show you important details. Basically, container creation for our kubernetes orchestrator can be boiled down to creating a Pod specification and passing it to k8s API, like so:

func (k *KubeClient) CreateContainer(containerSpec *types.ContainerSpec) (*types.Container, error) {
    // ...

    // Making pod specification
    pod := &apiv1.Pod{
        ObjectMeta: metav1.ObjectMeta{
            Name:   podName,
            Labels: labels,
        },
        Spec: apiv1.PodSpec{
            // ...
            NodeSelector:  nodeSelector, // Can be for Fargate or EC2
            Containers: []apiv1.Container{
                {
                    Name:  "sandbox",
                    Image: containerSpec.Image,
                    // ...
                },
                // ...
            },
        },
    }

    // Creating pod via API
    pod, err = clientset.
        CoreV1().
        Pods(k.Namespace).
        Create(context.TODO(), pod, metav1.CreateOptions{})

    // ...

    return &types.Container{
        ID:        pod.Name,
        IP:        pod.Status.PodIP,
        UpdatedAt: updatedAt,
    }, nil
}

ECS implementation looks very similar, except the fact that we need to create a template for our container (Task Definition) separately:

func (c *ECSClient) CreateContainer(containerSpec *types.ContainerSpec) (*types.Container, error) {
    // ... 

    // Creating task definition
    taskDefInput := &ecs.RegisterTaskDefinitionInput{
        // ...
        RequiresCompatibilities: []*string{&LaunchType}, // Fargate
        ContainerDefinitions: []*ecs.ContainerDefinition{
            {
                // ...
                Name:  &podName,
                Image: aws.String(containerSpec.Image),
            },
        },
        Cpu:    &Cpu,
        Memory: &Memory,
    }
    taskDefResp, err := ecsSvc.RegisterTaskDefinition(taskDefInput)

    // ...

    // Running the task
    runTaskInput := &ecs.RunTaskInput{
        Cluster:        aws.String(c.ClusterName),
        TaskDefinition: taskDefResp.TaskDefinition.TaskDefinitionArn,
        LaunchType:     &LaunchType,
        // ...
    }
    res, err := ecsSvc.RunTask(runTaskInput)

    // ...

    return &types.Container{
        ID:        cnt.ID,
        IP:        cnt.Ip,
        UpdatedAt: time.Now(),
    }, nil
}

Note: creating task definition for each container might not be the optimal strategy here, but we conducted our experiment both with predefined task ARN and newly created one - it did not affect the speed of container (task) creation in any significant way, so we’re showing the simplest version.

Quick note on infrastructure

Having Hub and Orchestrators implemented, we need to prepare our infrastructure. Unfortunately, this is out of current article’s scope, so I’ll just list most significant requirements (that I can remember):

EKS:

EKS cluster
Manager node group (specs don’t really matter), where Hub will be deployed
Worker node group with 13 x m6i.2xlarge EC2 machines
Fargate profile (~ special node group)
Service account, bound to our Hub deployment, so it can create/get/remove pods
Correctly applied taints to our node groups, so that only containers can be scheduled on our Worker node group

ECS:

ECS resource
Manager service for our Hub
VPS configuration

Implementing testing tool

With our Hub deployed and infrastructure ready we can finally make some requests and create some containers. We wrote a CLI tool, that allowed us to make batches of requests and measure our metrics.

$ ./sh-dos spawn -h
NAME:
   sh-dos spawn - spawn containers

USAGE:
   sh-dos spawn [command options] [arguments...]

OPTIONS:
   -n value                    Number of containers to spawn (default: "1")
   --interval value, -i value  Sleep interval between spawns in Millisecond (default: "100")
   --cpu value                 CPU requests for each container (default: "0.1")
   --mem value                 Memory requests for each container in Mi (default: "128")
   --fargate                   Use fargate profile in node selector (default: false)
   --image value               Image for user container
   --files value               Files to put inside container
   --scenario value            Scenario to run (default: "basic")
   --help, -h                  show help

As a result it showed us values distribution for each metric we collected, here we can see container start time and average command execution time:

--------------------------------------------------
Container creation time
--------------------------------------------------
(2)    1.0s   |##
(24)   2.0s   |########################
(24)   3.0s   |########################
(0)    4.0s   |
bins: 4; total: 50; step: 1.00
--------------------------------------------------
--------------------------------------------------
Average time to exec command in container
--------------------------------------------------
(0)    0.1s   |
(50)   0.2s   |##################################################
(0)    0.3s   |
bins: 3; total: 50; step: 0.10
--------------------------------------------------
✅ Success!

Results

EKS + Fargate

I can feel everyone’s excited for the results, so here they are! First contestant - EKS + Fargate.

Let’s see, how our most promising candidate did…

(Sad trombone)

As you can see, table is not fully filled and it does not even have all the metrics. That’s because even creation of a single container turned our to be pretty slow for our needs. We were hoping to see something under 10 seconds at least. Here (row 3), a single container with a realistic 1.1 GB image and 0.25 allocated virtual CPU took 160 seconds to start up. When adding more CPU to our container (row 4), time dropped to 80 seconds. Just for the sake of experiment, we then created 20, a 100 and a 1000 containers - mean creation time was still around 80 seconds. Also note the “Node scaling (sec)” column - it’s 0 in all rows, that’s because we don’t need to add or remove any resources to our cluster, Fargate just does the job.

ECS + Fargate

Trying our luck with Fargate again, next up is ECS + Fargate

Somehow, here we can see a better picture. I guess AWS proprietary infrastructure has some sort of synergy. Node scaling is still 0, since we use Fargate, which is great, but realistic 1.1 GB image still takes quite some time to create container from - 40-50 seconds, depending on allocated CPU. On the other hand, containers from small 7.5 MB image were created faster, around 20 seconds. Interesting observation is that the difference in creation time of same sized images with different CPU allocations is not as significant, as it was with EKS + Fargate (observations in rows 1 and 2, and rows 3 and 4). I’m sure, that there’s a reason for this, but even the best time for the small image is still slow for our requirements, so we won’t dig much deeper into this option.

EKS + EC2

Last candidate, that we’ll observe is EKS + EC2. It is not using Fargate, instead we’re creating our own node groups from EC2 instances, as I mentioned earlier.

Now that’s better. We can see a significant improvement in our “CCT” metric, and other metrics also look promising. Even containers from the large 1.1 GB image are created with desirable speed. Results show us, that regardless of number of containers, image size and creation frequency - containers are created under 10 seconds. Or are they?

If you take a closer look, you’ll see “Node scaling” is not zero anymore, it’s actually 90 seconds roughly. What this means is that if we’re out of resources on our nodes, cluster autoscaler will add a new node to our node group and this will take around 90 seconds, meaning that we can’t instantly adapt to spikes in our load. Cluster autoscaler in it’s default configuration is triggered by a single pod being pending, after that it will start adding a node. This means, that an unlucky user will try to create a container via our system and will just wait for at least 90 seconds. We can reduce this number or even completely get rid of it by using “hacks” like cluster overprovisioning, but depending on pod creation frequency and overprovisioning configuration wait time will differ, and there still will always be a scenario in which a user will have to wait for node scaling. Sounds like a problem for future me.

Also an important note is that you need to pull the image from the registry on each node first, and only then you’ll be able to create a container from it. Results in this table does not show that, because images were already present on our nodes which is a realistic circumstance for our use-case.

Conclusion

Having conducted our research, we chose EKS + EC2 as an orchestrator for our new service. This was an easy decision for us, since container creation time was just not fast enough with other options. EKS + EC2 also has other advantages, that are most likely subjective to our company, one of which is that our infrastructure was already set up for easy monitoring, logging and deployment of newly added kubernetes services.

As always, problems in software engineering and system design require compromise and risk. I thank you for reading this article and hope, that I explained my though process well enough and provided a reasonable amount of data and instructions, so you can use it in search of this compromise and risk mitigation, while building a similar system.

P.S.: It would be unfair not to mention my colleagues, talented engineers, with whom as a team we've implemented this research - Timur Kushukov and Adel Natfulin.

DEV Community: oo00oo00oo00