Topics
- Existing Solutions - to integrate or build custom?
- Technologies Chosen
- High-level Architecture Overview - The Plan
- The Implementation
- Code Part 1 - Getting Started
- Code Part 2 - implementing the core modules
Problem Background
When I started my previous software engineering job search, I had progressed to the final rounds of interviews with a SaaS company in a niche field of pharmaceutical research known as Health Economics and Outcomes Research (HEOR). The interviewer (and soon-to-be-colleague-and-friend) had asked me a very interesting question on how I would solve an architecture problem. Unbeknownst to me, this was a real-world problem they were in the process of trying to solve, but with little-to-no progress.
The interviewer asked:
Lets say you had to execute code provided by a user and you weren't able to do so on the frontend, how would you do it?
Although he had no way of knowing, this was a problem I had always wanted to solve. Paraphrasing, I stated:
The backbone of the architecture, for security reasons, would be based on containers. I would first evaluate existing containerization solutions to see if they would be a good fit with all of the project requirements, and if not, would implement from scratch. The implementation would consist of a distributed event-based architecture for scalability, using PubSub (or even more robust) patterns around a distributed messaging system, such as Redis, Kafka, or RabbitMQ. The logic flow would be as follows:
- User makes a request with code to be executed to the backend api service (In this case Nodejs / Express)
- The Backend service immediately uploads all user-code to AWS S3 or similar object store, and then relays the path/URL data to the message system - let's say Redis (all encrypted, of course).
- A given container outside of the main backend service picks-up the task to execute code, retrieves the code from the object store, and begins executing the code.
- Upon completion, the container relays the result back to Redis so the backend Nodejs service can send a websocket message back to the target user who made the request.
This happened to be inline (disregarding tech choices) with the proposal of the outsourced engineering firm that this HEOR company had hired to implement, but failed to develop after 8 months. I was later informed that my answer, along with my experience leading software teams, ultimately got me the job as the Director of Software Engineering at the company. Now - could I succeed where a team of outsourced engineers had failed? I was up for the challenge.
Requirements
So I'm hired, grateful for the opportunity, and ready to architect a solution for one of the scariest problems that engineers intentionally try to prevent from happening every day - Untrusted Remote Code Execution. And although my answer was high-level enough for an interview, there would inevitably be a list of unique requirements that had to be fulfilled in addition to the execution of untrusted code itself. Below is the list in all of its glory, along with my initial reactions:
-
There could be no "spin-up" time between the request to execute user code and the time the execute would start
Nice, I will use a preload/pooling strategy for containers, and make them hang right before the line of code that is ready to receive user code.
-
Containers cannot be reused for multiple code executions.
Okay, single-use containers means each container has a specific lifecycle, which I can model as a state-machine.
-
Memory and CPU limits must be configurable by a superadmin for each user per organization
Alright, so there must be a way to administer isolated groups of pooled containers with different resource configurations.
-
The architecture around memory and CPU resources must be virtually limitless. If one of our big pharma clients (referred to as an org in our platform) is working on early modeling for Covid-19, they should be able to request as many resources as they need.
Great! I will make sure my architecture runs on a virtually-limitless cloud service provider...but looks like AWS Lambda is off the table due to the 10 gb memory limit.
-
There must be a flexible and configurable scaling mechanism per org or even per user. Even though resources must be virtually limitless, the architecture should not just "always have as much resources available all the time" and it should be able to scale differently per configuration grouping.
That's doable, and existing container solutions have that feature for the most part.
-
The arch must support per org and per user max code execution time limit configuration.
Well, if I choose an existing containers solution, this might need some custom code for determining when to mark the start time of code execution to the time at which a given container should be killed. Not too crazy.
-
During scale-down, current executions of user code should never be killed unless they exceed a per-org configured time limit.
well, since there must be a configurable per user/org limit, I'm going to have wait until the max time is exceeded for the longest configured execution time limit that is currently running on the same compute resource (EC2, etc) before scaling down. More custom code.
-
Memory and CPU must be configurable on-the-fly. If a request for code execution is made, and that given user's configured resource limits are different than what is set in the pool of containers, the resources must be configured at the time of execution.
Hmmmmm, interesting. So no batched roll-outs for pools/pods of containers when resource limits are changed. Well, Lambda is already a no-go, I don't think Kubernetes can do that..., and for ECS, once you define your task, you cannot change the resource limits after the fact (if I tried to preemptively create a pool). Things are now heading in the "build from scratch" direction.
-
The arch should never waste resources if possible. But, a user should not have to be online while their code is executing in order to receive results when they come back online. Also, the user should be able to run multiple executions of the same model from different browser tabs.
Alright, so I can't assume a user's request for code execution can be cancelled just because their websocket disconnected, and I also can't just use the session of a user, or even the model ID itself, to track the given code execution.
-
If a user requests a code execution from the same browser tab, the previous code execution for that browser tab should automatically be cancelled to avoid wasting resources.
similar to #9, I need some strategy to track per browser tab code execution over time in order to cancel previous executions on the same tab.
-
A user should only be able to run a configured amount of parallel code executions at one time; assume the number is 2 parallel for now.
Since this whole arch is going to be distributed, sounds like I will need to implement a shared data structure that will handle tracking in-flight executions per user so users' who have ran more than 2 executions at the same time have the oldest one automatically cancelled. (all while ensuring race conditions don't exist).
-
Realtime execution progress, and results response.
This will be custom regardless of the core of the architecture. All websockets, keying off of events streaming through the message/event system.
-
Superadmins should be able to monitor memory and CPU usage per container in realtime for every execution that's happening in the platform.
More events, and more websocket messages. Fun!
-
My own requirement if possible: build the architecture in a way that allows it to be simulated completely on a local machine with a single command, no cloud provider required when developing.
I try to take this approach whenever possible so that it's easier to extend, maintain, test, and on-board new developers onto the project. The worst feeling is when you have to make a change in a codebase that you can't guarantee will work correctly in production due to environment issues even after you've tested.
The Domain - Who needs this?
Before jumping into technologies, and how I actually built the architecture, it might help to understand who actually needs this kind of solution in the real-world. The first SaaS services that come to mind include some kind of code editor: repl.it, HackerRank, Leetcode, etc. These services need to be able to execute user-provided code in a variety of languages, and although the security concerns for data privacy are significantly less strict than Pharma Analytics, this type of architecture solves the problem of scalability and resource isolation/control. It might not be as obvious, but every single cloud provider - Google Cloud Console, AWS, Digital Ocean - all must support a similar architecture (obviously not exactly the same), and can be recognized through their development efforts of tools like AWS Firecracker, Google's gvisor, and others. All of these cloud providers must be able to execute your code, on demand, at scale, and at the same time, allow it to have highly configurable resources while provide analytics around your code that is executing over time.
In my particular case, our big pharma clients need to be able to create HEOR Models that allow these companies to make better decisions around drugs and medical equipment they are bringing to market. These models can include actual code, written in R Lang (or any language in the future thanks to the architecture), that must be executed during model execution. As a result, our SaaS product must be able to support secure, user-provided remote code execution at scale.
Whats next?
I've seen very few posts on remote code execution architectures that go into detail about the when, why, and how around implementing this type of product. If these topics interest you, throw a comment below or share the article on twitter with my handle @mkralla11 to let me know, and I'll keep this series going!
Top comments (1)
This was amazing and I really like your writing style. Canβt wait for the next articles in the series! Some of these requirements seem challenging and I look forward to finding out how you solved them. Keep up the great writing πππ