DEV Community: Meadowrun

Running pytest in the Cloud for Fun and Profit

Hyunho Richard Lee — Wed, 07 Sep 2022 14:06:14 +0000

Introducing pytest-cloudist on Meadowrun

Nobody likes waiting, but it seems to be part of the developer life: we wait for builds, tests, code review, and deployments. This is especially annoying given that computers are more powerful than ever and the cloud promises access to infinite compute resources.

A python in the clouds. Generated by Stable Diffusion.

In this post, we introduce pytest-cloudist, a plugin for the Python testing library pytest. pytest-cloudist leverages Meadowrun to run pytest on any number of cloud machines with a minimal amount of fuss. The name is a riff on the venerable pytest-xdist, which is mainly used for running tests locally in parallel. pytest-xdist does support distributed runs using SSH, but pytest-cloudist develops this capability further by provisioning cloud compute on-demand and synchronizing our code and libraries across machines seamlessly.

We’ll introduce pytest-cloudist through a case study of running the pandas unit tests on AWS EC2 virtual machines.

Running the pandas tests locally

First, we’ll get the pandas tests running locally to establish a baseline. For all of the local runs, we’ll be using a laptop which has an Intel Core i7 with 12 vCPUs and 16 GiB RAM.

The development documentation has a few well-documented options to set up a pandas development environment. We went with a low-tech option, a virtualenv with libraries installed by pip. The requirements-dev.txt in the pandas repo is quite large, so we trimmed it down to the essentials which means we’ll skip a few of the integration tests involving other packages like geopandas.

We’ll also skip tests marked as slow, network, or db, following the logic in test_fast.sh, and we also explicitly disabled a handful of flaky tests. This still leaves over 100,000 tests to run.

Running vanilla pytest

One last bit of configuration. We used a pytest hook to turn off printing dots to the terminal for each completed test. This was adding a minute to the total runtime and wasn’t providing much value for running more than 100,000 tests. (Don’t get me started on how my laptop runs Fortnite at 60fps but printing dots to the terminal takes ages.)

On my laptop, running pytest on the pandas tests takes about 13.5 minutes:

> time pytest --skip-slow --skip-network --skip-db -m "not single_cpu" pandas

============= 145386 passed, 21573 skipped, 1230 xfailed, 1 xpassed, 22 warnings in 810.80s (0:13:30) ==============

real 13m34.698s
user 12m25.510s
sys 0m22.408s

Running in parallel with pytest-xdist

Pandas recommends running pytest-xdist with 4 workers in test_fast.sh, but on my laptop, that actually results in a slowdown because the worker processes run out of memory. I did manage to get good results with 2 worker processes though:

> time pytest --skip-slow --skip-network --skip-db -m "not single_cpu" -n 2 pandas

============= 145386 passed, 21573 skipped, 1230 xfailed, 1 xpassed, 22 warnings in 447.80s (0:07:27) ==============

real 7m32.799s
user 16m18.856s
sys 0m56.269s

Nice — almost a 2x speedup.

Amdahl’s Law

Why don’t we get the full 2x speedup, i.e. just under 7 minutes? Because of Amdahl’s law — not all the work is parallelizable.

First, pytest collects tests to run on a single thread, which takes over a minute for the pandas tests:

time pytest --skip-slow --skip-network --skip-db -m "not single_cpu" --collect-only pandas

=== 168158/169338 tests collected (1180 deselected) in 72.39s (0:01:12) ===

real 1m19.560s
user 0m59.860s
sys 0m7.425s

There’s also time spent aggregating the results as tests complete, which is harder to isolate and measure.

Back of the envelope, we have 13.5 minutes of work total, of which 1 minute is test collection time, and about another minute is test aggregation and reporting time. That leaves 11.5 minutes of embarrassingly parallel work, i.e. running the tests, which lines up with the roughly 7.5 minute runtime (2 + 11.5 / 2) that we see in practice for two workers.

I’d like to try more workers, but sadly, two is close to what my laptop can handle — I had trouble doing anything else while the test was running as it consumed almost all of my laptop’s memory.

The current state of the art then, puts me between a rock and a hard place: I can run tests slowly sequentially and use my laptop for something else, or I can run tests quickly in parallel but make my laptop unusable.

Running pandas’ tests in the cloud

With pytest-cloudist, there’s now a third option: I can run the tests in parallel on AWS spot instances. Pytest-cloudist is a fairly thin wrapper around Meadowrun, which does the heavy lifting of creating EC2 instances, starting workers and deploying the environment and local code. In theory this is a seamless experience (Meadowrun maintainer bias warning!).

Getting started with pytest-cloudist

pytest-cloudist installation is like any other pytest plugin: you install it using pip or poetry alongside pytest, and it’s automagically available. If you haven’t set up Meadowrun in your AWS account before, there’s an additional step which we won’t repeat here.

By default cloud distribution is not enabled; it only kicks in if you pass the --cloudist test or --cloudist file arguments to pytest. The former distributes each individual test as a separate task and the latter distributes each test file as a task. Since pandas has over 100,000 tests, making tests into individual tasks introduces way too much overhead, so we’ll use per-file distribution exclusively in this post. There are about 850 test files, so still plenty of parallelization opportunity.

Further pytest-cloudist options all start with --cd. There are options to control the number of workers and how much CPU and memory each worker needs. This information is passed straight to Meadowrun which takes care of creating machines of the right size.

There’s also an option --cd-tasks-per-worker-target for combining tests or files into bigger tasks to maximize performance. Workers invoke pytest once per task, but there is some overhead for each invocation of pytest. To reduce this overhead, we can ask cloudist to combine multiple files into tasks so each task runs multiple files in one go. For example, if there are 850 test files, a run with --cloudist file --cd-num-workers 4 --cd-tasks-per-worker-target 10 tries to create 40 tasks total (4 workers*10 tasks/worker). Each task consists of roughly 21 test files each (850 files//40 tasks).

--cd-tasks-per-worker-target 1 minimizes the pytest invocation overhead, but also introduces a problem: a single slow task could take significantly longer than the other tasks, resulting in a longer overall runtime. We find empirically that a tasks per worker target between 5 and 20 works well.

Configuration for the pandas tests

The first issue we ran into when trying to run the pandas tests using pytest-cloudist is that pandas has Cython dependencies which Meadowrun currently doesn’t take care of automatically. To solve this, we compiled locally and then used cloudist’s --cd-extra-files argument to sync the .so and .pxd files to the remote machines. The --cd-extra-files argument is similar to pytest-xdist’s --rsyncdir.

The second issue is that some of the tests rely on data files. We can use the same --cd-extra-files argument to make sure these files are copied to the remote machines.

Here’s the full command line:

time pytest 
--skip-slow --skip-network --skip-db -m "not single_cpu" 
--cloudist file
--cd-extra-files 'pandas/_libs/**/*.pxd'
--cd-extra-files 'pandas/_libs/**/*.so' 
--cd-extra-files 'pandas/io/sas/*.so'
--cd-extra-files 'pandas/tests/ **/data/**' 
--cd-num-workers 2
--cd-cpu-per-worker 2
--cd-memory-per-worker 6 
--cd-tasks-per-worker-target 20 
pandas

Note that we didn’t have to specify explicitly which python files to make available or what the environment should be. pytest-cloudist, via Meadowrun, figures that out by itself.

Also, we’ve given each worker two CPUs instead of the default of one as this seems to benefit some of the tests.

We have lift off

When starting a run, pytest collects tests as normal, but then pytest-cloudist (or Meadowrun, really) kicks in to create the necessary EC2 instances and synchronize the current Python environment, code, and extra files:

Mirroring current pip environment
0/2 workers allocated to existing instances: 
The current run_map's id is 6610ce7a-08c6-4995-8a29-59e3c66dac68
Launched 1 new instance(s) (total $0.0411/hr) for the remaining 2 workers:
        ec2-18-219-20-201.us-east-2.compute.amazonaws.com: m5.xlarge (4.0 CPU, 16.0 GB), spot ($0.0411/hr, 2.5% eviction rate), will run 2 workers

First, Meadowrun detected that we’re running in a pip virtual environment. Meadowrun recreates virtual environments on the remote machine by building a container. This can take some time, but the resulting container is cached in ECR.

Then, local code and extra files are zipped and uploaded to S3. The file is not re-uploaded if its contents hasn’t changed.

As a final step before running the actual tests, EC2 virtual machines are created or reused from previous jobs if they’re available. Meadowrun keeps machines around for a limited amount of time after they’re idle, to save on startup times. Meadowrun also tries to pack more than one worker on the same machine if it’s cost-effective, so the number of workers is likely greater than the number of machines. Here we’ve asked for 2 workers with 2 vCPUs and 6GiB of memory each, which will all be running on a single m5.xlarge machine. pytest-cloudist uses cheap spot instances by default.

With a lukewarm start (defined shortly) and two workers, running the tests takes about 11.5 minutes:

==== 145376 passed, 21594 skipped, 1180 deselected, 1225 xfailed, 1 xpassed, 63 warnings in 685.60s (0:11:25) ===

real 11m36.304s
user 1m59.359s
sys 0m6.353s

A lukewarm start means:

The container with the virtual environment has been built and cached in ECR which saves about 2 minutes of container building.
The code has been uploaded to S3. For pandas, the zip file is about 230MB and this takes about 1min 30sec.
No EC2 instances have been created or warmed up yet. This means an EC2 instance needs to be created and booted, and the instance also needs to pull the cached Docker container image from ECR which takes about 15 seconds.

A cold start takes about 3 minutes longer than the lukewarm start, although there can be a good amount of variation on creating and launching an EC2 instance.

A warm start means that suitable EC2 instances are already running and have the necessary Docker container available locally. Running the tests in this case takes almost 9 minutes.

=========================================================== 145376 passed, 21594 skipped, 1180 deselected, 1224 xfailed, 3 xpassed, 63 warnings in 522.77s (0:08:42) ============================================================

real 8m50.467s
user 1m10.620s
sys 0m2.983s

So in the best case of a warm start, pytest-cloudist is about 1min 20 sec slower than pytest-xdist when we use two workers.

We’re obsessed with Meadowrun performance, and we hope to close this gap over time. The lowest hanging fruit is to make code upload smarter. That said, there’s not much we can do about how long it takes for pip to build the virtualenv, or for AWS to launch an EC2 instance.

Turn it to eleven

Unlike pytest-xdist, however, we can easily add more workers now. Just by changing --cd-num-workers we can speed up our tests even more:

We decreased the tasks per worker target (--cd-task-per-worker-target) as we added more workers, because finer-grained tasks come with more overhead. The “Lukewarm” and “Warm” columns give the runtime under those conditions and “Lukewarm-Warm” shows the difference between the two. This difference mostly represents EC2 startup overhead.

There are clearly diminishing returns as more workers are added, but it’s still cool that we can run all of these tests in about 3 minutes, especially considering that the non-parallelizable portion takes up about 2 of those 3 minutes. On top of that, having my laptop available while I was running these tests in the cloud was great — a much better experience than running them locally.

Conclusion

This post introduced pytest-cloudist, a pytest plugin that distributes tests to EC2 virtual machines using Meadowrun. As a case study, we used it to distribute a subset of the pandas tests. Developing this plugin drove a number of performance and feature enhancements in Meadowrun in the recent 0.2 releases, and has given us ideas for a few more.

We hope you are inspired to give pytest-cloudist and Meadowrun a try. Do get in touch for feedback, questions or just to hang out!

To stay updated, star us on Github or follow us on Twitter!

How to Run Stable Diffusion on EC2

Hyunho Richard Lee — Fri, 02 Sep 2022 14:36:00 +0000

Use Meadowrun to run the latest text-to-image model on AWS

Stable Diffusion is a new, open text-to-image model from Stability.ai which has blown people away.

The publicly available official tool gives you 200 image generations for free, and then charges about 1¢ per image generation after that. But because the model is open, you can download the code and the model and run your own version of it. The r/StableDiffusion subreddit has a good guide to doing this, and the options boil down to using Google Colab which requires a Colab Pro subscription ($9.99/month) to get enough GPUs, or running locally on your laptop which requires a GPU with at least 10GB of VRAM.

We’ll present a different option here, which is to use Meadowrun to rent a GPU machine from AWS EC2 for just a few minutes at a time. Meadowrun is an open source library that makes it easy to run your python code on the cloud. It will take care of launching an EC2 instance, getting our code and libraries onto it, and turning it off when we’re done.

Images generated by Stable Diffusion from “a digital illustration of a steampunk computer floating among clouds, detailed”

AWS and Meadowrun Prerequisites

First, we’ll need an AWS account where we’ve increased our quotas for GPU instances (from the default of 0). We’ll also need a local python environment with Meadowrun installed. We covered both of these steps in a previous article on running Craiyon aka DALL·E Mini (not to be confused with OpenAI’s DALL·E) so we’ll link to the instructions for these steps from that article rather than repeating them here. We recommend checking this out sooner rather than later, as it seems like AWS has a human in the loop for granting quota increases, and in our experience it can take up to a day or two to get a quota increase granted.

Stable Diffusion Prerequisites

Next, we’ll need to go to the Stable Diffusion page on Hugging Face, accept the terms, and download the checkpoint file containing the model weights to our local machine.

Then, we’ll create an S3 bucket and upload this file to our new bucket so that our EC2 instance can access this file. From the directory where the checkpoint file was downloaded, we’ll run:

aws s3 mb s3://meadowrun-sd
aws s3 cp sd-v1-4.ckpt s3://meadowrun-sd

Remember that S3 bucket names are globally unique, so you’ll need to use a unique bucket name that’s different from what we’re using here (meadowrun-sd).

Finally, we’ll need to grant access to this bucket for the Meadowrun-launched EC2 instances:

meadowrun-manage-ec2 grant-permission-to-s3-bucket meadowrun-sd

Running Stable Diffusion

Now we’re ready to run Stable Diffusion!

import asyncio
import meadowrun

def main():
    folder_name = "steampunk_computer"
    prompt = "a digital illustration of a steampunk computer floating among clouds, detailed"

    asyncio.run(
        meadowrun.run_command(
            'bash -c \''
            'aws s3 sync s3://meadowrun-sd /var/meadowrun/machine_cache --exclude "*" '
            '--include sd-v1-4.ckpt '
            f'&& python scripts/txt2img.py --prompt "{prompt}" --plms '
            '--ckpt /var/meadowrun/machine_cache/sd-v1-4.ckpt --outdir /tmp/outputs '
            f'&& aws s3 sync /tmp/outputs s3://meadowrun-sd/{folder_name}\'',
            meadowrun.AllocCloudInstance("EC2"),
            meadowrun.Resources(
                logical_cpu=1, memory_gb=8, max_eviction_rate=80,
                gpu_memory=10, flags="nvidia"
            ),
            meadowrun.Deployment.git_repo(
                "https://github.com/hrichardlee/stable-diffusion",
                branch="meadowrun-compatibility",
                interpreter=meadowrun.CondaEnvironmentYmlFile(
                    "environment.yaml", additional_software="awscli"
                ),
                environment_variables={
                    "TRANSFORMERS_CACHE": "/var/meadowrun/machine_cache/transformers"
                }
            )
        )
    )

if __name__ == "__main__":
    main()

Let’s walk through this snippet. The first parameter to run_command tells Meadowrun what we want to run on the remote machine. In this case we’re using bash to chain three commands together:

First, we’ll use aws s3 sync to download the weights from S3. Our command will run in a container, but the /var/meadowrun/machine_cache folder that we download into can be used to cache data for multiple jobs that run on the same instance. aws s3 cp doesn’t have a --no-overwrite option, so we use aws s3 sync to only download the file if we don’t already have it. This isn’t robust to multiple processes running concurrently on the same machine, but in this case we’re only running one command at a time.
Second, we’ll run the txt2img.py script which will generate images from the prompt we specify.
The last part of our command will then upload the outputs of the txt2img.py script into our same S3 bucket.

The next two parameters tell Meadowrun what kind of instance we need to run our code:

AllocCloudInstance("EC2") tells Meadowrun to provision an EC2 instance.
Resources tells Meadowrun the requirements for the EC2 instance. In this case we’re requiring at least 1 CPU, 8 GB of main memory, and 10GB of GPU memory on an Nvidia GPU. We also set max_eviction_rate to 80 which means we’re okay with spot instances up to an 80% chance of interruption. The GPU instances we’re using are fairly popular, so if our instance is interrupted or evicted frequently, we might need to switch to an on-demand instance by setting this parameter to 0.

Finally, Deployment.git_repo specifies our python dependencies:

The first two parameters tell Meadowrun to get the code from the meadowrun-compatibility branch of this fork of the official repo. We were almost able to use the original repo as-is, but we had to make a small tweak to the environment.yaml file — Meadowrun doesn’t yet support installing the current code as an editable pip package.
The third parameter tells Meadowrun to create a conda environment based on the packages specified in the environment.yaml file in the repo.
We also need to tell Meadowrun to install awscli, which is a non-conda dependency installed via apt. We’re using the AWS CLI to download and upload files to/from S3.
The last parameter sets the TRANSFORMERS_CACHE environment variable. Stable Diffusion uses Hugging Face’s transformers library which downloads model weights. This environment variable points transformers to the /var/meadowrun/machine_cache folder so that we can reuse this cache across runs.

To walk through selected parts of the output, first Meadowrun tells us everything we need to know about the instance it started for this job and how much it will cost us (only 16¢ per hour for the spot instance! If we need the on-demand instance it will cost 53¢ per hour).

Launched a new instance for the job: ec2-3-15-146-110.us-east-2.compute.amazonaws.com: g4dn.xlarge (4.0 CPU, 16.0 GB, 1.0 GPU), spot ($0.1578/hr, 61.0% eviction rate), will run 1 workers

Next, Meadowrun builds a container based on the contents of the environment.yaml file we specified. This takes a while, but Meadowrun caches the image in ECR for us so this only needs to happen once. Meadowrun also cleans up the image if we don’t use it for a while.

Building python environment in container a07bf5...

After that, we’ll see the output from the txt2img.py script:

Global seed set to 42
Loading model from /var/meadowrun/machine_cache/sd-v1-4.ckpt
...

This script usually around 3 minutes and generates 6 images with the default settings. This adds up to about 6 images for 1¢ with a spot instance, and about 2 images for 1¢ with an on-demand instance, although we do have to pay for some overhead for creating the environment.

Once the last command completes, our images will be available in our S3 bucket! We can view them using an S3 UI like CyberDuck or just sync the bucket to our local machine using the command line:

aws s3 sync s3://meadowrun-sd/steampunk_computer steampunk_computer

Meadowrun will automatically turn off the machine if we don’t use it for 5 minutes, but if we know we’re done generating images, we can turn it off manually from the command line:

meadowrun-manage-ec2 clean

Closing remarks

Stable Diffusion is remarkable for how good it is, how open it is, and how cheap and easy it is to use. And Meadowrun makes it even easier!

To stay updated on Meadowrun, star us on Github or follow us on Twitter!

Kubernetes Was Never Designed for Batch Jobs

Hyunho Richard Lee — Thu, 01 Sep 2022 13:58:29 +0000

In this post we’ll make the case that Kubernetes is philosophically biased towards microservices over batch jobs. This results in an impedance mismatch that makes it harder than it “should” be to use Kubernetes for batch jobs.

Batch world vs microservice world

One thing that took me a while to figure out was why so many wildly popular technologies felt so weird to me. It clicked when I realized that I was living in “batch world”, and these technologies were coming out of “microservice world”.

I’ve mostly worked at hedge funds, and the ones I’ve worked at are part of batch world. This means that code is usually run when triggered by an external event, like a vendor sending the closing prices for US equities for that day or a clearinghouse sending the list of trades that it recorded for the day. A job scheduler like Airflow triggers a job to run depending on which event happened. That job usually triggers other jobs, and the ultimate output of these jobs is something like an API call to an external service that executes trades or a PnL (profit and loss) report that gets emailed to humans.

In contrast, in microservice world (and take this with a grain of salt, as I’ve only ever touristed there), instead of jobs that run to completion, everything is a long-running service. So instead of a job that ingests closing US equity prices and then kicks off a job that calculates the PnL report, there might be a PnL report service that polls the prices service until the prices it needs are available.

But what’s the difference, really?

The difference between batch jobs and services seems easy to define — batch jobs are triggered by some “external” event, do some computation and then exit on their own. Services on the other hand run forever in a loop that accepts requests, does a smaller bit of computation, and then responds to those requests.

But we could describe batch jobs in a way that makes them sound like services by saying that each invocation of a batch job is a “request”. In that view, the overall pattern of running batch jobs looks like a service with the job scheduler playing the role of the load balancer and each batch job invocation plays the role of handling a request. This pattern of handling each request in its own process is similar to the fairly common “forking server” pattern.

And vice versa, consider the thought experiment where we’ve broken up our code into thousands of different microservices but we don’t have enough hardware to run all of these microservices at the same time. So we configure a sophisticated, responsive autoscaler that only starts each microservice for the duration of a single request/reply when a request comes in. At that point, our autoscaler is starting to look more like a job scheduler, and our microservices kind of look like batch jobs!

Let’s clarify our original intuition — our description of services “running forever” is imprecise. It’s true that most services don’t exit “successfully” on their own, but they might exit due to a catastrophic exception, in response to an autoscaling decision, or as part of a deployment process. So what are we actually getting at when we say that services “run forever”? The most important aspect is that we think of services as “stateless” which can also be stated as “services are easily restartable”. Because each request/reply cycle is usually on the order of milliseconds to seconds and there’s usually minimal state in the server process outside of the request handler, restarting a service shouldn’t lose us more than a few seconds of computation. In contrast, batch jobs are generally “not restartable”, which is to say that if we have a batch job that takes several hours to run and we restart it after one hour, we will need to redo that first hour of work which is usually unacceptable. If we somehow wrote a batch job that created checkpoints of its state every 5 seconds, thereby making it more or less “stateless”, then that batch job would be just as “easily restartable” as a service.

So the most defensible distinction between service and batch jobs is that services do a small (milliseconds to seconds) amount of work at a time which makes them easy to restart, while batch jobs do a large (minutes to hours) amount of work at a time which makes them harder to restart. Stated this way, the difference between services and batch jobs is fuzzier than it first appears.

But just because we can come up with hard-to-categorize examples that challenge this binary, it doesn’t mean the difference is meaningless. In practice, services generally run as long-lived processes that respond to requests, do a small bit of work for each request and then respond. Meanwhile batch jobs are usually triggered ad-hoc by a data scientist or from a job scheduler and do a large amount of work at a time which makes them harder to restart.

Another way to see the difference between services vs batch jobs is how they scale. Scaling a service usually means dealing with a large number of concurrent requests by having an autoscaler run replicas of the service with a load balancer sending different requests to different replicas. Scaling a batch job usually usually means dealing with a large amount of data by running the same code over different chunks of data in parallel. We’ll go into these aspects below in more depth as we walk through how well (or poorly) Kubernetes supports these scenarios.

How Kubernetes sees the world

Technology is neither good nor bad; nor is it neutral. Melvin Kranzberg

We’ll skip the full history and functionality of Kubernetes as there’s plenty of existing coverage (this post from Coinbase gives a good overview). Our focus here is understanding Kubernetes’ philosophy — how does it see the world and what kinds of patterns does it create for users.

It’s all about philosophy. The Thinker in The Gates of Hell at the Musée Rodin via Wikimedia Commons

Kubernetes is for services

We’ll start with the “what Kubernetes can do” section on the “Overview” page in the Kubernetes documentation. Most of the key features listed here are focused on services rather than batch jobs:

“Service discovery and load balancing” resolves domain names to one or more replicas of a service. This isn’t relevant for batch jobs which usually don’t have a concept of request/response, so there’s no need to resolve domain names to containers, or round-robin requests between different instances of a service.
“Automated rollouts and rollbacks” makes deployment of services easier by turning off a few instances, restarting them with the new deployment, and then repeating until all of the instances have been updated. This idea doesn’t apply to batch jobs because batch jobs are “hard to restart” and naturally exit on their own, so the right thing to do is to wait until the batch job finishes and then start subsequent jobs with the new deployment rather than losing work to a restart. And we certainly wouldn’t want a rolling deployment to result in a distributed job where different tasks run on different version of our code!
“Self-healing”: Restarting jobs that fail is useful, but batch jobs don’t have a concept of a “health check”, and not advertising services to clients until they’re ready isn’t relevant to batch jobs.
“Automatic bin packing” is only partially relevant for batch jobs — we definitely want to be smart about the initial allocation of where we run a job, but again, batch jobs can’t be restarted willy-nilly, so they can’t be “moved” to a different node.
“Secret and configuration management” and “Storage orchestration” are equally relevant for services and batch jobs.

One theme throughout these features is that Kubernetes assumes that the code it’s running is relatively easy to restart. In other words, it assumes it’s running services.

Kubernetes doesn’t believe in orchestration

That same Overview page declares:

Kubernetes is not a mere orchestration system. In fact, it eliminates the need for orchestration. The technical definition of orchestration is execution of a defined workflow: first do A, then B, then C. In contrast, Kubernetes comprises a set of independent, composable control processes that continuously drive the current state towards the provided desired state. It shouldn’t matter how you get from A to C. Centralized control is also not required. This results in a system that is easier to use and more powerful, robust, resilient, and extensible.

This paragraph presumably refers to the idea that in Kubernetes you define your configuration declaratively (e.g. make sure there are 3 instances of this service running at all times) rather than imperatively (e.g. check how many instances are running, if there more than 3, kill instances until there are 3 remaining; if there are fewer than 3, start instances until we have 3).

Nevertheless, job schedulers like Airflow are orchestration frameworks in the exact way this paragraph describes. And of course we can just run Airflow on top of Kubernetes to work around this bit of philosophy, but Kubernetes intentionally makes it hard to implement this kind of orchestration natively.

Kubernetes has a “job” concept for running batch jobs, but the aversion to the idea of “first do A, then B” means that the job API will probably never be able to express this core concept. The only hope for expressing job dependencies in Kubernetes would be a declarative model — “in order to run B, A must have run successfully”. But this feature doesn’t exist either, and while a fully declarative model does have its advantages, it’s not currently the dominant paradigm for expressing job dependencies.

Moreover, jobs are clearly secondary to services in Kubernetes. Jobs don’t appear in that Overview page at all, and are mostly ignored throughout the documentation aside from the sections that are explicitly about jobs. One conspicuous absence is this page which states “Pods do not disappear until someone (a person or a controller) destroys them, or there is an unavoidable hardware or system software error.” This inexplicably ignores the case where pods for jobs complete naturally.

More missing features

Not only are Kubernetes jobs themselves not fully featured, they exist in a larger system that is designed with a philosophy that makes jobs less useful than they could be. For the rest of this article, we’ll examine some of these details of Kubernetes and why they make life harder for batch jobs.

Pod preemption

Both jobs and services are implemented as “pods” which are neutral in theory but in practice biased towards services. One example is that pods can be preempted to make room for other pods, which assumes that pods are easily restarted. The documentation acknowledges this isn’t appropriate for batch jobs but the recommendation it makes is a bit backwards — it suggests that batch jobs should set preemptionPolicy: Never. This means those pods will never preempt another job which only works if all of the pods on the cluster do the same thing. Ideally there would be a way to guarantee that the pod itself would never be preempted even in a cluster that runs both batch jobs and services. There are workarounds like reserving higher priorities for batch jobs or using pod disruption budgets, but these aren’t mentioned on that page. This is exactly what we mean by an impedance mismatch — we can ultimately accomplish what we need to, but it takes more effort than it “should”.

Composability

Kubernetes only works with containers, and containers themselves are also biased towards services in how they compose. If we have two containers that need to talk to each other, exposing services in one or both of the containers is the only game in town.

For example, let’s say we want to run some Python code and that code needs to call ImageMagick to do some image processing. We want to run the Python code in a container based on the Python Docker image and we want to run ImageMagick in a separate container based on this ImageMagick Docker image. Let’s think about our options for calling the ImageMagick container from the Python container, i.e. for composing these two containers.

We could use the Python image as a base image and copy parts of the ImageMagick Dockerfile into a new Dockerfile that builds a custom combined image. This is probably the most practical solution, but it has all the usual drawbacks of copy/paste — any improvements to the ImageMagick Dockerfile won’t make it into our Dockerfile without manually updating our Dockerfile.
We could invoke the ImageMagick container as if it were a command line application. Kubernetes supports sharing files between containers in the same pod so at least we can send our inputs/outputs back and forth, but there isn’t a great way to invoke a command and get notified when it’s done. Anything is possible, of course (e.g. starting a new job and polling the pods API to see when it completes), but Kubernetes’ philosophical aversion to orchestration is not helping here.
We could modify the ImageMagick image to expose a service. This seems silly, but is effectively what happens most of the time — instead of building command-line tools like ImageMagick, people end up building services to sidestep this problem.

In fact in Docker, “Compose” means combining services:

Compose is a tool for defining and running multi-container Docker applications. With Compose, you use a YAML file to configure your application’s services. Then, with a single command, you create and start all the services from your configuration.

Even if we acquiesce to the idea of composing via services, Kubernetes doesn’t give us great options for running a service in the context where a batch job is the consumer:

The simplest option is to run one or more copies of the ImageMagick service constantly, but that’s a waste of resources when we’re not running anything that needs the service, and it will be overwhelmed when we launch a distributed job running thousands of concurrent tasks.
So we might reach for the HorizontalPodAutoscaler to spin up instances when we need them and turn them off when we don’t. This works, but to get the responsiveness of calling something on the command line, we’ll need to make the autoscaler’s “sync-period” much shorter than the default 15 seconds.
Another option is “sidecar containers” where we run the Python image and the ImageMagick service image in the same pod. This mostly works, but there’s no way to automatically killthe service’s sidecar container when the “main” batch job container is done. The proposed feature to allow this was ultimately rejected because it “is not an incremental step in the right direction”.

Exploring these options for how batch jobs can call other containers shows that we can make it work, but Kubernetes makes it harder than it “should” be.

Ad-hoc jobs

One aspect of batch jobs is that we often run them ad-hoc for research, development, or troubleshooting. For example, we edit some code or tweak some data, and then rerun our linear regression to see if we get better results. For these ad-hoc jobs there would ideally be some way to take e.g. a CSV file that we’re working with locally, and “upload” it to a volume that we could read from a pod. This scenario isn’t supported by Kubernetes, though, so we have to figure out other ways to get data into our pod.

One option would be to set up an NFS (Network File System) that’s accessible from outside of the cluster and expose it to our pods in the cluster. The other option is, as usual, a service of some sort. We could use a queue like RabbitMQ that will temporarily store our data and make it available to our pod.

But with a service, we now have another problem of accessing this service from outside of the cluster. One way to solve this problem is to make our dev machine part of the Kubernetes cluster. Then we can configure a simple Service for RabbitMQ which will be accessible from inside the cluster. If that’s not an option, though, we’ll need to explore Accessing Applications in a Cluster. The options boil down to using port forwarding, which is a debugging tool not recommended for “real” use cases, or setting up an Ingress Controller which is a formalized way of exposing services running in a Kubernetes cluster that is overkill for a single client accessing a single instance of a service from an internal network. None of these options are ideal.

Kubernetes doesn’t make this easy because when we’re running services, we either don’t want anything outside of the cluster interacting with our service, or we want a relatively heavy-duty load balancer in front of our service to make sure our service doesn’t go down under load from the outside world. The missing feature here is some way for the client that is launching a job/pod to upload a file that the pod can access. This scenario is relatively common when running batch jobs and uncommon when running services, so it doesn’t exist in Kubernetes and we have to use these workarounds.

Distributed jobs that kind of work

Another aspect of batch jobs is that we’ll often want to run distributed computations where we split our data into chunks and run a function on each chunk. One popular option is to run Spark, which is built for exactly this use case, on top of Kubernetes. And there are other options for additional software to make running distributed computations on Kubernetes easier.

The Kubernetes documentation, however, doesn’t cede to third-party frameworks, and instead gives several options for running distributed computations directly on Kubernetes. But none of these options are compelling especially in contrast to how well-designed Kubernetes is for services workloads.

The simplest option is to just create a single Job object per task. As the documentation points out, this won’t work well with a large number of tasks. One user’s experience is that it’s hard to go beyond a few thousand jobs total. It seems like the best way around that is to use Indexed Jobs, which is a relatively new feature for running multiple copies of the same job where the different copies of the job have different values for the JOB_COMPLETION_INDEX environment variable. This gives us the most fundamental layer for running distributed jobs. As long as each task just needs an index number and doesn’t need to “send back” any outputs, this works. E.g. if all of the tasks are working on a single file and the tasks “know” that they need to process n rows that come after skipping the first JOB_COMPLETION_INDEX * n rows, and then write their output to a database, this works great.

But in some cases, we’ll want to run tasks that e.g. need a filename to know where their input data is, and it could be convenient to send back results directly to the process that launched the distributed job for post-processing. In other words we might need to send more data back and forth from the tasks beyond a single number. For that, the documentation offers two variations of using a message queue service that you start in your Kubernetes cluster. The main difficulty with this approach is that we have the same problem as before of accessing services inside of the Kubernetes cluster from outside of it so that we can add messages to the message queue service. The documentation suggests creating a temporary interactive Pod but that only really makes sense for testing. We have the same options as before — make sure everything including our dev machines run inside the cluster, use port forwarding, or create an Ingress.

Distributed groupby

An additional wrinkle with distributed computations is distributed “groupbys”. Distributed groupbys are necessary when our dataset is “chunked” or “grouped” by one column (e.g. date) but we want to group by a different column (e.g. zip code) before applying a computation. This requires re-grouping our original chunks, which is implemented as a “shuffle” (also known as a map-reduce). Worker-to-worker communication is central to shuffles. Each per-date worker gets a chunk of data for a particular date, and then sends rows for each zip code to a set of per-zip code downstream workers. The per-zip code workers receive the rows for “their” zip code from each upstream per-date worker.

To implement this worker-to-worker communication, Kubernetes could implement some version of shared volumes that would allow us to expose the results from one pod to a downstream pod that needs those outputs. Again, this functionality would be really useful in “batch world”, but the use case doesn’t exist in “service world”, so this functionality doesn’t exist. Instead we need to write our own service for worker-to-worker communication.

This is what, for example, Spark does, but it runs into an impedance mismatch as well. Spark’s shuffle implementation makes extensive use of local disk in order to deal with data that doesn’t fit in memory. But Kubernetes makes it hard to get the full performance of your local disks because it only allows for disk caching at the container, meaning that the kernel’s disk caching isn’t used and Spark’s shuffle performance suffers on Kubernetes. A more native way to share files across pods would enable a faster implementation.

Caching data

Another aspect of distributed computations that we’ll talk about is caching data. Most distributed computations have the concept of a distributed dataset that’s cached on the workers that can be reused later. For example, Spark has the concept of RDDs (resilient distributed dataset). We can cache an RDD in Spark, which means that each Spark worker will store one or more chunks of the RDD. For any subsequent computations on that RDD, the worker that has a particular chunk will run the computation for that chunk. This general idea of “sending code to the data” is a crucial aspect of implementing an efficient distributed computation system.

Kubernetes in general is a bit unfriendly to the idea of storing data on the node, recommending that “it is a best practice to avoid the use of HostPaths when possible”. And even though there are extensive capabilities for assigning pods to nodes via affinity, anti-affinity, taints, tolerations, and topology spread constraints, none of these work with the concept of “what cached data is available on a particular node”.

Overall, Kubernetes has some very half-hearted support for natively running distributed computations. Anything beyond the most simple scenario requires either implementing workarounds or layering on another platform like Spark.

The future of batch jobs on Kubernetes

The point of this article is not to suggest that Kubernetes is poorly thought out or poorly implemented. Arguably, if you have a monolithic service, Kubernetes is overkill, and there’s at least one person predicting it will be gone in 5 years, but it seems like a great choice for e.g. this person running 120 microservices. And it makes deploying your web app with a database, DNS records, and SSL certificates easy. What we’re saying here is that Kubernetes, like all technologies, takes a point of view, and that point of view isn’t particularly friendly to batch jobs.

Instead of the currently half-hearted support for batch jobs, one option we’d love to see is Kubernetes making its stance more explicit and declaring that Kubernetes is designed primarily for services. This could open up space for other platforms more specifically designed for the batch job use case. Alternatively, as Kubernetes is already a “platform for building platforms” in some ways, we could see something like Spark-on-Kubernetes become better supported. It seems unlikely that Kubernetes would adopt a significantly different overall philosophy where batch jobs are a first-class use case.

Star us on Github or follow us on Twitter! We’re working on Meadowrun which is an open source library that solves some of these problems and makes it easier to run your Python code on the cloud, with or without Kubernetes.

How to Deploy ML Models Using Gravity AI and Meadowrun

Hyunho Richard Lee — Wed, 17 Aug 2022 15:58:00 +0000

Transform your containerized models-as-services into batch jobs

GravityAI is a marketplace for ML models where data scientists can publish their models as containers, and consumers can subscribe to access those models. In this article, we’ll talk about different options for deploying these containers and then walk through using them for batch jobs via Meadowrun, which is an open-source library for running your Python code and containers in the cloud.

A library inside of a container. Photo by Manuel Palmeira on Unsplash

Containers vs libraries

This section lays out some motivation for this post — if you’re interested in just getting things working, please go to the next section!

If you think of an ML model as a library, then it might seem more natural to publish it as a package, either on PyPI for use with pip or Anaconda.org for use with conda, rather than a container. Hugging Face’s transformers is a good example — you run pip install transformers, then your Python interpreter can do things like:

from transformers import pipeline
classifier = pipeline("sentiment-analysis")
classifier("We are very happy to show you the 🤗 Transformers library.")

There are a few disadvantages with this approach:

For pip specifically (not conda), there’s no way for packages to express non-Python dependencies. With the transformers library, you’ll usually depend on something like CUDA for GPU support, but users need to know to install that separately.
The transformers library also needs to download gigabytes of model weights in order to do anything useful. These get cached in a local directory so they don’t need to be downloaded every time you run, but for some contexts you might want to make these model weights a fixed part of the deployment.
Finally, when publishing a pip or conda package, users expect you to specify your dependencies relatively flexibly. E.g. the transformers package specifies “tensorflow>=2.3” which declares that the package works with any version of tensorflow 2.3 or higher. This means that if the tensorflow maintainers introduce a bug or break backwards compatibility, they can effectively cause your package to stop working (welcome to dependency hell). Flexible dependencies are useful for libraries, because it means more people can install your package into their environments. But for deployments, e.g. if you’re deploying a model within your company, you’re not taking advantage of that flexibility and you’d rather have the certainty it’s going to work every time, and reproducibility of the deployment.

One common way to solve these problems is to build a container. You can install CUDA into your container, if you include your model weights in the image Docker will make sure you only have a single copy of that data on each machine as long as you’re managing your Docker layers correctly, and you’ll be able to choose and test the specific version of tensorflow that goes into the container.

So we’ve solved a bunch of problems, but we’ve traded them for a new problem, which is that each container runs in its own little world and we have to do some work to expose our API to our user. With a library, the consumer can just call a Python function, pass in some inputs and get back some outputs. With a container, our model is more like an app or service, so there’s no built-in way for a consumer to “call” the container with some input data and get back some output data.

One option is to create a command line interface, but that requires explicitly binding input and output files to the container. This feels a bit unnatural, but we can see an example of it in this container image of ImageMagick by dpokidov. In the “Usage” section, the author recommends binding a local folder into the container in order to run it as a command line app.

The traditional answer to this question for Docker images is to expose an HTTP-based API, which is what Gravity AI’s containers do. But this means turning our function (e.g. classifier()) into a service, which means we need to figure out where to put this service. To give the traditional answer again, we could deploy it to Kubernetes with an autoscaler and a load balancer in front of it, which can work well if e.g. you have a constant stream of processes that need to call classifier(). But you might instead have a use case where some data shows up from a vendor every few hours which triggers a batch processing job. In that case, things can get a bit weird. You might be in a situation where the batch job is running, calls classifier(), then has to wait for the autoscaler/Kubernetes to find a free machine that can spin up the service while the machine running the batch job is sitting idle.

In other words, both of these options (library vs service) are reasonable, but they come with their own disadvantages.

As a bit of an aside, you could imagine a way to get the best of both worlds with an extension to Docker that would allow you to publish a container that exposes a Python API, so that someone could call sentiment = call_container_api(image="huggingface/transformers", "my input text") directly from their python code. This would effectively be a remote procedure call into a container that is not running as a service but instead spun up just for the purpose of executing a function on-demand. This feels like a really heavyweight approach to solving dependency hell, but if your libraries are using a cross-platform memory format (hello Apache Arrow!) under the covers, you could imagine doing some fun tricks like giving the container a read-only view into the caller’s memory space to reduce the overhead. It’s a bit implausible, but sometimes it’s helpful to sketch out these ideas to clarify the tradeoffs we’re making with the more practical bits of technology we have available.

Using models-in-containers locally

In this article, we’ll tackle this batch jobs-with-containers scenario. To make this concrete, let’s say that every morning, a vendor gives us a relatively large dump of everything everyone has said on the internet (Twitter, Reddit, Seeking Alpha, etc.) about the companies in the S&P 500 overnight. We want to feed these pieces of text into FinBERT, which is a version of BERT that has been fine-tuned for financial sentiment analysis. BERT is a language model from Google that was state of the art when it was published in 2018.

We’ll be using Gravity AI’s container for FinBERT, but we’ll also assume that we operate in a primarily batch process-oriented environment, so we don’t have e.g. a Kubernetes cluster set up and even if we did it would probably be tricky to get the autoscaling right because of our usage pattern.

If we’re just trying this out on our local machine, it’s pretty straightforward to use, as per Gravity AI’s documentation:

docker load -i Sentiment_Analysis_o_77f77f.docker.tar.gz
docker run -d -p 7000:80 gx-images:t-39c447b9e5b94d7ab75060d0a927807f

Sentiment_Analysis_o_77f77f.docker.tar.gz is the name of the file you download from Gravity AI and gx-images:t-39c447b9e5b94d7ab75060d0a927807f is the name of the Docker image once it’s loaded from the .tar.gz file, which will show up in the output of the first command.

And then we can write a bit of glue code:

import time

import requests


def upload_license_file(base_url: str) -> None:
    with open("Sentiment Analysis on Financial Text.gravity-ai.key", "r") as f:
        response = requests.post(f"{base_url}/api/license/file", files={"License": f})
        response.raise_for_status()


def call_finbert_container(base_url: str, input_text: str) -> str:
    add_job_response = requests.post(
        f"{base_url}/data/add-job",
        files={
            "CallbackUrl": (None, ""),
            "File": ("temp.txt", input_text, "text/plain"),
            "MimeType": (None, "text/plain"),
        }
    )
    add_job_response.raise_for_status()
    add_job_response_json = add_job_response.json()
    if (
        add_job_response_json.get("isError", False)
        or add_job_response_json.get("errorMessage") is not None
    ):
        raise ValueError(f"Error from server: {add_job_response_json.get('errorMessage')}")
    job_id = add_job_response_json["data"]["id"]
    job_status = add_job_response_json["data"]["status"]

    while job_status != "Complete":
        status_response = requests.get(f"{base_url}/data/status/{job_id}")
        status_response.raise_for_status()
        job_status = status_response.json()["data"]["status"]
        time.sleep(1)

    result_response = requests.get(f"{base_url}/data/result/{job_id}")
    return result_response.text


def process_data():
    base_url = "http://localhost:7000"

    upload_license_file(base_url)

    # Pretend to query input data from somewhere
    sample_data = [
        "Finnish media group Talentum has issued a profit warning",
        "The loss for the third quarter of 2007 was EUR 0.3 mn smaller than the loss of"
        " the second quarter of 2007"
    ]

    results = [call_finbert_container(base_url, line) for line in sample_data]

    # Pretend to store the output data somewhere
    print(results)


if __name__ == "__main__":
    process_data()

call_finbert_container calls the REST API provided by the container to submit a job, polls for the job completion, and then returns the job result. process_data pretends to get some text data and processes it using our container, and then pretends to write the output somewhere else. We’re also assuming you’ve downloaded the Gravity AI key to the current directory as Sentiment Analysis on Financial Text.gravity-ai.key.

Using models-in-containers on the cloud

This works great for playing around with our model locally, but at some point we’ll probably want to run this on the cloud, either to access additional compute whether that’s CPU or GPU, or to run this as a scheduled job with e.g. Airflow. The usual thing to do is package up our code (finbert_local_example.py and its dependencies) as a container, which means now we have two containers — one containing our glue code, and the FinBERT container that we need to launch together and coordinate (i.e. our glue code container needs to know the address/name of the FinBERT container to access it). We might start reaching for Docker Compose which works great for long-running services, but in the context of an ad-hoc distributed batch job or a scheduled job, it will be tricky to work with.

Instead, we’ll use Meadowrun to do most of the heavy lifting. Meadowrun will not only take care of the usual difficulties of allocating instances, deploying our code, etc., but also help launch an extra container and make it available to our code.

To follow along, you’ll need to set up an environment. We’ll show how this works with pip on Windows, but you should be able to follow along the package manager of your choice (conda environments don’t work across platforms, so conda will only work if you’re on Linux).

python -m venv venv
venv\Scripts\activate.bat
pip install requests meadowrun
meadowrun-manage-ec2 install

This creates a new virtaulenv, adds requests and meadowrun, and then installs Meadowrun into your AWS account.

When you download a container from Gravity AI, it comes as a .tar.gz file that needs to get uploaded to a container registry in order to work. There are slightly longer instructions in the Gravity AI documentation, but here’s a short version of how to create an ECR (Elastic Container Registry) repository then upload a container from Gravity AI to it:

aws ecr create-repository --repository-name mygravityaiaws ecr get-login-password | docker login --username AWS --password-stdin 012345678901.dkr.ecr.us-east-2.amazonaws.comdocker tag gx-images:t-39c447b9e5b94d7ab75060d0a927807f 012345678901.dkr.ecr.us-east-2.amazonaws.com/mygravityai:finbertdocker push 012345678901.dkr.ecr.us-east-2.amazonaws.com/mygravityai:finbert

012345678901 appears in a few places in this snippet and it needs to be replaced with your account id. You’ll see your account id in the output of the first command as registryId.

One more step before we can run some code: we’ll need to give the Meadowrun role permissions to this ECR repository.

meadowrun-manage-ec2 grant-permission-to-ecr-repo mygravityai

Now we can run our code:

import asyncio

import meadowrun

from finbert_local_example import upload_license_file, call_finbert_container

# Normally this would live in S3 or a database somewhere
_SAMPLE_DATA = {
    "Company1": [
        "Commission income fell to EUR 4.6 mn from EUR 5.1 mn in the corresponding "
        "period in 2007",
        "The purchase price will be paid in cash upon the closure of the transaction, "
        "scheduled for April 1 , 2009"
    ],
    "Company2": [
        "The loss for the third quarter of 2007 was EUR 0.3 mn smaller than the loss of"
        " the second quarter of 2007",
        "Consolidated operating profit excluding one-off items was EUR 30.6 mn, up from"
        " EUR 29.6 mn a year earlier"
    ]
}


def process_one_company(company_name: str):
    base_url = "http://container-service-0"

    data_for_company = _SAMPLE_DATA[company_name]

    upload_license_file("http://container-service-0")

    results = [call_finbert_container(base_url, line) for line in data_for_company]

    return results


async def process_data():
    results = await meadowrun.run_map(
        process_one_company,
        ["Company1", "Company2"],
        meadowrun.AllocCloudInstance("EC2"),
        meadowrun.Resources(logical_cpu=1, memory_gb=1.5, max_eviction_rate=80),
        await meadowrun.Deployment.mirror_local(working_directory_globs=["*.key"]),
        sidecar_containers=meadowrun.ContainerInterpreter(
            "012345678901.dkr.ecr.us-east-2.amazonaws.com/mygravityai", "finbert"
        ),
    )
    print(results)


if __name__ == "__main__":
    asyncio.run(process_data())

Let’s walk through the process_data function line by line.

Meadowrun’s run_map function runs the specified function (in this case process_one_company) in parallel on the cloud. In this case, we provide two arguments (["Company1", "Company2"]) so we’ll run these two tasks in parallel. The idea is that we’re splitting up the workload so that we can finish the job quickly.
AllocCloudInstance tells Meadowrun to launch an EC2 instance to run this job if we don’t have one running already and Resources tells Meadowrun what resources are needed to run this code. In this case we’re requesting 1 CPU and 1.5 GB of RAM per task. We’re also specifying that we we’re okay with spot instances up to an 80% eviction rate (aka probability of interruption).
mirror_local tells Meadowrun that we want to use the code in the current directory, which is important as we’re reusing some code from finbert_local_example.py. Meadowrun only uploads .py files by default, but in our case, we need to include the .key file in our current working directory so that we can apply the Gravity AI license.
Finally, container_services tells Meadowrun to launch a container with the specified image for every task we have running in parallel. Each task can access its associated container as container-service-0, which you can see in the code for process_one_company. If you’re following along, you’ll need to edit the account id again to match your account id.

Let’s look at a few of the more important lines from the output:

Launched 1 new instance(s) (total $0.0209/hr) for the remaining 2 workers:
 ec2-13-59-48-22.us-east-2.compute.amazonaws.com: r5d.large (2.0 CPU, 16.0 GB), spot ($0.0209/hr, 2.5% chance of interruption), will run 2 workers

Here Meadowrun is telling us that it’s launching the cheapest EC2 instance that can run our job. In this case, we’re only paying 2¢ per hour!

Next, Meadowrun will replicate our local environment on the EC2 instance by building a new container image, and then also pull the FinBERT container that we specified:

Building python environment in container  4e4e2c...
...
Pulling docker image 012345678901.dkr.ecr.us-east-2.amazonaws.com/mygravityai:finbert

Our final result will be some output from the FinBERT model that looks like:

[['sentence,logit,prediction,sentiment_score\nCommission income fell to EUR 4.6 mn from EUR 5.1 mn in the corresponding period in 2007,[0.24862055 0.44351986 0.30785954],negative,-0.1948993\n',
...

Closing remarks

Gravity AI packages up ML models as containers which are really easy to use as services. Making these containers work naturally for batch jobs takes a bit of work, but Meadowrun makes it easy!

Run Your Own DALL·E Mini (Craiyon) Server on EC2

Hyunho Richard Lee — Tue, 26 Jul 2022 15:53:00 +0000

In case you’ve been under a rock for the last few months, DALL·E is an ML model from OpenAI that generates images from text prompts. DALL·E Mini (renamed to Craiyon) by Boris Dayma et al. is a less powerful but open version of DALL·E, and there’s a hosted version at craiyon.com for everyone to try.

If you’re anything like us, though, you’ll feel compelled to poke around the code and run the model yourself. We’ll do that in this article using Meadowrun, an open-source library that makes it easy to run Python code in the cloud. For ML models in particular, we just added a feature for requesting GPU machines in a recent release. We’ll also feed the images generated by DALL·E Mini into additional image processing models (GLID-3-xl and SwinIR) to improve the quality of our generated images. Along the way we’ll deal with the speedbumps that come up when running open-source ML models on EC2.

Running dalle-playground

For the first half of this article, we’ll show how to run saharmor/dalle-playground, which wraps the DALL·E Mini code in an HTTP API, and provides a simple web page to generate images via that API.

dalle-playground provides a Jupyter notebook that you can run in Google Colab. If you’re doing anything more than kicking the tires, though, you’ll run into the dynamic usage limit in Colab’s free tier. You could upgrade to Colab Pro ($9.99/month) or Colab Pro+ ($49.99/month), but we’ll get this functionality for pennies on the dollar by using AWS directly!

Prerequisites

First, you’ll need an AWS account. If you’ve never used GPU instances in AWS before, you’ll probably need to increase your quotas. AWS accounts have quotas in each region that limit how many CPUs of a particular instance type you can run at once. There are 4 quotas for GPU instances:

L-3819A6DF: “All G and VT Spot Instance Requests”
L-7212CCBC: “All P Spot Instance Requests”
L-DB2E81BA: “Running On-Demand G and VT instances”
L-417A185B: “Running On-Demand P instances”

These are all set to 0 for a new EC2 account, so if you try to run the code below, you’ll get this message from Meadowrun:

Unable to launch new g4dn.xlarge spot instances due to the L-3819A6DF
quota which is set to 0. This means you cannot have more than 0 CPUs
across all of your spot instances from the g, vt instance families.
This quota is currently met. Run `aws service-quotas
request-service-quota-increase --service-code ec2 --quota-code
L-3819A6DF --desired-value X` to set the quota to X, where X is
larger than the current quota. (Note that terminated instances
sometimes count against this limit:
https://stackoverflow.com/a/54538652/908704 Also, quota increases are
not granted immediately.)

We recommend running the command in that message or clicking on one of the links in the list above to request a quota increase if you’re giving this a go (if you use a link, make sure you are in the same region as your AWS CLI as given by aws configure get region). It seems like AWS has a human in the loop for granting quota increases, and in our experience it can take up to a day or two to get a quota increase granted.

Second, we’ll need a local Python environment with Meadowrun, and then we’ll install Meadowrun in our AWS account. Here’s an example using pip in Linux:

$ python3 -m venv meadowrun-venv
$ source meadowrun-venv/bin/activate
$ pip install meadowrun
$ meadowrun-manage-ec2 install --allow-authorize-ips

Running DALL·E Mini

Now that we have that out of the way, it’s easy to run the dalle-playground backend!

import asyncio
import meadowrun

async def run_dallemini():
    return await meadowrun.run_command(
        "python backend/app.py --port 8080 --model_version mini",
        meadowrun.AllocCloudInstance("EC2"),
        meadowrun.Resources(
            logical_cpu=1,
            memory_gb=16,
            max_eviction_rate=80,
            gpu_memory=4,
            flags="nvidia"
        ),
        meadowrun.Deployment.git_repo(
            "https://github.com/hrichardlee/dalle-playground",
            interpreter=meadowrun.PipRequirementsFile("backend/requirements.txt", "3.9")
        ),
        ports=8080
    )

asyncio.run(run_dallemini())

A quick tour of this snippet:

run_command tells Meadowrun to run python backend/app.py --port 8080 --model_version mini on an EC2 instance. This starts the dalle-playground backend on port 8080, using the mini version of DALL·E Mini. The mini version is 27 times smaller than the mega version of DALL·E Mini, which makes it less powerful but easier to run.
The next few lines tell Meadowrun what the requirements for our job are: 1 CPU, 16 GB of main memory, and we’re okay with spot instances up to an 80% probability of eviction (aka interruption). The instance types we’ll be using do tend to get interrupted, so if that becomes a problem we can change this to 0% which tells Meadowrun we want an on-demand instance. We also ask for an Nvidia GPU that has at least 4GB of GPU memory which is what’s needed by the mini model.
Next, we want the code in the https://github.com/hrichardlee/dalle-playground repo, and we want to construct a pip environment from the backend/requirements.txt file in that repo. We were almost able to use the saharmor/dalle-playground repo as-is, but we had to make one change to add the jax[cuda] package to the requirements.txt file. In case you haven’t seen jax before, jax is a machine-learning library from Google, roughly equivalent to Tensorflow or PyTorch. It combines Autograd for automatic differentiation and XLA (accelerated linear algebra) for JIT-compiling numpy-like code for Google’s TPUs or Nvidia’s CUDA API for GPUs. The CUDA support requires explicitly selecting the [cuda] option when we install the package.
Finally, we tell Meadowrun that we want to open port 8080 on the machine that’s running this job so that we can access the backend from our current IP address. Be careful with this! dalle-playground doesn’t use TLS and it’s not a good idea to give everyone with your IP address access to this interface forever.

To walk through selected parts of the output from this command:

Launched a new instance for the job:
ec2-3-138-184-193.us-east-2.compute.amazonaws.com: g4dn.xlarge (4.0
CPU, 16.0 GB, 1.0 GPU), spot ($0.1578/hr, 61.0% chance of
interruption), will run 1 workers

Here Meadowrun tells us everything we need to know about the instance it started for this job and how much it will cost us (only 15¢ per hour!).

Building python environment in container  eccac6...

Next, Meadowrun is building a container based on the contents of the requirements.txt file we specified. This takes a while, but Meadowrun caches the image in ECR for you so this only needs to happen once (until your requirements.txt file changes). Meadowrun also cleans up the image if you don’t use it for a while.

\--> Starting DALL-E Server. This might take up to two minutes.

Here we’ve gotten to the code in dalle-playground, which needs to do a few minutes of initialization.

\--> DALL-E Server is up and running!

And now we’re up and running!

Now we’ll need to run the front end on our local machine (if you don’t have npm, you’ll need to install node.js):

git clone https://github.com/saharmor/dalle-playground
cd dalle-playground/interface
npm start

You’ll want to construct the backend URL in a separate editor, e.g. http://ec2-3-138-184-193.us-east-2.compute.amazonaws.com:8080 and copy/paste it into the webapp—typing it in directly causes unnecessary requests to the partially complete URL, which fail slowly.

Time to generate some images!

DALL·E Mini (mini version): batman praying in the garden of gethsemane

DALL·E Mini (mini version): olive oil and vinegar drizzled on a plate in the shape of the solar system

It was pretty easy to get this working, but this model isn’t really doing what we’re asking it to do. For the first set of images, we clearly have a Batman-like figure, but he’s not really praying and I’m not sure he’s in the garden of Gethsemane. For the second set of images, it looks like we’re either getting olive oil or a planet, but we’re not getting both of them in the same image, let alone an entire system. Let’s see if the “mega” version of DALL·E Mini can do any better.

Running DALL·E Mega

DALL·E Mega is a larger version of DALL·E Mini, meaning it has the same architecture but more parameters. Theoretically we can just replace --model_version mini with --model_version mega_full in the previous snippet and get the mega version. When we do this, though, the dalle-playground initialization code takes about 45 minutes.

We don’t need any real profiling to figure this one out—if you just kill the process after it’s running for a while, the stack trace will clearly show that the culprit is the from_pretrained function, which is downloading the pretrained model weights from Weights and Biases (aka wandb). Weights and Biases is an MLOps platform that helps you keep track of the code, data, and analyses that go into training and evaluating an ML model. For the purposes of this article, it’s where we go to download pretrained model weights. We can look at the specification for the artifacts we’re downloading from wandb, browse to the web view for the mega version and see that the main file we need is about 10GB. If we ssh into the EC2 instance that Meadowrun creates to run this command and run iftop, we can see that we’re getting a leisurely 35 Mbps from wandb.

We don’t want to wait 45 minutes every time we run DALL-E Mega, and it’s painful to see a powerful GPU machine sipping 35 Mbps off the internet while almost all of its resources sit idle. So we made some tweaks to dalle-playground to cache the artifacts in S3. cache_in_s3.py effectively calls wandb.Api().artifact("dalle-mini/dalle-mini/mega-1:latest").download() then uploads the artifacts to S3. To follow along, you’ll first need to create an S3 bucket and give the Meadowrun EC2 role access to it:

aws s3 mb s3://meadowrun-dallemini
meadowrun-manage-ec2 grant-permission-to-s3-bucket meadowrun-dallemini

Remember that S3 bucket names need to be globally unique, so you won’t be able to use the exact same name we’re using here.

Then we can use Meadowrun to kick off the long-running download job on a much cheaper machine—note that we’re only requesting 2 GB of memory and no GPUs for this job:

import asyncio
import meadowrun

async def cache_pretrained_model_in_s3():
    return await meadowrun.run_command(
        "python backend/cache_in_s3.py --model_version mega_full --s3_bucket meadowrun-dallemini --s3_bucket_region us-east-2",
        meadowrun.AllocCloudInstance("EC2"),
        meadowrun.Resources(1, 2, 80),
        meadowrun.Deployment.git_repo(
            "https://github.com/hrichardlee/dalle-playground",
            branch="s3cache",
            interpreter=meadowrun.PipRequirementsFile(
                "backend/requirements_for_caching.txt", "3.9"
            )
        )
    )

asyncio.run(cache_pretrained_model_in_s3())

We’ve also changed the model code to download files from S3 instead of wandb. We’re downloading the files into the special /var/meadowrun/machine_cache folder which is shared across Meadowrun-launched containers on a machine. That way, if we run the same container multiple times on the same machine, we won’t need to redownload these files.

Once that’s in place, we can run the mega version and it will start up relatively quickly:

import asyncio
import meadowrun

async def run_dallemega():
    return await meadowrun.run_command(
        "python backend/app.py --port 8080 --model_version mega_full --s3_bucket meadowrun-dallemini --s3_bucket_region us-east-2",
        meadowrun.AllocCloudInstance("EC2"),
        meadowrun.Resources(1, 32, 80, gpu_memory=12, flags="nvidia"),
        meadowrun.Deployment.git_repo(
            "https://github.com/hrichardlee/dalle-playground",
            branch="s3cache",
            interpreter=meadowrun.PipRequirementsFile("backend/requirements.txt", "3.9")
        ),
        ports=8080
    )

asyncio.run(run_dallemega())

A few things to note about this snippet:

We’re asking Meadowrun to use the s3cache branch of our git repo, which includes the changes to allow caching/retrieving the artifacts in S3.
We’ve increased the requirements to 32 GB of main memory and 12 GB of GPU memory, which the larger model requires.
The first time we run, Meadowrun builds a new image because we added the boto3 package for fetching our cached files from S3.

One last note—Meadowrun’s install sets up an AWS Lambda that runs periodically and cleans up your instances automatically if you haven’t run a job for a while. To be extra safe, you can also manually clean up instances with:

meadowrun-manage-ec2 clean

Here’s what we get:

DALL·E Mega (full version of DALL·E Mini): batman praying in the garden of gethsemane

DALL·E Mega (full version of DALL·E Mini): olive oil and vinegar drizzled on a plate in the shape of the solar system

Much better! For the first set of images, I’m not sure Batman is praying in all of those images, but he’s definitely Batman and he’s definitely in the garden of Gethsemane. For the second set of images, we have a plate now, some olive oil and vinegar, and it definitely looks like more of a solar system. The images aren’t quite on par with OpenAI’s DALL·E yet, but they are noticeably better! Unfortunately there’s not too much more we can do to improve the translation of text to image short of training our own 12 billion-parameter model, but we’ll try tacking on a diffusion model to improve the finer details in the images. We’ll also add a model for upscaling the images, as they’re only 256x256 pixels right now.

Building an image generation pipeline

For the second half of this article, we’ll use meadowdata/meadowrun-dallemini-demo which contains a notebook for running multiple models as sequential batch jobs to generate images using Meadowrun. The combination of models is inspired by jina-ai/dalle-flow.

DALL·E Mini: The model we’ve been focusing on in the first half of this article. This post is a good guide to how OpenAI’s DALL·E 2 is built. To simplify, DALL·E is a combination of two models. The first model is trained on images and learns how to “compress” images to vectors and then “decompress” those vectors back into the original images. The second model is trained on image/caption pairs and learns how to turn captions into image vectors. After training, we can put new captions into the second model to produce an image vector, and then we can feed that image vector into the first model to produce a novel image.
GLID-3-xl: A diffusion model. Diffusion models are trained by taking images, blurring (aka diffusing) them, and training the model on original/blurred image pairs. The model learns to reconstruct the original unblurred version from the blurred version. Diffusion models can be used for a variety of tasks, but in this case we’ll use GLID-3-xl to fill in the finer details in our images.
SwinIR: A model for upscaling images (aka image restoration). Image restoration models are trained by taking images and downscaling them. The model learns to produce the original higher resolution image from the downscaled image.

To run this pipeline, in addition to the prerequisites from the first half of this article, we’ll get the meadowrun-dallemini-demo git repo and the local dependencies, then launch a Jupyter notebook server:

git clone https://github.com/meadowdata/meadowrun-dallemini-demo
cd meadowrun-dallemini-demo
# assuming you are already in a virtualenv from before
pip install -r local_requirements.txt
jupyter notebook

We’ll then need to open the main notebook in Jupyter, and edit S3_BUCKET_NAME and S3_BUCKET_REGION to match the bucket we created in the first half of this article.

The code in the notebook is similar to the first half of this article so we won’t go over it in depth. A few notes on what the rest of the code in the repo is doing:

We’ve adapted the sample code that comes with all of our models to use our S3 cache and provide easy-to-use interfaces in dalle_wrapper.py, glid3xl_wrapper.py, and swinir_wrapper.py.
We’re referencing our three models directly as git repos (because they’re not available as packages in PyPI) in model_requirements.txt, but we had to make a few changes to make these repos work as pip packages. Pip looks for a setup.py file in the git repo to figure out which files from the repo need to be installed into the environment, as well as what the dependencies of that repo are. GLID-3-xl and latent-diffusion (another diffusion model that GLID-3-xl depends on) had setup.py files that needed tweaks to include all of the code needed to run the models. SwinIR didn’t have a setup.py file at all, so we added one. Finally, all of these setup.py files needed additional dependencies, which we just added to the model_requirements.txt file.
All of these models are pretty challenging to run on anything other than Linux, which is why we’ve split out the local_requirements.txt from the model_requirements.txt. Even if you’re running on Windows or Mac, you shouldn’t have any trouble running this notebook—Meadowrun takes care of creating the hairy model environment on an EC2 instance running Linux.

And a couple more notes on Meadowrun:

Now that because we’re running these models as batch jobs instead of as services, Meadowrun will reuse a single EC2 instance to run them.
If you’re feeling ambitious, you could even use meadowrun.run_map to run these models in parallel on multiple GPU machines.

Let’s see some results!

The notebook asks for a text prompt and has DALL·E Mini generate 8 images:

DALL·E Mini: batman praying in the garden of gethsemane

We select one of the images and GLID-3-xl produces 8 new images based on our chosen image.

Images generated by GLID-3-xl based on image 6 above

Finally, we select one of these images and have SwinIR upscale it from 256x256 to 1024x1024 pixels:

Image 3 from above upscaled by SwinIR

Not terrible, although we did provide some human help at each stage!

Here’s what OpenAI’s DALL·E generates from the same prompt:

// Detect dark theme var iframe = document.getElementById('tweet-1550117634863030276-670'); if (document.body.className.includes('dark-theme')) { iframe.src = "https://platform.twitter.com/embed/Tweet.html?id=1550117634863030276&theme=dark" }

And here’s one more comparison:

DALL·E Mini: “olive oil and vinegar drizzled on a plate in the shape of the solar system”, upscaled by SwinIR

// Detect dark theme var iframe = document.getElementById('tweet-1519295975839260675-245'); if (document.body.className.includes('dark-theme')) { iframe.src = "https://platform.twitter.com/embed/Tweet.html?id=1519295975839260675&theme=dark" }

All of this underscores how impressive OpenAI’s DALL·E is. That said, DALL·E Mini is very fun to play with, is truly open, and will only get better as it continues to train.

Closing remarks

This post demonstrates how to use Meadowrun for running GPU computations like ML inference in EC2. Meadowrun takes care of details like finding the cheapest available GPU instance types, as well as making sure CUDA and Nvidia’s Container Runtime (previously known as Nvidia Docker) are installed in the right places.

It’s pretty cool that we can point Meadowrun to a repo like dalle-playground, tell it what resources it needs, and get it running with almost no fuss. One of the most annoying things in software is getting other people’s code to work, and it’s great to see that the Python and ML ecosystem have made a ton of progress in this regard. Thanks to better package management tools, MLOps tools like Hugging Face and Weights and Biases, as well as Meadowrun (if we do say so ourselves) it’s easier than ever to build on the work of others.

To stay updated on Meadowrun, star us on Github or follow us on Twitter!

Why Starting a Fresh EC2 Instance and Running Python with Meadowrun Took Over a Minute

Hyunho Richard Lee — Thu, 14 Jul 2022 13:55:36 +0000

And what we did about it

Meadowrun is a no-ceremony tool to run your Python code in the cloud that automates the boring stuff for you.

Meadowrun checks if suitable EC2 machines are already running, and starts some if not; it packages up code and environments; logs into the EC2 machine via SSH and runs your code; and finally, gets the results back, as well as logs and output.

Photo by Braden Collum on Unsplash

This can take between a few seconds in the warmest of warm starts to several minutes in the coldest start. For a warm start, a machine is already running, we have a container already built and local code is uploaded to S3. During a cold start, all these things need to happen before you get to see your Python code actually run.

Nobody likes waiting, so we were keen to improve the wait time. This is the story of what we discovered along the way.

You don’t know what you don’t measure

At the risk of sounding like a billboard for telemetry services, the first rule when you’re trying to optimize something is: measure where you’re spending time.

It sounds absolutely trivial, but if I had a dollar for every time I didn’t follow my own advice, I’d have…well, at least 10 dollars.

If you’re optimizing a process that just runs locally, and has just one thread, this is all pretty easy to do, and there are various tools for it:

log timings
attach a debugger and break once in a while
profiling

Most people jump straight to profiling, but personally I like the hands-on approaches especially in cases where whatever it is you’re optimizing has a long total runtime.

In any case, measuring what Meadowrun does is a bit more tricky. First, it starts processes on remote machines. Ideally, we’d have some kind of overview where activities running on one machine that start activities on another machine are linked appropriately.

Second, much of meadowrun’s activity consists of calls to AWS APIs, connection requests or other types of I/O.

Because of both these reasons, meadowrun is not particularly well-suited to profiling. So we’re left with print statements—but surely something better must exist already?

Indeed it does. After some research online, I came across Eliot, a library that allows you to log what it calls actions. Crucially, actions have a beginning and end, and Eliot keeps track of which actions are part of other, higher-level actions. Additionally, it can do this across async, thread and process boundaries.

The way it works is pretty simple: you can annotate functions or methods with a decorator which turns it into an Eliot-tracked action. For finer-grained actions, you can also put any code block into a context manager.

@eliot.log_call
def choose_instances():
    ...
    with eliot.start_action(action_type="subaction of choose"):
        ...

To make this work across processes, you also need to pass an Eliot identifier to the other process, and put it in context there—just two extra lines of code.

# parent process
task_id = eliot.current_action().serialize_task_id()

# child process receives task_id somehow, 
# e.g. via a command line argument
eliot.Action.continue_task(task_id=task_id)

After adding the Eliot actions, Eliot logs information to a JSON file. You can then visualize it using a handy command line program called eliot-tree. Here’s the result of eliot-tree showing what Meadowrun does when calling run_function:

Meadowrun being busy

This is a great breakdown of how long everything takes, that pretty much directly led to all the investigations we’re about to discuss. For example, the whole run took 93 seconds, of which the job itself used only a handful of seconds—this was for a cold start situation. In this instance, it took about 15 seconds for AWS to report that our new instance is running and got an IP address (wait_until_running) but then we had to wait another 34 seconds before we could actually SSH into the machine (wait_for_ssh_connection).

Based on Eliot’s measurements, we tried to make improvements in a number of areas (but didn’t always succeed!).

Cache all the things some of the time

One of the quickest wins: meadowrun was downloading EC2 instance prices before every run. In practice spot and on-demand prices don’t change frequently:

The patient has flatlined

So it’s more reasonable to cache the download locally for up to 4 hours. That saves us 5–10 seconds on every run.

All Linuxes are fast but some are faster than others

We can’t influence how long it takes for AWS to start a virtual machine, but perhaps some Linux distributions are faster to start than others—in particular it’d be nice if they’d start sshd sooner rather than later. That potentially cuts down on waiting for the ssh connection.

Meadowrun is currently using an Ubuntu-based image, which according to our measurements takes about 30–40 seconds from the time when AWS says the machine is running, to when we can actually ssh into it. (This was measured while running from London to Ohio/us-east-2—if you run Meadowrun closer to home you’ll see better times.)

In any case, according to EC2 boot time benchmarking, some of which we independently verified, Clear Linux is the clear winner here. This is borne out in practice, reducing the “wait until ssh” time from 10 seconds to a couple when connecting to a close region, and from 30 seconds to 5 seconds when connecting across the Atlantic.

Despite these results, so far we’ve not been able to switch Meadowrun to use Clear Linux—first it took a long time before we figured out how to configure Clear so that it picks up the AWS user-data. At the moment we’re struggling with installing Nvidia drivers on it, which is important for machine learning workloads.

We’ll see what happens with this one, but if Nvidia drivers are not a blocker and startup times are important to you, do consider using Clear Linux as a base AMI for your EC2 machines.

EBS volumes are sluggish at start-up

The next issue we looked at is Python startup times. To run a function on the EC2 machine, meadowrun starts a Python process which reads the function to run and its pickled arguments from disk, and runs it. However, just starting that Python process without actually running anything, on first startup took about 7–8 seconds.

This seems extreme even for Python (yes, we still cracked the inevitable “we should rewrite it in Rust” jokes).

A run with -X importtime revealed the worst offenders, but unfortunately it’s mostly boto3 which we need to talk to the AWS API, for example to pull Docker images.

Also, subsequent process startups on the same machine take a more reasonable 1–2 seconds. So what’s going on? Our first idea was file system caching, which was a close but ultimately wrong guess.

It turned out that AWS’ EBS (Elastic Block Store) which provides the actual storage for EC2 machines we were measuring, has a warmup time. AWS recommends reading the entire device to warm up the EBS volume by using dd. That works in that after the warmup, Python’s startup time drops to 1–2 seconds. Unfortunately reading even a 16GB volume takes over 100 seconds. Reading just the files in the meadowrun environment instead still takes about 20 seconds. So this solution won’t work for us—the warmup time takes longer than the cold start.

There is a possibility to pay for “hot” EBS snapshots which give warmed-up performance from the get-go, but it’s crazy expensive because you’re billed per hour, per availability zone, per snapshot as long as fast snapshot restore is enabled. This comes to $540/month for a single snapshot for a single availability zone, and a single region like us-east-1 has 6 availability zones!

At the end of the day, looks like we’ll just have to suck this one up. If you’re an enterprise that’s richer than god, consider enabling hot EBS snapshots. For some use cases reading the entire disk up front is a more appropriate and certainly much cheaper alternative.

If your code is async, use async libraries

From the start, meadowrun was using fabric to execute SSH commands on EC2 machines. Fabric isn’t natively async though, while meadowrun is—meadowrun manages almost exclusively I/O bound operations, such as calling AWS APIs and SSH’ing into remote machines, so async makes sense for us.

The friction with Fabric caused some headaches, like having to start a thread to make Fabric appear async with respect to the rest of the code. This was not only ugly, but also slow: when executing a parallel distributed computation using Meadowrun’s run_map, meadowrun was spending a significant amount of time waiting for Fabric.

We have now fully switched to a natively async SSH client, asyncssh. Performance for run_map has improved considerably: with fabric, for a map with 20 concurrent tasks, median execution time over 10 runs was about 110 seconds. With asyncssh, this dropped to 15 seconds—a 7x speedup.

Conclusion

Not everything you find while optimizing is actionable, and that’s ok—at the very least you’ll have learnt something!

Takeaways which you may be able to apply in your own adventures with AWS:

If startup time is important, try using Clear Linux.
EBS volumes need some warmup to reach full I/O speed—you can avoid this if you have lots of money to spend.

To stay updated on Meadowrun, star us on Github or follow us on Twitter!

Use AWS to unzip all of Wikipedia in 10 minutes

Hyunho Richard Lee — Wed, 29 Jun 2022 19:20:00 +0000

This is the first article in a series that walks through how to use Meadowrun to quickly run regular expressions over a large text dataset. This first article reviews parsing the Wikipedia dump file format, and then walks through using Meadowrun and EC2 to unzip ~67GB (uncompressed) of articles from the English language Wikipedia dump file. The second article will cover running regular expressions over the extracted data.

Background and Motivation

The goal of this first post is mostly to walk through creating a large text dataset (the contents of English-language Wikipedia) so that we have a real dataset to work with for the second article in this series. This article also introduces Meadowrun as a tool that makes it easy to scale your python code into the cloud.

If you want to understand some of the details of the Wikipedia dataset, start with this article. If you’re interested in generally applicable examples of searching large text datasets very quickly, start with the second article, and come back if you want to be able to follow along using the same dataset.

Unzipping Wikipedia

Wikipedia (as explained here) provides a “multistream” XML dump (caution! that’s a link to a ~19GB file). This is a single file that is effectively multiple bz2 files concatenated together. It’s meant to be read using the “index” file, which is a single bz2 text file whose contents look like:

602:594:Apollo
602:595:Andre Agassi
683215:596:Artificial languages
683215:597:Austroasiatic languages

The first line is saying that there’s an article called “Apollo” with article ID 594, which is in the section of the file starting at 602 bytes into the multistream dump. The next article is “Andre Agassi”, has article ID 595, and it’s in the same section that starts at 602 bytes (each section has 100 articles). The article after that is called “Artificial languages”, has article ID 596, and it’s in the next section which starts at 683215 bytes.

So we’ll write a function iterate_articles_chunk that takes a multistream_file, a index_file, skips the first article_offset articles, and then read num_articles. A few notes on this code:

We’re using smart_open, which is an amazing library that lets you open objects in S3 (and other cloud object stores) as if they’re files on your filesystem. It’s obviously critical that we’re able to seek to an arbitrary position in an S3 file without first downloading the whole thing. We’ll assume you’re using Poetry, but you should be able to follow along with any other package manager:

poetry add smart_open[s3]

We’re ignoring a ton of metadata about each Wikipedia article, but that doesn’t matter for our purposes.
A bit of an aside, but the above code is also a good example of how to use xml.etree.ElementTree.XMLPullParser to parse an XML file as a stream, which makes sense for large files, as it means you don’t need to hold the entire file in memory. In contrast, xml.dom.minidom requires enough memory for your entire file, but it does allow processing elements in any order.

Let’s try it out!

import time
from unzip_wikipedia_articles import iterate_articles_chunk

n = 1000
t0 = time.perf_counter()
bytes_read = 0
for title, text in iterate_articles_chunk(
    "enwiki-latest-pages-articles-multistream-index.txt.bz2",
    "enwiki-latest-pages-articles-multistream.xml.bz2",
    0,
    n,
):
    bytes_read += len(title) + len(text)
print(
    f"Read ~{bytes_read:,d} bytes from {n} articles in {time.perf_counter() - t0:.2f}s"
)

Read ~34,795,421 bytes from 1000 articles in 2.17s

Man that’s slow! Counting the lines in the index file tells us there are 22,114,834 articles (this is as of the 2022–06–20 dump). So at 2.17s per 1000 articles times 22 million articles, I’m looking at around 13 hours to unzip this entire file on my i7-8550U processor. Presumably most of this time is decompressing bz2, so as a sanity check, let’s see what others are getting for bz2 decompression speeds. This article gives 24MB/s for decompression speed, and we’re in the same ballpark at 16MB/s (we’re not counting the bytes for XML tags we’re ignoring, so our true decompression speed is a bit faster than this).

Scaling with Meadowrun

We could use multiprocessing to make use of all the cores on my laptop, but that would still only get us to 1–2 hours of runtime at best. In order to get through this in a more reasonable amount of time, we’ll need to use multiple machines in EC2. Meadowrun makes this easy!

We’ll assume you’ve configured your AWS CLI, and we’ll continue using Poetry. (See the docs for more context, as well as for using Meadowrun with Azure, pip, or conda.) To get started, install the Meadowrun package and then run Meadowrun’s install command to set up the resources Meadowrun needs in your AWS account.

poetry add meadowrun
poetry run meadowrun-manage-ec2 install

Next, we’ll need to create an S3 bucket, upload the data files there, and then give the Meadowrun IAM role access to that bucket:

aws s3 mb s3://wikipedia-meadowrun-demo
aws s3 cp enwiki-latest-pages-articles-multistream-index.txt.bz2 s3://wikipedia-meadowrun-demo
aws s3 cp enwiki-latest-pages-articles-multistream.xml.bz2 s3://wikipedia-meadowrun-demo

poetry run meadowrun-manage-ec2 grant-permission-to-s3-bucket wikipedia-meadowrun-demo

Now we’re ready to run our unzipping on the cloud:

import asyncio

import meadowrun

from convert_to_tar import convert_articles_chunk_to_tar_gz


async def unzip_all_articles():
    total_articles = 22_114_834
    chunk_size = 100_000

    await meadowrun.run_map(
        lambda i: convert_articles_chunk_to_tar_gz(i, chunk_size),
        [i * chunk_size for i in range(total_articles // chunk_size + 1)],
        meadowrun.AllocCloudInstance("EC2"),
        meadowrun.Resources(
            logical_cpu=1,
            memory_gb=2,
            max_eviction_rate=80,
        ),
        await meadowrun.Deployment.mirror_local(),
        num_concurrent_tasks=64,
    )


if __name__ == "__main__":
    asyncio.run(unzip_all_articles())

In this snippet, we’re splitting all of the articles into chunks of 100,000, which gives us 222 tasks. We’re telling Meadowrun to start up enough EC2 instances to run 64 of these tasks in parallel at a time, and that each task will need 1 CPU and 2 GB of RAM. And we’re okay with spot instances up to an 80% chance of eviction (aka interruption).

Each task will run convert_articles_chunk_to_tar_gz which:

Calls iterate_articles_chunk to read its chunk of 100,000 articles
Gets just the title and text of those articles
Packs those into a .tar.gz file of plain text files where the name of each file in the archive is the title of the article (a .gz file is much faster to decompress than a bz2 file)
And finally writes that new file back to S3.

Using Meadowrun and EC2, this takes about 10 minutes from start to finish, where this would have taken 13 hours on my laptop just for reading the articles, not even counting the time to recompress into a .tar.gz.

The exact instance type that Meadowrun selects will vary based on spot instance availability and real-time pricing, but in an example run Meadowrun prints out:

Launched 1 new instance(s) (total $0.9107/hr) for the remaining 64 workers:
    ec2-3-12-160-131.us-east-2.compute.amazonaws.com: r6i.16xlarge (64 CPU/512.0 GB), spot ($0.9107/hr, 2.5% chance of interruption), will run 64 job/worker

At $0.9107/hr, this whole process costs us less than a quarter!

Closing remarks

EC2 is amazing! (And so are Azure and GCP.) Spot pricing makes really powerful machines accessible for not very much money.
On the other hand, using EC2 for a task like this can require a decent amount of setup in terms of selecting an instance, remembering to turn it off when you’re done, and getting your code and libraries onto the machine. Meadowrun makes all of that easy!
The complete code for this series is here in case you want to use it as a template.

To stay updated on Meadowrun, star us on Github or follow us on Twitter!

Using AWS and Hyperscan to match regular expressions on 100GB of text

Hyunho Richard Lee — Wed, 29 Jun 2022 19:10:00 +0000

This is the second article in a series that walks through how to use Meadowrun to quickly run regular expressions over a large text dataset, using English-language Wikipedia as our example data set. If you want to follow along with this second article, you’ll need the simplified article extracts that we produce in the first article. Alternatively, it should be pretty simple to translate the code in this post to work with any data set in any data format.

Background and motivation

This series is inspired by past projects at hedge funds for parsing credit card transaction data. To oversimplify quite a bit, we would look at the description fields of each transaction to see if it matched a tradeable public company, add up the amounts for all the transactions for each company, and use that to try to predict revenue for each company.

There were a lot of challenges to making these revenue forecasts accurate. The problem we’ll focus on this article is that for some reason (which patio11 could probably explain in depth), the description field for credit card transactions would come to us completely garbled. For a company like McDonald’s, we would see variations like mcdonalds, mcdonald's, mcdonalds, mcdonald s, mcd, and even misspellings and typos like mcdnalds. Our solution was to create regular expressions that covered all of the common variations of all of the brands of each company we were interested in.

This dataset was about a terabyte uncompressed, and we had hundreds of regular expressions, which meant that we needed two main pieces of infrastructure: a really fast regular expression library and a distributed computation engine. For regular expressions we used Hyperscan which we’ll introduce here. Our distributed computation engine isn’t publicly available, but we’ll introduce Meadowrun which works in a similar way.

The credit card dataset we used obviously isn’t available publicly as well, so we’ll use English-language Wikipedia as a stand-in (~67 GB uncompressed), as the goal of this article is to walk through the engineering aspects of this analysis.

Getting up to speed

If you didn’t follow along with the first article in this series, you should be able to follow this article with your own dataset as long as you install smart_open and Meadowrun. smart_open is an amazing library that lets you open objects in S3 (and other cloud object stores) as if they’re files on your filesystem, and Meadowrun makes it easy to run your Python code on the cloud.

poetry add smart_open[s3]
poetry add meadowrun
poetry run meadowrun-manage-ec2 install

We’ll assume you’re using Poetry here, but feel free to use any environment manager. We’ll also assume you’ve configured your AWS CLI. (See the docs for more context, as well as for using Meadowrun with Azure, pip, and/or conda.)

Some realistic data

We’ll use simplified extracts of English-language Wikipedia as generated by the code in the first article in this series. As a quick overview, that produces 222 files like s3://wikipedia-meadowrun-demo/extracted-200000.tar.gz, which is a tar.gz file containing the 200,000th through 299,999th Wikipedia article. The filenames are the titles of the articles and the contents of each file are the text of the corresponding article. We’ll need a little function that can read one of these tar files, iterate_extract.

For our regular expressions, we’ll do a simplified version of what I describe above and just take the names of the companies in the S&P500, which we get, of course, from Wikipedia. (If you’re curious, this is the 1379418th article in the 2022–06–20 index, so you can also find it in s3://wikipedia-meadowrun-demo/extracted-1300000.tar.gz.) I used a little bit of elbow grease to get this into a file called companies.txt, and the first few lines look like:

3M
A. O. Smith
Abbott

We’ll upload this to S3 with aws s3 cp companies.txt s3://wikipedia-meadowrun-demo/companies.txt, and use this bit of code to turn this into a simple regex:

import smart_open


def company_names_regex():
    with smart_open.open(
        "s3://wikipedia-meadowrun-demo/companies.txt"
    ) as companies_file:
        companies = companies_file.read().splitlines()

    return "|".join(companies)

This will give us a regex like 3M|A.O. Smith|Abbott|... which lets us look for any occurrence of these company names.

Regex engines: re vs re2 vs Hyperscan

Some people, when confronted with a problem, think “I know, I’ll use regular expressions.” Now they have two problems. (Jamie Zawinski)

If you haven’t spent too much time in the world of regular expressions, you’ll probably start with the re standard library—Python comes with batteries included, after all. Let’s see how this does. We’ll use Meadowrun’s run_function to run some exploratory code directly on EC2:

import asyncio
import itertools
import re

# import re2 as re
import time

import meadowrun

from company_names import company_names_regex
from read_articles_extract import iterate_extract


def scan_re(extract_file):
    compiled_re = re.compile(company_names_regex(), re.IGNORECASE)

    t0 = time.perf_counter()
    bytes_scanned = 0
    i = 0
    for article_title, article_content in itertools.islice(
        iterate_extract(extract_file), 100
    ):
        for match in compiled_re.finditer(article_content.decode("utf-8")):
            # print out a little context around a sample of matches
            if i % 100000 == 0:
                print(
                    f"{article_title}: "
                    + " ".join(
                        match.string[match.start() - 50 : match.end() + 50].split()
                    )
                )
            i += 1
        bytes_scanned += len(article_content)
    time_taken = time.perf_counter() - t0
    print(
        f"Scanned {bytes_scanned:,d} bytes in {time_taken:.2f} seconds "
        f"({bytes_scanned / time_taken:,.0f} B/s)"
    )


async def scan_re_ec2():
    return await meadowrun.run_function(
        lambda: scan_re("s3://wikipedia-meadowrun-demo/extracted-200000.tar.gz"),
        meadowrun.AllocCloudInstance("EC2"),
        meadowrun.Resources(
            logical_cpu=1,
            memory_gb=2,
            max_eviction_rate=80,
        ),
        await meadowrun.Deployment.mirror_local(),
    )


if __name__ == "__main__":
    asyncio.run(scan_re_ec2())

scan_re will just extract the first 100 articles from the specified .tar.gz file and run our company name regex on those articles, and tell us how many bytes per second it’s able to process.

scan_re_ec2 uses Meadowrun to run scan_re on an EC2 instance — here we’re asking it to start an EC2 instance that has at least 1 CPU and 2 GB of memory, and that we’re okay with spot instances up to an 80% chance of eviction (aka interruption). We could just run scan_re locally, but this will actually be faster overall because downloading data from S3 is significantly faster from EC2 than over the internet. In other words, we’re sending our code to our data rather than the other way around.

Running this gives us:

Scanned 1,781,196 bytes in 10.38 seconds (171,554 B/s)

So for this regular expression we’re getting about 170 KB/s. We have about 67 GB, so distributing this over many EC2 instances is still going to be slow and expensive.

Let’s try re2, which is a regular expression engine built by Google primarily with the goal of taking linear time to search a string for any regular expression. For context, python’s built-in re library uses a backtracking approach, which can take exponential time to search a string. re2 uses a Thompson NFA approach, which can guarantee the linear time search, but offers fewer features.

The official python package on PyPI is google-re2, but pyre2 has nicely pre-compiled wheels for Windows and Mac as well (in addition to Linux):

poetry add pyre2

pyre2 is designed as a drop-in replacement, so we can just replace import re with import re2 as re in our last snippet and rerun to see what re2’s performance is like. Note that Meadowrun is automatically syncing the changes to our code and environment on the remote machine, and we don’t have to worry about manually rebuilding a virtual environment or a container image with our new re2 library in it.

Rerunning with re2 and 10,000 articles, we get:

Scanned 190,766,485 bytes in 6.94 seconds (27,479,020 B/s)

A massive speedup to 27 MB/s!

Our last stop is Hyperscan, which is a regular expression engine originally built with an eye towards deep packet inspection by a startup called Sensory Networks which was acquired by Intel in 2013. Hyperscan has a ton of really cool parts to it—there’s a good overview by a maintainer Geoff Langdale, and the paper goes into more depth. I’ll just highlight one of my favorites, which is its extensive use of SIMD instructions for searching strings.

There’s a compiled version of Hyperscan with python bindings in pypi thanks to python-hyperscan. python-hyperscan only has pre-built wheels for Linux, but that’s fine, as the EC2 instances Meadowrun creates for us are running Linux. We can even tell Poetry to just install Hyperscan if we’re on Linux, as installing it on Windows or Mac will probably fail:

poetry add hyperscan --platform linux

(Pip requirements.txt files also support “environment markers” which allow you to accomplish the same thing.)

Hyperscan’s API isn’t a drop-in replacement for re2, so we’ll need to adjust our code:

import asyncio
import itertools
import time

import meadowrun

from company_names import company_names_regex
from read_articles_extract import iterate_extract


def scan_hyperscan(extract_file, take_first_n, print_matches):
    import hyperscan

    i = 0

    def on_match(match_id, from_index, to_index, flags, context=None):
        nonlocal i
        if i % 100000 == 0 and print_matches:
            article_title, article_content = context
            print(
                article_title
                + ": "
                + str(article_content[from_index - 50 : to_index + 50])
            )
        i += 1

    db = hyperscan.Database()
    patterns = (
        # expression, id, flags
        (
            company_names_regex().encode("utf-8"),
            1,
            hyperscan.HS_FLAG_CASELESS | hyperscan.HS_FLAG_SOM_LEFTMOST,
        ),
    )
    expressions, ids, flags = zip(*patterns)

    db.compile(expressions=expressions, ids=ids, elements=len(patterns), flags=flags)

    bytes_scanned = 0
    t0 = time.perf_counter()
    for article_title, article_content in itertools.islice(
        iterate_extract(extract_file), take_first_n
    ):
        db.scan(
            article_content,
            match_event_handler=on_match,
            context=(article_title, article_content),
        )
        bytes_scanned += len(article_content)

    time_taken = time.perf_counter() - t0
    print(
        f"Scanned {bytes_scanned:,d} bytes in {time_taken:.2f} seconds "
        f"({bytes_scanned / time_taken:,.0f} B/s)"
    )


async def scan_hyperscan_ec2():
    return await meadowrun.run_function(
        lambda: scan_hyperscan(
            "s3://wikipedia-meadowrun-demo/extracted-200000.tar.gz", 10000, True
        ),
        meadowrun.AllocCloudInstance("EC2"),
        meadowrun.Resources(
            logical_cpu=1,
            memory_gb=2,
            max_eviction_rate=80,
        ),
        await meadowrun.Deployment.mirror_local(),
    )


if __name__ == "__main__":
    asyncio.run(scan_hyperscan_ec2())

scan_hyperscan shows a really basic usage of the python-hyperscan API, and scan_hyperscan_ec2 is doing the same thing of using Meadowrun to run this on EC2. Running this gives us:

Scanned 190,766,485 bytes in 2.74 seconds (69,679,969 B/s)

Which is another solid improvement on top of re2.

Putting it all together

Now we can use Meadowrun’s run_map to run this over all of Wikipedia:

import asyncio
import sys

import meadowrun

from scan_wikipedia_hyperscan import scan_hyperscan


async def scan_hyperscan_ec2_full():

    total_articles = 22_114_834
    chunk_size = 100_000

    await meadowrun.run_map(
        lambda i: scan_hyperscan(
            f"s3://wikipedia-meadowrun-demo/extracted-{i}.tar.gz", sys.maxsize, False
        ),
        [i * chunk_size for i in range(total_articles // chunk_size + 1)]
        + [i * chunk_size for i in range(total_articles // chunk_size + 1)],
        meadowrun.AllocCloudInstance("EC2"),
        meadowrun.Resources(
            logical_cpu=1,
            memory_gb=2,
            max_eviction_rate=80,
        ),
        await meadowrun.Deployment.mirror_local(),
        num_concurrent_tasks=64,
    )


if __name__ == "__main__":
    asyncio.run(scan_hyperscan_ec2_full())

run_map has similar semantics to python’s built-in map function. If you haven’t seen it before, map(f, xs) is roughly equivalent to [f(x) for x in xs]. run_map is doing the same thing as map but in parallel on the cloud. So we’re requesting the same CPU/memory per task as before, and we’re asking Meadowrun to start enough EC2 instances that we can run 64 tasks in parallel. Each task will run scan_hyperscan on a different extract file, although we’re actually going over the dataset twice to synthetically make the dataset a bit larger.

The exact instance type that Meadowrun selects will vary based on spot instance availability and real-time pricing, but in an example run Meadowrun prints out:

Launched 1 new instance(s) (total $0.6221/hr) for the remaining 64 workers:
    ec2-18-117-89-205.us-east-2.compute.amazonaws.com: c5a.16xlarge (64 CPU/128.0 GB), spot ($0.6221/hr, 2.5% chance of interruption), will run 64 job/worker

The doubled Wikipedia data set is about 135GB and searching for any occurrence of a company name in the S&P 500 takes about 5 minutes and only costs us about 5 cents!

Closing remarks

Hyperscan can be a bit annoying to use, as it has a different API from re and there aren’t any precompiled versions for Windows or Mac (it is possible to compile for these platforms, it just requires a bit of work). In these very unscientific benchmarks it’s “only” 2.5x faster than re2, but in my experience it’s worth it because as your regular expressions get larger and more complicated, the gap in performance compared to re2 gets larger. And it’s really quite a marvel of both computer science theory and engineering.
Meadowrun makes it easy to use really powerful machines in EC2 (or Azure) to process your data. Obviously for this exact workload it would be faster to use Google or Wikipedia’s own search functionality, but the approach we’re showing here can be used on any large text dataset with arbitrarily complicated regular expressions. For anything that isn’t already indexed by Google or another tech giant, I don’t think there’s a better combination of tools.
The complete code for this series is here in case you want to use it as a template.

To stay updated on Meadowrun, star us on Github or follow us on Twitter!