DEV Community: Luis Sena

Sharing big NumPy arrays across python processes

Luis Sena — Mon, 31 Jan 2022 09:34:44 +0000

What is the best way to share huge NumPy arrays between python processes?

A situation I’ve come across multiple times is the need to keep one or multiple NumPy arrays in memory that serve as the “database” for specific computations (e.g. doing collaborative or content-based filtering recommendations).

For a scenario where you want to be able to have a web server using those arrays, you need to use multiprocessing in order to use more than one CPU, as I’ve discussed in this previous article.

Having to use multiple processes means we have some limitations when it comes to sharing those NumPy arrays, but fortunately, we have many options to choose from and that’s exactly what we’ll see in this article.

We’ll see how to use NumPy with different multiprocessing options and benchmark each one of them, using ~1.5 GB array with random values.

For the examples, I’ll mostly use a ProcessPoolExecutor, but these methods are applicable to any multi process environment (even Gunicorn).

Strategies that we’ll explore and benchmark in this article

IPC with pickle

This is the easiest (and most inefficient) way of sharing data between python processes. The data you pass as a parameter will automatically be pickled so it can be sent from one process to the other.

Copy-on-write pattern

As I explained in a previous article, when you use fork() in UNIX compatible systems, each process will point to the same memory address and will be able to read from the same address space until they need to write to it.

This makes it easy to emulate “thread-like” behaviour. The only issue is that you need to keep that data immutable after the fork() and only works for data created before the fork().

Shared array

One of the oldest ways to share data in python is by using sharedctypes. This module provides multiple data structures for the effect.

I’ll be using the RawArray since I don’t care about locks for this use case. If you need a structure that can support locks out of the box,Array is a better option.

Memory-mapped file (mmap)

Memory-mapped files are considered by many as the most efficient way to handle and share big data structures.

NumPy supports it out of the box and we’ll make use of that. We’ll also explore the difference between mapping it to disk and memory (with tmpfs).

SharedMemory (Python 3.8+)

SharedMemory is a module that makes it much easier to share data structures between python processes. Like many other shared memory strategies, it relies on mmap under the hood.

It makes it extremely easy to share NumPy arrays between processes as we’ll see in this article.

Ray

Ray is an open-source project that makes it simple to scale any compute-intensive Python workload.

It has been growing a lot in popularity, especially with the current need to process huge amounts of data and serve models on a large scale.

In this article, we’ll be using just 0.001% of its awesome features.

Benchmarks

All benchmarks use the same randomly generated NumPy array that is ~1.5GB.

I’m running everything with Docker and 4 dedicated CPU cores.

The computation is always the same, numpy.sum().

The final runtime for each benchmark is the average runtime in milliseconds between 30 runs with all the outliers removed.

IPC with pickle

In this approach, a slice of the array is pickled and sent to each process to be processed.

Total Runtime: 4137.79ms

Copy-on-write pattern

As expected, we get really good performance with this approach.

The major downside to this approach is that you can’t change data (well, you can, but that will create a copy inside the process that tried to change it).

The other major downside is that every new object that was created after the fork() will only exist inside the process that is creating it.

If you’re using Gunicorn to scale your web application, for example, it’s very likely that you’ll need to update that shared data from time to time, making this approach more restrictive.

Total Runtime: 80.30ms

Shared Array

This approach will create an array in a shared memory block that allows you to freely read and write from any process.

If you’re expecting concurrent writes, you might want to use Array instead of RawArray since it allows using locks out of the box.

Total Runtime: 102.24ms

Memory-mapped file (mmap)

Here, the location of your backing file will matter a lot.

Ideally, always use a memory mounted folder (backed by tmpfs). In Linux, that usually means the /tmp folder.

But when using Docker, you need to use the /dev/shm since the /tmp folder is not mounted in memory.

Total Runtime with /tmp: 159.62ms

Total Runtime with /dev/shm: 108.68ms

SharedMemory (Python 3.8+)

SharedMemory was introduced with Python 3.8, it’s backed by mmap(2) and makes sharing Numpy arrays across processes really simple and efficient.

It’s usually my recommendation if you don’t want to use any external libraries.

Total Runtime: 99.96ms

Ray

Ray is an awesome collection of tools/libraries that allow you to tackle many different large scale problems.

Modern workloads like deep learning and hyperparameter tuning are compute-intensive, and require distributed or parallel execution. Ray makes it effortless to parallelize single machine code — go from a single CPU to multi-core, multi-GPU or multi-node with minimal code changes.

— https://www.ray.io

Here we’ll just explore two different ways to share NumPy arrays using Ray. Soon I’ll showcase better and more detailed use cases for Ray.

One thing to take note of is that I’m not counting ray.init() in the total runtime. That line of code can take around 3 seconds but you only need to call it once, so it shouldn’t be a problem in production scenarios.

It does make these benchmarks a bit unfair since, for all the other scenarios, the Process Pool initialization is being counter for the total runtime.

Because of this, in the final results, I’m also excluding that pool initialization from the runtime.

Using a naive approach, where Ray will need to serialize/deserialize data like the first scenario that uses pickle, we can still see a big improvement in total runtime where comparing to pickle.

Total Runtime: 252.08ms

A better approach for this use case is to use Ray Object Store.

We can even have it backed by Redis, but in this example, it will just use shared memory.

We can see a high improvement with this small change.

Total Runtime: 70.65ms

One thing I really like about Ray is that it allows you to start “small” with very simple and efficient code and then scale your project as your needs get bigger (from a single machine to multi-node cluster).

Final Results

For the final results, and to make a fair comparison with Ray, I’m excluding the time taken to init the processes inside theProcessPoolExecutor since I also excluded the ray.init() from the Ray benchmark.

Communicating through pickle is so slow that it’s even hard to understand the other benchmarks, let’s remove it for clarity.

Conclusions

Sharing a global variable before forking (copy-on-write) seems to be the fastest, although also the most limited option.

When using mmap, always make sure to map to a path that is in memory (tmpfs mount).

SharedMemory has a really good performance and a simple and easy to use API.

Ray with the its Object Store seems to be the winner if you need performance and flexibility. It’s also a good framework to grow your project into a bigger scope.

Want to learn more about python? Check these out!

How does this all sound? Is there anything you’d like me to expand on? Let me know your thoughts in the comments section below (and hit the clap if this was useful)!

Stay tuned for the next post. Follow so you won’t miss it!

Achieving Sub-Millisecond Latencies With Redis by Using Better Serializers.

Luis Sena — Thu, 19 Aug 2021 13:17:39 +0000

How some simple changes can result in less latency and better memory usage.

Redis Strings are probably the most used (and abused) Redis data structure.

One of their main advantages is that they are binary-safe — This means you can save any type of binary data in Redis.

But as it turns out, most Redis users are serializing objects to JSON strings and storing them inside Redis.

What’s the problem you might ask?

JSON serialization/deserialization is incredibly inefficient and costly
You end up using more space in storage (which is expensive in Redis since it’s an in-memory database)
You increase your overall service latency without any real benefit

Using JSON to store data in Redis will increase your latency and resource usage without bringing any real benefit.

One other “simple” optimization you can use is compression.

This one will depend on each use case since it will be a trade-off between size, latency, and CPU usage.

Algorithms like ZSTD or LZ4 can be used with minimal CPU overhead, resulting in some good storage savings.

The following charts show how much you gain just by switching from JSON to a binary format like MessagePack.

These charts are also including the serialization/deserialization times.

We can also see that we can save some storage/memory by using compression at the expense of some latency.

Using a random “JSON” object with different attributes

While the previous charts showed a fairly complex JSON object that LZ4 can handle pretty well (compression ratio wise). When we need to compress arrays of floats, we see that ZSTD has the advantage in the next charts.

Here I ran the benchmarks with different sized arrays.

Using a small array of floats

Using a big array of floats

As you can see, just by switching from JSON to MessagePack, you can reduce your latency by more than 3x without any real disadvantage!

Simple example using python to set/get a Redis String using JSON and MessagePack:

As you can see, it’s as simple as using JSON.

Benchmarking Different Methods For Full-Text Search Using Elasticsearch

Luis Sena — Mon, 16 Aug 2021 13:55:02 +0000

How to choose between different analyzers and queries to get the best search performance? Benchmarking of course!

Photo by Arie Wubben on Unsplash

Deploying a large-scale full-text search engine can be very hard. Elasticsearch makes the job much easier but it’s not one size fits all — quite the contrary.

Elasticsearch has many configurations and features, but having many features also means many ways to achieve the same goal and it’s not always straightforward to know what’s the best way for the product you’re building.

Let’s start with finding out the main ways we can find users by their username/name, measuring their performance, advantages, and drawbacks.

Experiment Stats

Match Query

This will match terms using a fuzziness param.

Pros

Simple to use
Doesn’t use much space
Allows fuzzy search

Cons

If the size of the indexed word is bigger than the searched term+fuzziness_size it will not match
Fuzzy search can slow things down

Prefix query

Pros

Simple to use
Potentially very fast (especially if you use index_prefixes option)

Cons

It will only match if the indexed term starts with the searched term
If you use the index_prefixes option, it will use more space
No fuzzy search

Wildcard query

Works a bit the same way as “LIKE %term%” when using a relational database SELECT.

Pros

Easy to implement and debug

Cons

Usually, the slowest option, especially if the wildcard is placed at the start or very few characters are used

Match query + ngram analyzer

Pros

will match even if the search term is in the middle of a word
good search performance
allows having a “fuzzy” search since it will match segments of each word

Cons

specialized analyzer
uses more disk space
only matches if the search term is at least the size of the smallest “gram”

Mappings

Standard

Ngram

Queries

Match query

Prefix query

Wildcard query

Match query + Ngram Analyzer

Query Benchmarks

To do the benchmarks, I’ve created a small python script that uses 4 parallel processes that will each run 1000 consecutive queries.

It runs that for each kind of query.

The main objective is not to know how long each query takes but to compare their execution time under the same conditions.

Time in seconds is calculated summing the time of 1000 runs and then doing the average between 4 parallel processes

Conclusions

Avoid the wildcard query at all costs: I see the wildcard query being recommended everywhere but as we saw, it is the slowest option and you can get better results with the other options.
If you can live with matching only the beginning of a word: The prefix query can do this job, and it can do it really fast. If your use case fits this, it’s a good choice. There is also the possibility of using the index_prefix option to speed things up even more at the cost of disk space.
If you want to save on disk space : Using the standard analyzer with a match+fuzziness param should do the trick.
If you want to be able to match even if the search term is in the middle of a word and really need it to be fast : ngram seems to be the choice in this case. It can be “dangerous” to use it sometimes though.

When using the ngram analyzer , you should avoid having a big distance between min and max gram size and also avoid using very small ngram sizes like 1 to allow showing results when using only 1 letter.

If you have a big range of gram sizes, it will become very expensive disk-wise and potentially degrade your performance.

Instead, you could, for example, use the fields that use the standard analyzer and perform a simple match or prefix query when your search_term < min_ngram_size.

Into Elasticsearch? Check these out:

How does this all sound? Is there anything you’d like me to expand on? Let me know your thoughts in the comments section below (and hit the clap if this was useful)!

Stay tuned for the next post. Follow so you won’t miss it!

Understanding and optimizing python multi-process memory management

Luis Sena — Sun, 07 Feb 2021 17:10:11 +0000

Understanding and Optimizing Python multi-process Memory Management

This post will focus on lowering your memory usage and increase your IPC at the same time

This blog post will focus on POSIXoriented OS like Linux or macOS

To avoid the GIL bottleneck, you might have already used multi-processing with python, be it using a pre-fork worker model (more on that here), or just using the multiprocessingpackage.

What that does, under the hood, is using the OS fork() function that will create a child process with an exact virtual copy of the parent’s memory.

The OS is really clever while doing this since it doesn’t copy the memory right away. Instead, it will expose it to each process as its own isolated memory, keeping all the previous addresses intact.

The new process generated from fork() keeps the same memory addresses

This is possible thanks to the concept of virtual memory.

Let’s take a small detour just to refresh your memory on some of the underlying concepts, feel free to skip this section if it’s old news to you.

So how can you have two processes with the exact same memory addresses holding different values?

Your process does not interact directly with your computer RAM, in fact, the OS abstracts memory through a mechanism called Virtual Memory. This has many advantages like:

You can use more memory than the available RAM in your system (using disk)
Memory address isolation and protection from other processes
Contiguous address space
No need to manage shared memory directly

Virtual Memory vs Physical Memory

In the above picture, you can see two independent processes that have their isolated memory space.

Each process has its contiguous address space and does not need to manage where each page is located.

You probably noticed some of the memory pages are located on disk. This can happen if your process never had to access that page since it was started (the OS will only load pages into RAM when a process needs them) or they were evicted from RAM because it needed that space for other processes.

When the process tries to access a page, the OS will serve it directly from RAM if it is already loaded or fetch it from disk, load it into RAM and then serve it to the process, with the only difference being the latency.

Sorry for the detour, now let’s get back to our main topic!

After you fork(), you end up with two processes, a child and a parent that share most of their memory until one of them needs to write to any of the shared memory pages.

This approach is called copy on write (COW), and this avoids having the OS duplicating the entire process memory right from the beginning, thus saving memory and speeding up the process creation.

COW works by marking those pages of memory as read-only and keeping a count of the number of references to the page. When data is written to these pages, the kernel intercepts the write attempt and allocates a new physical page, initialized with the copy-on-write data. The kernel then updates the page table with the new (writable) page, decrements the number of references, and performs the write.

What this means is that one easy way to avoid bloating your memory is to make sure you load everything you intend to share between processes into memory before you fork().

If you’re using gunicorn to serve your API, this means using the preload parameter for example.

Not only you can avoid duplicating memory but it will also avoid costly IPC.

Loading shared read-only objects before the fork() works great for “well behaved” languages since those object pages will never get copied, unfortunately with python, that’s not the case.

One of python’s GC strategies is reference counting and python keeps track of references in each object header.

What this means in practice is that each time you read said object, you will write to it.

Be it using gunicorn with the preload parameter or just loading your data and then forking using the multiprocessing package, you’ll notice that, after an amount of time, your memory usage will bloat to be almost 1:1 with the number of processes. This is the work of the GC.

I have some good and bad news… you’re not alone in this and it will be a bit more trouble but there are some workarounds to the issue.

Let’s establish a baseline and run some benchmarks first, and then explore our options. To run these benchmarks, I created a small Flask server with gunicorn to fork the process into 3 workers. You can check the script here.

memory usage multiplies with number of workers

In the above chart, gunicorn will fork before running the server code, this means each worker will run this script:

self.big_data = [item _for_ item _in_ range(10000000)]

As we can see, memory usage grows linearly with each worker.

memory usage doesn’t change with the number of workers

In the above chart, since I’m using the preload option, gunicorn will load everything before forking. We can see COW in action here since the memory usage stays constant.

Memory multiples as soon as a worker loops through their copy of the shared list

Unfortunately, as we can see here, as soon as each worker needs to read the shared data, GC will try to write into that page to save the reference count, provoking a copy on write.

In the end, we end up with the same memory usage as if we didn’t use the preload option!

Ok, we have our baseline, how can we improve?

Using joblib

A very small difference in memory usage after access

Joblib is a python library that is mainly used for data serialization and parallel work. One really good thing about it is that it enables easy memory savings since it won’t COW when you access data loaded by this package.

_import_ joblib
# previously created with joblib.dump()
self.big_data = joblib.load('test.pkl') # big_data is a big list()

Using numpy

A very small difference in memory usage after access

If you’re doing data science, I have really good news for you! You get memory savings for “free” just by using numpy data structures. And this includes if you use pandas or another library as long as the inner data structure is a numpy array.

The reason for this is how they manage memory. Since this package is basically C with python bindings, they have the liberty (and responsibility) of managing everything without the interference of cpython.

They made the clever choice of not saving the reference counts in the same pages those large data structures are kept, avoiding COW when you access them.

_import_ numpy _as_ np
self.big_data = np.array([[item, item] _for_ item _in_ range(10000000)])

Using mmap

Zero overhead in memory usage

mmap is a POSIX-compliant Unix system call that maps files or devices into memory. This allows you to interact with huge files that exist on disk without having to load them into memory as a whole.

Another big advantage is that you can even create a block of shared “unmanaged” memory without a file reference passing -1 instead of a file path like this:

_import_ mmap
mmap.mmap(-1, length=....)

Another great advantage is that you can write to it as well without incurring COW. As long as you deal with concurrency, it is an efficient way to share memory/data between processes, although it’s probably easier/safer to use multiprocessing.shared_memory.

How does this all sound? Is there anything you’d like me to expand on? Let me know your thoughts in the comments section below (and hit the clap if this was useful)!

Conclusions

Python will generally copy shared data to each process when you access it
“Preload” is a great way to save memory if you need to share a read-only big data structure in your API
To avoid COW when you read data, you’ll need to use joblib, numpy, mmap, shared_memory or similar
Sharing data instead of communicating data between processes can save you a lot of latency

Stay tuned for the next post. Follow so you won’t miss it!

Gunicorn Worker Types: How to choose the right one

Luis Sena — Mon, 25 Jan 2021 07:29:24 +0000

Scale your wsgi project to the next level by leveraging everything Gunicorn has to offer.

This article assumes you’re using a sync framework like flask or Django and won’t explore the possibility of using the async/await pattern.

First, let’s briefly discuss how python handles concurrency and parallelism.

Python never runs more than 1 thread per process because of the GIL.

Even if you have 100 threads inside your process, the GIL will only allow a single thread to run at the same time. That means that, at any time, 99 of those threads are paused and 1 thread is working. The GIL is responsible for that orchestration.

To get around this limitation, we can use Gunicorn. From the docs:

Gunicorn is based on the pre-fork worker model. This means that there is a central master process that manages a set of worker processes. The master never knows anything about individual clients. All requests and responses are handled completely by worker processes.

This means that Gunicorn will spawn the specified number of individual processes and load your application into each process/worker allowing parallel processing for your python application.

Since one size will never fit everyone’s needs, it offers different worker types in order to suit a broader range of use cases.

sync

This is the default worker class. Each process will handle 1 request at a time and you can use the parameter -w to set workers.

The recommendation for the number of workers is 2–4 x $(NUM_CORES), although it will depend on how your application works.

When to use:

Your work is almost entirely CPU bound;
Low to zero I/O operations (this includes database access, network requests, etc).

Signs to look for in production:

Monitor CPU usage and incoming requests to make sure you have the right average number of processes for your machine size and also request patterns.

If you have too many processes, it can slow down your average latency since it will force a lot of context switching to happen in your machine CPU.

If you see a lot of timeout errors between your reverse proxy (i.e. nginx), it’s a sign that you don’t have enough concurrency to handle your traffic patterns/load.

gthread

If you try to use the sync worker type and set the threads setting to more than 1, the gthread worker type will be used instead.

If you use gthread, Gunicorn will allow each worker to have multiple threads. In this case, the Python application is loaded once per worker, and each of the threads spawned by the same worker shares the same memory space.

Those threads will be at the mercy of the GIL, but it’s still useful for when you have some I/O blocking happening. It will allow you to handle more concurrency without increasing your memory too much.

The recommended total amount of parallel requests is still the same.

This is probably the most used configuration you’ll see out in the wild.

When to use:

Moderate I/O operations;
Moderate CPU usage;
You’re using packages/extensions that are not patched to run async and/or are unable to patch them yourself.

Signs to look for in production:

The ones I described for the sync worker type.

…with the caveat of the balance between proc vs threads. This balance will depend a lot on your usage patterns.

eventlet/gevent

Eventlet and gevent make use of “green threads” or “pseudo threads” and are based on greenlet.

In practice, if your application work is mainly I/O bound, it will allow it to scale to potentially thousands of concurrent requests on a single process.

Even with the rise of async frameworks (fastapi, sanic, etc), this is still relevant today since it allows you to optimize for I/O without having the extra code complexity.

The way they manage to do it is by “monkey patching” your code, mainly replacing blocking parts with compatible cooperative counterparts from gevent package.

It uses epoll or kqueue or libevent for highly scalable non-blocking I/O. Coroutines ensure that the developer uses a blocking style of programming that is similar to threading, but provide the benefits of non-blocking I/O.

This is usually the most efficient way to run your django/flask/etc web application, since most of the time the bulk of the latency comes from I/O related work.

That being said, it can be tricky to have it configured 100% correctly, and if you’re not serving hundreds or more requests/sec, it’s probably easier to just use the gthread worker class

Signs to look for in production:

Make sure all parts of your code cooperate with these async frameworks (e.g. properly patched). Without that, you could have blocked threads that are sitting idle and won’t be able to execute work (like accepting new requests and answer to previously accepted requests that finished the I/O call). In production, if your CPU usage is low but you’re seeing a lot of timeouts in your nginx logs, there’s a good chance that’s happening. But you should audit this before deploying to production. (I’ll describe how to handle this later on this post).
Connections to your databases. If you have thousands of concurrent connections and you’re using a DBMS like PostgreSQL without a connection pooler, chances are, you’re going to have a bad time (I’ll describe how to handle this later on this post).

tornado

There’s also a Tornado worker class. It can be used to write applications using the Tornado framework. Although the Tornado workers are capable of serving a WSGI application, this is not a recommended configuration.

Tips and best practices when using the “green thread” worker types

I’ll focus on gevent instead of eventlet since it has become the popular choice.

Make sure everything on your project is gevent friendly. This includes packages and drivers. I’ll list some of the most used packages and how to patch them if needed.

PostgreSQL

The official package psycopg2, but it’s not prepared to be patched by gevent.

You also need psycogreen:

psycopg/psycogreen

MySQL

The recommended package is PyMySQL and it is gevent friendly:

PyMySQL/PyMySQL

Redis

The recommended package is redis-py and it is gevent friendly:

andymccurdy/redis-py

MongoDB

The recommended package is PyMongo and it is gevent friendly:

mongodb/mongo-python-driver

Elasticsearch

The recommended package is elasticsearch-py and it is gevent friendly.

Quote from a maintainer:



The library itself just passes whatever is returned from the connection class. It uses standard sockets by default (via urllib3) so it can be made compatible by monkey patching. Alternatively you can create your own connection_class and plug it in.

elastic/elasticsearch-py

Cassandra

The recommended package is from datastax and it is gevent friendly:

datastax/python-driver

Connection Pooling

One thing to take into consideration when using gevent is to understand that it’s really easy to end up with a lot of concurrent connections to, for example, your database. For some DBMS like PostgreSQL, that can be really dangerous.

The standard practice for these cases is to use a connection pool. In the case of PostgreSQL, the SQLAlchemy framework or PgBouncer will work very well.

Blocked thread monitoring

It’s really important to make sure parts of your code are not blocking a greenlet from returning to the hub.

Fortunately, since gevent version 1.3, it’s simple to monitor using the property monitor_thread and you can event enable it inside your unit tests:

gevent

It’s also a good idea to have it enabled in your development environment since some blocks might be missed during your CI runs since it’s usual to mock some of the I/O stuff.

Conclusions

Gunicorn/wsgi is still a valid choice even with the rise of async frameworks like fastapi and sanic;
gthread is usually the preferred worker type by many due to it’s ease of configuration coupled with the ability to scale concurrency without bloating your memory too much;
gevent is the best choice when you need concurrency and most of your work is I/O bound (network calls, file access, databases, etc…).

DEV Community: Luis Sena

Sharing big NumPy arrays across python processes

What is the best way to share huge NumPy arrays between python processes?

Strategies that we’ll explore and benchmark in this article

IPC with pickle

Copy-on-write pattern

Shared array

Memory-mapped file (mmap)

SharedMemory (Python 3.8+)

Ray

Benchmarks

IPC with pickle

Copy-on-write pattern

Shared Array

Memory-mapped file (mmap)

SharedMemory (Python 3.8+)

Ray

Final Results

Conclusions

Want to learn more about python? Check these out!

Achieving Sub-Millisecond Latencies With Redis by Using Better Serializers.

Further Reading

Benchmarking Different Methods For Full-Text Search Using Elasticsearch

Experiment Stats

Match Query

Prefix query

Wildcard query

Match query + ngram analyzer

Mappings

Queries

Query Benchmarks

Conclusions

Into Elasticsearch? Check these out:

Understanding and optimizing python multi-process memory management

Understanding and Optimizing Python multi-process Memory Management

So how can you have two processes with the exact same memory addresses holding different values?

Using joblib

Using numpy

Using mmap

Conclusions

Gunicorn Worker Types: How to choose the right one

sync

gthread

eventlet/gevent

tornado

Tips and best practices when using the “green thread” worker types

Conclusions

Further Reading