<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Luis Sena</title>
    <description>The latest articles on DEV Community by Luis Sena (@lsena).</description>
    <link>https://dev.to/lsena</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F555397%2F97585d67-1b12-405e-8b3c-b9ccda9b5434.jpeg</url>
      <title>DEV Community: Luis Sena</title>
      <link>https://dev.to/lsena</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/lsena"/>
    <language>en</language>
    <item>
      <title>Sharing big NumPy arrays across python processes</title>
      <dc:creator>Luis Sena</dc:creator>
      <pubDate>Mon, 31 Jan 2022 09:34:44 +0000</pubDate>
      <link>https://dev.to/lsena/sharing-big-numpy-arrays-across-python-processes-2ik8</link>
      <guid>https://dev.to/lsena/sharing-big-numpy-arrays-across-python-processes-2ik8</guid>
      <description>&lt;h4&gt;
  
  
  What is the best way to share huge NumPy arrays between python processes?
&lt;/h4&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--jqtqF0tk--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn-images-1.medium.com/max/1024/1%2AZfFUl8TpZcelKgXd9_Xc6Q.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--jqtqF0tk--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn-images-1.medium.com/max/1024/1%2AZfFUl8TpZcelKgXd9_Xc6Q.png" alt="" width="880" height="395"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;A situation I’ve come across multiple times is the need to keep one or multiple NumPy arrays in memory that serve as the “database” for specific computations (e.g. doing collaborative or content-based filtering recommendations).&lt;/p&gt;

&lt;p&gt;For a scenario where you want to be able to have a web server using those arrays, you need to use multiprocessing in order to use more than one CPU, as I’ve discussed in &lt;a href="https://dev.to/lsena/gunicorn-worker-types-you-re-probably-using-them-wrong-52a2-temp-slug-1492068"&gt;this previous article&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Having to use multiple processes means we have some limitations when it comes to sharing those NumPy arrays, but fortunately, we have many options to choose from and that’s exactly what we’ll see in this article.&lt;/p&gt;

&lt;p&gt;We’ll see how to use NumPy with different multiprocessing options and benchmark each one of them, using ~1.5 GB array with random values.&lt;/p&gt;

&lt;p&gt;For the examples, I’ll mostly use a ProcessPoolExecutor, but these methods are applicable to any multi process environment (even Gunicorn).&lt;/p&gt;

&lt;h3&gt;
  
  
  Strategies that we’ll explore and benchmark in this article
&lt;/h3&gt;

&lt;h4&gt;
  
  
  IPC with pickle
&lt;/h4&gt;

&lt;p&gt;This is the easiest (and most inefficient) way of sharing data between python processes. The data you pass as a parameter will automatically be pickled so it can be sent from one process to the other.&lt;/p&gt;

&lt;h4&gt;
  
  
  Copy-on-write pattern
&lt;/h4&gt;

&lt;p&gt;As I explained in a &lt;a href="https://luis-sena.medium.com/understanding-and-optimizing-python-multi-process-memory-management-24e1e5e79047"&gt;previous article&lt;/a&gt;, when you use fork() in UNIX compatible systems, each process will point to the same memory address and will be able to read from the same address space until they need to write to it.&lt;/p&gt;

&lt;p&gt;This makes it easy to emulate “thread-like” behaviour. The only issue is that you need to keep that data immutable after the fork() and only works for data created before the fork().&lt;/p&gt;

&lt;h4&gt;
  
  
  Shared array
&lt;/h4&gt;

&lt;p&gt;One of the oldest ways to share data in python is by using sharedctypes. This module provides multiple data structures for the effect.&lt;/p&gt;

&lt;p&gt;I’ll be using the RawArray since I don’t care about locks for this use case. If you need a structure that can support locks out of the box,Array is a better option.&lt;/p&gt;

&lt;h4&gt;
  
  
  Memory-mapped file (mmap)
&lt;/h4&gt;

&lt;p&gt;Memory-mapped files are considered by many as the most efficient way to handle and share big data structures.&lt;/p&gt;

&lt;p&gt;NumPy supports it out of the box and we’ll make use of that. We’ll also explore the difference between mapping it to disk and memory (with tmpfs).&lt;/p&gt;

&lt;h4&gt;
  
  
  SharedMemory (Python 3.8+)
&lt;/h4&gt;

&lt;p&gt;SharedMemory is a module that makes it much easier to share data structures between python processes. Like many other shared memory strategies, it relies on mmap under the hood.&lt;/p&gt;

&lt;p&gt;It makes it extremely easy to share NumPy arrays between processes as we’ll see in this article.&lt;/p&gt;

&lt;h4&gt;
  
  
  Ray
&lt;/h4&gt;

&lt;p&gt;Ray is an open-source project that makes it simple to scale any compute-intensive Python workload.&lt;br&gt;&lt;br&gt;
It has been growing a lot in popularity, especially with the current need to process huge amounts of data and serve models on a large scale.&lt;/p&gt;

&lt;p&gt;In this article, we’ll be using just 0.001% of its awesome features.&lt;/p&gt;
&lt;h3&gt;
  
  
  Benchmarks
&lt;/h3&gt;

&lt;p&gt;All benchmarks use the same randomly generated NumPy array that is ~1.5GB.&lt;/p&gt;

&lt;p&gt;I’m running everything with Docker and 4 dedicated CPU cores.&lt;/p&gt;

&lt;p&gt;The computation is always the same, numpy.sum().&lt;/p&gt;

&lt;p&gt;The final runtime for each benchmark is the average runtime in milliseconds between &lt;strong&gt;30 runs&lt;/strong&gt; with all the outliers removed.&lt;/p&gt;
&lt;h4&gt;
  
  
  IPC with pickle
&lt;/h4&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;


&lt;p&gt;In this approach, a slice of the array is pickled and sent to each process to be processed.&lt;/p&gt;

&lt;p&gt;Total Runtime: &lt;strong&gt;4137.79ms&lt;/strong&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Copy-on-write pattern
&lt;/h4&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;


&lt;p&gt;As expected, we get really good performance with this approach.&lt;/p&gt;

&lt;p&gt;The major downside to this approach is that you can’t change data (well, you can, but that will create a copy inside the process that tried to change it).&lt;br&gt;&lt;br&gt;
The other major downside is that every new object that was created after the fork() will only exist inside the process that is creating it.&lt;/p&gt;

&lt;p&gt;If you’re using Gunicorn to scale your web application, for example, it’s very likely that you’ll need to update that shared data from time to time, making this approach more restrictive.&lt;/p&gt;

&lt;p&gt;Total Runtime:  &lt;strong&gt;80.30ms&lt;/strong&gt;&lt;/p&gt;
&lt;h4&gt;
  
  
  Shared Array
&lt;/h4&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;


&lt;p&gt;This approach will create an array in a shared memory block that allows you to freely read and write from any process.&lt;/p&gt;

&lt;p&gt;If you’re expecting concurrent writes, you might want to use Array instead of RawArray since it allows using locks out of the box.&lt;/p&gt;

&lt;p&gt;Total Runtime:  &lt;strong&gt;102.24ms&lt;/strong&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Memory-mapped file (mmap)
&lt;/h4&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;


&lt;p&gt;Here, the location of your backing file will matter a lot.&lt;/p&gt;

&lt;p&gt;Ideally, always use a memory mounted folder (backed by tmpfs). In Linux, that usually means the /tmp folder.&lt;/p&gt;

&lt;p&gt;But when using Docker, you need to use the /dev/shm since the /tmp folder is not mounted in memory.&lt;/p&gt;

&lt;p&gt;Total Runtime with /tmp: &lt;strong&gt;159.62ms&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Total Runtime with /dev/shm: &lt;strong&gt;108.68ms&lt;/strong&gt;&lt;/p&gt;
&lt;h4&gt;
  
  
  SharedMemory (Python 3.8+)
&lt;/h4&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;


&lt;p&gt;SharedMemory was introduced with Python 3.8, it’s backed by mmap(2) and makes sharing Numpy arrays across processes really simple and efficient.&lt;/p&gt;

&lt;p&gt;It’s usually my recommendation if you don’t want to use any external libraries.&lt;/p&gt;

&lt;p&gt;Total Runtime:  &lt;strong&gt;99.96ms&lt;/strong&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Ray
&lt;/h4&gt;

&lt;p&gt;Ray is an awesome collection of tools/libraries that allow you to tackle many different large scale problems.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Modern workloads like deep learning and hyperparameter tuning are compute-intensive, and require distributed or parallel execution. Ray makes it effortless to parallelize single machine code — go from a single CPU to multi-core, multi-GPU or multi-node with minimal code changes.&lt;br&gt;&lt;br&gt;
 — &lt;a href="https://www.ray.io/"&gt;https://www.ray.io&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Here we’ll just explore two different ways to share NumPy arrays using Ray. Soon I’ll showcase better and more detailed use cases for Ray.&lt;/p&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;


&lt;p&gt;One thing to take note of is that I’m not counting ray.init() in the total runtime. That line of code can take around 3 seconds but you only need to call it once, so it shouldn’t be a problem in production scenarios.&lt;/p&gt;

&lt;p&gt;It does make these benchmarks a bit unfair since, for all the other scenarios, the Process Pool initialization is being counter for the total runtime.&lt;br&gt;&lt;br&gt;
Because of this, in the final results, I’m also excluding that pool initialization from the runtime.&lt;/p&gt;

&lt;p&gt;Using a naive approach, where Ray will need to serialize/deserialize data like the first scenario that uses pickle, we can still see a big improvement in total runtime where comparing to pickle.&lt;/p&gt;

&lt;p&gt;Total Runtime:  &lt;strong&gt;252.08ms&lt;/strong&gt;&lt;/p&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;


&lt;p&gt;A better approach for this use case is to use Ray Object Store.&lt;/p&gt;

&lt;p&gt;We can even have it backed by Redis, but in this example, it will just use shared memory.&lt;/p&gt;

&lt;p&gt;We can see a high improvement with this small change.&lt;/p&gt;

&lt;p&gt;Total Runtime:  &lt;strong&gt;70.65ms&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;One thing I really like about Ray is that it allows you to start “small” with very simple and efficient code and then scale your project as your needs get bigger (from a single machine to multi-node cluster).&lt;/p&gt;

&lt;h4&gt;
  
  
  Final Results
&lt;/h4&gt;

&lt;p&gt;For the final results, and to make a fair comparison with Ray, I’m excluding the time taken to init the processes inside theProcessPoolExecutor since I also excluded the ray.init() from the Ray benchmark.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--t-qGvmUe--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn-images-1.medium.com/max/1024/1%2AelTbIrQXX6caNuZRmOdMgg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--t-qGvmUe--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn-images-1.medium.com/max/1024/1%2AelTbIrQXX6caNuZRmOdMgg.png" alt="" width="880" height="571"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Communicating through pickle is so slow that it’s even hard to understand the other benchmarks, let’s remove it for clarity.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--K5M96Utv--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn-images-1.medium.com/max/1024/1%2AHOSmmumJzclZCVDBLr5sew.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--K5M96Utv--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn-images-1.medium.com/max/1024/1%2AHOSmmumJzclZCVDBLr5sew.png" alt="" width="880" height="570"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Conclusions
&lt;/h4&gt;

&lt;p&gt;Sharing a global variable before forking (copy-on-write) seems to be the fastest, although also the most limited option.&lt;/p&gt;

&lt;p&gt;When using mmap, always make sure to map to a path that is in memory (tmpfs mount).&lt;/p&gt;

&lt;p&gt;SharedMemory has a really good performance and a simple and easy to use API.&lt;/p&gt;

&lt;p&gt;Ray with the its Object Store seems to be the winner if you need performance and flexibility. It’s also a good framework to grow your project into a bigger scope.&lt;/p&gt;

&lt;h3&gt;
  
  
  Want to learn more about python? Check these out!
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://dev.to/lsena/gunicorn-vs-python-gil-43jl-temp-slug-9778611"&gt;Gunicorn vs Python GIL&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://dev.to/lsena/gunicorn-worker-types-you-re-probably-using-them-wrong-52a2-temp-slug-1492068"&gt;Gunicorn Worker Types: You’re Probably Using Them Wrong&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://luis-sena.medium.com/understanding-and-optimizing-python-multi-process-memory-management-24e1e5e79047"&gt;Understanding and optimizing python multi-process memory management&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://dev.to/lsena/creating-the-perfect-python-dockerfile-5fn2-temp-slug-3298889"&gt;Creating the Perfect Python Dockerfile&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;How does this all sound? Is there anything you’d like me to expand on? Let me know your thoughts in the comments section below (and hit the clap if this was useful)!&lt;/p&gt;

&lt;p&gt;Stay tuned for the next post. Follow so you won’t miss it!&lt;/p&gt;

</description>
      <category>numpy</category>
      <category>datascience</category>
      <category>softwaredevelopment</category>
      <category>programming</category>
    </item>
    <item>
      <title>Achieving Sub-Millisecond Latencies With Redis by Using Better Serializers.</title>
      <dc:creator>Luis Sena</dc:creator>
      <pubDate>Thu, 19 Aug 2021 13:17:39 +0000</pubDate>
      <link>https://dev.to/lsena/achieving-sub-millisecond-latencies-with-redis-by-using-better-serializers-mjo</link>
      <guid>https://dev.to/lsena/achieving-sub-millisecond-latencies-with-redis-by-using-better-serializers-mjo</guid>
      <description>&lt;p&gt;How some simple changes can result in less latency and better memory usage.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--sbpq6Ew8--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn-images-1.medium.com/max/1024/0%2A9SEE-L33xj-aQyUC.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--sbpq6Ew8--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn-images-1.medium.com/max/1024/0%2A9SEE-L33xj-aQyUC.png" alt=""&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Redis Strings are probably the most used (and abused) Redis data structure.&lt;/p&gt;

&lt;p&gt;One of their main advantages is that they are &lt;strong&gt;binary-safe — &lt;/strong&gt; This means you can save any type of binary data in Redis.&lt;/p&gt;

&lt;p&gt;But as it turns out, most Redis users are serializing objects to JSON strings and storing them inside Redis.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What’s the problem you might ask?&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;JSON serialization/deserialization is incredibly inefficient and costly&lt;/li&gt;
&lt;li&gt;You end up using more space in storage (which is expensive in Redis since it’s an in-memory database)&lt;/li&gt;
&lt;li&gt;You increase your overall service latency without any real benefit&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;Using JSON to store data in Redis will increase your latency and resource usage without bringing any real benefit.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;One other “simple” optimization you can use is compression.&lt;/p&gt;

&lt;p&gt;This one will depend on each use case since it will be a trade-off between size, latency, and CPU usage.&lt;/p&gt;

&lt;p&gt;Algorithms like ZSTD or LZ4 can be used with minimal CPU overhead, resulting in some good storage savings.&lt;/p&gt;

&lt;p&gt;The following charts show how much you gain just by switching from JSON to a binary format like MessagePack.&lt;/p&gt;

&lt;p&gt;These charts are also including the &lt;strong&gt;serialization/deserialization times&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;We can also see that we can save some storage/memory by using compression at the expense of some latency.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--iu1gPe6w--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn-images-1.medium.com/max/640/1%2AABDAxbPzjOnYKHQx83fJYw.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--iu1gPe6w--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn-images-1.medium.com/max/640/1%2AABDAxbPzjOnYKHQx83fJYw.png" alt=""&gt;&lt;/a&gt;Using a random “JSON” object with different attributes&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--shNewFxL--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn-images-1.medium.com/max/640/1%2AfoMuU2v0Sm_vgMedUmX2rA.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--shNewFxL--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn-images-1.medium.com/max/640/1%2AfoMuU2v0Sm_vgMedUmX2rA.png" alt=""&gt;&lt;/a&gt;Using a random “JSON” object with different attributes&lt;/p&gt;

&lt;p&gt;While the previous charts showed a fairly complex JSON object that LZ4 can handle pretty well (compression ratio wise). When we need to compress arrays of floats, we see that ZSTD has the advantage in the next charts.&lt;/p&gt;

&lt;p&gt;Here I ran the benchmarks with different sized arrays.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--7lbmbIrE--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn-images-1.medium.com/max/640/1%2Agc-Aq4kV8mrDO5uQxsJbXw.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--7lbmbIrE--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn-images-1.medium.com/max/640/1%2Agc-Aq4kV8mrDO5uQxsJbXw.png" alt=""&gt;&lt;/a&gt;Using a small array of floats&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--SGGCRf01--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn-images-1.medium.com/max/640/1%2AQP1Q9ZOmNtwUD52r4lsh6Q.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--SGGCRf01--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn-images-1.medium.com/max/640/1%2AQP1Q9ZOmNtwUD52r4lsh6Q.png" alt=""&gt;&lt;/a&gt;Using a small array of floats&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--Xd4kKufa--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn-images-1.medium.com/max/640/1%2AdTlNG5DUcdqELmGnbasKXA.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--Xd4kKufa--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn-images-1.medium.com/max/640/1%2AdTlNG5DUcdqELmGnbasKXA.png" alt=""&gt;&lt;/a&gt;Using a big array of floats&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--6nmD0h7r--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn-images-1.medium.com/max/640/1%2AoJa_pBSNFxYIwZSKC3OBUA.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--6nmD0h7r--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn-images-1.medium.com/max/640/1%2AoJa_pBSNFxYIwZSKC3OBUA.png" alt=""&gt;&lt;/a&gt;Using a big array of floats&lt;/p&gt;

&lt;p&gt;As you can see, just by switching from JSON to MessagePack, you can reduce your latency by more than 3x without any real disadvantage!&lt;/p&gt;

&lt;p&gt;Simple example using python to set/get a Redis String using JSON and MessagePack:&lt;/p&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;


&lt;p&gt;As you can see, it’s as simple as using JSON.&lt;/p&gt;

&lt;h3&gt;
  
  
  Further Reading
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://luis-sena.medium.com/using-redis-to-build-a-realtime-nike-sneakers-drop-app-backend-b0bd0fef7056"&gt;Using Redis to Build a Realtime “NIKE Sneakers Drop App” Backend&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://dev.to/lsena/multi-region-the-final-frontier-how-redis-and-atomic-clocks-save-the-day-268j-temp-slug-6446881"&gt;Multi-Region: the final frontier — How Redis and atomic clocks save the day&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;How does this all sound? Is there anything you’d like me to expand on? Let me know your thoughts in the comments section below (and hit the clap if this was useful)!&lt;/p&gt;

&lt;p&gt;Stay tuned for the next post. Follow so you won’t miss it!&lt;/p&gt;

</description>
      <category>python</category>
      <category>database</category>
      <category>softwaredevelopment</category>
      <category>redis</category>
    </item>
    <item>
      <title>Benchmarking Different Methods For Full-Text Search Using Elasticsearch</title>
      <dc:creator>Luis Sena</dc:creator>
      <pubDate>Mon, 16 Aug 2021 13:55:02 +0000</pubDate>
      <link>https://dev.to/lsena/benchmarking-different-methods-for-full-text-search-using-elasticsearch-na</link>
      <guid>https://dev.to/lsena/benchmarking-different-methods-for-full-text-search-using-elasticsearch-na</guid>
      <description>&lt;p&gt;How to choose between different analyzers and queries to get the best search performance? Benchmarking of course!&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--A1PlKjPl--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn-images-1.medium.com/max/1024/0%2AIpHPJfJIe8xG2cox" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--A1PlKjPl--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn-images-1.medium.com/max/1024/0%2AIpHPJfJIe8xG2cox" alt=""&gt;&lt;/a&gt;Photo by &lt;a href="https://unsplash.com/@condorito1953?utm_source=medium&amp;amp;utm_medium=referral"&gt;Arie Wubben&lt;/a&gt; on &lt;a href="https://unsplash.com?utm_source=medium&amp;amp;utm_medium=referral"&gt;Unsplash&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Deploying a large-scale full-text search engine can be very hard. Elasticsearch makes the job much easier but it’s not one size fits all — quite the contrary.&lt;/p&gt;

&lt;p&gt;Elasticsearch has many configurations and features, but having many features also means many ways to achieve the same goal and it’s not always straightforward to know what’s the best way for the product you’re building.&lt;/p&gt;

&lt;p&gt;Let’s start with finding out the main ways we can find users by their username/name, measuring their performance, advantages, and drawbacks.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Experiment Stats&lt;/strong&gt;
&lt;/h3&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;


&lt;h3&gt;
  
  
  Match Query
&lt;/h3&gt;

&lt;p&gt;This will match terms using a fuzziness param.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pros&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Simple to use&lt;/li&gt;
&lt;li&gt;Doesn’t use much space&lt;/li&gt;
&lt;li&gt;Allows fuzzy search&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Cons&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;If the size of the indexed word is bigger than the searched term+fuzziness_size it will not match&lt;/li&gt;
&lt;li&gt;Fuzzy search can slow things down&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Prefix query&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Pros&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Simple to use&lt;/li&gt;
&lt;li&gt;Potentially very fast (especially if you use &lt;a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/index-prefixes.html"&gt;index_prefixes&lt;/a&gt; option)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Cons&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;It will only match if the indexed term starts with the searched term&lt;/li&gt;
&lt;li&gt;If you use the index_prefixes option, it will use more space&lt;/li&gt;
&lt;li&gt;No fuzzy search&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Wildcard query&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Works a bit the same way as “LIKE %term%” when using a relational database SELECT.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pros&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Easy to implement and debug&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Cons&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Usually, the slowest option, especially if the wildcard is placed at the start or very few characters are used&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Match query + ngram analyzer&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Pros&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;will match even if the search term is in the middle of a word&lt;/li&gt;
&lt;li&gt;good search performance&lt;/li&gt;
&lt;li&gt;allows having a “fuzzy” search since it will match segments of each word&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Cons&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;specialized analyzer&lt;/li&gt;
&lt;li&gt;uses more disk space&lt;/li&gt;
&lt;li&gt;only matches if the search term is at least the size of the smallest “gram”&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Mappings
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Standard&lt;/strong&gt;&lt;/p&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;


&lt;p&gt;&lt;strong&gt;Ngram&lt;/strong&gt;&lt;/p&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;


&lt;h3&gt;
  
  
  Queries
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Match query&lt;/strong&gt;&lt;/p&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;


&lt;p&gt;&lt;strong&gt;Prefix query&lt;/strong&gt;&lt;/p&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;


&lt;p&gt;&lt;strong&gt;Wildcard query&lt;/strong&gt;&lt;/p&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;


&lt;p&gt;&lt;strong&gt;Match query + Ngram Analyzer&lt;/strong&gt;&lt;/p&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;


&lt;h3&gt;
  
  
  Query Benchmarks
&lt;/h3&gt;

&lt;p&gt;To do the benchmarks, I’ve created a small python script that uses 4 parallel processes that will each run 1000 consecutive queries.&lt;/p&gt;

&lt;p&gt;It runs that for each kind of query.&lt;/p&gt;

&lt;p&gt;The main objective is not to know how long each query takes but to &lt;strong&gt;compare&lt;/strong&gt; their execution time under the same conditions.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Time in seconds is calculated summing the time of 1000 runs and then doing the average between 4 parallel processes&lt;/li&gt;
&lt;/ul&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;


&lt;h3&gt;
  
  
  Conclusions
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Avoid the wildcard query at all costs:&lt;/strong&gt; I see the wildcard query being recommended everywhere but as we saw, it is the slowest option and you can get better results with the other options.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;If you can live with matching only the beginning of a word:&lt;/strong&gt; The prefix query can do this job, and it can do it really fast. If your use case fits this, it’s a good choice. There is also the possibility of using the index_prefix option to speed things up even more at the cost of disk space.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;If you want to save on disk space&lt;/strong&gt; : Using the standard analyzer with a match+fuzziness param should do the trick.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;If you want to be able to match even if the search term is in the middle of a word and really need it to be fast&lt;/strong&gt; : ngram seems to be the choice in this case. It can be “dangerous” to use it sometimes though.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;When using the ngram analyzer&lt;/strong&gt; , you should avoid having a big distance between min and max gram size and also avoid using very small ngram sizes like 1 to allow showing results when using only 1 letter.&lt;/p&gt;

&lt;p&gt;If you have a big range of gram sizes, it will become very expensive disk-wise and potentially degrade your performance.&lt;/p&gt;

&lt;p&gt;Instead, you could, for example, use the fields that use the standard analyzer and perform a simple match or prefix query when your search_term &amp;lt; min_ngram_size.&lt;/p&gt;

&lt;h3&gt;
  
  
  Into Elasticsearch? Check these out:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://dev.to/lsena/the-complete-guide-to-increase-your-elasticsearch-write-throughput-nlo-temp-slug-2628411"&gt;The Complete Guide to Increase Your Elasticsearch Write Throughput&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://dev.to/lsena/using-event-sourcing-to-increase-elasticsearch-performance-4aph-temp-slug-9860704"&gt;Using Event Sourcing to Increase Elasticsearch Performance&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;How does this all sound? Is there anything you’d like me to expand on? Let me know your thoughts in the comments section below (and hit the clap if this was useful)!&lt;/p&gt;

&lt;p&gt;Stay tuned for the next post. Follow so you won’t miss it!&lt;/p&gt;

</description>
      <category>development</category>
      <category>softwaredevelopment</category>
      <category>elasticsearch</category>
      <category>programming</category>
    </item>
    <item>
      <title>Understanding and optimizing python multi-process memory management</title>
      <dc:creator>Luis Sena</dc:creator>
      <pubDate>Sun, 07 Feb 2021 17:10:11 +0000</pubDate>
      <link>https://dev.to/lsena/understanding-and-optimizing-python-multi-process-memory-management-4ech</link>
      <guid>https://dev.to/lsena/understanding-and-optimizing-python-multi-process-memory-management-4ech</guid>
      <description>&lt;h3&gt;
  
  
  Understanding and Optimizing Python multi-process Memory Management
&lt;/h3&gt;

&lt;p&gt;This post will focus on lowering your memory usage and increase your IPC at the same time&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;This blog post will focus on &lt;a href="https://en.wikipedia.org/wiki/POSIX"&gt;POSIX&lt;/a&gt;oriented OS like Linux or macOS&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;To avoid the GIL bottleneck, you might have already used multi-processing with python, be it using a pre-fork worker model (&lt;a href="https://luis-sena.medium.com/gunicorn-worker-types-youre-probably-using-them-wrong-381239e13594"&gt;more on that here&lt;/a&gt;), or just using the &lt;a href="https://docs.python.org/3/library/multiprocessing.html"&gt;multiprocessing&lt;/a&gt;package.&lt;/p&gt;

&lt;p&gt;What that does, under the hood, is using the OS &lt;em&gt;fork()&lt;/em&gt; function that will create a child process with an exact virtual copy of the parent’s memory.&lt;br&gt;&lt;br&gt;
The OS is really clever while doing this since it doesn’t copy the memory right away. Instead, it will expose it to each process as its own isolated memory, keeping all the previous addresses intact.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--c7sOFLwB--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn-images-1.medium.com/max/655/1%2AB3jzse_G1dIwL5US0LlStg.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--c7sOFLwB--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn-images-1.medium.com/max/655/1%2AB3jzse_G1dIwL5US0LlStg.jpeg" alt=""&gt;&lt;/a&gt;The new process generated from fork() keeps the same memory addresses&lt;/p&gt;

&lt;p&gt;This is possible thanks to the concept of &lt;a href="https://en.wikipedia.org/wiki/Virtual_memory"&gt;virtual memory&lt;/a&gt;.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Let’s take a small detour just to refresh your memory on some of the underlying concepts, feel free to skip this section if it’s old news to you.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h3&gt;
  
  
  So how can you have two processes with the exact same memory addresses holding different values?
&lt;/h3&gt;

&lt;p&gt;Your process does not interact directly with your computer RAM, in fact, the OS abstracts memory through a mechanism called Virtual Memory. This has many advantages like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You can use more memory than the available RAM in your system (using disk)&lt;/li&gt;
&lt;li&gt;Memory address isolation and protection from other processes&lt;/li&gt;
&lt;li&gt;Contiguous address space&lt;/li&gt;
&lt;li&gt;No need to manage shared memory directly&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--NgR6uGQ_--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn-images-1.medium.com/max/631/1%2Ayt13_erSJUJJuavgm_Wsnw.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--NgR6uGQ_--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn-images-1.medium.com/max/631/1%2Ayt13_erSJUJJuavgm_Wsnw.jpeg" alt=""&gt;&lt;/a&gt;Virtual Memory vs Physical Memory&lt;/p&gt;

&lt;p&gt;In the above picture, you can see two independent processes that have their isolated memory space.&lt;br&gt;&lt;br&gt;
Each process has its contiguous address space and does not need to manage where each &lt;a href="https://en.wikipedia.org/wiki/Paging"&gt;page&lt;/a&gt; is located.&lt;/p&gt;

&lt;p&gt;You probably noticed some of the memory pages are located on disk. This can happen if your process never had to access that page since it was started (the OS will only load pages into RAM when a process needs them) or they were evicted from RAM because it needed that space for other processes.&lt;/p&gt;

&lt;p&gt;When the process tries to access a page, the OS will serve it directly from RAM if it is already loaded or fetch it from disk, load it into RAM and then serve it to the process, with the only difference being the latency.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Sorry for the detour, now let’s get back to our main topic!&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;After you fork(), you end up with two processes, a child and a parent that share most of their memory until one of them needs to write to any of the shared memory pages.&lt;/p&gt;

&lt;p&gt;This approach is called copy on write (&lt;a href="https://en.wikipedia.org/wiki/Copy-on-write"&gt;COW&lt;/a&gt;), and this avoids having the OS duplicating the entire process memory right from the beginning, thus saving memory and speeding up the process creation.&lt;/p&gt;

&lt;p&gt;COW works by marking those pages of memory as read-only and keeping a count of the number of references to the page. When data is written to these pages, the kernel intercepts the write attempt and allocates a new physical page, initialized with the copy-on-write data. The kernel then updates the page table with the new (writable) page, decrements the number of references, and performs the write.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;What this means is that one easy way to avoid bloating your memory is to make sure you load everything you intend to share between processes into memory before you fork().&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;If you’re using gunicorn to serve your API, this means using the &lt;a href="https://docs.gunicorn.org/en/stable/settings.html#preload-app"&gt;preload&lt;/a&gt; parameter for example.&lt;br&gt;&lt;br&gt;
Not only you can avoid duplicating memory but it will also avoid costly &lt;a href="https://en.wikipedia.org/wiki/Inter-process_communication"&gt;IPC&lt;/a&gt;.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Loading shared read-only objects before the fork() works great for “well behaved” languages since those object pages will never get copied, unfortunately with python, that’s not the case.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;One of python’s GC strategies is &lt;a href="https://en.wikipedia.org/wiki/Reference_counting"&gt;reference counting&lt;/a&gt; and python keeps track of references in each object header.&lt;br&gt;&lt;br&gt;
What this means in practice is that each time you read said object, you will write to it.&lt;/p&gt;

&lt;p&gt;Be it using gunicorn with the preload parameter or just loading your data and then forking using the multiprocessing package, you’ll notice that, after an amount of time, your memory usage will bloat to be almost 1:1 with the number of processes. This is the work of the GC.&lt;/p&gt;

&lt;p&gt;I have some good and bad news… you’re &lt;a href="https://instagram-engineering.com/copy-on-write-friendly-python-garbage-collection-ad6ed5233ddf"&gt;not alone&lt;/a&gt; in this and it will be a bit more trouble but there are some workarounds to the issue.&lt;/p&gt;

&lt;p&gt;Let’s establish a baseline and run some benchmarks first, and then explore our options. To run these benchmarks, I created a small Flask server with gunicorn to fork the process into 3 workers. You can check the script &lt;a href="https://gist.github.com/lsena/4c39c61ae5900d662f74a5a479b78411"&gt;here&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--ldhNfppQ--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn-images-1.medium.com/max/1024/1%2A79bJrzXM2u8Gqmf3IlwXuA.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--ldhNfppQ--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn-images-1.medium.com/max/1024/1%2A79bJrzXM2u8Gqmf3IlwXuA.png" alt=""&gt;&lt;/a&gt;memory usage multiplies with number of workers&lt;/p&gt;

&lt;p&gt;In the above chart, gunicorn will fork before running the server code, this means each worker will run this script:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;self.big_data = [item _for_ item _in_ range(10000000)]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;As we can see, memory usage grows linearly with each worker.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--aZfFc5bH--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn-images-1.medium.com/max/1024/1%2AJJdN10XUMUxp5hC7Nh0l0A.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--aZfFc5bH--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn-images-1.medium.com/max/1024/1%2AJJdN10XUMUxp5hC7Nh0l0A.png" alt=""&gt;&lt;/a&gt;memory usage doesn’t change with the number of workers&lt;/p&gt;

&lt;p&gt;In the above chart, since I’m using the preload option, gunicorn will load everything before forking. We can see COW in action here since the memory usage stays constant.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--vf-cPucI--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn-images-1.medium.com/max/1024/1%2Ah4VFHlEqkh_Ll5umytVURQ.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--vf-cPucI--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn-images-1.medium.com/max/1024/1%2Ah4VFHlEqkh_Ll5umytVURQ.png" alt=""&gt;&lt;/a&gt;Memory multiples as soon as a worker loops through their copy of the shared list&lt;/p&gt;

&lt;p&gt;Unfortunately, as we can see here, as soon as each worker needs to read the shared data, GC will try to write into that page to save the reference count, provoking a copy on write.&lt;br&gt;&lt;br&gt;
In the end, we end up with the same memory usage as if we didn’t use the preload option!&lt;/p&gt;

&lt;p&gt;Ok, we have our baseline, how can we improve?&lt;/p&gt;
&lt;h3&gt;
  
  
  Using joblib
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--ZoEmUKgY--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn-images-1.medium.com/max/1024/1%2AjvJdDv9h4qkCJPu_r8Vzeg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--ZoEmUKgY--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn-images-1.medium.com/max/1024/1%2AjvJdDv9h4qkCJPu_r8Vzeg.png" alt=""&gt;&lt;/a&gt;A very small difference in memory usage after access&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/joblib/joblib"&gt;Joblib&lt;/a&gt; is a python library that is mainly used for data serialization and parallel work. One really good thing about it is that it enables easy memory savings since it won’t COW when you access data loaded by this package.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;_import_ joblib
# previously created with joblib.dump()
self.big_data = joblib.load('test.pkl') # big_data is a big list()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Using numpy
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--i6x0heDV--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn-images-1.medium.com/max/1024/1%2A4Tm50nA8Q-L42Lz8auszTA.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--i6x0heDV--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn-images-1.medium.com/max/1024/1%2A4Tm50nA8Q-L42Lz8auszTA.png" alt=""&gt;&lt;/a&gt;A very small difference in memory usage after access&lt;/p&gt;

&lt;p&gt;If you’re doing data science, I have really good news for you! You get memory savings for “free” just by using numpy data structures. And this includes if you use pandas or another library as long as the inner data structure is a numpy array.&lt;/p&gt;

&lt;p&gt;The reason for this is how they manage memory. Since this package is basically C with python bindings, they have the liberty (and responsibility) of managing everything without the interference of cpython.&lt;br&gt;&lt;br&gt;
They made the clever choice of not saving the reference counts in the same pages those large data structures are kept, avoiding COW when you access them.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;_import_ numpy _as_ np
self.big_data = np.array([[item, item] _for_ item _in_ range(10000000)])
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Using mmap
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--WoRDHC36--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn-images-1.medium.com/max/1024/1%2AK9m3pJYtGKRchQkgFzWX7Q.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--WoRDHC36--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn-images-1.medium.com/max/1024/1%2AK9m3pJYtGKRchQkgFzWX7Q.png" alt=""&gt;&lt;/a&gt;Zero overhead in memory usage&lt;/p&gt;

&lt;p&gt;&lt;a href="https://en.wikipedia.org/wiki/Mmap"&gt;mmap&lt;/a&gt; is a POSIX-compliant Unix system call that maps files or devices into memory. This allows you to interact with huge files that exist on disk without having to load them into memory as a whole.&lt;/p&gt;

&lt;p&gt;Another big advantage is that you can even create a block of shared “unmanaged” memory without a file reference passing -1 instead of a file path like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;_import_ mmap
mmap.mmap(-1, length=....)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Another great advantage is that you can write to it as well without incurring COW. As long as you deal with concurrency, it is an efficient way to share memory/data between processes, although it’s probably easier/safer to use &lt;a href="https://docs.python.org/3/library/multiprocessing.shared_memory.html"&gt;multiprocessing.shared_memory&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;How does this all sound? Is there anything you’d like me to expand on? Let me know your thoughts in the comments section below (and hit the clap if this was useful)!&lt;/p&gt;

&lt;h3&gt;
  
  
  Conclusions
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Python will generally copy shared data to each process when you access it&lt;/li&gt;
&lt;li&gt;“Preload” is a great way to save memory if you need to share a read-only big data structure in your API&lt;/li&gt;
&lt;li&gt;To avoid COW when you read data, you’ll need to use joblib, numpy, mmap, shared_memory or similar&lt;/li&gt;
&lt;li&gt;Sharing data instead of communicating data between processes can save you a lot of latency&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Stay tuned for the next post. Follow so you won’t miss it!&lt;/p&gt;

</description>
      <category>memorymanagement</category>
      <category>python</category>
      <category>numpy</category>
      <category>performance</category>
    </item>
    <item>
      <title>Gunicorn Worker Types: How to choose the right one</title>
      <dc:creator>Luis Sena</dc:creator>
      <pubDate>Mon, 25 Jan 2021 07:29:24 +0000</pubDate>
      <link>https://dev.to/lsena/gunicorn-worker-types-how-to-choose-the-right-one-4n2c</link>
      <guid>https://dev.to/lsena/gunicorn-worker-types-how-to-choose-the-right-one-4n2c</guid>
      <description>&lt;p&gt;Scale your wsgi project to the next level by leveraging everything Gunicorn has to offer.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F1024%2F1%2AU3OFfUPCKV7qMmLRRiiYDA.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F1024%2F1%2AU3OFfUPCKV7qMmLRRiiYDA.jpeg"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;This article assumes you’re using a sync framework like flask or Django and won’t explore the possibility of using the async/await pattern.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;First, let’s briefly discuss how python handles concurrency and parallelism.&lt;/p&gt;

&lt;p&gt;Python never runs more than 1 thread per process because of the &lt;a href="https://wiki.python.org/moin/GlobalInterpreterLock" rel="noopener noreferrer"&gt;GIL&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Even if you have 100 threads inside your process, the GIL will only allow a single thread to run at the same time. That means that, at any time, 99 of those threads are paused and 1 thread is working. The GIL is responsible for that orchestration.&lt;/p&gt;

&lt;p&gt;To get around this limitation, we can use Gunicorn. From the docs:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Gunicorn is based on the pre-fork worker model. This means that there is a central master process that manages a set of worker processes. The master never knows anything about individual clients. All requests and responses are handled completely by worker processes.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This means that Gunicorn will spawn the specified number of individual processes and load your application into each process/worker allowing parallel processing for your python application.&lt;/p&gt;

&lt;p&gt;Since one size will never fit everyone’s needs, it offers different worker types in order to suit a broader range of use cases.&lt;/p&gt;

&lt;h3&gt;
  
  
  sync
&lt;/h3&gt;

&lt;p&gt;This is the default worker class. Each process will handle 1 request at a time and you can use the parameter &lt;em&gt;-w&lt;/em&gt; to set workers.&lt;/p&gt;

&lt;p&gt;The recommendation for the number of workers is 2–4 x $(NUM_CORES), although it will depend on how your application works.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;When to use:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Your work is almost entirely CPU bound;&lt;/li&gt;
&lt;li&gt;Low to zero I/O operations (this includes database access, network requests, etc).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Signs to look for in production:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Monitor CPU usage and incoming requests to make sure you have the right average number of processes for your machine size and also request patterns.&lt;/p&gt;

&lt;p&gt;If you have too many processes, it can slow down your average latency since it will force a lot of context switching to happen in your machine CPU.&lt;/p&gt;

&lt;p&gt;If you see a lot of timeout errors between your reverse proxy (i.e. nginx), it’s a sign that you don’t have enough concurrency to handle your traffic patterns/load.&lt;/p&gt;

&lt;h3&gt;
  
  
  gthread
&lt;/h3&gt;

&lt;blockquote&gt;
&lt;p&gt;If you try to use the sync worker type and set the threads setting to more than 1, the gthread worker type will be used instead.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;If you use gthread, Gunicorn will allow each worker to have multiple threads. In this case, the Python application is loaded once per worker, and each of the threads spawned by the same worker shares the same memory space.&lt;/p&gt;

&lt;p&gt;Those threads will be at the mercy of the GIL, but it’s still useful for when you have some I/O blocking happening. It will allow you to handle more concurrency without increasing your memory too much.&lt;/p&gt;

&lt;p&gt;The recommended total amount of parallel requests is still the same.&lt;br&gt;&lt;br&gt;
This is probably the most used configuration you’ll see out in the wild.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;When to use:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Moderate I/O operations;&lt;/li&gt;
&lt;li&gt;Moderate CPU usage;&lt;/li&gt;
&lt;li&gt;You’re using packages/extensions that are not patched to run async and/or are unable to patch them yourself.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Signs to look for in production:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The ones I described for the sync worker type.&lt;/p&gt;

&lt;p&gt;…with the caveat of the balance between proc vs threads. This balance will depend a lot on your usage patterns.&lt;/p&gt;

&lt;h3&gt;
  
  
  eventlet/gevent
&lt;/h3&gt;

&lt;p&gt;Eventlet and gevent make use of “green threads” or “pseudo threads” and are based on &lt;a href="https://greenlet.readthedocs.io/en/latest/" rel="noopener noreferrer"&gt;greenlet&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;In practice, if your application work is mainly I/O bound, it will allow it to scale to potentially thousands of concurrent requests on a single process.&lt;/p&gt;

&lt;p&gt;Even with the rise of async frameworks (fastapi, sanic, etc), this is still relevant today since it allows you to optimize for I/O without having the extra code complexity.&lt;/p&gt;

&lt;p&gt;The way they manage to do it is by “&lt;a href="https://en.wikipedia.org/wiki/Monkey_patch" rel="noopener noreferrer"&gt;monkey patching&lt;/a&gt;” your code, mainly replacing blocking parts with compatible cooperative counterparts from gevent package.&lt;/p&gt;

&lt;p&gt;It uses epoll or kqueue or libevent for highly scalable non-blocking I/O. Coroutines ensure that the developer uses a blocking style of programming that is similar to threading, but provide the benefits of non-blocking I/O.&lt;/p&gt;

&lt;p&gt;This is usually the most efficient way to run your django/flask/etc web application, since most of the time the bulk of the latency comes from I/O related work.&lt;/p&gt;

&lt;p&gt;That being said, it can be tricky to have it configured 100% correctly, and if you’re not serving hundreds or more requests/sec, it’s probably easier to just use the gthread worker class&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Signs to look for in production:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Make sure all parts of your code cooperate with these async frameworks (e.g. properly patched).
Without that, you could have blocked threads that are sitting idle and won’t be able to execute work (like accepting new requests and answer to previously accepted requests that finished the I/O call).
In production, if your CPU usage is low but you’re seeing a lot of timeouts in your nginx logs, there’s a good chance that’s happening.
But you should audit this before deploying to production. (I’ll describe how to handle this later on this post).&lt;/li&gt;
&lt;li&gt;Connections to your databases. If you have thousands of concurrent connections and you’re using a DBMS like PostgreSQL without a connection pooler, chances are, you’re going to have a bad time (I’ll describe how to handle this later on this post).&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  tornado
&lt;/h3&gt;

&lt;p&gt;There’s also a Tornado worker class. It can be used to write applications using the Tornado framework. Although the Tornado workers are capable of serving a WSGI application, this is not a recommended configuration.&lt;/p&gt;

&lt;h3&gt;
  
  
  Tips and best practices when using the “green thread” worker types
&lt;/h3&gt;

&lt;blockquote&gt;
&lt;p&gt;I’ll focus on &lt;a href="http://www.gevent.org/intro.html" rel="noopener noreferrer"&gt;gevent&lt;/a&gt; instead of eventlet since it has become the popular choice.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Make sure everything on your project is gevent friendly. This includes packages and drivers. I’ll list some of the most used packages and how to patch them if needed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;PostgreSQL&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The official package psycopg2, but it’s not prepared to be patched by gevent.&lt;br&gt;&lt;br&gt;
You also need psycogreen:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/psycopg/psycogreen/" rel="noopener noreferrer"&gt;psycopg/psycogreen&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;MySQL&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The recommended package is PyMySQL and it is gevent friendly:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/PyMySQL/PyMySQL" rel="noopener noreferrer"&gt;PyMySQL/PyMySQL&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Redis&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The recommended package is redis-py and it is gevent friendly:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/andymccurdy/redis-py" rel="noopener noreferrer"&gt;andymccurdy/redis-py&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;MongoDB&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The recommended package is PyMongo and it is gevent friendly:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/mongodb/mongo-python-driver" rel="noopener noreferrer"&gt;mongodb/mongo-python-driver&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Elasticsearch&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The recommended package is elasticsearch-py and it is gevent friendly.&lt;br&gt;&lt;br&gt;
Quote from a maintainer:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;

The library itself just passes whatever is returned from the connection class. It uses standard sockets by default (via urllib3) so it can be made compatible by monkey patching. Alternatively you can create your own connection_class and plug it in.


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;a href="https://github.com/elastic/elasticsearch-py" rel="noopener noreferrer"&gt;elastic/elasticsearch-py&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cassandra&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The recommended package is from datastax and it is gevent friendly:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/datastax/python-driver" rel="noopener noreferrer"&gt;datastax/python-driver&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Connection Pooling&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;One thing to take into consideration when using gevent is to understand that it’s really easy to end up with a lot of concurrent connections to, for example, your database. For some DBMS like PostgreSQL, that can be really dangerous.&lt;br&gt;&lt;br&gt;
The standard practice for these cases is to use a connection pool. In the case of PostgreSQL, the SQLAlchemy framework or PgBouncer will work very well.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Blocked thread monitoring&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;It’s really important to make sure parts of your code are not blocking a greenlet from returning to the hub.&lt;/p&gt;

&lt;p&gt;Fortunately, since gevent version 1.3, it’s simple to monitor using the property monitor_thread and you can event enable it inside your unit tests:&lt;/p&gt;

&lt;p&gt;&lt;a href="http://www.gevent.org/configuration.html#gevent._config.Config.monitor_thread" rel="noopener noreferrer"&gt;gevent&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;It’s also a good idea to have it enabled in your development environment since some blocks might be missed during your CI runs since it’s usual to mock some of the I/O stuff.&lt;/p&gt;

&lt;h3&gt;
  
  
  Conclusions
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Gunicorn/wsgi is still a valid choice even with the rise of async frameworks like fastapi and sanic;&lt;/li&gt;
&lt;li&gt;gthread is usually the preferred worker type by many due to it’s ease of configuration coupled with the ability to scale concurrency without bloating your memory too much;&lt;/li&gt;
&lt;li&gt;gevent is the best choice when you need concurrency and most of your work is I/O bound (network calls, file access, databases, etc…).&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Further Reading
&lt;/h3&gt;

&lt;p&gt;A Deep Dive into Gunicorn and the Python GIL: &lt;a href="https://luis-sena.medium.com/gunicorn-vs-python-gil-221e673d692" rel="noopener noreferrer"&gt;Gunicorn vs Python GIL&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;How does this all sound? Is there anything you’d like me to expand on? Let me know your thoughts in the comments section below (and hit the clap if this was useful)!&lt;/p&gt;

&lt;p&gt;Stay tuned for the next post. Follow so you won’t miss it!&lt;/p&gt;

</description>
      <category>python</category>
      <category>gunicorn</category>
      <category>softwaredevelopment</category>
      <category>performance</category>
    </item>
  </channel>
</rss>
