<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Ming</title>
    <description>The latest articles on DEV Community by Ming (@keming).</description>
    <link>https://dev.to/keming</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F372112%2Ff5920f16-2073-48e4-a997-89d8e21b887a.jpeg</url>
      <title>DEV Community: Ming</title>
      <link>https://dev.to/keming</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/keming"/>
    <language>en</language>
    <item>
      <title>Lessons learned from improving a Rust program</title>
      <dc:creator>Ming</dc:creator>
      <pubDate>Sun, 13 Oct 2024 10:29:29 +0000</pubDate>
      <link>https://dev.to/keming/improve-an-algorithm-performance-step-by-step-1jnf</link>
      <guid>https://dev.to/keming/improve-an-algorithm-performance-step-by-step-1jnf</guid>
      <description>&lt;p&gt;Recently, I've been working on a new approximate nearest neighbor search algorithm called &lt;a href="https://arxiv.org/abs/2405.12497" rel="noopener noreferrer"&gt;RaBitQ&lt;/a&gt;. The author has already provided a &lt;a href="https://github.com/gaoj0017/RaBitQ" rel="noopener noreferrer"&gt;C++ implementation&lt;/a&gt; that runs quite fast. I tried to &lt;a href="https://github.com/kemingy/rabitq" rel="noopener noreferrer"&gt;rewrite it in Rust&lt;/a&gt; (yet another RiiR). However, I found that my implementation was much slower than the original one. Here is how I improve the performance step by step.&lt;/p&gt;

&lt;h2&gt;
  
  
  Prepare the environment
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Datasets
&lt;/h3&gt;

&lt;p&gt;The most important thing is to have some reasonable datasets. Since the paper already demonstrate some results on the &lt;code&gt;sift_dim128_1m_l2&lt;/code&gt; and &lt;code&gt;gist_dim960_1m_l2&lt;/code&gt; datasets, 128 and 960 dimensions are typical and 1_000_000 vectors should be sufficient for benchmark purpose. I decided to use them as well. The datasets can be downloaded from &lt;a href="http://corpus-texmex.irisa.fr/" rel="noopener noreferrer"&gt;here&lt;/a&gt;. (Yes, I know this site doesn't have TLS and it only provides FTP downloads).&lt;/p&gt;

&lt;p&gt;The format used by these datasets is called &lt;code&gt;fvecs/ivecs&lt;/code&gt;, which is a common vector format:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="nf"&gt;dim &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt; &lt;span class="nb"&gt;bytes&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="nf"&gt;vector &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;dim&lt;/span&gt; &lt;span class="nb"&gt;bytes&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt;
&lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="nf"&gt;dim &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt; &lt;span class="nb"&gt;bytes&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="nf"&gt;vector &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;dim&lt;/span&gt; &lt;span class="nb"&gt;bytes&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt;
&lt;span class="bp"&gt;...&lt;/span&gt;
&lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="nf"&gt;dim &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt; &lt;span class="nb"&gt;bytes&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="nf"&gt;vector &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;dim&lt;/span&gt; &lt;span class="nb"&gt;bytes&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You can get the read/write script from my &lt;a href="https://gist.github.com/kemingy/2f503fcfff86b9e0197e975c02359157" rel="noopener noreferrer"&gt;gist&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Profiling tool
&lt;/h3&gt;

&lt;p&gt;I use &lt;a href="https://github.com/mstange/samply" rel="noopener noreferrer"&gt;samply&lt;/a&gt; to profile the Rust code. It has a nice integration with the &lt;a href="https://profiler.firefox.com/" rel="noopener noreferrer"&gt;Firefox Profiler&lt;/a&gt;. You can also share the profiling results with others by uploading them to the cloud. Here is &lt;a href="https://share.firefox.dev/3Y4Hppz" rel="noopener noreferrer"&gt;an example of the C++ version profiling on GIST&lt;/a&gt;. The FlameGraph and CallTree are the most common views. Remember to grant the performance event permission and increase the &lt;code&gt;mlock&lt;/code&gt; limit:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s1"&gt;'1'&lt;/span&gt; | &lt;span class="nb"&gt;sudo tee&lt;/span&gt; /proc/sys/kernel/perf_event_paranoid
&lt;span class="nb"&gt;sudo &lt;/span&gt;sysctl kernel.perf_event_mlock_kb&lt;span class="o"&gt;=&lt;/span&gt;2048
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;a href="https://godbolt.org/" rel="noopener noreferrer"&gt;GodBolt&lt;/a&gt; compiler explorer is also useful for comparing the assembly function code between C++ and Rust.&lt;/p&gt;

&lt;h3&gt;
  
  
  Cargo profile
&lt;/h3&gt;

&lt;p&gt;To include the debug information in the release build, you can add another profile to the &lt;code&gt;Cargo.toml&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight toml"&gt;&lt;code&gt;&lt;span class="nn"&gt;[profile.perf]&lt;/span&gt;
&lt;span class="py"&gt;inherits&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"release"&lt;/span&gt;
&lt;span class="py"&gt;debug&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
&lt;span class="py"&gt;codegen-units&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;16&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The compiling cost and runtime speed can greatly affect the profiling user experience.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;cargo build&lt;/code&gt; has a faster compile speed, but the code may be slower than pure Python&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;cargo build --release&lt;/code&gt; runs fast but it might take a long time to compile&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For benchmarking, we have no choice but to use the &lt;code&gt;opt-level = 3&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;I saw some advice to use the following settings:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight toml"&gt;&lt;code&gt;&lt;span class="py"&gt;codegen-units&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
&lt;span class="py"&gt;lto&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"fat"&lt;/span&gt;
&lt;span class="py"&gt;panic&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"abort"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In my case, this only slows down the compilation speed and doesn't improve the performance at all.&lt;/p&gt;

&lt;h3&gt;
  
  
  Benchmark tool
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://github.com/bheisler/criterion.rs" rel="noopener noreferrer"&gt;Criterion&lt;/a&gt; is a good statistics-driven benchmark tool. I create another &lt;a href="https://github.com/kemingy/rs_bench" rel="noopener noreferrer"&gt;repo&lt;/a&gt; to store all the related benchmark codes. It turns out that I should put them in the same repo.&lt;/p&gt;

&lt;p&gt;One thing to note is that the benchmark results are not very stable. I have seen &lt;strong&gt;&lt;code&gt;±10%&lt;/code&gt;&lt;/strong&gt; differences without modifying the code. If you're using your laptop, this could be even worse since the CPU might be underclocked due to the high temperature.&lt;/p&gt;

&lt;p&gt;I suggest to benchmark the function with several different parameters. In this case, I use different vector dimensions. If the results for all the dimensions are positive, it usually means that the improvement is effective.&lt;/p&gt;

&lt;h3&gt;
  
  
  Metrics
&lt;/h3&gt;

&lt;p&gt;Remember to add some metrics from the start. Many bugs and performance issues can be found by checking the metrics. I use &lt;code&gt;AtomicU64&lt;/code&gt; directly since the current requirements are simple. I may switch to the &lt;a href="https://github.com/prometheus/client_rust" rel="noopener noreferrer"&gt;Prometheus metrics&lt;/a&gt; later.&lt;/p&gt;

&lt;p&gt;Note that too many metrics/logging/traces can also affect the performance. So be careful when adding them.&lt;/p&gt;

&lt;h3&gt;
  
  
  Resources
&lt;/h3&gt;

&lt;p&gt;During the benchmark, I noticed that the end-to-end QPS is extremely unstable. I could get a &lt;strong&gt;15%&lt;/strong&gt; improvement or deterioration the next day morning without recompiling the code. Then I found that the CPUs are not completely idle as I have VSCode + Rust Analyzer, it seems they don't consume much CPU but they do affect the benchmark results heavily. Even though I'm using &lt;a href="https://www.intel.com/content/www/us/en/products/sku/230500/intel-core-i713700k-processor-30m-cache-up-to-5-40-ghz/specifications.html" rel="noopener noreferrer"&gt;Intel Core i7-13700K&lt;/a&gt;, which has 8 performance cores and 8 efficient cores, also the program is single-threaded.&lt;/p&gt;

&lt;p&gt;I use &lt;a href="https://www.man7.org/linux/man-pages/man1/taskset.1.html" rel="noopener noreferrer"&gt;&lt;code&gt;taskset&lt;/code&gt;&lt;/a&gt; to bind the process to a specific CPU. This way it won't be affected by mixed cores scheduling.&lt;/p&gt;

&lt;p&gt;Note that Intel Core 13th/14th CPUs are affected by the instability problem due to the extremely high voltage. I have fixed this in the BIOS.&lt;/p&gt;

&lt;p&gt;Cloud VMs may not be affected by the CPU temperature, but the cloud providers may have their own CPU throttling and overbooking policies.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step by Step Improvement
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Start with a naive implementation
&lt;/h3&gt;

&lt;p&gt;My &lt;a href="https://github.com/kemingy/rabitq/tree/dbfd54bd5d739b0729dc28e6fbd8d5413b019561" rel="noopener noreferrer"&gt;first release&lt;/a&gt; implemented the RaBitQ algorithm based on an algebra library called &lt;a href="https://docs.rs/nalgebra" rel="noopener noreferrer"&gt;nalgebra&lt;/a&gt;. The main reason is that I need to use the QR decomposition to obtain the orthogonal matrix, which is the key step in the RaBitQ algorithm. Also, a mature linear algebra library provides many useful functions for manipulating the matrix and vectors, making it easier for me to implement the algorithm. Imagine implementing an algorithm involving matrix multiplication, projection and decomposition in Python without &lt;code&gt;numpy&lt;/code&gt;, it's a nightmare.&lt;/p&gt;

&lt;p&gt;I thought that the performance should be good since &lt;code&gt;nalgebra&lt;/code&gt; is optimized for such kind of scenarios. But the benchmark shows that it is much slower than I expected. I guess reimplementing it in &lt;code&gt;numpy&lt;/code&gt; would be much faster :(&lt;/p&gt;

&lt;p&gt;According to the &lt;a href="https://share.firefox.dev/3AwiVNR" rel="noopener noreferrer"&gt;profiling&lt;/a&gt;, there are lots of &lt;code&gt;f32::clone()&lt;/code&gt; calls. It takes about 33% of the total time, or 44% if you focus on the &lt;code&gt;query_one&lt;/code&gt; function. This reminds me that I can preallocate the memory for some vectors and reuse it in the iteration, a very common trick. So instead of using &lt;code&gt;(x - y).norm_squared()&lt;/code&gt;, I need to pre-declare another vector that stores the result of &lt;code&gt;(x - y)&lt;/code&gt;, which ends up being &lt;code&gt;x.sub_to(y, &amp;amp;mut z); z.norm_squared()&lt;/code&gt;. See the &lt;a href="https://github.com/kemingy/rabitq/commit/23f9aff4c8b3303c0a03ac9a7472ada8cc915a3b" rel="noopener noreferrer"&gt;commit 23f9aff&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Like most of the algebra libraries, it stores the matrix in the column-major order, which means iterating over the column could be faster than over the row. It's a bit annoying because I have to transpose the matrix before the iteration, and not all the vector/matrix multiplications can detect the dimension mismatch error (&lt;code&gt;1 x dyn&lt;/code&gt; or &lt;code&gt;dyn x 1&lt;/code&gt;) during compilation.&lt;/p&gt;

&lt;h3&gt;
  
  
  CPU target
&lt;/h3&gt;

&lt;p&gt;RaBitQ uses the binary dot product distance to estimate the approximate distance, which is computed by:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="k"&gt;fn&lt;/span&gt; &lt;span class="nf"&gt;binary_dot_product&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;u64&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;u64&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="k"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;u32&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nd"&gt;assert_eq!&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="nf"&gt;.len&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="nf"&gt;.len&lt;/span&gt;&lt;span class="p"&gt;());&lt;/span&gt;
    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="k"&gt;mut&lt;/span&gt; &lt;span class="n"&gt;res&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="k"&gt;in&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="o"&gt;..&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="nf"&gt;.len&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;res&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;&lt;span class="nf"&gt;.count_ones&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="n"&gt;res&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Here the &lt;a href="https://doc.rust-lang.org/std/primitive.u64.html#method.count_ones" rel="noopener noreferrer"&gt;&lt;code&gt;u64::count_ones()&lt;/code&gt;&lt;/a&gt; would use intrinsics directly, I thought. It turns out that I still need to enable the &lt;code&gt;popcnt&lt;/code&gt; feature during the compilation. This could be done by using the &lt;code&gt;RUSTFLAGS="-C target-feature=+popcnt"&lt;/code&gt;, but I prefer &lt;code&gt;RUSTFLAGS="-C target-cpu=native"&lt;/code&gt;, which enables all the CPU features supported by the current CPU, but also makes the binary non-portable, which is fine for now. The following sections also require this &lt;code&gt;env&lt;/code&gt; to enable the AVX2 features.&lt;/p&gt;

&lt;p&gt;You can use the following command to check your CPU features:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;rustc &lt;span class="nt"&gt;--print&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;cfg &lt;span class="nt"&gt;-C&lt;/span&gt; target-cpu&lt;span class="o"&gt;=&lt;/span&gt;native | rg target_feature
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  SIMD
&lt;/h3&gt;

&lt;p&gt;The key function for the nearest neighbor search is the distance function, which in this case is the Euclidean distance. We usually use the L2 square distance to avoid the square root computation. The naive implementation is as follows:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="nf"&gt;.sub_to&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="k"&gt;mut&lt;/span&gt; &lt;span class="n"&gt;residual&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="n"&gt;residual&lt;/span&gt;&lt;span class="nf"&gt;.norm_squared&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;After the profiling, I found that it still has &lt;code&gt;f32::clone()&lt;/code&gt;. By checking the source code of &lt;code&gt;nalgebra&lt;/code&gt;, I found that there are many &lt;code&gt;clone&lt;/code&gt; for some reasons I don't know. I decide to write the SIMD by hand. Fortunately, &lt;a href="https://github.com/nmslib/hnswlib" rel="noopener noreferrer"&gt;hnswlib&lt;/a&gt; (a popular HNSW implementation) already implements &lt;a href="https://github.com/nmslib/hnswlib/blob/master/hnswlib/space_l2.h" rel="noopener noreferrer"&gt;this&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;This eliminates the &lt;code&gt;f32::clone()&lt;/code&gt; in the distance computation and improves the QPS by &lt;strong&gt;28%&lt;/strong&gt; for SIFT. Check the &lt;a href="https://github.com/kemingy/rabitq/commit/5f82fccf8b39964ef1f66e9927fb126fd6886765" rel="noopener noreferrer"&gt;commit 5f82fcc&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;My CPU doesn't support AVX512, so I use the AVX2 version. You can check the &lt;a href="https://store.steampowered.com/hwsurvey/" rel="noopener noreferrer"&gt;Steam Hardware Stats&lt;/a&gt;, it lists the SIMD support in the "&lt;em&gt;Other Settings&lt;/em&gt;". &lt;strong&gt;100%&lt;/strong&gt; users have SSE3, &lt;strong&gt;94.61%&lt;/strong&gt; users have AVX2, only &lt;strong&gt;13.06%&lt;/strong&gt; users have AVX512F. Of course this statistic is biased, most of the cloud Intel CPUs have AVX512 support, game players cannot represent all the users.&lt;/p&gt;

&lt;p&gt;To use SIMD, the most useful guide is the &lt;a href="https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#" rel="noopener noreferrer"&gt;Intel Intrinsics Guide&lt;/a&gt;. It's better to download the website as the online experience is not good. Remember to check the "&lt;strong&gt;latency&lt;/strong&gt;" and "&lt;strong&gt;throughput&lt;/strong&gt;" of the intrinsics, otherwise, your code may be slower than the normal version.&lt;/p&gt;

&lt;p&gt;Another resource is the &lt;a href="https://db.in.tum.de/~finis/x86%20intrinsics%20cheat%20sheet%20v1.0.pdf" rel="noopener noreferrer"&gt;x86 Intrinsics Cheat Sheet&lt;/a&gt;. This is good for newbies like me.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/ashvardanian" rel="noopener noreferrer"&gt;@ashvardanian&lt;/a&gt; has a &lt;a href="https://ashvardanian.com/posts/simsimd-faster-scipy/#tails-of-the-past-the-significance-of-masked-loads" rel="noopener noreferrer"&gt;post&lt;/a&gt; about the "mask load" that solves the tail elements problem (requires AVX512).&lt;/p&gt;

&lt;p&gt;To make the code work on other platforms:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="nd"&gt;#[cfg(any(target_arch&lt;/span&gt; &lt;span class="nd"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"x86_64"&lt;/span&gt;&lt;span class="nd"&gt;,&lt;/span&gt; &lt;span class="nd"&gt;target_arch&lt;/span&gt; &lt;span class="nd"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"x86"&lt;/span&gt;&lt;span class="nd"&gt;))]&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nd"&gt;is_x86_feature_detected!&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"avx2"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="c1"&gt;// AVX2 version&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="c1"&gt;// normal version&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;There are some useful crates for writing better &lt;code&gt;cfg&lt;/code&gt; for the SIMD, let's keep it simple for now.&lt;/p&gt;

&lt;h3&gt;
  
  
  More SIMD
&lt;/h3&gt;

&lt;p&gt;SIMD is like a hammer, now I need to find more nails in the code.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;rewrite the &lt;code&gt;binarize_vector&lt;/code&gt; function with AVX2 in &lt;a href="https://github.com/kemingy/rabitq/commit/f114fc1ec58686596ade0df02a96fcf04b0bf828" rel="noopener noreferrer"&gt;commit f114fc1&lt;/a&gt; improves the QPS by &lt;strong&gt;32%&lt;/strong&gt; for GIST.&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;@andrewaylett pointed out that &lt;code&gt;opt-level=3&lt;/code&gt; can optimize this&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;del&gt;Compared to the original C++ version, this implementation is also branchless.&lt;/del&gt; When enabling &lt;code&gt;opt-level=3&lt;/code&gt;, this can be optimized by the compiler. See the &lt;a href="https://godbolt.org/z/hjP5qjabz" rel="noopener noreferrer"&gt;assembly&lt;/a&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight diff"&gt;&lt;code&gt;&lt;span class="gd"&gt;- let shift = if (i / 32) % 2 == 0 { 32 } else { 0 };
&lt;/span&gt;&lt;span class="gi"&gt;+ let shift = ((i &amp;gt;&amp;gt; 5) &amp;amp; 1) &amp;lt;&amp;lt; 5;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;&lt;a class="mentioned-user" href="https://dev.to/novax"&gt;@novax&lt;/a&gt; first pointed out that it's equivalent to &lt;code&gt;i &amp;amp; 32&lt;/code&gt;, which is more readable.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;See the &lt;a href="https://godbolt.org/z/YbP5vW34q" rel="noopener noreferrer"&gt;assembly&lt;/a&gt; for the difference.&lt;/p&gt;

&lt;p&gt;Well, going branchless doesn't make the overall performance much better since the &lt;code&gt;binarize_vector&lt;/code&gt; function is called only once for each query. But it's a good learning opportunity.&lt;/p&gt;

&lt;h3&gt;
  
  
  Scalar quantization
&lt;/h3&gt;

&lt;p&gt;To eliminate more &lt;code&gt;f32::clone()&lt;/code&gt; in the code, I decided to replace more &lt;code&gt;nalgebra&lt;/code&gt; functions with the manual implementation. The &lt;code&gt;min&lt;/code&gt; and &lt;code&gt;max&lt;/code&gt; functions are the most common ones. The &lt;code&gt;nalgebra&lt;/code&gt; version is like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;lower_bound&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;residual&lt;/span&gt;&lt;span class="nf"&gt;.min&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;upper_bound&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;residual&lt;/span&gt;&lt;span class="nf"&gt;.max&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This can be done by:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="k"&gt;fn&lt;/span&gt; &lt;span class="nf"&gt;min_max&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;vec&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;f32&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="k"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;f32&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;f32&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="k"&gt;mut&lt;/span&gt; &lt;span class="n"&gt;min&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nn"&gt;f32&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="n"&gt;MAX&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="k"&gt;mut&lt;/span&gt; &lt;span class="n"&gt;max&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nn"&gt;f32&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="n"&gt;MIN&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt; &lt;span class="k"&gt;in&lt;/span&gt; &lt;span class="n"&gt;vec&lt;/span&gt;&lt;span class="nf"&gt;.iter&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;v&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;min&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="n"&gt;min&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;v&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;max&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="n"&gt;max&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;min&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;max&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I used to use &lt;code&gt;f32::min()&lt;/code&gt; and &lt;code&gt;f32::max()&lt;/code&gt; because they are convenient. But for non-(asc/desc) vectors, &lt;code&gt;if&lt;/code&gt; has a better performance.&lt;/p&gt;

&lt;p&gt;Instead of iterating through the vector several times in a function chain and computing the scalar quantization with sum in different iterations:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;y_scaled&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;residual&lt;/span&gt;&lt;span class="nf"&gt;.add_scalar&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;lower_bound&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;one_over_delta&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="k"&gt;self&lt;/span&gt;&lt;span class="py"&gt;.rand_bias&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;y_quantized&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;y_scaled&lt;/span&gt;&lt;span class="nf"&gt;.map&lt;/span&gt;&lt;span class="p"&gt;(|&lt;/span&gt;&lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="p"&gt;|&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="nf"&gt;.to_u8&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="nf"&gt;.expect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"convert to u8 error"&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;
&lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;scalar_sum&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;y_quantized&lt;/span&gt;&lt;span class="nf"&gt;.iter&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="nf"&gt;.fold&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0u32&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;|&lt;/span&gt;&lt;span class="n"&gt;acc&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="p"&gt;|&lt;/span&gt; &lt;span class="n"&gt;acc&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="nb"&gt;u32&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We can do this in one loop:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="k"&gt;mut&lt;/span&gt; &lt;span class="n"&gt;sum&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0u32&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="k"&gt;in&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="o"&gt;..&lt;/span&gt;&lt;span class="n"&gt;vec&lt;/span&gt;&lt;span class="nf"&gt;.len&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;q&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;vec&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;lower_bound&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;multiplier&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;bias&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="nb"&gt;u8&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
        &lt;span class="n"&gt;quantized&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;q&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
        &lt;span class="n"&gt;sum&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="n"&gt;q&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="nb"&gt;u32&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="n"&gt;sum&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For scalar quantization, we are sure that the &lt;code&gt;f32&lt;/code&gt; can be converted to &lt;code&gt;u8&lt;/code&gt;, so we can use &lt;code&gt;as u8&lt;/code&gt; instead of &lt;code&gt;to_u8().unwrap()&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;The &lt;a href="https://github.com/kemingy/rabitq/commit/af39c1ce47eb8ea32e11f47b99548e77846397ea" rel="noopener noreferrer"&gt;commit af39c1c&lt;/a&gt; &amp;amp; &lt;a href="https://github.com/kemingy/rabitq/commit/d2d51b0785f0234df4d83a60eea96a36486a1120" rel="noopener noreferrer"&gt;commit d2d51b0&lt;/a&gt; improved the QPS by &lt;strong&gt;31%&lt;/strong&gt; for GIST.&lt;/p&gt;

&lt;p&gt;The following part can also be rewritten with SIMD, which improves the QPS by &lt;strong&gt;12%&lt;/strong&gt; for GIST:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;min/max: &lt;a href="https://github.com/kemingy/rabitq/commit/c97be68c13c7b4498b564afe3de2a1f6d8bca5ce" rel="noopener noreferrer"&gt;commit c97be68&lt;/a&gt; &amp;amp; &lt;a href="https://github.com/kemingy/rabitq/commit/e5a4af05433bf724da6902d34a745b4b2bdefd8d" rel="noopener noreferrer"&gt;commit e5a4af0&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;scalar quantization: &lt;a href="https://github.com/kemingy/rabitq/commit/28efe097a46696bb1a5469db22e500bafdc04514" rel="noopener noreferrer"&gt;commit 28efe09&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I also tried replacing &lt;code&gt;tr_mul&lt;/code&gt; with SIMD, which is a vector projection. It turns out that &lt;code&gt;nalgebra&lt;/code&gt; uses &lt;a href="https://en.wikipedia.org/wiki/Basic_Linear_Algebra_Subprograms" rel="noopener noreferrer"&gt;&lt;code&gt;BLAS&lt;/code&gt;&lt;/a&gt; here, so the performance stays the same.&lt;/p&gt;

&lt;h3&gt;
  
  
  Yet another algebra crate: faer
&lt;/h3&gt;

&lt;p&gt;I found another Rust algebra crate called &lt;a href="https://github.com/sarah-quinones/faer-rs" rel="noopener noreferrer"&gt;faer&lt;/a&gt; while investigating the &lt;code&gt;f32::clone()&lt;/code&gt; problem. It's optimized with lots of SIMD and provides better row/column iteration performance. The QR decomposition is also much faster than &lt;code&gt;nalgebra&lt;/code&gt;. This &lt;a href="https://github.com/kemingy/rabitq/commit/04118219d28bd0d43594c98c71e752faa81ff79d" rel="noopener noreferrer"&gt;commit 0411821&lt;/a&gt; makes the training part faster.&lt;/p&gt;

&lt;p&gt;Also, I can now use these vectors as a normal slice without the &lt;code&gt;ColRef&lt;/code&gt; or &lt;code&gt;RowRef&lt;/code&gt; wrapper after &lt;a href="https://github.com/kemingy/rabitq/commit/0d969bdcfb331f87e938e043e01acc648e1cf963" rel="noopener noreferrer"&gt;commit 0d969bd&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;I have to admit that if I used &lt;code&gt;faer&lt;/code&gt; from the beginning, I could avoid lots of troubles. Anyway, I learned a lot from this experience.&lt;/p&gt;

&lt;h3&gt;
  
  
  Binary dot product
&lt;/h3&gt;

&lt;p&gt;I thought &lt;code&gt;popcnt&lt;/code&gt; already solved the binary dot product, but the &lt;a href="https://share.firefox.dev/3Yk3Ok8" rel="noopener noreferrer"&gt;FlameGraph&lt;/a&gt; shows that &lt;code&gt;count_ones()&lt;/code&gt; only takes 7% of the &lt;code&gt;binary_dot_product&lt;/code&gt;. Although the AVX512 has the &lt;code&gt;vpopcntq&lt;/code&gt; instruction, I would prefer to use the AVX2 simulation since it's more common.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/komrad36/popcount/blob/master/popcnt.h" rel="noopener noreferrer"&gt;This&lt;/a&gt; is a good reference for the &lt;code&gt;popcnt&lt;/code&gt; implementation with AVX2. The &lt;a href="https://github.com/kemingy/rabitq/commit/edabd4a64c5b8ea2637b5332105638edf16afa7c" rel="noopener noreferrer"&gt;commit edabd4a&lt;/a&gt; re-implement this in Rust which improves the QPS by &lt;strong&gt;11%&lt;/strong&gt; for GIST. This trick only works when the vector has more than 256 dimensions, which means 256 bits for the binary representation.&lt;/p&gt;

&lt;h3&gt;
  
  
  Inline
&lt;/h3&gt;

&lt;p&gt;The &lt;a href="https://doc.rust-lang.org/reference/attributes/codegen.html#the-inline-attribute" rel="noopener noreferrer"&gt;#[inline]&lt;/a&gt; attribute should be used with caution. Adding this attribute to all the SIMD functions improves the QPS by &lt;strong&gt;5%&lt;/strong&gt; for GIST.&lt;/p&gt;

&lt;h3&gt;
  
  
  IO
&lt;/h3&gt;

&lt;p&gt;I need to add some background information here.&lt;/p&gt;

&lt;p&gt;The current implementation is based on the IVF algorithm, which will use &lt;a href="https://en.wikipedia.org/wiki/K-means_clustering" rel="noopener noreferrer"&gt;&lt;em&gt;k&lt;/em&gt;-means&lt;/a&gt; to cluster the vectors and store the centroids in memory. The query vector is only compared to the clusters with smaller &lt;code&gt;l2_squared_distance(query, centroid)&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;There is a parameter called &lt;code&gt;n_probe&lt;/code&gt; that controls how many nearest clusters will be probed. A large &lt;code&gt;n_probe&lt;/code&gt; will increase the recall but decrease the QPS.&lt;/p&gt;

&lt;p&gt;RaBitQ uses the binary dot product to estimate the approximate distance. If it's smaller than the threshold, it will re-rank with the original L2 squared distance and update the threshold accordingly.&lt;/p&gt;

&lt;p&gt;Previously, I used &lt;a href="https://doc.rust-lang.org/std/primitive.slice.html#method.select_nth_unstable" rel="noopener noreferrer"&gt;&lt;code&gt;slice::select_nth_unstable&lt;/code&gt;&lt;/a&gt; which only selects the n-nearest but doesn't sort them in order. Going through the clusters that are far away from the query will increase the re-ranking ratio, which requires more L2 squared distance computation. Re-sorting the selected n-th clusters improved the QPS by &lt;strong&gt;4%&lt;/strong&gt; for GIST.&lt;/p&gt;

&lt;p&gt;Another trick is to sort the vectors in each cluster by their distance to the centroids, this &lt;a href="https://github.com/kemingy/rabitq/commit/ea13ebca46257d7c2e22250fe02a481e7681f0a9" rel="noopener noreferrer"&gt;commit ea13ebc&lt;/a&gt; also improved the QPS by &lt;strong&gt;4%&lt;/strong&gt; for GIST.&lt;/p&gt;

&lt;p&gt;There are some metadata used to estimate the approximate distance for each vector:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;factor_ip: f32&lt;/li&gt;
&lt;li&gt;factor_ppc: f32&lt;/li&gt;
&lt;li&gt;error: f32&lt;/li&gt;
&lt;li&gt;x_c_distance_square: f32&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Previously I use 4 &lt;code&gt;Vec&amp;lt;f32&amp;gt;&lt;/code&gt; to store them, which is not IO friendly, since the calculation requires &lt;code&gt;vector[i]&lt;/code&gt; for each of them. By combining them into one &lt;code&gt;struct&lt;/code&gt; in &lt;a href="https://github.com/kemingy/rabitq/commit/bb440e3e8b150f590523eaa77e7c62165a5ee764" rel="noopener noreferrer"&gt;commit bb440e3&lt;/a&gt;, the QPS improved by &lt;strong&gt;2.5%&lt;/strong&gt; for GIST. This works well because it's 4xf32, so I can use the C representation directly:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="nd"&gt;#[derive(Debug,&lt;/span&gt; &lt;span class="nd"&gt;Clone,&lt;/span&gt; &lt;span class="nd"&gt;Copy,&lt;/span&gt; &lt;span class="nd"&gt;Default,&lt;/span&gt; &lt;span class="nd"&gt;Serialize,&lt;/span&gt; &lt;span class="nd"&gt;Deserialize)]&lt;/span&gt;
&lt;span class="nd"&gt;#[repr(C)]&lt;/span&gt;
&lt;span class="k"&gt;struct&lt;/span&gt; &lt;span class="n"&gt;Factor&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;factor_ip&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;f32&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;factor_ppc&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;f32&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;error_bound&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;f32&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;center_distance_square&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;f32&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Unfortunately, &lt;code&gt;faer&lt;/code&gt; doesn't support u64 vectors. So I have to store the vector binary representation in &lt;code&gt;Vec&amp;lt;Vec&amp;lt;u64&amp;gt;&amp;gt;&lt;/code&gt;. By changing it to &lt;code&gt;Vec&amp;lt;u64&amp;gt;&lt;/code&gt; in &lt;a href="https://github.com/kemingy/rabitq/commit/48236b23069db92bdb741fc6693e126b52c397ce" rel="noopener noreferrer"&gt;commit 48236b2&lt;/a&gt;, the QPS improved by &lt;strong&gt;2%&lt;/strong&gt; for GIST.&lt;/p&gt;

&lt;h3&gt;
  
  
  Const generics
&lt;/h3&gt;

&lt;p&gt;The C++ version uses the template to generate the code for different dimensions. This feature is also available in Rust. I didn't try it because re-compiling the code for different dimensions might only be possible for specific use cases, like inside a company with only a few fixed dimensions. For the public library, it's better to provide a general solution so users don't have to re-compile it by themselves.&lt;/p&gt;

&lt;h3&gt;
  
  
  Other tools
&lt;/h3&gt;

&lt;p&gt;There is a &lt;a href="https://github.com/Shnatsel/bounds-check-cookbook/" rel="noopener noreferrer"&gt;bounds-check-cookbook&lt;/a&gt; which provides several examples of how to eliminate the boundary checking in safe Rust.&lt;/p&gt;

&lt;p&gt;I tried &lt;a href="https://doc.rust-lang.org/rustc/profile-guided-optimization.html" rel="noopener noreferrer"&gt;PGO&lt;/a&gt; and &lt;a href="https://github.com/llvm/llvm-project/tree/main/bolt" rel="noopener noreferrer"&gt;BOLT&lt;/a&gt; but didn't get any improvement.&lt;/p&gt;

&lt;p&gt;Switching to &lt;a href="https://github.com/tikv/jemallocator" rel="noopener noreferrer"&gt;jemalloc&lt;/a&gt; or &lt;a href="https://github.com/microsoft/mimalloc" rel="noopener noreferrer"&gt;mimalloc&lt;/a&gt; doesn't improve the performance either.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;SIMD is awesome when it's used properly&lt;/li&gt;
&lt;li&gt;IO is also important, especially for the large datasets&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The current performance is the same as the C++ version for dataset GIST. While I use more SIMD, the C++ version uses const generics.&lt;/p&gt;

&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://en.algorithmica.org/hpc/algorithms/matmul/" rel="noopener noreferrer"&gt;Algorithmica / HPC&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>rust</category>
      <category>nlp</category>
      <category>algorithms</category>
      <category>benchmark</category>
    </item>
    <item>
      <title>User Authorization with Postgres Row Level Security Policy</title>
      <dc:creator>Ming</dc:creator>
      <pubDate>Tue, 04 Jun 2024 13:31:02 +0000</pubDate>
      <link>https://dev.to/keming/user-authorization-with-postgres-row-level-security-policy-4g91</link>
      <guid>https://dev.to/keming/user-authorization-with-postgres-row-level-security-policy-4g91</guid>
      <description>&lt;p&gt;Supabase has a &lt;a href="https://github.com/supabase/storage"&gt;storage gateway&lt;/a&gt; that uses &lt;a href="https://www.postgresql.org/docs/current/ddl-rowsecurity.html"&gt;RLS&lt;/a&gt; for authorization.&lt;/p&gt;

&lt;p&gt;It requires a JWT that provides the role information to perform the SQL, here is an example of the JWT payload:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"sub"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"authenticated"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"iat"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1516239022&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"role"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"f918ffd9-a611-4b2a-b4bb-df8f25d7569f"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The storage bucket table is:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;                        Table "storage.buckets"
   Column   |           Type           | Collation | Nullable | Default 
------------+--------------------------+-----------+----------+---------
 id         | text                     |           | not null | 
 name       | text                     |           | not null | 
 owner      | uuid                     |           |          | 
 created_at | timestamp with time zone |           |          | now()
 updated_at | timestamp with time zone |           |          | now()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You only need to set up the correct RLS policy in the database. Here is an example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- generate a UUID as the role name since it needs to match the owner type&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;gen_random_uuid&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt; &lt;span class="c1"&gt;-- f918ffd9-a611-4b2a-b4bb-df8f25d7569f&lt;/span&gt;

&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;ROLE&lt;/span&gt; &lt;span class="nv"&gt;"f918ffd9-a611-4b2a-b4bb-df8f25d7569f"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;GRANT&lt;/span&gt; &lt;span class="k"&gt;all&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="k"&gt;schema&lt;/span&gt; &lt;span class="k"&gt;storage&lt;/span&gt; &lt;span class="k"&gt;TO&lt;/span&gt; &lt;span class="nv"&gt;"f918ffd9-a611-4b2a-b4bb-df8f25d7569f"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;GRANT&lt;/span&gt; &lt;span class="k"&gt;all&lt;/span&gt; &lt;span class="k"&gt;on&lt;/span&gt; &lt;span class="n"&gt;buckets&lt;/span&gt; &lt;span class="k"&gt;TO&lt;/span&gt; &lt;span class="nv"&gt;"f918ffd9-a611-4b2a-b4bb-df8f25d7569f"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="c1"&gt;-- generate another role&lt;/span&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;ROLE&lt;/span&gt; &lt;span class="nv"&gt;"11b795e0-a566-491b-9ee7-62c025175dd8"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;GRANT&lt;/span&gt; &lt;span class="k"&gt;all&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="k"&gt;schema&lt;/span&gt; &lt;span class="k"&gt;storage&lt;/span&gt; &lt;span class="k"&gt;TO&lt;/span&gt; &lt;span class="nv"&gt;"11b795e0-a566-491b-9ee7-62c025175dd8"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;GRANT&lt;/span&gt; &lt;span class="k"&gt;all&lt;/span&gt; &lt;span class="k"&gt;on&lt;/span&gt; &lt;span class="n"&gt;buckets&lt;/span&gt; &lt;span class="k"&gt;TO&lt;/span&gt; &lt;span class="nv"&gt;"11b795e0-a566-491b-9ee7-62c025175dd8"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;OR&lt;/span&gt; &lt;span class="k"&gt;REPLACE&lt;/span&gt; &lt;span class="k"&gt;FUNCTION&lt;/span&gt; &lt;span class="n"&gt;user_record_count&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;uuid&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;RETURNS&lt;/span&gt; &lt;span class="nb"&gt;integer&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="err"&gt;$$&lt;/span&gt;
&lt;span class="k"&gt;DECLARE&lt;/span&gt;
    &lt;span class="k"&gt;count&lt;/span&gt; &lt;span class="nb"&gt;integer&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;BEGIN&lt;/span&gt;
    &lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;INTO&lt;/span&gt; &lt;span class="k"&gt;count&lt;/span&gt;
    &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;buckets&lt;/span&gt;
    &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="k"&gt;owner&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="err"&gt;$&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

    &lt;span class="k"&gt;RETURN&lt;/span&gt; &lt;span class="k"&gt;count&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;END&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="err"&gt;$$&lt;/span&gt; &lt;span class="k"&gt;LANGUAGE&lt;/span&gt; &lt;span class="n"&gt;plpgsql&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="n"&gt;POLICY&lt;/span&gt; &lt;span class="n"&gt;limit_user_crud&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;buckets&lt;/span&gt;
    &lt;span class="k"&gt;USING&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;owner&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;current_user&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="n"&gt;uuid&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;WITH&lt;/span&gt; &lt;span class="k"&gt;CHECK&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;user_record_count&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;current_user&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="n"&gt;uuid&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="k"&gt;SET&lt;/span&gt; &lt;span class="k"&gt;ROLE&lt;/span&gt; &lt;span class="nv"&gt;"f918ffd9-a611-4b2a-b4bb-df8f25d7569f"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="k"&gt;INSERT&lt;/span&gt; &lt;span class="k"&gt;INTO&lt;/span&gt; &lt;span class="n"&gt;buckets&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;owner&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;VALUES&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'1'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'one'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;current_user&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="n"&gt;uuid&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="k"&gt;INSERT&lt;/span&gt; &lt;span class="k"&gt;INTO&lt;/span&gt; &lt;span class="n"&gt;buckets&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;owner&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;VALUES&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'2'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'two'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;current_user&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="n"&gt;uuid&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="k"&gt;INSERT&lt;/span&gt; &lt;span class="k"&gt;INTO&lt;/span&gt; &lt;span class="n"&gt;buckets&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;owner&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;VALUES&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'3'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'three'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;current_user&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="n"&gt;uuid&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="c1"&gt;-- check before the insertion&lt;/span&gt;
&lt;span class="k"&gt;INSERT&lt;/span&gt; &lt;span class="k"&gt;INTO&lt;/span&gt; &lt;span class="n"&gt;buckets&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;owner&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;VALUES&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'4'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'four'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;current_user&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="n"&gt;uuid&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt; &lt;span class="c1"&gt;-- ERROR:  new row violates row-level security policy for table "buckets"&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;buckets&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="c1"&gt;-- this returns 3 rows&lt;/span&gt;

&lt;span class="k"&gt;SET&lt;/span&gt; &lt;span class="k"&gt;role&lt;/span&gt; &lt;span class="nv"&gt;"11b795e0-a566-491b-9ee7-62c025175dd8"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;buckets&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="c1"&gt;-- this returns nothing&lt;/span&gt;
&lt;span class="k"&gt;INSERT&lt;/span&gt; &lt;span class="k"&gt;INTO&lt;/span&gt; &lt;span class="n"&gt;buckets&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;owner&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;VALUES&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'4'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'four'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;current_user&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="n"&gt;uuid&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt; &lt;span class="c1"&gt;-- success&lt;/span&gt;
&lt;span class="k"&gt;DELETE&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;buckets&lt;/span&gt; &lt;span class="k"&gt;where&lt;/span&gt; &lt;span class="n"&gt;id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'4'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="c1"&gt;-- success&lt;/span&gt;
&lt;span class="k"&gt;DELETE&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;buckets&lt;/span&gt; &lt;span class="k"&gt;where&lt;/span&gt; &lt;span class="n"&gt;id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'1'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="c1"&gt;-- delete 0&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



</description>
      <category>postgres</category>
      <category>sql</category>
      <category>database</category>
      <category>security</category>
    </item>
    <item>
      <title>HTTP Rate Limit</title>
      <dc:creator>Ming</dc:creator>
      <pubDate>Thu, 04 Jan 2024 06:37:29 +0000</pubDate>
      <link>https://dev.to/keming/http-rate-limit-1hmj</link>
      <guid>https://dev.to/keming/http-rate-limit-1hmj</guid>
      <description>&lt;h2&gt;
  
  
  Draft
&lt;/h2&gt;

&lt;p&gt;The story starts with a &lt;a href="https://youtu.be/BIguvia6AvM?t=1313"&gt;link checker sharing&lt;/a&gt; that mentions the &lt;a href="https://www.ietf.org/archive/id/draft-polli-ratelimit-headers-02.html"&gt;HTTP rate limit header&lt;/a&gt; in the IETF proposed standard.&lt;/p&gt;

&lt;p&gt;Ideally, we expect something like this in the HTTP response headers:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;   RateLimit-Limit: 10
   RateLimit-Remaining: 1
   RateLimit-Reset: 7
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;RateLimit-Reset&lt;/code&gt; specifies the remaining &lt;strong&gt;seconds&lt;/strong&gt; for the current time window. This should not be considered as a fixed value.&lt;/p&gt;

&lt;p&gt;It may also contain a &lt;a href="https://datatracker.ietf.org/doc/html/rfc7231#section-7.1.3"&gt;&lt;code&gt;Retry-After&lt;/code&gt;&lt;/a&gt; header, usually with a &lt;a href="https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/429"&gt;429&lt;/a&gt; status code.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/ietf-wg-httpapi/ratelimit-headers"&gt;ratelimit-headers&lt;/a&gt; has a test implementation of this draft.&lt;/p&gt;

&lt;p&gt;Sadly, some HTTP APIs do not strictly implement this draft (others may not even have these headers). You can find different names like &lt;code&gt;X-RateLimit-Reset&lt;/code&gt;, &lt;code&gt;X-RateLimit-Requests-Reset&lt;/code&gt;, &lt;code&gt;X-RateLimit-Reset-After&lt;/code&gt;, etc. Some official SDKs may consider this.&lt;/p&gt;

&lt;h2&gt;
  
  
  Python &lt;code&gt;httpx&lt;/code&gt; with rate limit
&lt;/h2&gt;

&lt;p&gt;There are already some implementations for Python HTTP clients. One of them is &lt;a href="https://github.com/florimondmanca/aiometer"&gt;aiometer&lt;/a&gt;. But it's not suitable for my use case. Since &lt;a href="https://github.com/encode/httpx/"&gt;&lt;code&gt;httpx&lt;/code&gt;&lt;/a&gt; already has the internal pool, it would be better to reuse the design.&lt;/p&gt;

&lt;p&gt;BTW, my use case is a web crawler client, I hope I can query the URL directly in the code (with rate limit), instead of gathering lots of URLs and using the &lt;code&gt;map&lt;/code&gt; function.&lt;/p&gt;

&lt;p&gt;Here is a simple implementation:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;RateLimitTransport&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;httpx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;AsyncHTTPTransport&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;max_per_second&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;kwargs&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
        Async HTTP transport with rate limit.

        Args:
            max_per_second: Maximum number of requests per second.

        Other args are passed to httpx.AsyncHTTPTransport.
        &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;interval&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;max_per_second&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;next_start_time&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
        &lt;span class="nf"&gt;super&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;kwargs&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;notify_task_start&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
        https://github.com/florimondmanca/aiometer/blob/358976e0b60bce29b9fe8c59807fafbad3e62cbc/src/aiometer/_impl/meters.py#L57
        &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
        &lt;span class="n"&gt;loop&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;asyncio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_running_loop&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;now&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;loop&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;time&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
            &lt;span class="n"&gt;next_start_time&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;max&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;next_start_time&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;now&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;until_now&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;next_start_time&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;now&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;until_now&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;interval&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="k"&gt;break&lt;/span&gt;
            &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;asyncio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sleep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;max&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;until_now&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;interval&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;next_start_time&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;max&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;next_start_time&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;now&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;interval&lt;/span&gt;

    &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;handle_async_request&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;httpx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Request&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;httpx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Response&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;notify_task_start&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;super&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;handle_async_request&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__aenter__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;Self&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;notify_task_start&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;super&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;__aenter__&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__aexit__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Any&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;super&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;__aexit__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You can specify the rate limit when you initialize your HTTP client like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;httpx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;AsyncClient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;transport&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nc"&gt;RateLimitTransport&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;max_per_second&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



</description>
      <category>python</category>
      <category>http</category>
      <category>webdev</category>
      <category>networking</category>
    </item>
    <item>
      <title>Serving fine-tuned large language model with vLLM</title>
      <dc:creator>Ming</dc:creator>
      <pubDate>Sat, 26 Aug 2023 12:50:58 +0000</pubDate>
      <link>https://dev.to/keming/serving-fine-tuned-large-language-model-with-vllm-39a0</link>
      <guid>https://dev.to/keming/serving-fine-tuned-large-language-model-with-vllm-39a0</guid>
      <description>&lt;p&gt;Fine-tuned large language models (LLM) are becoming increasingly popular in AI applications. These powerful language models are widely used to automate a series of tasks, improve customer service, and generate domain-specific content.&lt;/p&gt;

&lt;p&gt;However, serving these fine-tuned LLMs at scale comes with challenges. Those models are computationally consuming. Their sizes are much larger than the traditional microservices. These features make it hard to archive high throughput serving and low cold start scaling.&lt;/p&gt;

&lt;p&gt;This post will introduce our experience on LLM serving with vLLM and service scaling in modelz.&lt;/p&gt;

&lt;h2&gt;
  
  
  Use vLLM for high throughput LLM serving
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://github.com/vllm-project/vllm"&gt;vLLM&lt;/a&gt; is a high-throughput and memory-efficient LLM serving engine. It offers OpenAI compatible API, which makes it easy to be integrated with the existing LLM applications.&lt;/p&gt;

&lt;p&gt;The first problem of using vLLM is building a GPU environment to build and install vLLM. With the help of &lt;a href="https://github.com/tensorchord/envd"&gt;envd&lt;/a&gt;, this can be done in one file like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# syntax=v1
&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;build&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="n"&gt;base&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dev&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;install&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cuda&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;version&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;"11.8.0"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;install&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;conda&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;install&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;python&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;install&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;apt_packages&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;"build-essential"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
    &lt;span class="c1"&gt;# install torch here to reuse the cache
&lt;/span&gt;    &lt;span class="n"&gt;install&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;python_packages&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;"torch"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
    &lt;span class="c1"&gt;# install from source
&lt;/span&gt;    &lt;span class="n"&gt;install&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;python_packages&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;"git+https://github.com/vllm-project/vllm.git"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;By running &lt;code&gt;envd up&lt;/code&gt;, you can get into the development environment with everything you need. If you prefer Dockerfile, we also have a &lt;a href="https://github.com/kemingy/vllm-env/blob/main/Dockerfile"&gt;template&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;vLLM already supports many LLM such as LLaMA, Falcon, MPT, etc. However, to support your own LLM, you may need to provide a model-specific prompt template. To address this issue, we create a tool called &lt;a href="https://github.com/tensorchord/llmspec"&gt;llmspec&lt;/a&gt;, which provides the prompt templates with OpenAI compatible interface. You can build your prompt generator on top of this library.&lt;/p&gt;

&lt;p&gt;To run the vLLM serving in a Kubernetes cluster, there are some necessary configurations:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Always set &lt;code&gt;--worker-use-ray&lt;/code&gt; to run the model inference in another Python process to avoid health probe failure.&lt;/li&gt;
&lt;li&gt;Provide enough shared memory (at least 30% RAM).&lt;/li&gt;
&lt;li&gt;Reduce &lt;code&gt;--gpu-memory-utilization&lt;/code&gt; to avoid GPU OOM for long sequences.&lt;/li&gt;
&lt;li&gt;Increase &lt;code&gt;--max-num-batched-tokens&lt;/code&gt; if you want to get long sequences.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you want to simulate a multiple concurrent request test, you can use the following script:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;random&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;randint&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;concurrent.futures&lt;/span&gt;

&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;openai&lt;/span&gt;

&lt;span class="n"&gt;openai&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;api_key&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"EMPTY"&lt;/span&gt;
&lt;span class="n"&gt;openai&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;api_base&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"http://localhost:8000/v1"&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;max_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;chat&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;openai&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ChatCompletion&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;"mosaicml/mpt-30b-chat"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;
            &lt;span class="s"&gt;"role"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s"&gt;"user"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="s"&gt;"content"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s"&gt;"Who are you?"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;}],&lt;/span&gt;
        &lt;span class="n"&gt;stream&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;max_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;max_tokens&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;delta&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;choices&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;delta&lt;/span&gt;
        &lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;delta&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;'content'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;''&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;end&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;''&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;flush&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;batch_test&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;concurrent&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;futures&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ThreadPoolExecutor&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;executor&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;futures&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
            &lt;span class="n"&gt;executor&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;submit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;max_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;randint&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nb"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;future&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;concurrent&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;futures&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;as_completed&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;futures&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="n"&gt;future&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;


&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;__name__&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="s"&gt;"__main__"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;batch_test&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Scaling with Modelz
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://cloud.modelz.ai/"&gt;Modelz&lt;/a&gt; is a fully managed platform that provides users with a simple API for deploying machine learning models. By using our platform, your service can be scaled according to the real-time API invocation. The docker image will also be optimized to minimize the container cold start time.&lt;/p&gt;

&lt;p&gt;If you want to deploy models to your private cluster or single GPU server, try &lt;a href="https://github.com/tensorchord/openmodelz"&gt;openmodelz&lt;/a&gt;. It takes care of the underlying technical details and provides a simple and easy-to-use CLI to deploy and manage your machine learning services.&lt;/p&gt;

&lt;p&gt;If you have any questions related to deploying models into production, feel free to &lt;a href="https://docs.modelz.ai/community"&gt;reach out&lt;/a&gt;, by joining &lt;a href="https://discord.gg/F4WnzqmeNj"&gt;Discord&lt;/a&gt;, or through &lt;a href="//mailto:modelz-support@tensorchord.ai"&gt;modelz-support@tensorchord.ai&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Advertisement Time
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://github.com/mosecorg/mosec"&gt;mosec&lt;/a&gt; - A general high-performance and easy-to-use machine learning serving framework.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/tensorchord/pgvecto.rs"&gt;pgvecto.rs&lt;/a&gt; - A powerful Postgres extension for vector similarity search.&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>machinelearning</category>
      <category>deeplearning</category>
      <category>python</category>
      <category>llm</category>
    </item>
    <item>
      <title>My Journey with envd</title>
      <dc:creator>Ming</dc:creator>
      <pubDate>Sat, 26 Aug 2023 12:44:02 +0000</pubDate>
      <link>https://dev.to/keming/my-journey-with-envd-35h5</link>
      <guid>https://dev.to/keming/my-journey-with-envd-35h5</guid>
      <description>&lt;p&gt;&lt;code&gt;envd&lt;/code&gt; is a frontend of &lt;a href="https://github.com/moby/buildkit"&gt;BuildKit&lt;/a&gt;. Just like the Dockerfile. It has been more than a year since I started working on this project. Since the features are relatively stable, I'd like to write a blog about my journey with &lt;code&gt;envd&lt;/code&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why we need this tool
&lt;/h2&gt;

&lt;p&gt;The machine learning development environment has been a pain point for a while. "Which Python are you using now?" is definitely a newbie slayer. It's even worse if you need to use CUDA. "It works on my machine!" happens a lot.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;envd&lt;/code&gt; was created to solve the problem of the machine learning development environment. However, it goes far beyond that.&lt;/p&gt;

&lt;h2&gt;
  
  
  Infrastructure as code (IaC)
&lt;/h2&gt;

&lt;p&gt;What a fancy name! Here it means by using the &lt;code&gt;envd&lt;/code&gt; config file, you will be able to get the same environment on different machines, whether it's a local machine, a remote server, or a Kubernetes cluster.&lt;/p&gt;

&lt;h2&gt;
  
  
  Naming
&lt;/h2&gt;

&lt;p&gt;It was named &lt;code&gt;MIDI&lt;/code&gt; in the beginning. But that is not friendly for SEO.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;d&lt;/code&gt; in &lt;code&gt;envd&lt;/code&gt; has no official meaning (as far as I know). It can be "docker", "deep learning", "dev", etc.&lt;/p&gt;

&lt;p&gt;For more information, check this &lt;a href="https://github.com/tensorchord/envd/issues/2#issuecomment-1119175904"&gt;issue&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Logo
&lt;/h2&gt;

&lt;p&gt;We have a cute logo designed by &lt;a href="https://github.com/lilylee1874"&gt;Lily&lt;/a&gt;. It's a cat face with the &lt;code&gt;envd&lt;/code&gt; characters.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--t7z64uL5--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://user-images.githubusercontent.com/12974685/200007223-cd94fe9a-266d-4bbd-ac23-e71043d0c3dc.svg" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--t7z64uL5--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://user-images.githubusercontent.com/12974685/200007223-cd94fe9a-266d-4bbd-ac23-e71043d0c3dc.svg" alt="envd" width="139" height="113"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Actually, the cat only blinked once when we created the GIF. The recording tool on macOS is tricky to use. That's why it ends up blinking twice. By the way, we replaced it with SVG to make the animation clear and smooth. Writing the SVG animation from scratch is not that hard.&lt;/p&gt;

&lt;p&gt;You can find the drafts &lt;a href="https://github.com/tensorchord/envd/issues/326"&gt;here&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Installation
&lt;/h2&gt;

&lt;p&gt;Obviously, &lt;code&gt;envd&lt;/code&gt; is a Golang project. However, our target audiences are mainly using Python. That's why we spend a lot of effort to support installation through &lt;code&gt;pip&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;As we know, Python has never done a good job of distributing pre-compiled binaries. I didn't find any good document about how to create a Python pre-compiled binary distribution. People just copy &amp;amp; paste the code from other projects. So does &lt;code&gt;envd&lt;/code&gt;. The code is mainly copied from &lt;a href="https://github.com/mosecorg/mosec"&gt;&lt;code&gt;mosec&lt;/code&gt;&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;I do learn something new from others' contributions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://github.com/pypa/cibuildwheel"&gt;cibuildwheel&lt;/a&gt; has become mature nowadays. It's a great tool for setting up the multi-platform distribution pipeline in CI.&lt;/li&gt;
&lt;li&gt;You can &lt;a href="https://github.com/tensorchord/envd/pull/1254"&gt;package a binary file without any Python code&lt;/a&gt;. (by &lt;a href="https://github.com/frostming"&gt;frostming&lt;/a&gt;)&lt;/li&gt;
&lt;li&gt;You can create the Python &lt;a href="https://github.com/tensorchord/envd/pull/1324"&gt;ABI-agnostic wheel&lt;/a&gt;. (by &lt;a href="https://github.com/frostming"&gt;frostming&lt;/a&gt;)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Of course, you can use &lt;code&gt;conda-forge&lt;/code&gt;. I have tried to create &lt;a href="https://github.com/conda-forge/staged-recipes/pull/22367"&gt;a recipe for &lt;code&gt;mosec&lt;/code&gt;&lt;/a&gt;. It has a totally different packaging logic.&lt;/p&gt;

&lt;h2&gt;
  
  
  Rootless
&lt;/h2&gt;

&lt;p&gt;As a developer, I don't like to run the command with &lt;code&gt;sudo&lt;/code&gt; unless I have to. When I was trying to debug with the &lt;code&gt;buildkit&lt;/code&gt; daemon, I found that we can run it &lt;a href="https://github.com/moby/buildkit/blob/master/docs/rootless.md"&gt;in rootless mode&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Starlark
&lt;/h2&gt;

&lt;p&gt;Starlark is a dialect of Python, which makes it easy to use for machine learning engineers and data scientists.&lt;/p&gt;

&lt;p&gt;I know that lots of configuration files are written in YAML. I personally don't like it. You may also heard lots of complaints about the YAML format. I think the configuration file should be able to validate itself.&lt;/p&gt;

&lt;p&gt;You can use if-condition, for-loop, etc. in Starlark. The following code works:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;build&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;libs&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;gpu&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;base&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"ubuntu:20.04"&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;gpu&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;base&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"nvidia/cuda:11.2.2-cudnn8-runtime-ubuntu20.04"&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;lib&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nb"&gt;sorted&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;libs&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;install&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;python_packages&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;lib&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For more information, check the &lt;a href="https://github.com/bazelbuild/starlark/blob/master/spec.md"&gt;Starlark spec&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Although Starlark has an interpreting order, we don't rely on that. We will parse the file to an internal graph and construct the BuildKit Low-Level Build (LLB) graph on top of it. This tradeoff makes it easy to cache the layers.&lt;/p&gt;

&lt;p&gt;Starlark is also easy to extend. We added lots of &lt;code&gt;envd&lt;/code&gt; specific functions to make it more powerful. You can find them in the &lt;a href="https://envd.tensorchord.ai/api/starlark/v1/global.html"&gt;reference&lt;/a&gt;. It has a &lt;code&gt;load&lt;/code&gt; function which is similar to &lt;code&gt;import&lt;/code&gt; in Python to load another file. We create a new one called &lt;a href="https://envd.tensorchord.ai/api/starlark/v1/global.html#include"&gt;&lt;code&gt;include&lt;/code&gt;&lt;/a&gt; (because &lt;code&gt;import&lt;/code&gt; is reserved) to import functions from a git repository. People can create their own &lt;code&gt;envd&lt;/code&gt; build functions and share them with others.&lt;/p&gt;

&lt;h2&gt;
  
  
  VSCode support
&lt;/h2&gt;

&lt;p&gt;To make it more user-friendly, we have a VSCode &lt;a href="https://github.com/tensorchord/vscode-envd"&gt;extension&lt;/a&gt; for &lt;code&gt;envd&lt;/code&gt;, which provides the following features:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://github.com/tensorchord/envd-lsp"&gt;LSP&lt;/a&gt;: this enables the Starlark auto-completion.&lt;/li&gt;
&lt;li&gt;manage &lt;code&gt;envd&lt;/code&gt; environment&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  BuildKit
&lt;/h2&gt;

&lt;p&gt;This is the backend of &lt;code&gt;envd&lt;/code&gt;. Integrating with it is troublesome. Mainly because it doesn't have any documentation. The only way to learn it is to read the &lt;a href="https://github.com/moby/buildkit/tree/master/examples"&gt;examples&lt;/a&gt;. Since the source code is written in a functional style, it's a bit hard to understand. Once you get used to it, things will be easier.&lt;/p&gt;

&lt;p&gt;There are some nice features in BuildKit:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Parallel build&lt;/li&gt;
&lt;li&gt;Distributable workers&lt;/li&gt;
&lt;li&gt;Better cache&lt;/li&gt;
&lt;li&gt;Advanced operators&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We will go through them one by one.&lt;/p&gt;

&lt;h3&gt;
  
  
  Parallel build
&lt;/h3&gt;

&lt;p&gt;The main idea is to split the build graph into multiple sub-graphs and run them in parallel if possible. This is a great feature when some steps take a long time to finish while there is no overlap among them. For example, we can install the system packages and Conda environments in parallel.&lt;/p&gt;

&lt;p&gt;The related operators are &lt;code&gt;diff&lt;/code&gt; and &lt;code&gt;merge&lt;/code&gt;. In the &lt;code&gt;merge&lt;/code&gt; list, the later state will override the previous stats if they change the same directories. Sometimes, it may take longer than you expect to get the diff and merge them together. This should be used when you're sure that the parallelism will save time.&lt;/p&gt;

&lt;h3&gt;
  
  
  Distributable workers
&lt;/h3&gt;

&lt;p&gt;Basically, the frontend will construct the build graph and serialize it in a Protocol Buffer format, then send it to the backend workers through TCP or Unix Domain Socket.&lt;/p&gt;

&lt;p&gt;It's recommended to set up a long-running BuildKit daemon and use it as a remote worker since in this way it can benefit from the cache.&lt;/p&gt;

&lt;p&gt;By default, we will create a &lt;code&gt;buildkitd&lt;/code&gt; container for &lt;code&gt;envd&lt;/code&gt; to build the image.&lt;/p&gt;

&lt;h3&gt;
  
  
  Better cache
&lt;/h3&gt;

&lt;p&gt;BuildKit can import/export the cache from/to the local/inline/registry. You can choose to export the intermediate layer or not.&lt;/p&gt;

&lt;p&gt;By default, the cache limit is 10% of your disk space. You can configure this through the &lt;a href="https://docs.docker.com/build/buildkit/toml-configuration/"&gt;buildkit config&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;envd&lt;/code&gt; v0 will download a pre-build base image that contains the basic development tools and Python environment. This image can be used as the cache layer if none of the dependencies change. This is a great way to speed up the build process. You can check the nightly build &lt;a href="https://github.com/tensorchord/envd-nightly"&gt;benchmark&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Moby
&lt;/h2&gt;

&lt;p&gt;For now, the best user experience is to use &lt;code&gt;envd&lt;/code&gt; v1 with &lt;code&gt;moby&lt;/code&gt; worker. This requires the docker engine version &amp;gt;= 22. To enable it, you can create a new &lt;code&gt;envd&lt;/code&gt; context like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;envd context create &lt;span class="nt"&gt;--name&lt;/span&gt; moby &lt;span class="nt"&gt;--builder&lt;/span&gt; moby-worker &lt;span class="nt"&gt;--use&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Need to mention that the &lt;code&gt;moby&lt;/code&gt; worker is still experimental. Due to the &lt;a href="https://github.com/moby/moby/issues/45111"&gt;issue&lt;/a&gt;, we have to &lt;a href="https://github.com/tensorchord/envd/pull/1699"&gt;disable the &lt;code&gt;merge&lt;/code&gt; operator&lt;/a&gt; used in &lt;code&gt;envd&lt;/code&gt; when using the &lt;code&gt;moby&lt;/code&gt; worker. Thus the build step might be slower but the export step will be much faster. Overall it's still faster, especially when you have a large image, which is the common case for machine learning.&lt;/p&gt;

&lt;h2&gt;
  
  
  Cache
&lt;/h2&gt;

&lt;p&gt;Docker layer cache is a common optimization for image building. Besides, we also enable the cache for the APT packages, Python wheels, VSCode extensions, and &lt;code&gt;oh-my-zsh&lt;/code&gt; plugins. This is done by mounting a cache directory during the build time. The machine learning related pip wheels can be huge, which makes the cache very useful.&lt;/p&gt;

&lt;h2&gt;
  
  
  Horust
&lt;/h2&gt;

&lt;p&gt;I totally agree that for the online environment, one container should only do one thing, usually, that means running only one service. However, for the development environment, it's totally fine to run as many processes as you like, as long as they don't conflict with each other.&lt;/p&gt;

&lt;p&gt;That's why we need a process management tool to control all of these processes. We have explored several options like &lt;a href="https://systemd.io/"&gt;systemd&lt;/a&gt;, &lt;a href="https://github.com/just-containers/s6-overlay"&gt;s6 overlay&lt;/a&gt;, &lt;a href="https://github.com/Supervisor/supervisor"&gt;Supervisor&lt;/a&gt;. In the end, we decided to use &lt;a href="https://github.com/FedericoPonzi/Horust"&gt;Horust&lt;/a&gt; which is both simple and powerful. You can check the &lt;a href="https://github.com/tensorchord/envd/issues/930"&gt;discussion&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Shell prompt
&lt;/h2&gt;

&lt;p&gt;I personally use &lt;code&gt;fish&lt;/code&gt; with &lt;code&gt;starship&lt;/code&gt;, which gives a great out-of-box shell experience. &lt;code&gt;starship&lt;/code&gt; can work well with the most common shells like &lt;code&gt;bash&lt;/code&gt;, &lt;code&gt;zsh&lt;/code&gt;, &lt;code&gt;fish&lt;/code&gt;, etc. It's easy to configure and extend. You can check the &lt;a href="https://starship.rs/"&gt;starship documentation&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;It works better when you have the &lt;a href="https://www.nerdfonts.com/"&gt;Nerd font&lt;/a&gt;, but we cannot control the users' terminal configuration, we have to disable some fancy icons.&lt;/p&gt;

&lt;h2&gt;
  
  
  Coding in Jupyter Notebook and VSCode
&lt;/h2&gt;

&lt;p&gt;These are the most common coding tools for machine learning engineers and data scientists.&lt;/p&gt;

&lt;p&gt;Whether it's Jupyter Notebook or Jupyter Lab, it can be exposed as a normal web service.&lt;/p&gt;

&lt;p&gt;VSCode really did a good job on the remote development. You can use the VSCode on your local machine to connect to the remote server or even the container running on a remote server.&lt;/p&gt;

&lt;p&gt;Limited by the license, we have to use the &lt;a href="https://open-vsx.org/"&gt;Open VSX Registry&lt;/a&gt;. Sometimes the related CI test fails due to its stability.&lt;/p&gt;

&lt;h2&gt;
  
  
  Develop in the Kubernetes cluster
&lt;/h2&gt;

&lt;p&gt;We were hoping to monetize &lt;code&gt;envd&lt;/code&gt; with this feature. But not many people are interested in this one. The code is open sourced as &lt;a href="https://github.com/tensorchord/envd-server/"&gt;&lt;code&gt;envd-server&lt;/code&gt;&lt;/a&gt;. Maybe we can bring this feature to the new &lt;a href="https://github.com/tensorchord/openmodelz/issues/105"&gt;openmodelz&lt;/a&gt; project. Although you can run &lt;code&gt;mdz exec {name} -ti bash&lt;/code&gt; to get into a container, but it doesn't support VScode-Remote for now.&lt;/p&gt;

&lt;h2&gt;
  
  
  Use pointer receivers
&lt;/h2&gt;

&lt;p&gt;This is the most common bug during the development with &lt;code&gt;envd&lt;/code&gt;. We have an internal build graph, which has many methods to build the LLB graph. Not all of these methods are using the pointer receivers, which results in the inconsistent state of the internal graph. I would prefer to use the pointer receivers for all of the methods.&lt;/p&gt;

&lt;p&gt;You might be curious how come the lint doesn't catch this. That's because it can be used in a nested way, with the outer function using the value receiver while the inner function uses the pointer receiver.&lt;/p&gt;

&lt;p&gt;This is also a good example to show the language design (personal option). You won't see this kind of bug in Rust. But Rust doesn't have a good container ecosystem. :(&lt;/p&gt;

&lt;h2&gt;
  
  
  Progress bar
&lt;/h2&gt;

&lt;p&gt;The default docker progress bar is really complex. When I was implementing the &lt;a href="https://github.com/tensorchord/envd/pull/1708"&gt;&lt;code&gt;moby&lt;/code&gt; push&lt;/a&gt; feature, I chose to reuse another progress bar lib to make life easier. Although it lacks multi-line log support.&lt;/p&gt;

&lt;h2&gt;
  
  
  SSH agent forwarding
&lt;/h2&gt;

&lt;p&gt;Actually, we can forward the host SSH credentials to the container. So we can use the &lt;code&gt;git&lt;/code&gt; command as we're in the host machine.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;code&gt;envd&lt;/code&gt; v1
&lt;/h2&gt;

&lt;p&gt;This new version is created to address the inappropriate design of the &lt;code&gt;envd&lt;/code&gt; v0. The main idea is that &lt;code&gt;envd&lt;/code&gt; file should be a more general frontend of BuildKit. It should be able to build any image, not only for the machine learning development environment.&lt;/p&gt;

&lt;p&gt;Here is a comparison:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Features&lt;/th&gt;
&lt;th&gt;v0&lt;/th&gt;
&lt;th&gt;v1&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;is default for &lt;code&gt;envd&amp;lt;v1.0&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;support dev&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;support CUDA&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;support serving&lt;/td&gt;
&lt;td&gt;⚠️&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;support custom base image&lt;/td&gt;
&lt;td&gt;⚠️&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;support installing multiple languages&lt;/td&gt;
&lt;td&gt;⚠️&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;support &lt;code&gt;moby&lt;/code&gt; builder&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Make it faster
&lt;/h2&gt;

&lt;p&gt;The &lt;a href="https://github.com/tensorchord/envd/blob/3b5fae2de801b6e8fee98d1f2e743dce63a20085/pkg/lang/ir/v1/system.go#L346"&gt;&lt;code&gt;compileBaseImage&lt;/code&gt;&lt;/a&gt; function should be able to run faster. You can try it if you're interested in the &lt;code&gt;envd&lt;/code&gt; development.&lt;/p&gt;

&lt;h2&gt;
  
  
  Regrets
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://github.com/tensorchord/envd/pull/972"&gt;state-based implementation&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This feature will make it much more powerful, but also comes with complexity.&lt;/p&gt;

&lt;p&gt;Users can use low-level operators to build the graph. We can execute the commands from &lt;code&gt;envd&lt;/code&gt; file in the user-defined order.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://github.com/tensorchord/envd/pull/1459"&gt;incremental development environment&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Lots of development environments are not built in one shot. This proposal wants to track the changes in the running environment and update the &lt;code&gt;envd&lt;/code&gt; file accordingly.&lt;/p&gt;

&lt;h2&gt;
  
  
  Summary
&lt;/h2&gt;

&lt;p&gt;It is the first time that I can work on an open source project as my daily work.I have learned a lot from the community. I hope more people can benefit from &lt;code&gt;envd&lt;/code&gt;.&lt;/p&gt;

</description>
      <category>machinelearning</category>
      <category>python</category>
      <category>docker</category>
      <category>buildkit</category>
    </item>
    <item>
      <title>Develop machine learning applications inside the containers</title>
      <dc:creator>Ming</dc:creator>
      <pubDate>Thu, 05 Jan 2023 00:42:54 +0000</pubDate>
      <link>https://dev.to/keming/develop-machine-learning-applications-inside-the-containers-1mbo</link>
      <guid>https://dev.to/keming/develop-machine-learning-applications-inside-the-containers-1mbo</guid>
      <description>&lt;div class="crayons-card c-embed text-styles text-styles--secondary"&gt;
      &lt;div class="c-embed__cover"&gt;
        &lt;a href="https://docs.google.com/presentation/d/e/2PACX-1vTPrXjF_ae__fJv5F7W_n8W10NT8Fqu04sLbucd7vtgjEsV67De5xPMj1cOdEnif5IXOMLCu_yxZf0v/embed?start=false&amp;amp;amp%3Bloop=false&amp;amp;amp%3Bdelayms=3000" class="c-link s:max-w-50 align-middle" rel="noopener noreferrer"&gt;
          &lt;img alt="" src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Flh7-us.googleusercontent.com%2Fdocs%2FAHkbwyK-2pmI4pYHWq-TZaMsXcyHwEhP6kw840FqEhYW16veWOwAMyXXhQdXYzN1x6iGzYUOPQQqLqNYzhVh9wxijJ8spC2EWN5vbjbuo7Q-rliowdUJ00I%3Dw1200-h630-p" height="630" class="m-0" width="1200"&gt;
        &lt;/a&gt;
      &lt;/div&gt;
    &lt;div class="c-embed__body"&gt;
      &lt;h2 class="fs-xl lh-tight"&gt;
        &lt;a href="https://docs.google.com/presentation/d/e/2PACX-1vTPrXjF_ae__fJv5F7W_n8W10NT8Fqu04sLbucd7vtgjEsV67De5xPMj1cOdEnif5IXOMLCu_yxZf0v/embed?start=false&amp;amp;amp%3Bloop=false&amp;amp;amp%3Bdelayms=3000" rel="noopener noreferrer" class="c-link"&gt;
          pycon China 2022 envd - Google Slides
        &lt;/a&gt;
      &lt;/h2&gt;
      &lt;div class="color-secondary fs-s flex items-center"&gt;
          &lt;img alt="favicon" class="c-embed__favicon m-0 mr-2 radius-0" src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fssl.gstatic.com%2Fdocs%2Fpresentations%2Fimages%2Ffavicon-2023q4.ico" width="256" height="256"&gt;
        docs.google.com
      &lt;/div&gt;
    &lt;/div&gt;
&lt;/div&gt;


</description>
      <category>mentorship</category>
      <category>community</category>
      <category>career</category>
      <category>gratitude</category>
    </item>
    <item>
      <title>Machine learning container environment should be easy</title>
      <dc:creator>Ming</dc:creator>
      <pubDate>Sun, 18 Sep 2022 13:43:11 +0000</pubDate>
      <link>https://dev.to/keming/machine-learning-container-environment-should-be-easy-1jn4</link>
      <guid>https://dev.to/keming/machine-learning-container-environment-should-be-easy-1jn4</guid>
      <description>&lt;p&gt;As a machine learning engineer that works on different deep learning models, unexpected environmental issues always bother me.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--iZHdEqyR--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://user-images.githubusercontent.com/52693877/191025031-b3b1822f-7c54-4641-90a8-986fadff606f.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--iZHdEqyR--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://user-images.githubusercontent.com/52693877/191025031-b3b1822f-7c54-4641-90a8-986fadff606f.png" alt="ds_blame" width="880" height="311"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Do these scenarios look familiar to you? What happens to the machine learning development environment?&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Even though you can &lt;code&gt;pip install torch&lt;/code&gt;, it doesn't mean you don't need to deal with the low-level code dependencies.&lt;/li&gt;
&lt;li&gt;Container is necessary for a consistent environment. Especially for the GPU part.&lt;/li&gt;
&lt;li&gt;Dockerfile is hard to reuse conveniently.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Dealing with the environment is just the first step of your work. It should be made easy but it's never easy. Although we need to admit that it's much easier than the day we had to search how to install NumPy.&lt;/p&gt;

&lt;p&gt;Meanwhile, from the machine learning infra engineers' perspective:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--exY6A7ZG--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://user-images.githubusercontent.com/52693877/191036993-922c27cb-36d3-4db3-a6eb-03f8b16207c9.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--exY6A7ZG--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://user-images.githubusercontent.com/52693877/191036993-922c27cb-36d3-4db3-a6eb-03f8b16207c9.png" alt="infra_blame" width="880" height="311"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Infra engineers are never the enemies of machine learning engineers. A better tool can make everyone happy.&lt;/p&gt;

&lt;p&gt;Let's sum up our requirements:&lt;br&gt;
Machine learning engineers should submit container images instead of raw code. Because they know better about the model dependencies.&lt;br&gt;
Infra engineer should maintain a better utility to help machine learning engineers to build the container images following the best practice.&lt;br&gt;
Meanwhile, machine learning engineers don't want to sacrifice the development experience. They should be able to use Jupyter Notebook and VSCode as usual.&lt;/p&gt;

&lt;p&gt;So far, everything looks good. Obviously, it's not something impossible.&lt;/p&gt;



&lt;p&gt;Let's introduce the new tool: &lt;a href="https://github.com/tensorchord/envd"&gt;envd&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;It provides the following features:&lt;/p&gt;

&lt;p&gt;Writing Python-like function instead of the Dockerfile and share them across your team&lt;br&gt;
Based on bulidkit with better cache and parallel building&lt;br&gt;
Integrated with Jupyter Notebook and VSCode&lt;/p&gt;

&lt;p&gt;The syntax looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;build&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="n"&gt;base&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;"ubuntu20.04"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;language&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;"python"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;install&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cuda&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;version&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;"11.6"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;cudnn&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;"8"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;install&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;python_packages&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="s"&gt;"torch"&lt;/span&gt;
    &lt;span class="p"&gt;])&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Run the command &lt;code&gt;envd up&lt;/code&gt;, then you are in a isolated container environment.&lt;/p&gt;

&lt;p&gt;To reuse the function written by your teammates, you can import them like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;lib&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;include&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"https://github.com/tensorchord/envdlib"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;lib&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;jupyter_lab&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;host_port&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;8888&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It's also much faster. See the benchmark below:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--7lF99oDo--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/15q3ur0jyqdilpn6ct0u.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--7lF99oDo--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/15q3ur0jyqdilpn6ct0u.png" alt="benchmark" width="614" height="461"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;More features are coming! Feel free to open a issue or join the &lt;a href="https://discord.gg/KqswhpVgdU"&gt;discord&lt;/a&gt; community to discuss with us.&lt;/p&gt;

</description>
      <category>machinelearning</category>
      <category>deeplearning</category>
      <category>docker</category>
      <category>python</category>
    </item>
    <item>
      <title>Why not multiprocessing</title>
      <dc:creator>Ming</dc:creator>
      <pubDate>Thu, 14 Oct 2021 14:08:32 +0000</pubDate>
      <link>https://dev.to/keming/why-not-multiprocessing-35bc</link>
      <guid>https://dev.to/keming/why-not-multiprocessing-35bc</guid>
      <description>&lt;p&gt;During the development of a machine learning serving project &lt;a href="https://github.com/mosecorg/mosec"&gt;Mosec&lt;/a&gt;, I used a lot of multiprocessing to make it more efficient. I want to share some experiences and researches related to Python multiprocessing.&lt;/p&gt;

&lt;h2&gt;
  
  
  start from a segment fault
&lt;/h2&gt;

&lt;p&gt;Here is a code snippet that will run well on Darwin but trigger a segment fault on Unix.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;multiprocessing&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;mp&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;time&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;sleep&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;wait_for_event&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;is_set&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
        &lt;span class="n"&gt;sleep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;0.1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;trigger_segment_fault&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="n"&gt;event&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;mp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Event&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;p&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;mp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;get_context&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"spawn"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;Process&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;target&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;wait_for_event&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;,))&lt;/span&gt;
    &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;start&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  &lt;span class="c1"&gt;# this will show the exitcode=-SIGSEGV
&lt;/span&gt;    &lt;span class="n"&gt;sleep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;set&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;terminate&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;


&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;__name__&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="s"&gt;"__main__"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;trigger_segment_fault&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Yeah, the pure Python code can trigger a segment fault.&lt;/p&gt;

&lt;p&gt;The reason is because of the new process start method. According to the &lt;a href="https://docs.python.org/3/library/multiprocessing.html#contexts-and-start-methods"&gt;Python document&lt;/a&gt;, &lt;code&gt;spawn&lt;/code&gt; is the default one on macOS (start from Python 3.8) while &lt;code&gt;fork&lt;/code&gt; is the default one on Unix. But the start method also affects the &lt;code&gt;Event&lt;/code&gt; creation. Let's check the source code:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;Event&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;object&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_cond&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Condition&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Lock&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
        &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_flag&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Semaphore&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The initialization takes a &lt;code&gt;ctx&lt;/code&gt; which is related to the start method. So when you try to access a forked event in a spawned process, this segment fault occurs. The way to solve this is simple -- using the same context. (Actually, you can use the &lt;em&gt;spawn&lt;/em&gt; event in the forked process)&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;em&gt;fork&lt;/em&gt; or &lt;em&gt;spawn&lt;/em&gt;
&lt;/h2&gt;

&lt;p&gt;Another question is that, which start method should I use?&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;spawn&lt;/em&gt;: The parent process starts a &lt;strong&gt;fresh&lt;/strong&gt; python interpreter process. The child process will only inherit those resources necessary to run the process objects run() method. In particular, unnecessary file descriptors and handles from the parent process will not be inherited.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;fork&lt;/em&gt;: The parent process uses &lt;code&gt;os.fork()&lt;/code&gt; to fork the Python interpreter. The child process, when it begins, is effectively identical to the parent process. All resources of the parent are inherited by the child's process. Note that safely forking a multithreaded process is &lt;strong&gt;problematic&lt;/strong&gt;.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;We can see that &lt;em&gt;spawn&lt;/em&gt; will create a new Python process and only inherit necessary resources. &lt;em&gt;fork&lt;/em&gt; will call the underlying &lt;code&gt;os.fork()&lt;/code&gt;, but the implementation in CPython is problematic.&lt;/p&gt;

&lt;p&gt;When you are using &lt;em&gt;spawn&lt;/em&gt;, accidentally access the main process variables may have some unexpected consequences.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;multiprocessing&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;mp&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;os&lt;/span&gt;


&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;Dummy&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="s"&gt;"init in pid: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;getpid&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;


&lt;span class="n"&gt;Dummy&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;task&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"x is None"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;


&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;__name__&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="s"&gt;"__main__"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;p&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;mp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;get_context&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"spawn"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;Process&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;target&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;task&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;start&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;join&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In the above code snippet, if the &lt;em&gt;spawn&lt;/em&gt; process tries to access the variable &lt;code&gt;x&lt;/code&gt;, it will trigger the initialization of both &lt;code&gt;Dummy()&lt;/code&gt; and &lt;code&gt;x = None&lt;/code&gt;. So you can see the terminal will print two "init in pid" with different PIDs.&lt;/p&gt;

&lt;p&gt;So what kind of problem can the &lt;em&gt;fork&lt;/em&gt; cause? Let's take a look at this article: &lt;a href="https://pythonspeed.com/articles/python-multiprocessing/"&gt;Why your multiprocessing Pool is stuck (it’s full of sharks!)&lt;/a&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;threading&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;os&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;time&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;multiprocessing&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;mp&lt;/span&gt;


&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;AreYouOK&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"init in:"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;getpid&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
        &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;lock&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;threading&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Lock&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;check&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;lock&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;locked&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;acquire&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;lock&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;acquire&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;delay_release&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sleep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;lock&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;release&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;


&lt;span class="n"&gt;greeter&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;AreYouOK&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;greeter&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;acquire&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="n"&gt;threading&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Thread&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;target&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;greeter&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;delay_release&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;daemon&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;start&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;greeting&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sleep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;getpid&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;greeter&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;check&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;


&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;__name__&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="s"&gt;"__main__"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;mp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;get_context&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"fork"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;Process&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;target&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;greeting&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;start&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;greeting&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In the above example, after the lock is released, the child process still cannot acquire the lock. Why?&lt;/p&gt;

&lt;p&gt;The main point is that fork doesn't copy everything.&lt;/p&gt;

&lt;p&gt;Let's check the &lt;a href="https://man7.org/linux/man-pages/man2/fork.2.html"&gt;man page of fork&lt;/a&gt;:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;The child does not inherit its parent's memory locks&lt;/p&gt;

&lt;p&gt;The child does not inherit semaphore adjustments from its parent&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;So what happens here is that the child process has a lock already been acquired, but no thread will release the lock because that running thread won't be copied to the &lt;em&gt;fork&lt;/em&gt; process. These two locks are not the same (copied not shared). Here, the &lt;code&gt;threading.Lock&lt;/code&gt; is obviously not process-safe and should be handled with cautions when it's used in some other libraries (&lt;code&gt;queue.Queue&lt;/code&gt;).&lt;/p&gt;

&lt;p&gt;If we use &lt;em&gt;spawn&lt;/em&gt; instead of &lt;em&gt;fork&lt;/em&gt;, everything related will be &lt;strong&gt;rebuilt&lt;/strong&gt; in the new process (including the Thread). That's why  we should use &lt;em&gt;spawn&lt;/em&gt; instead of &lt;em&gt;fork&lt;/em&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;multiprocessing&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;set_start_method&lt;/span&gt;
&lt;span class="n"&gt;set_start_method&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"spawn"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The code snippet above may cause some problems when the code is executed more than once.&lt;/p&gt;

&lt;p&gt;My suggestion is to use the start method context:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;multiprocessing&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;mp&lt;/span&gt;


&lt;span class="n"&gt;context&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;mp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;get_context&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"spawn"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Event&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  garbage collection with deadlock
&lt;/h2&gt;

&lt;p&gt;Let's take a look at another article: &lt;a href="https://codewithoutrules.com/2017/08/16/concurrency-python/"&gt;The tragic tale of the deadlocking Python queue&lt;/a&gt;.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;This code snippet is copied from the above article.&lt;br&gt;
&lt;/p&gt;
&lt;/blockquote&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;queue&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Queue&lt;/span&gt;

&lt;span class="n"&gt;q&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Queue&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;


&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;Circular&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;object&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;circular&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;self&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__del__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Adding to queue in GC"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;q&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;put&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;


&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nb"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1000000000&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"iteration"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="c1"&gt;# Create an object that will be garbage collected
&lt;/span&gt;    &lt;span class="c1"&gt;# asynchronously, and therefore have its __del__
&lt;/span&gt;    &lt;span class="c1"&gt;# method called later:
&lt;/span&gt;    &lt;span class="n"&gt;Circular&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Adding to queue regularly"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;q&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;put&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Usually, we believe that Python runs one line at a time. But that's not true.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Garbage collection can interrupt Python functions at any point, and run arbitrary other Python code: &lt;code&gt;__del__&lt;/code&gt; methods and &lt;a href="https://docs.python.org/3/library/weakref.html"&gt;weakref&lt;/a&gt; callbacks. So can signal handlers, which happen e.g. when you hit Ctrl-C (your process gets the SIGINT signal) or a subprocess dies (your process gets the SIGCHLD signal).&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;So when we try to &lt;code&gt;q.put(2)&lt;/code&gt;, the queue needs to acquire the lock. Meanwhile, the GC will try to call the &lt;code&gt;__del__&lt;/code&gt; which also does the &lt;code&gt;q.put(1)&lt;/code&gt;. The &lt;code&gt;q.put(2)&lt;/code&gt; is blocked by the GC, but the GC cannot acquire the lock because &lt;code&gt;q.put(2)&lt;/code&gt; won't release it. Deadlock happens!&lt;/p&gt;

&lt;p&gt;Thanks to the Python-dev team, this has been fixed in Python 3.7 by introducing &lt;code&gt;SimpleQueue&lt;/code&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Copy on write
&lt;/h2&gt;

&lt;p&gt;When running with multiprocessing, we hope the child process can share some data with the main process instead of copying from it. Especially when they are not used in the child process. This sounds reasonable. However, we missed another important part in Python: reference counting.&lt;/p&gt;

&lt;p&gt;CPython contains two kinds of garbage collection methods: reference counting and generational garbage collection. The reference counting is the fundamental one and cannot be disabled. The generational garbage collection is mainly used to solve the reference cycles. Check this article for more details: &lt;a href="https://rushter.com/blog/python-garbage-collector/"&gt;Garbage collection in Python: things you need to know&lt;/a&gt; and &lt;a href="https://devguide.python.org/garbage_collector/"&gt;Design of CPython’s Garbage Collector&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Let's take a look at the CPython implementation of PyObject:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight c"&gt;&lt;code&gt;&lt;span class="k"&gt;typedef&lt;/span&gt; &lt;span class="k"&gt;struct&lt;/span&gt; &lt;span class="n"&gt;_object&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;_PyObject_HEAD_EXTRA&lt;/span&gt;
    &lt;span class="n"&gt;Py_ssize_t&lt;/span&gt; &lt;span class="n"&gt;ob_refcnt&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="n"&gt;PyTypeObject&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;ob_type&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="n"&gt;PyObject&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;There is a class member called &lt;code&gt;ob_refcnt&lt;/code&gt; which is used to track the reference counting. If we call &lt;code&gt;fork()&lt;/code&gt; in the new process, the reference counting of all the Python objects will increase. This means the object itself has changed although the data accessed by the user is still the same.&lt;/p&gt;

&lt;p&gt;To handle this problem, the Instagram Engineering team has come up with a solution: &lt;a href="https://instagram-engineering.com/copy-on-write-friendly-python-garbage-collection-ad6ed5233ddf"&gt;Copy-on-write friendly Python garbage collection&lt;/a&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight c"&gt;&lt;code&gt;&lt;span class="k"&gt;static&lt;/span&gt; &lt;span class="n"&gt;PyObject&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;
&lt;span class="nf"&gt;gc_freeze_impl&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;PyObject&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;module&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="cm"&gt;/*[clinic end generated code: output=502159d9cdc4c139 input=b602b16ac5febbe5]*/&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;GCState&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;gcstate&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;get_gc_state&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;NUM_GENERATIONS&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;gc_list_merge&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;GEN_HEAD&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;gcstate&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;gcstate&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;permanent_generation&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;head&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
        &lt;span class="n"&gt;gcstate&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;generations&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;count&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="n"&gt;Py_RETURN_NONE&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Let's check the &lt;a href="https://docs.python.org/3/library/gc.html#gc.freeze"&gt;Python document for GC&lt;/a&gt;. In Python 3.7, it introduced a new method called &lt;code&gt;gc.freeze&lt;/code&gt;:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Freeze all the objects tracked by gc - move them to a permanent generation and ignore all the future collections.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;So will this solve the Copy-on-write problem? I'm not sure because I cannot come up with an example to reproduce it.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;time&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;psutil&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;multiprocessing&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;mp&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;display_memory_usage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;""&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;process&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;psutil&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Process&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"&amp;gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;process&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;memory_info&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;processing&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="n"&gt;display_memory_usage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"child "&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;


&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;__name__&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="s"&gt;"__main__"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;10000000&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

    &lt;span class="n"&gt;p&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;mp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;get_context&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"fork"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;Process&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;target&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;processing&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;start&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sleep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;0.1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;display_memory_usage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"parent"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;join&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The code snippet above will print the memory usage of the main process and child process. You may get something like this:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;child  &amp;gt; pmem(rss=414748672, vms=427634688, shared=2969600, text=2035712, lib=0, data=411791360, dirty=0)
parent &amp;gt; pmem(rss=419000320, vms=427634688, shared=7221248, text=2035712, lib=0, data=411791360, dirty=0)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;We can see that they don't share a lot. Although by default, the &lt;em&gt;fork&lt;/em&gt; process should share the data with the parent process.&lt;/p&gt;

&lt;p&gt;But if we change it to &lt;em&gt;spawn&lt;/em&gt;, we will get something like this:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;child  &amp;gt; pmem(rss=13848576, vms=23044096, shared=7069696, text=2035712, lib=0, data=7163904, dirty=0)
parent &amp;gt; pmem(rss=419139584, vms=428081152, shared=7196672, text=2035712, lib=0, data=412200960, dirty=0)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Since the &lt;code&gt;data&lt;/code&gt; is not used by the &lt;em&gt;spawn&lt;/em&gt; process, so this won't be copied to the new process.&lt;/p&gt;

&lt;p&gt;I try to add the &lt;code&gt;gc.freeze()&lt;/code&gt; before creating a new process, but it doesn't work at all. Not sure what I have missed.&lt;/p&gt;

&lt;p&gt;I found that some discussion in the &lt;a href="https://github.com/python/cpython/pull/3705#issuecomment-420191452"&gt;&lt;code&gt;gc.freeze()&lt;/code&gt; PR&lt;/a&gt;. It looks that the untouched data should be able to share among processes. Also, it has been 4 years for Gunicorn to process this &lt;a href="https://github.com/benoitc/gunicorn/issues/1640"&gt;support for &lt;code&gt;gc.freeze()&lt;/code&gt; for apps that use preloading&lt;/a&gt;. I cannot found a good example to demonstrate that this method works well.&lt;/p&gt;

&lt;p&gt;To my understanding, the &lt;code&gt;gc.freeze()&lt;/code&gt; will disable the generational garbage collection. But the reference counting cannot be disabled. So if we &lt;em&gt;fork&lt;/em&gt; a new process, everything will be shared with the new process, which means it will change all the reference count.&lt;/p&gt;

&lt;p&gt;If we change the start method from &lt;em&gt;spawn&lt;/em&gt; to &lt;em&gt;fork&lt;/em&gt;, it doesn't need the &lt;code&gt;gc.freeze()&lt;/code&gt; to freeze the reference count, which has conflicts with the description in the Instagram blog.&lt;/p&gt;

&lt;p&gt;Is there any method to avoid this? Yes. Check another blog written before the Instagram blog: &lt;a href="https://llvllatrix.wordpress.com/2016/02/19/python-vs-copy-on-write/"&gt;Python vs Copy on Write&lt;/a&gt;. The solution is very straightforward:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You can just use the &lt;a href="https://www.pypy.org/"&gt;PyPy&lt;/a&gt; because it has &lt;a href="https://doc.pypy.org/en/latest/cpython_differences.html#differences-related-to-garbage-collection-strategies"&gt;a different way for garbage collection&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;You can use the &lt;a href="https://docs.python.org/3/library/multiprocessing.html#shared-ctypes-objects"&gt;Shared &lt;code&gt;ctypes&lt;/code&gt; Objects&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;You can use the &lt;a href="https://docs.python.org/3/library/multiprocessing.shared_memory.html"&gt;shared memory&lt;/a&gt; for Python &amp;gt;= 3.8.&lt;/li&gt;
&lt;li&gt;You can use the &lt;a href="https://docs.python.org/3/library/mmap.html"&gt;mmap&lt;/a&gt; to &lt;a href="https://pythonspeed.com/articles/reduce-memory-array-copies/"&gt;reduce memory usage of array copies&lt;/a&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Suggestions
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Try to use Go, Rust, or C++ to do concurrency computing.&lt;/li&gt;
&lt;li&gt;Use &lt;em&gt;spawn&lt;/em&gt; instead of &lt;em&gt;fork&lt;/em&gt;.&lt;/li&gt;
&lt;li&gt;Be careful about the garbage collection behavior.&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>python</category>
      <category>multiprocessing</category>
      <category>fork</category>
      <category>spawn</category>
    </item>
    <item>
      <title>Yet another deep learning serving framework</title>
      <dc:creator>Ming</dc:creator>
      <pubDate>Wed, 13 May 2020 17:06:19 +0000</pubDate>
      <link>https://dev.to/keming/yet-another-deep-learning-serving-framework-21di</link>
      <guid>https://dev.to/keming/yet-another-deep-learning-serving-framework-21di</guid>
      <description>&lt;p&gt;Yet another deep learning serving framework that is easy to use.&lt;/p&gt;

&lt;p&gt;Previously, I tested the performance of some &lt;a href="https://dev.to/kemingy/deep-learning-serving-benchmark-1nko"&gt;deep learning serving frameworks&lt;/a&gt; like TensorFlow Serving, Triton, and I found that these frameworks are not that easy to use. By the way, they don't have much advantage in the performance. So I just write one as a prototype.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://github.com/kemingy/ventu"&gt;ventu&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/kemingy/batching"&gt;batching&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;del&gt;Feel free to give it a try&lt;/del&gt;. For production usage, check &lt;a href="https://github.com/mosecorg/mosec"&gt;MOSEC&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Basic features
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;serve the deep learning models (HTTP)&lt;/li&gt;
&lt;li&gt;preprocess and postprocess (optional)&lt;/li&gt;
&lt;li&gt;dynamic batching (increase the throughput)&lt;/li&gt;
&lt;li&gt;health check (need to provide examples)&lt;/li&gt;
&lt;li&gt;request &amp;amp; response validation&lt;/li&gt;
&lt;li&gt;model inference warm-up (need to provide examples)&lt;/li&gt;
&lt;li&gt;OpenAPI document&lt;/li&gt;
&lt;li&gt;supports both JSON and msgpack serialization&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Advantages
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;support all kinds of deep learning runtime&lt;/li&gt;
&lt;li&gt;easy to implement the preprocess and postprocess part&lt;/li&gt;
&lt;li&gt;validation for request&lt;/li&gt;
&lt;li&gt;health check and warm-up with examples&lt;/li&gt;
&lt;li&gt;OpenAPI document&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Design
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Dynamic Batching
&lt;/h3&gt;

&lt;p&gt;To implement the dynamic batching, we need a high-performance job queue that can be consumed by multiple workers. Go channel will be a good choice. In this situation, we have one producer and multiple consumers, so it's very easy to close the channel for the graceful shutdown.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;type&lt;/span&gt; &lt;span class="n"&gt;Batching&lt;/span&gt; &lt;span class="k"&gt;struct&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;Name&lt;/span&gt;       &lt;span class="kt"&gt;string&lt;/span&gt; &lt;span class="c"&gt;// socket name&lt;/span&gt;
    &lt;span class="n"&gt;socket&lt;/span&gt;     &lt;span class="n"&gt;net&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Listener&lt;/span&gt;
    &lt;span class="n"&gt;maxLatency&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Duration&lt;/span&gt; &lt;span class="c"&gt;// max latency for a batch inference to wait&lt;/span&gt;
    &lt;span class="n"&gt;batchSize&lt;/span&gt;  &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="c"&gt;// max batch size for a batch inference&lt;/span&gt;
    &lt;span class="n"&gt;capacity&lt;/span&gt;   &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="c"&gt;// the capacity of the batching queue&lt;/span&gt;
    &lt;span class="n"&gt;timeout&lt;/span&gt;    &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Duration&lt;/span&gt; &lt;span class="c"&gt;// timeout for jobs in the queue&lt;/span&gt;
    &lt;span class="n"&gt;logger&lt;/span&gt;     &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;zap&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Logger&lt;/span&gt;
    &lt;span class="n"&gt;queue&lt;/span&gt;      &lt;span class="k"&gt;chan&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;Job&lt;/span&gt; &lt;span class="c"&gt;// job queue&lt;/span&gt;
    &lt;span class="n"&gt;jobs&lt;/span&gt;       &lt;span class="k"&gt;map&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;Job&lt;/span&gt; &lt;span class="c"&gt;// use job id as the key to find the job&lt;/span&gt;
    &lt;span class="n"&gt;jobsLock&lt;/span&gt;   &lt;span class="n"&gt;sync&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Mutex&lt;/span&gt; &lt;span class="c"&gt;// lock for jobs&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For jobs in this queue, we need to create a UUID as a key. So after the inference, we can find this job by searching the key in a hash table. That means we also need a mutex for the hash table.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;type&lt;/span&gt; &lt;span class="n"&gt;Job&lt;/span&gt; &lt;span class="k"&gt;struct&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;id&lt;/span&gt;        &lt;span class="kt"&gt;string&lt;/span&gt;
    &lt;span class="n"&gt;done&lt;/span&gt;      &lt;span class="k"&gt;chan&lt;/span&gt; &lt;span class="kt"&gt;bool&lt;/span&gt;
    &lt;span class="n"&gt;data&lt;/span&gt;      &lt;span class="p"&gt;[]&lt;/span&gt;&lt;span class="kt"&gt;byte&lt;/span&gt; &lt;span class="c"&gt;// request data&lt;/span&gt;
    &lt;span class="n"&gt;result&lt;/span&gt;    &lt;span class="p"&gt;[]&lt;/span&gt;&lt;span class="kt"&gt;byte&lt;/span&gt; &lt;span class="c"&gt;// inference result or error message&lt;/span&gt;
    &lt;span class="n"&gt;errorCode&lt;/span&gt; &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="c"&gt;// HTTP Error Code&lt;/span&gt;
    &lt;span class="n"&gt;expire&lt;/span&gt;    &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Time&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Because the batching service and Python inference workers are on the same machine (or the same pod), so the most efficient communication should be the &lt;a href="https://en.wikipedia.org/wiki/Unix_domain_socket"&gt;Unix domain socket&lt;/a&gt;. And we also need to define a simple protocol for our use case. Since we only need to transfer the data of batch jobs, let's keep everything as simple as we can.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;| length  |       data        |
| 4 bytes |   {length} bytes  |
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ol&gt;
&lt;li&gt;workers send the first request with empty data to the batching service&lt;/li&gt;
&lt;li&gt;batching service collects a batch of jobs and sends to the workers&lt;/li&gt;
&lt;li&gt;worker processes these jobs

&lt;ul&gt;
&lt;li&gt;preprocess (for a single job)&lt;/li&gt;
&lt;li&gt;inference (for a batch of jobs)&lt;/li&gt;
&lt;li&gt;postprocess (for a single job)&lt;/li&gt;
&lt;li&gt;send to the results to the batching service&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;batching service notifies the handler that this job is done, then the handler sends the result to the original client and goes to #2&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Error handling
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;timeout&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If a job is not processed by one of the workers for a long time, the batching service will delete this job from the hash table and return 408.&lt;/p&gt;

&lt;p&gt;When the batching service tries to collect these jobs from the queue channel, it will check the &lt;code&gt;expire&lt;/code&gt; attribute first.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;validation error&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;To make sure the requested data is valid, we use &lt;a href="//pydantic-docs.helpmanual.io/"&gt;pydantic&lt;/a&gt; to do the validation. So the user needs to define the data schema with &lt;code&gt;pydantic&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;If one job data is invalid, this one will be marked and the result for this job is the validation error message generated by &lt;code&gt;pydantic&lt;/code&gt;. And this won't affect other jobs in the same batch. That part is handled by the &lt;code&gt;ventu&lt;/code&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Simple HTTP service without dynamic batching
&lt;/h3&gt;

&lt;p&gt;For this part, we use &lt;a href="//falcon.readthedocs.io/"&gt;falcon&lt;/a&gt; which is a very powerful Python framework for web APIs. To generate the OpenAPI document and validate the request data, we use &lt;a href="https://github.com/0b01001001/spectree"&gt;spectree&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;If you would like to use &lt;code&gt;gunicorn&lt;/code&gt;, &lt;code&gt;ventu&lt;/code&gt; also expose the &lt;code&gt;app&lt;/code&gt; element.&lt;/p&gt;

&lt;h2&gt;
  
  
  TODO
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;metrics

&lt;ul&gt;
&lt;li&gt;this can be added by users in model inference part&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;increase the number of workers dynamically&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>deeplearning</category>
    </item>
    <item>
      <title>Deep Learning Serving Benchmark</title>
      <dc:creator>Ming</dc:creator>
      <pubDate>Thu, 23 Apr 2020 06:30:30 +0000</pubDate>
      <link>https://dev.to/keming/deep-learning-serving-benchmark-1nko</link>
      <guid>https://dev.to/keming/deep-learning-serving-benchmark-1nko</guid>
      <description>&lt;p&gt;There is no black magic, everything follows the rules.&lt;/p&gt;

&lt;h2&gt;
  
  
  What does the deep learning serving frameworks do?
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;respond to request (RESTful HTTP or RPC)&lt;/li&gt;
&lt;li&gt;model inference (with runtime)&lt;/li&gt;
&lt;li&gt;preprocessing &amp;amp; postprocessing (optional)&lt;/li&gt;
&lt;li&gt;queries dynamic batching (increase throughput)&lt;/li&gt;
&lt;li&gt;monitoring metrics&lt;/li&gt;
&lt;li&gt;service health check&lt;/li&gt;
&lt;li&gt;versioning&lt;/li&gt;
&lt;li&gt;multiple instances&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Actually, when we are trying to deploy the models with kubernetes, we only need part of these features. But we do care about the performance of these frameworks. So let's do a benchmark.&lt;/p&gt;

&lt;h2&gt;
  
  
  Benchmark
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Environments&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;CPU: Intel(R) Xeon(R) Gold 6130 CPU @ 2.10GHz&lt;/li&gt;
&lt;li&gt;GPU: NVIDIA V100&lt;/li&gt;
&lt;li&gt;Memory: 251GiB&lt;/li&gt;
&lt;li&gt;OS: Ubuntu 16.04.6 LTS (Xenial Xerus)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Docker Images&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;tensorflow/tensorflow:latest-gpu&lt;/li&gt;
&lt;li&gt;tensorflow/serving:latest-gpu&lt;/li&gt;
&lt;li&gt;nvcr.io/nvidia/tensorrtserver:19.10-py3&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The cost of time is recorded after &lt;strong&gt;warmup&lt;/strong&gt;. Dynamic batching &lt;strong&gt;disabled&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;All the code can be found in this &lt;a href="https://gist.github.com/kemingy/a382528b29f6e34c47b464cf16806731"&gt;gist&lt;/a&gt;.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Framework&lt;/th&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Model Type&lt;/th&gt;
&lt;th&gt;Images&lt;/th&gt;
&lt;th&gt;Batch size&lt;/th&gt;
&lt;th&gt;Time(s)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Tensorflow&lt;/td&gt;
&lt;td&gt;ResNet50&lt;/td&gt;
&lt;td&gt;TF Savedmodel&lt;/td&gt;
&lt;td&gt;32000&lt;/td&gt;
&lt;td&gt;32&lt;/td&gt;
&lt;td&gt;83.189&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Tensorflow&lt;/td&gt;
&lt;td&gt;ResNet50&lt;/td&gt;
&lt;td&gt;TF Savedmodel&lt;/td&gt;
&lt;td&gt;32000&lt;/td&gt;
&lt;td&gt;10&lt;/td&gt;
&lt;td&gt;86.897&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Tensorflow Serving&lt;/td&gt;
&lt;td&gt;ResNet50&lt;/td&gt;
&lt;td&gt;TF Savedmodel&lt;/td&gt;
&lt;td&gt;32000&lt;/td&gt;
&lt;td&gt;32&lt;/td&gt;
&lt;td&gt;120.496&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Tensorflow Serving&lt;/td&gt;
&lt;td&gt;ResNet50&lt;/td&gt;
&lt;td&gt;TF Savedmodel&lt;/td&gt;
&lt;td&gt;32000&lt;/td&gt;
&lt;td&gt;10&lt;/td&gt;
&lt;td&gt;116.887&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Triton (TensorRT Inference Server)&lt;/td&gt;
&lt;td&gt;ResNet50&lt;/td&gt;
&lt;td&gt;TF Savedmodel&lt;/td&gt;
&lt;td&gt;32000&lt;/td&gt;
&lt;td&gt;32&lt;/td&gt;
&lt;td&gt;201.855&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Triton (TensorRT Inference Server)&lt;/td&gt;
&lt;td&gt;ResNet50&lt;/td&gt;
&lt;td&gt;TF Savedmodel&lt;/td&gt;
&lt;td&gt;32000&lt;/td&gt;
&lt;td&gt;10&lt;/td&gt;
&lt;td&gt;171.056&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Falcon + msgpack + Tensorflow&lt;/td&gt;
&lt;td&gt;ResNet50&lt;/td&gt;
&lt;td&gt;TF Savedmodel&lt;/td&gt;
&lt;td&gt;32000&lt;/td&gt;
&lt;td&gt;32&lt;/td&gt;
&lt;td&gt;115.686&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Falcon + msgpack + Tensorflow&lt;/td&gt;
&lt;td&gt;ResNet50&lt;/td&gt;
&lt;td&gt;TF Savedmodel&lt;/td&gt;
&lt;td&gt;32000&lt;/td&gt;
&lt;td&gt;10&lt;/td&gt;
&lt;td&gt;115.572&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;According to the benchmark, Triton is not ready for production, TF Serving is a good option for TensorFlow models, and self-host service is also quite good (you may need to implement dynamic batching for production).&lt;/p&gt;

&lt;h2&gt;
  
  
  Comparing
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Tensorflow Serving
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://www.tensorflow.org/tfx/serving"&gt;https://www.tensorflow.org/tfx/serving&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;coupled with Tensorflow ecosystem (also support other format, not out-of-box)&lt;/li&gt;
&lt;li&gt;A/B testing&lt;/li&gt;
&lt;li&gt;provide both gRPC and HTTP RESTful API&lt;/li&gt;
&lt;li&gt;prometheus integration&lt;/li&gt;
&lt;li&gt;batching&lt;/li&gt;
&lt;li&gt;multiple models&lt;/li&gt;
&lt;li&gt;preprocessing &amp;amp; postprocessing can be implemented with &lt;a href="https://github.com/tensorflow/tensorflow/issues/31055"&gt;signatures&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Triton Inference Server
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://github.com/NVIDIA/triton-inference-server/"&gt;https://github.com/NVIDIA/triton-inference-server/&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;support multiply backends: ONNX, PyTorch, TensorFlow, Caffe2, TensorRT&lt;/li&gt;
&lt;li&gt;both gRPC and HTTP with SDK&lt;/li&gt;
&lt;li&gt;internal health check and prometheus metrics&lt;/li&gt;
&lt;li&gt;batching&lt;/li&gt;
&lt;li&gt;concurrent model execution&lt;/li&gt;
&lt;li&gt;preprocessing &amp;amp; postprocessing can be done with ensemble models&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;shm-size&lt;/code&gt;, &lt;code&gt;memlock&lt;/code&gt;, &lt;code&gt;stack&lt;/code&gt; configurations are not available for Kubernetes&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Multi Model Server
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://github.com/awslabs/multi-model-server"&gt;https://github.com/awslabs/multi-model-server&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;require Java 8&lt;/li&gt;
&lt;li&gt;provide HTTP&lt;/li&gt;
&lt;li&gt;Java layer communicates with Python workers through Unix Domain Socket or TCP&lt;/li&gt;
&lt;li&gt;batching (not mature)&lt;/li&gt;
&lt;li&gt;multiple models&lt;/li&gt;
&lt;li&gt;&lt;code&gt;log4j&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;management API&lt;/li&gt;
&lt;li&gt;need to write model loading and inference code (means can use any runtime you want)&lt;/li&gt;
&lt;li&gt;easy to add preprocessing and postprocessing to the service&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  GraphPipe
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://oracle.github.io/graphpipe"&gt;https://oracle.github.io/graphpipe&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;use flatbuffer which is more efficient&lt;/li&gt;
&lt;li&gt;2 years ago...&lt;/li&gt;
&lt;li&gt;Oracle laid off the whole team&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  TorchServe
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://github.com/pytorch/serve"&gt;https://github.com/pytorch/serve&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;fork from Multi Model Server&lt;/li&gt;
&lt;li&gt;developing...&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>machinelearning</category>
      <category>tensorflow</category>
    </item>
  </channel>
</rss>
