DEV Community: Ming

Lessons learned from improving a Rust program

Ming — Sun, 13 Oct 2024 10:29:29 +0000

Recently, I've been working on a new approximate nearest neighbor search algorithm called RaBitQ. The author has already provided a C++ implementation that runs quite fast. I tried to rewrite it in Rust (yet another RiiR). However, I found that my implementation was much slower than the original one. Here is how I improve the performance step by step.

Prepare the environment

Datasets

The most important thing is to have some reasonable datasets. Since the paper already demonstrate some results on the sift_dim128_1m_l2 and gist_dim960_1m_l2 datasets, 128 and 960 dimensions are typical and 1_000_000 vectors should be sufficient for benchmark purpose. I decided to use them as well. The datasets can be downloaded from here. (Yes, I know this site doesn't have TLS and it only provides FTP downloads).

The format used by these datasets is called fvecs/ivecs, which is a common vector format:

| dim (4 bytes) | vector (4 * dim bytes) |
| dim (4 bytes) | vector (4 * dim bytes) |
...
| dim (4 bytes) | vector (4 * dim bytes) |

You can get the read/write script from my gist.

Profiling tool

I use samply to profile the Rust code. It has a nice integration with the Firefox Profiler. You can also share the profiling results with others by uploading them to the cloud. Here is an example of the C++ version profiling on GIST. The FlameGraph and CallTree are the most common views. Remember to grant the performance event permission and increase the mlock limit:

echo '1' | sudo tee /proc/sys/kernel/perf_event_paranoid
sudo sysctl kernel.perf_event_mlock_kb=2048

The GodBolt compiler explorer is also useful for comparing the assembly function code between C++ and Rust.

Cargo profile

To include the debug information in the release build, you can add another profile to the Cargo.toml:

[profile.perf]
inherits = "release"
debug = true
codegen-units = 16

The compiling cost and runtime speed can greatly affect the profiling user experience.

cargo build has a faster compile speed, but the code may be slower than pure Python
cargo build --release runs fast but it might take a long time to compile

For benchmarking, we have no choice but to use the opt-level = 3.

I saw some advice to use the following settings:

codegen-units = 1
lto = "fat"
panic = "abort"

In my case, this only slows down the compilation speed and doesn't improve the performance at all.

Benchmark tool

Criterion is a good statistics-driven benchmark tool. I create another repo to store all the related benchmark codes. It turns out that I should put them in the same repo.

One thing to note is that the benchmark results are not very stable. I have seen ±10% differences without modifying the code. If you're using your laptop, this could be even worse since the CPU might be underclocked due to the high temperature.

I suggest to benchmark the function with several different parameters. In this case, I use different vector dimensions. If the results for all the dimensions are positive, it usually means that the improvement is effective.

Metrics

Remember to add some metrics from the start. Many bugs and performance issues can be found by checking the metrics. I use AtomicU64 directly since the current requirements are simple. I may switch to the Prometheus metrics later.

Note that too many metrics/logging/traces can also affect the performance. So be careful when adding them.

Resources

During the benchmark, I noticed that the end-to-end QPS is extremely unstable. I could get a 15% improvement or deterioration the next day morning without recompiling the code. Then I found that the CPUs are not completely idle as I have VSCode + Rust Analyzer, it seems they don't consume much CPU but they do affect the benchmark results heavily. Even though I'm using Intel Core i7-13700K, which has 8 performance cores and 8 efficient cores, also the program is single-threaded.

I use taskset to bind the process to a specific CPU. This way it won't be affected by mixed cores scheduling.

Note that Intel Core 13th/14th CPUs are affected by the instability problem due to the extremely high voltage. I have fixed this in the BIOS.

Cloud VMs may not be affected by the CPU temperature, but the cloud providers may have their own CPU throttling and overbooking policies.

Step by Step Improvement

Start with a naive implementation

My first release implemented the RaBitQ algorithm based on an algebra library called nalgebra. The main reason is that I need to use the QR decomposition to obtain the orthogonal matrix, which is the key step in the RaBitQ algorithm. Also, a mature linear algebra library provides many useful functions for manipulating the matrix and vectors, making it easier for me to implement the algorithm. Imagine implementing an algorithm involving matrix multiplication, projection and decomposition in Python without numpy, it's a nightmare.

I thought that the performance should be good since nalgebra is optimized for such kind of scenarios. But the benchmark shows that it is much slower than I expected. I guess reimplementing it in numpy would be much faster :(

According to the profiling, there are lots of f32::clone() calls. It takes about 33% of the total time, or 44% if you focus on the query_one function. This reminds me that I can preallocate the memory for some vectors and reuse it in the iteration, a very common trick. So instead of using (x - y).norm_squared(), I need to pre-declare another vector that stores the result of (x - y), which ends up being x.sub_to(y, &mut z); z.norm_squared(). See the commit 23f9aff.

Like most of the algebra libraries, it stores the matrix in the column-major order, which means iterating over the column could be faster than over the row. It's a bit annoying because I have to transpose the matrix before the iteration, and not all the vector/matrix multiplications can detect the dimension mismatch error (1 x dyn or dyn x 1) during compilation.

CPU target

RaBitQ uses the binary dot product distance to estimate the approximate distance, which is computed by:

fn binary_dot_product(x: &[u64], y: &[u64]) -> u32 {
    assert_eq!(x.len(), y.len());
    let mut res = 0;
    for i in 0..x.len() {
        res += (x[i] & y[i]).count_ones();
    }
    res
}

Here the u64::count_ones() would use intrinsics directly, I thought. It turns out that I still need to enable the popcnt feature during the compilation. This could be done by using the RUSTFLAGS="-C target-feature=+popcnt", but I prefer RUSTFLAGS="-C target-cpu=native", which enables all the CPU features supported by the current CPU, but also makes the binary non-portable, which is fine for now. The following sections also require this env to enable the AVX2 features.

You can use the following command to check your CPU features:

rustc --print=cfg -C target-cpu=native | rg target_feature

SIMD

The key function for the nearest neighbor search is the distance function, which in this case is the Euclidean distance. We usually use the L2 square distance to avoid the square root computation. The naive implementation is as follows:

{
    y.sub_to(x, &mut residual);
    residual.norm_squared()
}

After the profiling, I found that it still has f32::clone(). By checking the source code of nalgebra, I found that there are many clone for some reasons I don't know. I decide to write the SIMD by hand. Fortunately, hnswlib (a popular HNSW implementation) already implements this.

This eliminates the f32::clone() in the distance computation and improves the QPS by 28% for SIFT. Check the commit 5f82fcc.

My CPU doesn't support AVX512, so I use the AVX2 version. You can check the Steam Hardware Stats, it lists the SIMD support in the "Other Settings". 100% users have SSE3, 94.61% users have AVX2, only 13.06% users have AVX512F. Of course this statistic is biased, most of the cloud Intel CPUs have AVX512 support, game players cannot represent all the users.

To use SIMD, the most useful guide is the Intel Intrinsics Guide. It's better to download the website as the online experience is not good. Remember to check the "latency" and "throughput" of the intrinsics, otherwise, your code may be slower than the normal version.

Another resource is the x86 Intrinsics Cheat Sheet. This is good for newbies like me.

@ashvardanian has a post about the "mask load" that solves the tail elements problem (requires AVX512).

To make the code work on other platforms:

#[cfg(any(target_arch = "x86_64", target_arch = "x86"))]
{
    if is_x86_feature_detected!("avx2") {
        // AVX2 version
    } else {
        // normal version
    }
}

There are some useful crates for writing better cfg for the SIMD, let's keep it simple for now.

More SIMD

SIMD is like a hammer, now I need to find more nails in the code.

rewrite the binarize_vector function with AVX2 in commit f114fc1 improves the QPS by 32% for GIST.

@andrewaylett pointed out that opt-level=3 can optimize this

~~Compared to the original C++ version, this implementation is also branchless.~~ When enabling opt-level=3, this can be optimized by the compiler. See the assembly.

- let shift = if (i / 32) % 2 == 0 { 32 } else { 0 };
+ let shift = ((i >> 5) & 1) << 5;

@novax first pointed out that it's equivalent to i & 32, which is more readable.

See the assembly for the difference.

Well, going branchless doesn't make the overall performance much better since the binarize_vector function is called only once for each query. But it's a good learning opportunity.

Scalar quantization

To eliminate more f32::clone() in the code, I decided to replace more nalgebra functions with the manual implementation. The min and max functions are the most common ones. The nalgebra version is like this:

let lower_bound = residual.min();
let upper_bound = residual.max();

This can be done by:

fn min_max(vec: &[f32]) -> (f32, f32) {
    let mut min = f32::MAX;
    let mut max = f32::MIN;
    for v in vec.iter() {
        if *v < min {
            min = *v;
        }
        if *v > max {
            max = *v;
        }
    }
    (min, max)
}

I used to use f32::min() and f32::max() because they are convenient. But for non-(asc/desc) vectors, if has a better performance.

Instead of iterating through the vector several times in a function chain and computing the scalar quantization with sum in different iterations:

let y_scaled = residual.add_scalar(-lower_bound) * one_over_delta + &self.rand_bias;
let y_quantized = y_scaled.map(|v| v.to_u8().expect("convert to u8 error"));
let scalar_sum = y_quantized.iter().fold(0u32, |acc, &v| acc + v as u32);

We can do this in one loop:

{
    let mut sum = 0u32;
    for i in 0..vec.len() {
        let q = ((vec[i] - lower_bound) * multiplier + bias[i]) as u8;
        quantized[i] = q;
        sum += q as u32;
    }
    sum
}

For scalar quantization, we are sure that the f32 can be converted to u8, so we can use as u8 instead of to_u8().unwrap().

The commit af39c1c & commit d2d51b0 improved the QPS by 31% for GIST.

The following part can also be rewritten with SIMD, which improves the QPS by 12% for GIST:

min/max: commit c97be68 & commit e5a4af0
scalar quantization: commit 28efe09

I also tried replacing tr_mul with SIMD, which is a vector projection. It turns out that nalgebra uses BLAS here, so the performance stays the same.

Yet another algebra crate: faer

I found another Rust algebra crate called faer while investigating the f32::clone() problem. It's optimized with lots of SIMD and provides better row/column iteration performance. The QR decomposition is also much faster than nalgebra. This commit 0411821 makes the training part faster.

Also, I can now use these vectors as a normal slice without the ColRef or RowRef wrapper after commit 0d969bd.

I have to admit that if I used faer from the beginning, I could avoid lots of troubles. Anyway, I learned a lot from this experience.

Binary dot product

I thought popcnt already solved the binary dot product, but the FlameGraph shows that count_ones() only takes 7% of the binary_dot_product. Although the AVX512 has the vpopcntq instruction, I would prefer to use the AVX2 simulation since it's more common.

This is a good reference for the popcnt implementation with AVX2. The commit edabd4a re-implement this in Rust which improves the QPS by 11% for GIST. This trick only works when the vector has more than 256 dimensions, which means 256 bits for the binary representation.

Inline

The #[inline] attribute should be used with caution. Adding this attribute to all the SIMD functions improves the QPS by 5% for GIST.

IO

I need to add some background information here.

The current implementation is based on the IVF algorithm, which will use k-means to cluster the vectors and store the centroids in memory. The query vector is only compared to the clusters with smaller l2_squared_distance(query, centroid).

There is a parameter called n_probe that controls how many nearest clusters will be probed. A large n_probe will increase the recall but decrease the QPS.

RaBitQ uses the binary dot product to estimate the approximate distance. If it's smaller than the threshold, it will re-rank with the original L2 squared distance and update the threshold accordingly.

Previously, I used slice::select_nth_unstable which only selects the n-nearest but doesn't sort them in order. Going through the clusters that are far away from the query will increase the re-ranking ratio, which requires more L2 squared distance computation. Re-sorting the selected n-th clusters improved the QPS by 4% for GIST.

Another trick is to sort the vectors in each cluster by their distance to the centroids, this commit ea13ebc also improved the QPS by 4% for GIST.

There are some metadata used to estimate the approximate distance for each vector:

factor_ip: f32
factor_ppc: f32
error: f32
x_c_distance_square: f32

Previously I use 4 Vec<f32> to store them, which is not IO friendly, since the calculation requires vector[i] for each of them. By combining them into one struct in commit bb440e3, the QPS improved by 2.5% for GIST. This works well because it's 4xf32, so I can use the C representation directly:

#[derive(Debug, Clone, Copy, Default, Serialize, Deserialize)]
#[repr(C)]
struct Factor {
    factor_ip: f32,
    factor_ppc: f32,
    error_bound: f32,
    center_distance_square: f32,
}

Unfortunately, faer doesn't support u64 vectors. So I have to store the vector binary representation in Vec<Vec<u64>>. By changing it to Vec<u64> in commit 48236b2, the QPS improved by 2% for GIST.

Const generics

The C++ version uses the template to generate the code for different dimensions. This feature is also available in Rust. I didn't try it because re-compiling the code for different dimensions might only be possible for specific use cases, like inside a company with only a few fixed dimensions. For the public library, it's better to provide a general solution so users don't have to re-compile it by themselves.

Other tools

There is a bounds-check-cookbook which provides several examples of how to eliminate the boundary checking in safe Rust.

I tried PGO and BOLT but didn't get any improvement.

Switching to jemalloc or mimalloc doesn't improve the performance either.

Conclusion

SIMD is awesome when it's used properly
IO is also important, especially for the large datasets

The current performance is the same as the C++ version for dataset GIST. While I use more SIMD, the C++ version uses const generics.

References

Algorithmica / HPC

User Authorization with Postgres Row Level Security Policy

Ming — Tue, 04 Jun 2024 13:31:02 +0000

Supabase has a storage gateway that uses RLS for authorization.

It requires a JWT that provides the role information to perform the SQL, here is an example of the JWT payload:

{
  "sub": "authenticated",
  "iat": 1516239022,
  "role": "f918ffd9-a611-4b2a-b4bb-df8f25d7569f"
}

The storage bucket table is:

                        Table "storage.buckets"
   Column   |           Type           | Collation | Nullable | Default 
------------+--------------------------+-----------+----------+---------
 id         | text                     |           | not null | 
 name       | text                     |           | not null | 
 owner      | uuid                     |           |          | 
 created_at | timestamp with time zone |           |          | now()
 updated_at | timestamp with time zone |           |          | now()

You only need to set up the correct RLS policy in the database. Here is an example:

-- generate a UUID as the role name since it needs to match the owner type
SELECT gen_random_uuid(); -- f918ffd9-a611-4b2a-b4bb-df8f25d7569f

CREATE ROLE "f918ffd9-a611-4b2a-b4bb-df8f25d7569f";
GRANT all ON schema storage TO "f918ffd9-a611-4b2a-b4bb-df8f25d7569f";
GRANT all on buckets TO "f918ffd9-a611-4b2a-b4bb-df8f25d7569f";
-- generate another role
CREATE ROLE "11b795e0-a566-491b-9ee7-62c025175dd8";
GRANT all ON schema storage TO "11b795e0-a566-491b-9ee7-62c025175dd8";
GRANT all on buckets TO "11b795e0-a566-491b-9ee7-62c025175dd8";

CREATE OR REPLACE FUNCTION user_record_count(uuid) RETURNS integer AS $$
DECLARE
    count integer;
BEGIN
    SELECT COUNT(*) INTO count
    FROM buckets
    WHERE owner = $1;

    RETURN count;
END;
$$ LANGUAGE plpgsql;

CREATE POLICY limit_user_crud ON buckets
    USING (owner = current_user::uuid)
    WITH CHECK (
        (SELECT user_record_count(current_user::uuid) < 3)
    );

SET ROLE "f918ffd9-a611-4b2a-b4bb-df8f25d7569f";

INSERT INTO buckets (id, name, owner) VALUES ('1', 'one', current_user::uuid);
INSERT INTO buckets (id, name, owner) VALUES ('2', 'two', current_user::uuid);
INSERT INTO buckets (id, name, owner) VALUES ('3', 'three', current_user::uuid);
-- check before the insertion
INSERT INTO buckets (id, name, owner) VALUES ('4', 'four', current_user::uuid); -- ERROR:  new row violates row-level security policy for table "buckets"
SELECT * FROM buckets; -- this returns 3 rows

SET role "11b795e0-a566-491b-9ee7-62c025175dd8";
SELECT * FROM buckets; -- this returns nothing
INSERT INTO buckets (id, name, owner) VALUES ('4', 'four', current_user::uuid); -- success
DELETE FROM buckets where id = '4'; -- success
DELETE FROM buckets where id = '1'; -- delete 0

HTTP Rate Limit

Ming — Thu, 04 Jan 2024 06:37:29 +0000

Draft

The story starts with a link checker sharing that mentions the HTTP rate limit header in the IETF proposed standard.

Ideally, we expect something like this in the HTTP response headers:

   RateLimit-Limit: 10
   RateLimit-Remaining: 1
   RateLimit-Reset: 7

RateLimit-Reset specifies the remaining seconds for the current time window. This should not be considered as a fixed value.

It may also contain a Retry-After header, usually with a 429 status code.

ratelimit-headers has a test implementation of this draft.

Sadly, some HTTP APIs do not strictly implement this draft (others may not even have these headers). You can find different names like X-RateLimit-Reset, X-RateLimit-Requests-Reset, X-RateLimit-Reset-After, etc. Some official SDKs may consider this.

Python `httpx` with rate limit

There are already some implementations for Python HTTP clients. One of them is aiometer. But it's not suitable for my use case. Since httpx already has the internal pool, it would be better to reuse the design.

BTW, my use case is a web crawler client, I hope I can query the URL directly in the code (with rate limit), instead of gathering lots of URLs and using the map function.

Here is a simple implementation:

class RateLimitTransport(httpx.AsyncHTTPTransport):
    def __init__(self, max_per_second: float = 5, **kwargs) -> None:
        """
        Async HTTP transport with rate limit.

        Args:
            max_per_second: Maximum number of requests per second.

        Other args are passed to httpx.AsyncHTTPTransport.
        """
        self.interval = 1 / max_per_second
        self.next_start_time = 0
        super().__init__(**kwargs)

    async def notify_task_start(self):
        """
        https://github.com/florimondmanca/aiometer/blob/358976e0b60bce29b9fe8c59807fafbad3e62cbc/src/aiometer/_impl/meters.py#L57
        """
        loop = asyncio.get_running_loop()
        while True:
            now = loop.time()
            next_start_time = max(self.next_start_time, now)
            until_now = next_start_time - now
            if until_now <= self.interval:
                break
            await asyncio.sleep(max(0, until_now - self.interval))
        self.next_start_time = max(self.next_start_time, now) + self.interval

    async def handle_async_request(self, request: httpx.Request) -> httpx.Response:
        await self.notify_task_start()
        return await super().handle_async_request(request)

    async def __aenter__(self) -> Self:
        await self.notify_task_start()
        return await super().__aenter__()

    async def __aexit__(self, *args: Any) -> None:
        await super().__aexit__(*args)

You can specify the rate limit when you initialize your HTTP client like:

client = httpx.AsyncClient(
    transport=RateLimitTransport(max_per_second=20),
)

Serving fine-tuned large language model with vLLM

Ming — Sat, 26 Aug 2023 12:50:58 +0000

Fine-tuned large language models (LLM) are becoming increasingly popular in AI applications. These powerful language models are widely used to automate a series of tasks, improve customer service, and generate domain-specific content.

However, serving these fine-tuned LLMs at scale comes with challenges. Those models are computationally consuming. Their sizes are much larger than the traditional microservices. These features make it hard to archive high throughput serving and low cold start scaling.

This post will introduce our experience on LLM serving with vLLM and service scaling in modelz.

Use vLLM for high throughput LLM serving

vLLM is a high-throughput and memory-efficient LLM serving engine. It offers OpenAI compatible API, which makes it easy to be integrated with the existing LLM applications.

The first problem of using vLLM is building a GPU environment to build and install vLLM. With the help of envd, this can be done in one file like:

# syntax=v1


def build():
    base(dev=True)
    install.cuda(version="11.8.0")
    install.conda()
    install.python()
    install.apt_packages(name=["build-essential"])
    # install torch here to reuse the cache
    install.python_packages(name=["torch"])
    # install from source
    install.python_packages(name=["git+https://github.com/vllm-project/vllm.git"])

By running envd up, you can get into the development environment with everything you need. If you prefer Dockerfile, we also have a template.

vLLM already supports many LLM such as LLaMA, Falcon, MPT, etc. However, to support your own LLM, you may need to provide a model-specific prompt template. To address this issue, we create a tool called llmspec, which provides the prompt templates with OpenAI compatible interface. You can build your prompt generator on top of this library.

To run the vLLM serving in a Kubernetes cluster, there are some necessary configurations:

Always set --worker-use-ray to run the model inference in another Python process to avoid health probe failure.
Provide enough shared memory (at least 30% RAM).
Reduce --gpu-memory-utilization to avoid GPU OOM for long sequences.
Increase --max-num-batched-tokens if you want to get long sequences.

If you want to simulate a multiple concurrent request test, you can use the following script:

from random import randint
import concurrent.futures

import openai

openai.api_key = "EMPTY"
openai.api_base = "http://localhost:8000/v1"


def query(max_tokens=20):
    chat = openai.ChatCompletion.create(
        model="mosaicml/mpt-30b-chat",
        messages=[{
            "role": "user",
            "content": "Who are you?",
        }],
        stream=True,
        max_tokens=max_tokens,
    )
    for result in chat:
        delta = result.choices[0].delta
        print(delta.get('content', ''), end='', flush=True)
    print()


def batch_test():
    with concurrent.futures.ThreadPoolExecutor() as executor:
        futures = [
            executor.submit(query, max_tokens=randint(20, 200)) for _ in range(20)
        ]
        for future in concurrent.futures.as_completed(futures):
            future.result()


if __name__ == "__main__":
    batch_test()

Scaling with Modelz

Modelz is a fully managed platform that provides users with a simple API for deploying machine learning models. By using our platform, your service can be scaled according to the real-time API invocation. The docker image will also be optimized to minimize the container cold start time.

If you want to deploy models to your private cluster or single GPU server, try openmodelz. It takes care of the underlying technical details and provides a simple and easy-to-use CLI to deploy and manage your machine learning services.

If you have any questions related to deploying models into production, feel free to reach out, by joining Discord, or through modelz-support@tensorchord.ai.

Advertisement Time

mosec - A general high-performance and easy-to-use machine learning serving framework.
pgvecto.rs - A powerful Postgres extension for vector similarity search.

My Journey with envd

Ming — Sat, 26 Aug 2023 12:44:02 +0000

envd is a frontend of BuildKit. Just like the Dockerfile. It has been more than a year since I started working on this project. Since the features are relatively stable, I'd like to write a blog about my journey with envd.

Why we need this tool

The machine learning development environment has been a pain point for a while. "Which Python are you using now?" is definitely a newbie slayer. It's even worse if you need to use CUDA. "It works on my machine!" happens a lot.

envd was created to solve the problem of the machine learning development environment. However, it goes far beyond that.

Infrastructure as code (IaC)

What a fancy name! Here it means by using the envd config file, you will be able to get the same environment on different machines, whether it's a local machine, a remote server, or a Kubernetes cluster.

Naming

It was named MIDI in the beginning. But that is not friendly for SEO.

The d in envd has no official meaning (as far as I know). It can be "docker", "deep learning", "dev", etc.

For more information, check this issue.

Logo

We have a cute logo designed by Lily. It's a cat face with the envd characters.

Actually, the cat only blinked once when we created the GIF. The recording tool on macOS is tricky to use. That's why it ends up blinking twice. By the way, we replaced it with SVG to make the animation clear and smooth. Writing the SVG animation from scratch is not that hard.

You can find the drafts here.

Installation

Obviously, envd is a Golang project. However, our target audiences are mainly using Python. That's why we spend a lot of effort to support installation through pip.

As we know, Python has never done a good job of distributing pre-compiled binaries. I didn't find any good document about how to create a Python pre-compiled binary distribution. People just copy & paste the code from other projects. So does envd. The code is mainly copied from mosec.

I do learn something new from others' contributions:

cibuildwheel has become mature nowadays. It's a great tool for setting up the multi-platform distribution pipeline in CI.
You can package a binary file without any Python code. (by frostming)
You can create the Python ABI-agnostic wheel. (by frostming)

Of course, you can use conda-forge. I have tried to create a recipe for mosec. It has a totally different packaging logic.

Rootless

As a developer, I don't like to run the command with sudo unless I have to. When I was trying to debug with the buildkit daemon, I found that we can run it in rootless mode.

Starlark

Starlark is a dialect of Python, which makes it easy to use for machine learning engineers and data scientists.

I know that lots of configuration files are written in YAML. I personally don't like it. You may also heard lots of complaints about the YAML format. I think the configuration file should be able to validate itself.

You can use if-condition, for-loop, etc. in Starlark. The following code works:

def build(libs, gpu=True):
    base = "ubuntu:20.04"
    if gpu:
        base = "nvidia/cuda:11.2.2-cudnn8-runtime-ubuntu20.04"
    for lib in sorted(libs):
        install.python_packages(name=[lib])

For more information, check the Starlark spec.

Although Starlark has an interpreting order, we don't rely on that. We will parse the file to an internal graph and construct the BuildKit Low-Level Build (LLB) graph on top of it. This tradeoff makes it easy to cache the layers.

Starlark is also easy to extend. We added lots of envd specific functions to make it more powerful. You can find them in the reference. It has a load function which is similar to import in Python to load another file. We create a new one called include (because import is reserved) to import functions from a git repository. People can create their own envd build functions and share them with others.

VSCode support

To make it more user-friendly, we have a VSCode extension for envd, which provides the following features:

LSP: this enables the Starlark auto-completion.
manage envd environment

BuildKit

This is the backend of envd. Integrating with it is troublesome. Mainly because it doesn't have any documentation. The only way to learn it is to read the examples. Since the source code is written in a functional style, it's a bit hard to understand. Once you get used to it, things will be easier.

There are some nice features in BuildKit:

Parallel build
Distributable workers
Better cache
Advanced operators

We will go through them one by one.

Parallel build

The main idea is to split the build graph into multiple sub-graphs and run them in parallel if possible. This is a great feature when some steps take a long time to finish while there is no overlap among them. For example, we can install the system packages and Conda environments in parallel.

The related operators are diff and merge. In the merge list, the later state will override the previous stats if they change the same directories. Sometimes, it may take longer than you expect to get the diff and merge them together. This should be used when you're sure that the parallelism will save time.

Distributable workers

Basically, the frontend will construct the build graph and serialize it in a Protocol Buffer format, then send it to the backend workers through TCP or Unix Domain Socket.

It's recommended to set up a long-running BuildKit daemon and use it as a remote worker since in this way it can benefit from the cache.

By default, we will create a buildkitd container for envd to build the image.

Better cache

BuildKit can import/export the cache from/to the local/inline/registry. You can choose to export the intermediate layer or not.

By default, the cache limit is 10% of your disk space. You can configure this through the buildkit config.

envd v0 will download a pre-build base image that contains the basic development tools and Python environment. This image can be used as the cache layer if none of the dependencies change. This is a great way to speed up the build process. You can check the nightly build benchmark.

Moby

For now, the best user experience is to use envd v1 with moby worker. This requires the docker engine version >= 22. To enable it, you can create a new envd context like this:

envd context create --name moby --builder moby-worker --use

Need to mention that the moby worker is still experimental. Due to the issue, we have to disable the merge operator used in envd when using the moby worker. Thus the build step might be slower but the export step will be much faster. Overall it's still faster, especially when you have a large image, which is the common case for machine learning.

Cache

Docker layer cache is a common optimization for image building. Besides, we also enable the cache for the APT packages, Python wheels, VSCode extensions, and oh-my-zsh plugins. This is done by mounting a cache directory during the build time. The machine learning related pip wheels can be huge, which makes the cache very useful.

Horust

I totally agree that for the online environment, one container should only do one thing, usually, that means running only one service. However, for the development environment, it's totally fine to run as many processes as you like, as long as they don't conflict with each other.

That's why we need a process management tool to control all of these processes. We have explored several options like systemd, s6 overlay, Supervisor. In the end, we decided to use Horust which is both simple and powerful. You can check the discussion.

Shell prompt

I personally use fish with starship, which gives a great out-of-box shell experience. starship can work well with the most common shells like bash, zsh, fish, etc. It's easy to configure and extend. You can check the starship documentation.

It works better when you have the Nerd font, but we cannot control the users' terminal configuration, we have to disable some fancy icons.

Coding in Jupyter Notebook and VSCode

These are the most common coding tools for machine learning engineers and data scientists.

Whether it's Jupyter Notebook or Jupyter Lab, it can be exposed as a normal web service.

VSCode really did a good job on the remote development. You can use the VSCode on your local machine to connect to the remote server or even the container running on a remote server.

Limited by the license, we have to use the Open VSX Registry. Sometimes the related CI test fails due to its stability.

Develop in the Kubernetes cluster

We were hoping to monetize envd with this feature. But not many people are interested in this one. The code is open sourced as envd-server. Maybe we can bring this feature to the new openmodelz project. Although you can run mdz exec {name} -ti bash to get into a container, but it doesn't support VScode-Remote for now.

Use pointer receivers

This is the most common bug during the development with envd. We have an internal build graph, which has many methods to build the LLB graph. Not all of these methods are using the pointer receivers, which results in the inconsistent state of the internal graph. I would prefer to use the pointer receivers for all of the methods.

You might be curious how come the lint doesn't catch this. That's because it can be used in a nested way, with the outer function using the value receiver while the inner function uses the pointer receiver.

This is also a good example to show the language design (personal option). You won't see this kind of bug in Rust. But Rust doesn't have a good container ecosystem. :(

Progress bar

The default docker progress bar is really complex. When I was implementing the moby push feature, I chose to reuse another progress bar lib to make life easier. Although it lacks multi-line log support.

SSH agent forwarding

Actually, we can forward the host SSH credentials to the container. So we can use the git command as we're in the host machine.

`envd` v1

This new version is created to address the inappropriate design of the envd v0. The main idea is that envd file should be a more general frontend of BuildKit. It should be able to build any image, not only for the machine learning development environment.

Here is a comparison:

Features	v0	v1
is default for `envd<v1.0`	✅	❌
support dev	✅	✅
support CUDA	✅	✅
support serving	⚠️	✅
support custom base image	⚠️	✅
support installing multiple languages	⚠️	✅
support `moby` builder	❌	✅

Make it faster

The compileBaseImage function should be able to run faster. You can try it if you're interested in the envd development.

Regrets

state-based implementation

This feature will make it much more powerful, but also comes with complexity.

Users can use low-level operators to build the graph. We can execute the commands from envd file in the user-defined order.

incremental development environment

Lots of development environments are not built in one shot. This proposal wants to track the changes in the running environment and update the envd file accordingly.

Summary

It is the first time that I can work on an open source project as my daily work.I have learned a lot from the community. I hope more people can benefit from envd.

Develop machine learning applications inside the containers

Ming — Thu, 05 Jan 2023 00:42:54 +0000

pycon China 2022 envd - Google Slides

docs.google.com

Machine learning container environment should be easy

Ming — Sun, 18 Sep 2022 13:43:11 +0000

As a machine learning engineer that works on different deep learning models, unexpected environmental issues always bother me.

Do these scenarios look familiar to you? What happens to the machine learning development environment?

Even though you can pip install torch, it doesn't mean you don't need to deal with the low-level code dependencies.
Container is necessary for a consistent environment. Especially for the GPU part.
Dockerfile is hard to reuse conveniently.

Dealing with the environment is just the first step of your work. It should be made easy but it's never easy. Although we need to admit that it's much easier than the day we had to search how to install NumPy.

Meanwhile, from the machine learning infra engineers' perspective:

Infra engineers are never the enemies of machine learning engineers. A better tool can make everyone happy.

Let's sum up our requirements:
Machine learning engineers should submit container images instead of raw code. Because they know better about the model dependencies.
Infra engineer should maintain a better utility to help machine learning engineers to build the container images following the best practice.
Meanwhile, machine learning engineers don't want to sacrifice the development experience. They should be able to use Jupyter Notebook and VSCode as usual.

So far, everything looks good. Obviously, it's not something impossible.

Let's introduce the new tool: envd.

It provides the following features:

Writing Python-like function instead of the Dockerfile and share them across your team
Based on bulidkit with better cache and parallel building
Integrated with Jupyter Notebook and VSCode

The syntax looks like this:

def build():
    base(os="ubuntu20.04", language="python")
    install.cuda(version="11.6", cudnn="8")
    install.python_packages(name=[
        "torch"
    ])

Run the command envd up, then you are in a isolated container environment.

To reuse the function written by your teammates, you can import them like:

lib = include("https://github.com/tensorchord/envdlib")
lib.jupyter_lab(host_port=8888)

It's also much faster. See the benchmark below:

More features are coming! Feel free to open a issue or join the discord community to discuss with us.

Why not multiprocessing

Ming — Thu, 14 Oct 2021 14:08:32 +0000

During the development of a machine learning serving project Mosec, I used a lot of multiprocessing to make it more efficient. I want to share some experiences and researches related to Python multiprocessing.

start from a segment fault

Here is a code snippet that will run well on Darwin but trigger a segment fault on Unix.

import multiprocessing as mp
from time import sleep


def wait_for_event(event):
    while not event.is_set():
        sleep(0.1)


def trigger_segment_fault():
    event = mp.Event()
    p = mp.get_context("spawn").Process(target=wait_for_event, args=(event,))
    p.start()  # this will show the exitcode=-SIGSEGV
    sleep(1)
    print(p)
    event.set()
    p.terminate()


if __name__ == "__main__":
    trigger_segment_fault()

Yeah, the pure Python code can trigger a segment fault.

The reason is because of the new process start method. According to the Python document, spawn is the default one on macOS (start from Python 3.8) while fork is the default one on Unix. But the start method also affects the Event creation. Let's check the source code:

class Event(object):

    def __init__(self, *, ctx):
        self._cond = ctx.Condition(ctx.Lock())
        self._flag = ctx.Semaphore(0)

The initialization takes a ctx which is related to the start method. So when you try to access a forked event in a spawned process, this segment fault occurs. The way to solve this is simple -- using the same context. (Actually, you can use the spawn event in the forked process)

fork or spawn

Another question is that, which start method should I use?

spawn: The parent process starts a fresh python interpreter process. The child process will only inherit those resources necessary to run the process objects run() method. In particular, unnecessary file descriptors and handles from the parent process will not be inherited.

fork: The parent process uses os.fork() to fork the Python interpreter. The child process, when it begins, is effectively identical to the parent process. All resources of the parent are inherited by the child's process. Note that safely forking a multithreaded process is problematic.

We can see that spawn will create a new Python process and only inherit necessary resources. fork will call the underlying os.fork(), but the implementation in CPython is problematic.

When you are using spawn, accidentally access the main process variables may have some unexpected consequences.

import multiprocessing as mp
import os


class Dummy:
    def __init__(self) -> None:
        print(f"init in pid: {os.getpid()}")


Dummy()
x = None


def task():
    if x is None:
        print("x is None")


if __name__ == "__main__":
    p = mp.get_context("spawn").Process(target=task)
    p.start()
    p.join()

In the above code snippet, if the spawn process tries to access the variable x, it will trigger the initialization of both Dummy() and x = None. So you can see the terminal will print two "init in pid" with different PIDs.

So what kind of problem can the fork cause? Let's take a look at this article: Why your multiprocessing Pool is stuck (it’s full of sharks!).

import threading
import os
import time
import multiprocessing as mp


class AreYouOK:
    def __init__(self):
        print("init in:", os.getpid())
        self.lock = threading.Lock()

    def check(self):
        if self.lock.locked():
            return False
        return True

    def acquire(self):
        self.lock.acquire()

    def delay_release(self):
        time.sleep(1)
        self.lock.release()


greeter = AreYouOK()
greeter.acquire()

threading.Thread(target=greeter.delay_release, daemon=True).start()


def greeting():
    time.sleep(1)
    print(os.getpid(), greeter.check())


if __name__ == "__main__":
    mp.get_context("fork").Process(target=greeting).start()
    greeting()

In the above example, after the lock is released, the child process still cannot acquire the lock. Why?

The main point is that fork doesn't copy everything.

Let's check the man page of fork:

The child does not inherit its parent's memory locks

The child does not inherit semaphore adjustments from its parent

So what happens here is that the child process has a lock already been acquired, but no thread will release the lock because that running thread won't be copied to the fork process. These two locks are not the same (copied not shared). Here, the threading.Lock is obviously not process-safe and should be handled with cautions when it's used in some other libraries (queue.Queue).

If we use spawn instead of fork, everything related will be rebuilt in the new process (including the Thread). That's why we should use spawn instead of fork:

from multiprocessing import set_start_method
set_start_method("spawn")

The code snippet above may cause some problems when the code is executed more than once.

My suggestion is to use the start method context:

import multiprocessing as mp


context = mp.get_context("spawn")
context.Event()

garbage collection with deadlock

Let's take a look at another article: The tragic tale of the deadlocking Python queue.

This code snippet is copied from the above article.

from queue import Queue

q = Queue()


class Circular(object):
    def __init__(self):
        self.circular = self

    def __del__(self):
        print("Adding to queue in GC")
        q.put(1)


for i in range(1000000000):
    print("iteration", i)
    # Create an object that will be garbage collected
    # asynchronously, and therefore have its __del__
    # method called later:
    Circular()
    print("Adding to queue regularly")
    q.put(2)

Usually, we believe that Python runs one line at a time. But that's not true.

Garbage collection can interrupt Python functions at any point, and run arbitrary other Python code: __del__ methods and weakref callbacks. So can signal handlers, which happen e.g. when you hit Ctrl-C (your process gets the SIGINT signal) or a subprocess dies (your process gets the SIGCHLD signal).

So when we try to q.put(2), the queue needs to acquire the lock. Meanwhile, the GC will try to call the __del__ which also does the q.put(1). The q.put(2) is blocked by the GC, but the GC cannot acquire the lock because q.put(2) won't release it. Deadlock happens!

Thanks to the Python-dev team, this has been fixed in Python 3.7 by introducing SimpleQueue.

Copy on write

When running with multiprocessing, we hope the child process can share some data with the main process instead of copying from it. Especially when they are not used in the child process. This sounds reasonable. However, we missed another important part in Python: reference counting.

CPython contains two kinds of garbage collection methods: reference counting and generational garbage collection. The reference counting is the fundamental one and cannot be disabled. The generational garbage collection is mainly used to solve the reference cycles. Check this article for more details: Garbage collection in Python: things you need to know and Design of CPython’s Garbage Collector.

Let's take a look at the CPython implementation of PyObject:

typedef struct _object {
    _PyObject_HEAD_EXTRA
    Py_ssize_t ob_refcnt;
    PyTypeObject *ob_type;
} PyObject;

There is a class member called ob_refcnt which is used to track the reference counting. If we call fork() in the new process, the reference counting of all the Python objects will increase. This means the object itself has changed although the data accessed by the user is still the same.

To handle this problem, the Instagram Engineering team has come up with a solution: Copy-on-write friendly Python garbage collection.

static PyObject *
gc_freeze_impl(PyObject *module)
/*[clinic end generated code: output=502159d9cdc4c139 input=b602b16ac5febbe5]*/
{
    GCState *gcstate = get_gc_state();
    for (int i = 0; i < NUM_GENERATIONS; ++i) {
        gc_list_merge(GEN_HEAD(gcstate, i), &gcstate->permanent_generation.head);
        gcstate->generations[i].count = 0;
    }
    Py_RETURN_NONE;
}

Let's check the Python document for GC. In Python 3.7, it introduced a new method called gc.freeze:

Freeze all the objects tracked by gc - move them to a permanent generation and ignore all the future collections.

So will this solve the Copy-on-write problem? I'm not sure because I cannot come up with an example to reproduce it.

import time
import psutil
import multiprocessing as mp


def display_memory_usage(msg=""):
    process = psutil.Process()
    print(msg, ">", process.memory_info())


def processing():
    display_memory_usage("child ")


if __name__ == "__main__":
    data = list(range(10000000))

    p = mp.get_context("fork").Process(target=processing)
    p.start()

    time.sleep(0.1)
    display_memory_usage("parent")
    p.join()

The code snippet above will print the memory usage of the main process and child process. You may get something like this:

child  > pmem(rss=414748672, vms=427634688, shared=2969600, text=2035712, lib=0, data=411791360, dirty=0)
parent > pmem(rss=419000320, vms=427634688, shared=7221248, text=2035712, lib=0, data=411791360, dirty=0)

We can see that they don't share a lot. Although by default, the fork process should share the data with the parent process.

But if we change it to spawn, we will get something like this:

child  > pmem(rss=13848576, vms=23044096, shared=7069696, text=2035712, lib=0, data=7163904, dirty=0)
parent > pmem(rss=419139584, vms=428081152, shared=7196672, text=2035712, lib=0, data=412200960, dirty=0)

Since the data is not used by the spawn process, so this won't be copied to the new process.

I try to add the gc.freeze() before creating a new process, but it doesn't work at all. Not sure what I have missed.

I found that some discussion in the gc.freeze() PR. It looks that the untouched data should be able to share among processes. Also, it has been 4 years for Gunicorn to process this support for gc.freeze() for apps that use preloading. I cannot found a good example to demonstrate that this method works well.

To my understanding, the gc.freeze() will disable the generational garbage collection. But the reference counting cannot be disabled. So if we fork a new process, everything will be shared with the new process, which means it will change all the reference count.

If we change the start method from spawn to fork, it doesn't need the gc.freeze() to freeze the reference count, which has conflicts with the description in the Instagram blog.

Is there any method to avoid this? Yes. Check another blog written before the Instagram blog: Python vs Copy on Write. The solution is very straightforward:

You can just use the PyPy because it has a different way for garbage collection.
You can use the Shared ctypes Objects.
You can use the shared memory for Python >= 3.8.
You can use the mmap to reduce memory usage of array copies.

Suggestions

Try to use Go, Rust, or C++ to do concurrency computing.
Use spawn instead of fork.
Be careful about the garbage collection behavior.

Yet another deep learning serving framework

Ming — Wed, 13 May 2020 17:06:19 +0000

Yet another deep learning serving framework that is easy to use.

Previously, I tested the performance of some deep learning serving frameworks like TensorFlow Serving, Triton, and I found that these frameworks are not that easy to use. By the way, they don't have much advantage in the performance. So I just write one as a prototype.

~~Feel free to give it a try~~. For production usage, check MOSEC.

Basic features

serve the deep learning models (HTTP)
preprocess and postprocess (optional)
dynamic batching (increase the throughput)
health check (need to provide examples)
request & response validation
model inference warm-up (need to provide examples)
OpenAPI document
supports both JSON and msgpack serialization

Advantages

support all kinds of deep learning runtime
easy to implement the preprocess and postprocess part
validation for request
health check and warm-up with examples
OpenAPI document

Design

Dynamic Batching

To implement the dynamic batching, we need a high-performance job queue that can be consumed by multiple workers. Go channel will be a good choice. In this situation, we have one producer and multiple consumers, so it's very easy to close the channel for the graceful shutdown.

type Batching struct {
    Name       string // socket name
    socket     net.Listener
    maxLatency time.Duration // max latency for a batch inference to wait
    batchSize  int // max batch size for a batch inference
    capacity   int // the capacity of the batching queue
    timeout    time.Duration // timeout for jobs in the queue
    logger     *zap.Logger
    queue      chan *Job // job queue
    jobs       map[string]*Job // use job id as the key to find the job
    jobsLock   sync.Mutex // lock for jobs
}

For jobs in this queue, we need to create a UUID as a key. So after the inference, we can find this job by searching the key in a hash table. That means we also need a mutex for the hash table.

type Job struct {
    id        string
    done      chan bool
    data      []byte // request data
    result    []byte // inference result or error message
    errorCode int // HTTP Error Code
    expire    time.Time
}

Because the batching service and Python inference workers are on the same machine (or the same pod), so the most efficient communication should be the Unix domain socket. And we also need to define a simple protocol for our use case. Since we only need to transfer the data of batch jobs, let's keep everything as simple as we can.

| length  |       data        |
| 4 bytes |   {length} bytes  |

workers send the first request with empty data to the batching service
batching service collects a batch of jobs and sends to the workers
worker processes these jobs
- preprocess (for a single job)
- inference (for a batch of jobs)
- postprocess (for a single job)
- send to the results to the batching service
batching service notifies the handler that this job is done, then the handler sends the result to the original client and goes to #2

Error handling

timeout

If a job is not processed by one of the workers for a long time, the batching service will delete this job from the hash table and return 408.

When the batching service tries to collect these jobs from the queue channel, it will check the expire attribute first.

validation error

To make sure the requested data is valid, we use pydantic to do the validation. So the user needs to define the data schema with pydantic.

If one job data is invalid, this one will be marked and the result for this job is the validation error message generated by pydantic. And this won't affect other jobs in the same batch. That part is handled by the ventu.

Simple HTTP service without dynamic batching

For this part, we use falcon which is a very powerful Python framework for web APIs. To generate the OpenAPI document and validate the request data, we use spectree.

If you would like to use gunicorn, ventu also expose the app element.

TODO

metrics
- this can be added by users in model inference part
increase the number of workers dynamically

Deep Learning Serving Benchmark

Ming — Thu, 23 Apr 2020 06:30:30 +0000

There is no black magic, everything follows the rules.

What does the deep learning serving frameworks do?

respond to request (RESTful HTTP or RPC)
model inference (with runtime)
preprocessing & postprocessing (optional)
queries dynamic batching (increase throughput)
monitoring metrics
service health check
versioning
multiple instances

Actually, when we are trying to deploy the models with kubernetes, we only need part of these features. But we do care about the performance of these frameworks. So let's do a benchmark.

Benchmark

Environments:

CPU: Intel(R) Xeon(R) Gold 6130 CPU @ 2.10GHz
GPU: NVIDIA V100
Memory: 251GiB
OS: Ubuntu 16.04.6 LTS (Xenial Xerus)

Docker Images:

tensorflow/tensorflow:latest-gpu
tensorflow/serving:latest-gpu
nvcr.io/nvidia/tensorrtserver:19.10-py3

The cost of time is recorded after warmup. Dynamic batching disabled.

All the code can be found in this gist.

Framework	Model	Model Type	Images	Batch size	Time(s)
Tensorflow	ResNet50	TF Savedmodel	32000	32	83.189
Tensorflow	ResNet50	TF Savedmodel	32000	10	86.897
Tensorflow Serving	ResNet50	TF Savedmodel	32000	32	120.496
Tensorflow Serving	ResNet50	TF Savedmodel	32000	10	116.887
Triton (TensorRT Inference Server)	ResNet50	TF Savedmodel	32000	32	201.855
Triton (TensorRT Inference Server)	ResNet50	TF Savedmodel	32000	10	171.056
Falcon + msgpack + Tensorflow	ResNet50	TF Savedmodel	32000	32	115.686
Falcon + msgpack + Tensorflow	ResNet50	TF Savedmodel	32000	10	115.572

According to the benchmark, Triton is not ready for production, TF Serving is a good option for TensorFlow models, and self-host service is also quite good (you may need to implement dynamic batching for production).

Comparing

Tensorflow Serving

https://www.tensorflow.org/tfx/serving

coupled with Tensorflow ecosystem (also support other format, not out-of-box)
A/B testing
provide both gRPC and HTTP RESTful API
prometheus integration
batching
multiple models
preprocessing & postprocessing can be implemented with signatures

Triton Inference Server

https://github.com/NVIDIA/triton-inference-server/

support multiply backends: ONNX, PyTorch, TensorFlow, Caffe2, TensorRT
both gRPC and HTTP with SDK
internal health check and prometheus metrics
batching
concurrent model execution
preprocessing & postprocessing can be done with ensemble models
shm-size, memlock, stack configurations are not available for Kubernetes

Multi Model Server

https://github.com/awslabs/multi-model-server

require Java 8
provide HTTP
Java layer communicates with Python workers through Unix Domain Socket or TCP
batching (not mature)
multiple models
log4j
management API
need to write model loading and inference code (means can use any runtime you want)
easy to add preprocessing and postprocessing to the service

GraphPipe

https://oracle.github.io/graphpipe

use flatbuffer which is more efficient
2 years ago...
Oracle laid off the whole team

TorchServe

https://github.com/pytorch/serve

fork from Multi Model Server
developing...