Deb

Posted on Dec 7, 2023

8 Rusty open source data projects to watch in 2024 🤩

#rust #data #opensource #dataengineering

Context

The open source ecosystem is my favourite aspect of tech. It has been since the early days of BIG data in late 2000's. As software, infra, and data engineers, open source projects are a source of great inspiration for all of us. Open source projects are also a great help in building solutions to validate and resolve customer challenges without reinventing the wheel.

In my first post wearing the developer advocacy hat, as a technical product leader, I want share 5 data projects that I am tinkering during the December break going into 2024.

First I will share why data projects? and why Rust? Then I will share 5 open source projects that I am looking deeper into over the coming days and months.

Why data projects?

Data is at the centre of attraction in tech. For the past 12 years, I have heard it said in several software companies that "we want to treat data as a first class citizen." The reason and reality that surrounded the statement is that data has been anything but a first class citizen in majority of organizations.

Data has come in the limelight every now and then. But it's been off and on.

In my personal experience, data got bursts of attention since the early 21st century with the rise of the big data ecosystem, distributed file systems, map-reduce, NoSQL, machine learning, deep learning, artificial intelligence, and more recently large language models.

I have been working with complex data use cases since 2009. While I got enthused by augmented and virtual reality for a while, blockchain and cryptography for a bit, my affinity towards data is consistent throughout my career.

There has never been a better time to dive deep into data.

Why Rust based?

Since the my first C/C++ programs implementing trees, and linked lists in early 2000s, I have written code on C#, ASP.net, Java, Python, and R between 2010 and now. It took extra work to keep up with writing code since moving to technical product management in 2014.

That's why I have not tried Go or Rust yet, and have remained content with SQL and Python scripts. As someone with a bit of experience in data centric applications, I am arguing that the existing stack has served us well. But the demands of data centric applications have increased along with the innovations in hardware infrastructure and we need to adapt.

Rust has been dubbed as a hype for a bit. But it is boring. It's complex to get started. It is at the system level as C++ with some additional promise.

Yet it is the most loved language by it's adopters and it has proved to be better than alternative system level programming languages in terms of performance, security, reliability, efficiency. Several big tech companies have made the incredibly rare choice of rewriting critical distributed systems in Rust.

I have been working since January 2023 with a team that has been building with Rust since 5 years. I believe Rust will inevitably power several distributed data systems as I dabble with various projects. It's about time I took a deeper dive!

The 8 Rust based open source data projects to try in 2023

Fluvio [GitHub: https://github.com/infinyon/fluvio]

The word Fluvio comes from Latin fluvius ‘river’. A river is made up of continuous 'streams' and powers entire civilizations through agriculture. It is an apt name for a data streaming platform.

I learnt about Fluvio in mid 2022 and it was a project recommended by a genius tech architect friend. I left my last role and higher pay to work in product leadership at InfinyOn the creators of Fluvio in January 2023. I have used the cloud instances a lot more since I did not want to spend the additional time configuring Kubernetes on my local system to play around with the open source version.

Here is the description that I recently updated on the Fluvio GitHub Repo:

Lean and mean distributed stream processing system written in rust and web assembly.

I have used the InfinyOn Managed Cloud since I did not want to spend the additional time configuring Kubernetes on my local system to play around with the open source version.

We just released a single binary installer for Fluvio Open Source with no dependency on Kubernetes. And I am now going to start tinkering around with the dev kits with the open source project. I will be sharing more about the workflows soon.

Arroyo [GitHub: https://github.com/ArroyoSystems/arroyo]

Arroyo is a scalable stateful stream processing engine that is written in Rust and provides a SQl interface. Arroyo is Y-Combinator backed startup project with a mission to build a streaming-first future.

About Arroyo on the GitHub Repo is succinct:

Distributed stream processing engine in Rust

The ability to express stateful streaming operations using declarative SQL is what data engineers love. One top of that Arroyo has an integration with Fluvio with a rust based engine which makes it a great combination.

I am looking froward to trying out the arroyo example projects and running SQL based state operations on data from Fluvio topics. I will be sharing all about these workflows everywhere.

Qdrant [GitHub: https://github.com/qdrant/qdrant]

Quadrant is a vector database and similarity search engine written in rust. It's making waves powering better search capabilities for several well known products like Twitter, GitBook and more.

The GitHub Repo About section describes Qdrant as::

Qdrant - High-performance, massive-scale Vector Database for the next generation of AI. Also available in the cloud https://cloud.qdrant.io/

The ability to store embeddings of large language models and deep neural network based auto encoders is in high demand given the massive accessibility and popularity of large language models.

Search and chat are highly used functionalities, and I am looking forward to testing out building out the demo projects and digging into the interfaces in quadrant.

Lancedb [GitHub: https://github.com/lancedb/lancedb]

LanceDB calls is a database for vector search. Seems like Lancedb is a bridge between, Python and Rust. Lancedb core is written in rust and it supports native Python and Javascript/Typescript.

The GitHub Repo About section describes LanceDB as:

Developer-friendly, serverless vector database for AI applications. Easily add long-term memory to your LLM apps!

I am looking forward to testing out the ecosystem integrations with LangChain 🦜️🔗, LlamaIndex 🦙, Apache-Arrow, Pandas, Polars, DuckDB and more on the way. It would be interesting to see the similarities and differences between LanceDB and Qdrant.

Linfa [GitHub: https://github.com/rust-ml/linfa]

Having spent a decent amount of time over the years tinkering around with Scikit-Learn in Pyhton, I am curious to try out Linfa. It would be cool to try out the basic machine learning and statistical models to get a feel for how to implement them in rust using a reasonable framework.

About Linfa on GitHub is precise:

A Rust machine learning framework.

Linfa in Italian translates to sap in English. The project "aims to provide a comprehensive toolkit to build Machine Learning applications with Rust. Kin in spirit to Python's scikit-learn, it focuses on common preprocessing tasks and classical ML algorithms for your everyday ML tasks."

I hope this becomes a decent bridge from Python to Rust. Data processing for machine learning and statistical modelling has been a useful way for me to learn about different use cases, and I am looking forward to playing around with Linfa.

Polars [GitHub: https://github.com/pola-rs/polars]

Polars has the highest momentum among data engineers who are used to Pandas DataFrames. I have experience with Pandas in a decent bit of experimentation. Interesting thing is that polars is a DaatFrame interface on top of an OLAP engine.

About Polars from GitHub reads:

Dataframes powered by a multithreaded, vectorized query engine, written in Rust.

I am looking forward to specifically trying out hybrid streaming functionality on Polars and testing out the possibilities of using the interface for interactive real time visualization. I also want to test out the Pyo3 extensions for Python functions complied in Rust.

Lance [GitHub: https://github.com/lancedb/lance]

Lance is a columnar data format which deliberately calls out the ease of converting from parquet files. Lance is the format used by the Lancedb vector database. Lance is also compatible with Pandas, DuckDb, Polars, Pyarrow.

About Lance from GitHub reads:

Modern columnar data format for ML and LLMs implemented in Rust. Convert from parquet in 2 lines of code for 100x faster random access, vector index, and data versioning. Compatible with Pandas, DuckDB, Polars, Pyarrow, with more integrations coming.

Converting parquet files with 2 lines of code for 100X faster random access is insane, given the popularity of the parquet format. I am looking to try out a couple of end to end streaming flows, switch around the data formats and compare the different engines in different parts of the pipeline to identify the integration challenges, pros and cons of using different systems.

Surrealdb [GitHub: https://github.com/surrealdb/surrealdb]

Surreal is the most popular project on this list in terms of community engagement. Surreal is a document-graph data store for modern web applications.

About SurrealDB from GitHub reads:

A scalable, distributed, collaborative, document-graph database, for the realtime web.

Surreal is interesting as it is transactional, strongly typed, and offers granular access control. It also offers a degree of multi-modality and offers representations in tables, documents, and graphs making it an interesting player in the distributed serverless databases.

I am looking forward to checking out the community projects and trying out theirs data models for streaming data.

Conclusion

It's always a party in tech, and with new things getting announced everyday there would surely be more options in the coming year. There are few other projects that I am looking at trying out in the process including Apache OpenDAL [https://opendal.apache.org/], OneTable [https://github.com/onetable-io/onetable] are a couple that comes to mind.

Anyways, it sounds like 2024 is going to take my back to my software and data engineering days in early 2010s, and I am looking forward to sharing about my experiences with these tools, performance profiles, opinions about the landscape and more.

I am also a big believer in sharing my work in the places where people find their information. I have a single link to my profiles on various developer communities, and social media. Feel free to connect with me wherever you engage with developer insights.

Here is my link tree - https://www.singlel.ink/u/debroychowdhury

See you around.

Deb.