4 best opensource projects about big data you should try out

#database #datascience #programming #github

With the development of big data, the data lake era is arriving, making relevant technical personnel scarce. More and more data engineers and data lake projects are coming into the public's view. There are also open-source products, but not every open-source product is worth trying. Let's see some open projects about data lake great and even better than paid projects.

1.Hudi
Hudi is an opensour procjects providing tables, transactions, efficent upserts/deletes, advanced indexes, streaming ingestion services, data clustering/compaction optimizations, and concurrency all while keeping your data in open source file formats.

Apache Hudi brings core warehouse and database functionality directly to a data lake, which is great for streming wokloads, making users create efficient incremental batch pipelines. Besides, Hudi is very compatible, for example, it can be used on any cloud, and it supports Apache Spark, Flink, Presto, Trino, Hive and many other query engines.

2.Iceberg
Iceberg is an open table format for huge analytic dataset with Schema evolution, Hidden partitioning, Partition layout evolution, Time trave, Version rollback, etc.

Iceberg was built for huge tables, even those that can't be read with a distributed SQL engine, used in production where a single table can contain tens of petabytes of data. Iceberg is famous for its fast scan planning, advanced filtering, works with any cloud store, serializable isolation,, multiple concurrent writers, etc.

3.Lakesoul
LakeSoul is a unified streaming and batch table storage solution built on top of the Apache Spark engine, and supports scalable metadata management, ACID transactions, efficient and flexible upsert operation, schema evolution, and streaming & batch unification.

LakeSoul specializes in row and column level incremental upserts, high concurrent write, and bulk scan for data on cloud storage. The cloud native computing and storage separation architecture makes deployment very simple, while supporting huge amounts of data at lower cost.

4.delta lake
Delta Lake is an open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs for Scala, Java, Rust, Ruby, and Python, providing ACID transactions, scalable metadata handling, and unifies streaming and batch data processing on top of existing data lakes, such as S3, ADLS, GCS, and HDFS.

Here is the comparison of projects about data lake.

Hudi focuses more on the fast landing of streaming data and the correction of delayed data. Iceberg focuses on providing a unified operation API by shielding the differences of the underlying data storage formats, forming a standard, open and universal data organization lattice, so that different engines can access through API. Lakesoul, now based on spark, focuses more on building a standardized pipeline of data lakehouse. Delta Lake, an open-source project from Databricks, tends to address storage formats such as Parquet and ORC on the Spark level.

As a newcomer to data lake warehouse, I will learn more about data lake warehouse in the future and record my learning process here. Next, _I will focus on these four open-source products, Hudi, Iceberg, Lakesoul, DeltaLake, and write some codes and tutorials, _conducting my learning. I hope my record can be helpful to you or get your advice.

DEV Community

4 best opensource projects about big data you should try out

Top comments (0)

Read next

Top Open Source Communities you should not miss out in 2025🔥

Software Dev Roles and Salary ranges in the Philippines (2024)

Embedded wallets: Web2 experience, Web3 security

Machine Learning Basics: Building Your First Predictive Model in R