kination

Posted on Mar 20

build-my-own-datalake: Part 1

#bigdata #datalake #rust #buildmyownx

build-your-own-x

build-your-own-x is the project which has been started from GitHub community. As you can expect through name, it aims to share code that builds well-known projects or applications(database, server framework, os, and more) from the bottom.

As you know there are a common programming tip "Don't re-invent the wheel" which says that building something new from scratch is often a foolish endeavor. However, I believe this advice is only partially correct.

In terms of time efficiency, developing something from the ground up can indeed be time-consuming. But it's worth noting that long-established systems often carry significant technical debt and can be challenging to modify, especially when they're widely used in production environments.

So, if you have a clear purpose and compelling reasons for creating a new system from scratch, it's not necessarily a waste of time. In fact, it can be a valuable and justified approach in certain situations."

Motivation

Of course, thing I'm starting here, doesn't have such a grand purpose. Currently I've worked as an data engineer for several years, and faced on file format which is specified to big data like parquet/orc.

datalake format

Recently one of trend in data engineering is datalake format, which is making a table based on group of file and metadata in particular architecture. It gives flexibility on keeping data with ACID transactions, versioning, and schema enforcement.

As you may know, there are already widely used datalake formats(iceberg, hudi, delta, ...). These formats are being rapidly developed with strong community support, and I'm also using them as part of my job.

However, aside from using them effectively, I have always been curious about how they work internally. While they are open-source and can review through them, I believe that building one from the bottom is also effective way to understand them.

That’s why I decided to start this project. Additionally, since most existing formats are based on Java/Scala, I wanted to explore whether using Rust could be an effective alternative.

How to start

So for the project, first part to make is

metadata
core file reading/writing part

Also, it should offer interface to communicate with data processing frameworks such as Spark/Flink, so should make JNI also.

Result

Here's current full result, which named as vine.

https://github.com/kination/vine

I'll describe about core part only in Part 1 post, and do others at next.

Core

#[no_mangle]
#[allow(non_snake_case)]
#[allow(unused_variables)]
pub extern "C" fn Java_io_kination_vine_VineModule_readDataFromVine(
    mut env: JNIEnv,
    class: JClass, 
    dir_path: JString) -> jobject {
    let path: String = env.get_string(&dir_path).expect("Cannot get data from dir_path").into();
    let rows = read_data(&path);
    let mut result = String::new();

    for row in rows {
        result.push_str(&row);
        result.push('\n')
    }

    let output = CString::new(result).expect("Cannot generate CString from result");    
    env.new_string(output.to_str().unwrap()).expect("Cannot create java string").into_raw()
}

#[no_mangle]
#[allow(non_snake_case)]
#[allow(unused_variables)]
pub extern "C" fn Java_io_kination_vine_VineModule_writeDataToVine(
    mut env: JNIEnv,
    class: JClass,
    path: JString,data: JString) {

    let path_str: String = env.get_string(&path).expect("Fail getting path").into();
    let data_str: String = env.get_string(&data).expect("Fail getting path").into();
    let rows: Vec<&str> = data_str.lines().collect();
    write_data(&path_str, &rows).expect("Failed to write data");
}

Currently, it assumes data is exchanged as raw strings between data processing frameworks. Of course, I'll do more research on the best format for reading and writing through JNI to optimize performance.

Metadata

The metadata file is JSON-based and contains information about fields and the table name. Currently, it's stored as a single file, but additional files will be needed to support versioning and schema evolution.

{
  "table_name": "vine-test",
  "fields": [
    {
      "id": 1,
      "name": "id",
      "data_type": "integer",
      "is_required": true
    },
    {
      "id": 2,
      "name": "name",
      "data_type": "string",
      "is_required": false
    }
  ]
}

Writer

Writing, goes as

Read raw data which is list of string
Read metadata, and check field type
Match field type and data
Write to file

On parsing metadata, it only supports 4 kinds of type, but should do it for more.

...
let metadata: Metadata = serde_json::from_str(&meta_str).expect("Failed to deserialize metadata");
    let meta_fields = metadata.fields.clone();

    let mut schema_str = String::from("message schema {\n");
    for field in meta_fields {
        let field_type = match field.data_type.as_str() {
            "integer" => "REQUIRED INT32",
            "string" => "REQUIRED BINARY",
            "boolean" => "REQUIRED BOOLEAN",
            "double" => "REQUIRED DOUBLE",
            _ => continue,
        };

        match field_type {
            "REQUIRED BINARY" => schema_str.push_str(&format!("    {} {} (UTF8);\n", field_type, field.name)),
            _ => schema_str.push_str(&format!("    {} {};\n", field_type, field.name))
        }
    }
...

After defining types, it writes the data in parquet format (of course, it can be changed)

One known issue is that the current logic follows these steps:

Define the type of the raw data.
Store the raw data in a temporary variable of that type.
Write the data through the temporary variable.

However, I believe there's a more efficient approach to streamline this process.

Reader

let meta_str = read_to_string("vine-test/vine_meta.json").expect("Failed to read vine_meta.json");
    let meta: Value = serde_json::from_str(&meta_str)
                    .expect("Failed to parse metadata JSON");

    for (_, path) in directories {
        let sub_entries = fs::read_dir(path).expect("Cannot read path");
        for se in sub_entries {
            let file_path = se.expect("Cannot get sub entry").path();
            if file_path.extension().map_or(false, |ext| ext == "parquet") {    
                let fields = meta["fields"].as_array()
                    .expect("fields should be an array");

                let file = File::open(file_path).expect("Cannot open file from file_path");
                let reader = SerializedFileReader::new(file).expect("cannot serialize file");
                let iter = reader.get_row_iter(None).expect("Cannot get row iterator");

                for row_result in iter {
                    if let Ok(row) = row_result {
                        let mut values = Vec::new();

                        for field in fields {
                            // Fields are 1-indexed in metadata, but 0-indexed in parquet
                            let col_index = (field["id"].as_i64().unwrap_or(0) - 1) as usize;
                            let data_type = field["data_type"].as_str().expect("data_type should be string");

                            let value = match data_type {
                                "integer" => row.get_int(col_index).unwrap_or_default().to_string(),
                                "string" => {                                    row.get_string(col_index).unwrap().clone()
                                },
                                ...
                                _ => String::from(""),
                            };
                            values.push(value);
                        }

                        row_list.push(values.join(","));
                    }
                }
            }
        }
    }

DEV Community