DEV Community: Hana Wang

Introducing OpenMLDB’s New Feature: Feature Signatures — Enabling Complete Feature Engineering with SQL

Hana Wang — Thu, 23 May 2024 08:42:00 +0000

Background

Rewinding to 2020, the Feature Engine team of Fourth Paradigm submitted and passed an invention patent titled “Data Processing Method, Device, Electronic Equipment, and Storage Medium Based on SQL”. This patent innovatively combines the SQL data processing language with machine learning feature signatures, greatly expanding the functional boundaries of SQL statements.

At that time, no SQL database or OLAP engine on the market supported this syntax, and even on Fourth Paradigm’s machine learning platform, the feature signature function could only be implemented using a custom DSL (Domain-Specific Language).

Finally, in version v0.9.0, OpenMLDB introduced the feature signature function, supporting sample output in formats such as CSV and LIBSVM. This allows direct integration with machine learning training or prediction while ensuring consistency between offline and online environments.

Feature Signatures and Label Signatures

The feature signature function in OpenMLDB is implemented based on a series of OpenMLDB-customized UDFs (User-Defined Functions) on top of standard SQL. Currently, OpenMLDB supports the following signature functions:

continuous(column): Indicates that the column is a continuous feature; the column can be of any numerical type.
discrete(column[, bucket_size]): Indicates that the column is a discrete feature; the column can be of boolean type, integer type, or date and time type. The optional parameter bucket_size sets the number of buckets. If bucket_size is not specified, the range of values is the entire range of the int64 type.
binary_label(column): Indicates that the column is a binary classification label; the column must be of boolean type.
multiclass_label(column): Indicates that the column is a multiclass classification label; the column can be of boolean type or integer type.
regression_label(column): Indicates that the column is a regression label; the column can be of any numerical type.

These functions must be used in conjunction with the sample format functions csv or libsvm and cannot be used independently. csv and libsvm can accept any number of parameters, and each parameter needs to be specified using functions like continuous to determine how to sign it. OpenMLDB handles null and erroneous data appropriately, retaining the maximum amount of sample information.

Usage Example

First, follow the quick start guide to get the image and start the OpenMLDB server and client.

docker run -it 4pdosc/openmldb:0.9.0 bash
/work/init.sh
/work/openmldb/sbin/openmldb-cli.sh

Create a database and import data in the OpenMLDB client.

--OpenMLDB CLI
CREATE DATABASE demo_db;
USE demo_db;
CREATE TABLE t1(id string, vendor_id int, pickup_datetime timestamp, dropoff_datetime timestamp, passenger_count int, pickup_longitude double, pickup_latitude double, dropoff_longitude double, dropoff_latitude double, store_and_fwd_flag string, trip_duration int);
SET @@execute_mode='offline';
LOAD DATA INFILE '/work/taxi-trip/data/taxi_tour_table_train_simple.snappy.parquet' INTO TABLE t1 options(format='parquet', header=true, mode='append');

Use the SHOW JOBS command to check the task running status. After the task is successfully executed, perform feature engineering and export the training data in CSV format.

Currently, OpenMLDB does not support overly long column names, so specifying the column name of the sample as instance using SELECT csv(...) AS instance is necessary.

--OpenMLDB CLI
USE demo_db;
SET @@execute_mode='offline';
WITH t1 as (SELECT trip_duration,
        passenger_count,
        sum(pickup_latitude) OVER w AS vendor_sum_pl,
        count(vendor_id) OVER w AS vendor_cnt,
    FROM t1
    WINDOW w AS (PARTITION BY vendor_id ORDER BY pickup_datetime ROWS_RANGE BETWEEN 1d PRECEDING AND CURRENT ROW))
SELECT csv(
    regression_label(trip_duration),
    continuous(passenger_count),
    continuous(vendor_sum_pl),
    continuous(vendor_cnt),
    discrete(vendor_cnt DIV 10)) AS instance
FROM t1 INTO OUTFILE '/tmp/feature_data_csv' OPTIONS(format='csv', header=false, quote='');

If LIBSVM format training data is needed, simply change SELECT csv(...) to SELECT libsvm(...). Note that the OPTIONS should still use the CSV format because the exported data only has one column, which already contains the complete LIBSVM format sample.

Moreover, the libsvm function will start numbering continuous features and discrete features with a known number of buckets from 1. Therefore, specifying the number of buckets ensures that the feature encoding ranges of different columns do not conflict. If the number of buckets for discrete features is not specified, there is a small probability of feature signature conflict in some samples.

--OpenMLDB CLI
USE demo_db;
SET @@execute_mode='offline';
WITH t1 as (SELECT trip_duration,
        passenger_count,
        sum(pickup_latitude) OVER w AS vendor_sum_pl,
        count(vendor_id) OVER w AS vendor_cnt,
    FROM t1
    WINDOW w AS (PARTITION BY vendor_id ORDER BY pickup_datetime ROWS_RANGE BETWEEN 1d PRECEDING AND CURRENT ROW))
SELECT libsvm(
    regression_label(trip_duration),
    continuous(passenger_count),
    continuous(vendor_sum_pl),
    continuous(vendor_cnt),
    discrete(vendor_cnt DIV 10, 100)) AS instance
FROM t1 INTO OUTFILE '/tmp/feature_data_libsvm' OPTIONS(format='csv', header=false, quote='');

Summary

By combining SQL with machine learning, feature signatures simplify the data processing workflow, making feature engineering more efficient and consistent. This innovation extends the functional boundaries of SQL, supporting the output of various formats of data samples, directly connecting to machine learning training and prediction, improving data processing flexibility and accuracy, and having significant implications for data science and engineering practices.

OpenMLDB introduces signature functions to further bridge the gap between feature engineering and machine learning frameworks. By uniformly signing samples with OpenMLDB, offline and online consistency can be improved throughout the entire process, reducing maintenance and change costs. In the future, OpenMLDB will add more signature functions, including one-hot encoding and feature crossing, to make the information in sample feature data more easily utilized by machine learning frameworks.

For more information on OpenMLDB:

Official website: https://openmldb.ai/
GitHub: https://github.com/4paradigm/OpenMLDB
Documentation: https://openmldb.ai/docs/en/
Join us on Slack!

OpenMLDB v0.9.0 Release: Major Upgrade in SQL Capabilities Covering the Entire Feature Servicing Process

Hana Wang — Fri, 03 May 2024 03:22:41 +0000

OpenMLDB has just released a new version v0.9.0, including SQL syntax extensions, MySQL protocol compatibility, TiDB storage support, online feature computation, feature signatures, and more. Among these, the most noteworthy features are the MySQL protocol and ANSI SQL compatibility, along with the extended SQL syntax capabilities.

Firstly, MySQL protocol compatibility allows OpenMLDB users to access OpenMLDB clusters using any MySQL client, not limited to GUI applications like NaviCat or Sequal Ace but also Java JDBC MySQL Driver, Python SQLAlchemy, Go MySQL Driver, and various programming language SDKs. For more information, you can refer to “Ultra High-Performance Database OpenM(ysq)LDB: Seamless Compatibility with MySQL Protocol and Multi-Language MySQL Client”.

Secondly, the new version significantly expands SQL capabilities, especially implementing OpenMLDB’s unique request mode and stored procedure execution within standard SQL syntax. Compared to traditional SQL databases, OpenMLDB covers the entire machine learning process, including offline and online modes. In online mode, users can input sample data, and get feature results through SQL feature extraction. On the contrary, in the past, we needed to deploy SQL as a stored procedure through the Deploy command and then perform online feature computation through SDKs or HTTP interfaces. The new version adds SELECT CONFIG and CALL statements, allowing users to directly specify request mode and sample data in SQL to compute feature results, as shown below:

-- Execute online request mode query for action (10, "foo", timestamp(4000))
SELECT id, count(val) over (partition by id order by ts rows between 10 preceding and current row)
FROM t1
CONFIG (execute_mode = 'online', values = (10, "foo", timestamp(4000)))

You can also use the ANSI SQL CALL statement to invoke stored procedures with sample rows as parameters, as shown below:

-- Execute online request mode query for action (10, "foo", timestamp(4000))
DEPLOY window_features SELECT id, count(val) over (partition by id order by ts rows between 10 preceding and current row)
FROM t1;

CALL window_features(10, "foo", timestamp(4000))

For detailed release notes, please refer to: https://github.com/4paradigm/OpenMLDB/releases/tag/v0.9.0

Please feel free to download and explore the latest release. Your feedback is highly valued and appreciated. We encourage you to share your thoughts and suggestions to help us improve and enhance the platform. Thank you for your support!

Release Date

April 25, 2024

Release Note

https://github.com/4paradigm/OpenMLDB/releases/tag/v0.9.0

Highlighted Features

Added support for the latest version of SQLAlchemy 2, seamlessly integrating with popular Python frameworks such as Pandas and Numpy.
Expanded support for more data backends, integrating TiDB’s distributed file storage capability with OpenMLDB’s high-performance in-memory feature computation capability.
Enhanced ANSI SQL support, fixed first_value semantics, supported MAP type and feature signatures, and added offline mode support for INSERT statements.
Added support for MySQL protocol, allowing access to OpenMLDB clusters using MySQL clients like NaviCat, Sequal Ace, and various MySQL SDKs for programming languages.
Extended SQL syntax support, enabling online feature computation directly through SELECT CONFIG or CALL statements.

For more information on OpenMLDB:

Official website: https://openmldb.ai/
GitHub: https://github.com/4paradigm/OpenMLDB
Documentation: https://openmldb.ai/docs/en/
Join us on Slack!

Comparative Analysis of Memory Consumption: OpenMLDB vs Redis Test Report

Hana Wang — Wed, 03 Apr 2024 03:56:42 +0000

Background

OpenMLDB is an open-source high-performance in-memory SQL database with numerous innovations and optimizations particularly tailored for time-series data storage, real-time feature computation, and other advanced functionalities. On the other hand, Redis is the most popular in-memory storage database widely used in high-performance online scenarios such as caching. While their respective application landscapes differ, both databases share a common trait of utilizing memory as their storage medium.

The objective of this article is to perform a comparative analysis of memory consumption under identical data row counts for both databases. Our goal is to provide users with a clear and intuitive understanding of the respective memory resource consumptions of each database.

Test Environment

This test is based on physical machine deployment (40C250G * 3) with the following hardware specifications:

CPU: Intel(R) Xeon(R) CPU E5–2630 v4 @ 2.20GHz
Processor: 40 Cores
Memory: 250 G
Storage: HDD 7.3T * 4

The software versions are as follows:

Test Methods

We have developed a Java-based testing tool using the OpenMLDB Java SDK and Jedis to compare memory usage between OpenMLDB and Redis. The objective is to insert identical data into both databases and analyze their respective memory usage. Due to variations in supported data types and storage methods, the data insertion process differs slightly between the two platforms. Since the data being tested consists of timestamped feature data, we have devised the following two distinct testing approaches to closely mimic real-world usage scenarios.

Method One: Random Data Generation

In this method, each test dataset comprises m keys serving as primary identifiers, with each key potentially having n different values (simulating time series data). For simplicity, each value is represented by a single field, and the lengths of the key and value fields can be controlled via configuration parameters. For OpenMLDB, we create a test table with two columns (key, value) and insert each key-value pair as a data entry. In the case of Redis, we use each key as an identifier and store multiple values corresponding to that key as a sorted set (zset) within Redis.

Example

We plan to test with 1 million (referred to as 1M) keys, each corresponding to 100 time-series data entries. Therefore, the actual data stored in OpenMLDB would be 1M * 100 = 100M, which is equivalent to 100 million data entries. In Redis, we store 1M keys, each key corresponding to a sorted set (zset) containing 100 members.

Configurable Parameters

Operations Steps (Reproducible Steps)

a. Deploy OpenMLDB and Redis

Deployment can be done through containerization or directly on physical machines using software packages. There is no significant difference between the two methods. Below is an example of using containerization for deployment:

OpenMLDB
- Docker image: docker pull 4pdosc/openmldb:0.8.5
- Documentation: https://openmldb.ai/docs/zh/main/quickstart/openmldb_quickstart.html
Redis:
- Docker image: docker pull redis:7.2.4
- Documentation: https://hub.docker.com/_/redis

b. Pull the testing code

c. Modify configuration

Configuration file: src/main/resources/memory.properties [link]
Note: Ensure that REDIS_HOST_PORT and ZK_CLUSTER configurations match the actual testing environment. Other configurations are related to the amount of test data and should be adjusted as needed. If the data volume is large, the testing process may take longer.

d. Rut the tests

[Related paths in the GitHub benchmark Readme]

e. Check the output results

Method Two: Using the Open Source Dataset TalkingData

To enhance the credibility of the results, cover a broader range of data types, and facilitate result reproduction and comparison, we have designed a test using an open-source dataset — the TalkingData dataset. This dataset is used as a typical case in OpenMLDB for ad fraud detection. Here, we utilize the TalkingData train dataset, which can be obtained as follows:

Sample data: sample data used in OpenMLDB
Full data: Available on Kaggle

Differing from the first method, the TalkingData dataset includes multiple columns with strings, numbers, and time types. To align storage and usage more closely with real-world scenarios, we use the “ip” column from TalkingData as the key for storage. In OpenMLDB, this involves creating a table corresponding to the TalkingData dataset and creating an index for the “ip” column (OpenMLDB defaults to creating an index for the first column). In Redis, we use “ip” as the key and store a JSON string composed of other column data in a zset (as TalkingData is time-series data, there can be multiple rows with the same “ip”).

Example

Configurable Parameters

Operation Steps (Reproducible Steps)

a. Deploy OpenMLDB and Redis

Same as in method one.

b. Pull the testing code

c. Modify configuration

Configuration file: src/main/resources/memory.properties [link]
Note:
- Ensure that REDIS_HOST_PORT and ZK_CLUSTER configurations match the actual testing environment.
- Modify TALKING_DATASET_PATH (defaults to resources/data/talking_data_sample.csv).

d. Obtain the test data file

Place the test data file in the resources/data directory, which is consistent with the TALKING_DATASET_PATH configuration path.

e. Run the tests

[Related paths in the GitHub benchmark Readme]

f. Check the output results

Results

Random Data Test

Under the experimental conditions mentioned above, storing the same amount of data, OpenMLDB (memory storage-mode) consumes over 30% less memory compared to Redis.

TalkingData Test

Thanks to OpenMLDB’s data compression capabilities, when sampling small batches of data from the TalkingData train dataset, OpenMLDB’s memory usage is significantly reduced by 74.77% compared to Redis. As the volume of test data increases, due to the nature of the TalkingData train dataset, a high number of duplicate keys during storage occurs, leading to a decrease in the storage advantage of OpenMLDB relative to Redis. This trend continues until all the train dataset is stored in the database, at which point OpenMLDB’s memory reduction compared to Redis is 45.66%.

Summary

For the open-source dataset TalkingData when storing data of similar magnitude, OpenMLDB reduces memory usage by 45.66% compared to Redis. Even on datasets consisting purely of string data, OpenMLDB can still reduce memory usage by over 30% compared to Redis.

This is because of OpenMLDB’s compact row encoding format, which optimizes various data types when storing the same amount of data. The optimization reduces memory usage in in-memory databases and lowers servicing costs. Comparisons with mainstream in-memory databases like Redis further demonstrate OpenMLDB’s superior performance in terms of memory usage and Total Cost of Ownership (TCO).

For more information on OpenMLDB:

Official website: https://openmldb.ai/
GitHub: https://github.com/4paradigm/OpenMLDB
Documentation: https://openmldb.ai/docs/en/
Join us on Slack!

Ultra High-Performance Database OpenM(ysq)LDB: Seamless Compatibility with MySQL Protocol and Multi-Language MySQL Client

Hana Wang — Tue, 26 Mar 2024 01:49:00 +0000

What’s OpenM(ysq)LDB?

OpenMLDB has introduced a new service module called OpenM(ysq)LDB, expanding its capabilities to integrate with MySQL infrastructure. This extension redefines the “ML” in OpenMLDB to signify both Machine Learning and MySQL compatibility. Through OpenM(ysq)LDB, users gain the ability to utilize MySQL command-line clients or MySQL SDKs in various programming languages, enabling seamless access to OpenMLDB’s unique online and offline feature calculation capabilities.

OpenMLDB itself is a distributed high-performance memory time-series database built on C++ and LLVM technologies. Its architectural design and implementation logic significantly differ from traditional standalone relational databases like MySQL. OpenMLDB has garnered widespread adoption, particularly in hard real-time online feature calculation scenarios such as financial risk control and recommendation systems. While OpenMLDB’s capabilities are robust, its adoption was initially hindered by perceived high adaptation costs.

However, the introduction of OpenM(ysq)LDB addresses this barrier by facilitating direct integration with MySQL Clients and SDKs. Through standard ANSI SQL interfaces, OpenMLDB is now compatible with MySQL protocol, allowing customers to directly use the familiar MySQL clients to access OpenMLDB data and perform special OpenMLDB SQL feature extraction syntax. This enhancement streamlines the transition for users familiar with MySQL environments, making OpenMLDB’s advanced features more accessible and user-friendly.

For more details, check the official documentation.

Usage

Use a Compatible MySQL Command Line

After deploying the OpenMLDB distributed cluster, developers do not need to install additional OpenMLDB command line tools. Using the pre-installed MySQL command line tool, developers can directly connect to the OpenMLDB cluster for testing ( note that the following SQL connections and execution results are all returned by the OpenMLDB cluster, not by a remote MySQL service).

By executing customized OpenMLDB SQL, developers can not only view the status of the OpenMLDB cluster but also switch between offline mode and online mode to realize the offline and online feature extraction functions of MLOps.

Use a Compatible JDBC Driver

Java developers generally use the MySQL JDBC driver to connect to MySQL. The same code can directly connect to the OpenMLDB cluster without any modification.

Write the Java application code as follows. Pay attention to modifying the IP, port, username, and password information according to the actual cluster situation.

public class Main {
    public static void main(String[] args) {
        String url = "jdbc:mysql://localhost:3307/db1";
        String user = "root";
        String password = "root";
        Connection connection = null;
        Statement statement = null;
        ResultSet resultSet = null;
        try {
            connection = DriverManager.getConnection(url, user, password);
            statement = connection.createStatement();
            resultSet = statement.executeQuery("SELECT * FROM db1.t1");
            while (resultSet.next()) {
                int id = resultSet.getInt("id");
                String name = resultSet.getString("name");
                System.out.println("ID: " + id + ", Name: " + name);
            }
        } catch (SQLException e) {
            e.printStackTrace();
        } finally {
            // Close the result set, statement, and connection
            try {
                if (resultSet != null) {
                    resultSet.close();
                }
                if (statement != null) {
                    statement.close();
                }
                if (connection != null) {
                    connection.close();
                }
            } catch (SQLException e) {
                e.printStackTrace();
            }
        }
    }
}

Then compile and execute, and you can see the queried data for the OpenMLDB database in the command line output.

Use a Compatible SQLAlchemy Driver

Python developers often use SQLAlchemy and MySQL drivers, and the same code can also be directly applied to query OpenMLDB’s online data.

Write the Python application code as follows:

from sqlalchemy import create_engine, text

def main():
    engine = create_engine("mysql+pymysql://root:root@127.0.0.1:3307/db1", echo=True)
    with engine.connect() as conn:
        result = conn.execute(text("SELECT * FROM db1.t1"))
        for row in result:
            print(row)

if __name__ == "__main__":
  main()

Then execute it directly, and you can see the corresponding OpenMLDB database output in the command line output.

Use a Compatible Go MySQL Driver

Golang developers generally use the officially recommended github.com/go-sql-driver/mysql driver to access MySQL. They can also directly access the OpenMLDB cluster without modifying the application code.

Write the Golang application code as follows:

package main

import (
        "database/sql"
        "fmt"
        "log"

        _ "github.com/go-sql-driver/mysql"
)

func main() {
        // MySQL database connection parameters
        dbUser := "root"         // Replace with your MySQL username
        dbPass := "root"         // Replace with your MySQL password
        dbName := "db1"    // Replace with your MySQL database name
        dbHost := "localhost:3307"        // Replace with your MySQL host address
        dbCharset := "utf8mb4"            // Replace with your MySQL charset

        // Create a database connection
        db, err := sql.Open("mysql", fmt.Sprintf("%s:%s@tcp(%s)/%s?charset=%s", dbUser, dbPass, dbHost, dbName, dbCharset))
        if err != nil {
                log.Fatalf("Error connecting to the database: %v", err)
        }
        defer db.Close()

        // Perform a simple query
        rows, err := db.Query("SELECT id, name FROM db1.t1")
        if err != nil {
                log.Fatalf("Error executing query: %v", err)
        }
        defer rows.Close()

        // Iterate over the result set
        for rows.Next() {
                var id int
                var name string
                if err := rows.Scan(&id, &name); err != nil {
                        log.Fatalf("Error scanning row: %v", err)
                }
                fmt.Printf("ID: %d, Name: %s\n", id, name)
        }
        if err := rows.Err(); err != nil {
                log.Fatalf("Error iterating over result set: %v", err)
        }
}

Compile and run directly, and you can view the database output results in the command line output.

Use a Compatible Sequel Ace Client

MySQL developers usually use GUI applications to simplify database management. If developers want to connect to an OpenMLDB cluster, they can also use such open-source GUI tools.

Taking Sequel Ace as an example, developers do not need to modify any project code. They only need to fill in the address and port of the OpenM(ysq)LDB service when connecting to the database and fill in the username and password of the OpenMLDB service as the username and password. Then developers can follow the MySQL operation method to access the OpenMLDB service.

Use a Compatible Navicat Client

In addition to Sequel Ace, Navicat is also a popular MySQL client. Developers do not need to modify any project code. They only need to fill in the address and port of the OpenM(ysq)LDB service when creating a new connection (MySQL), and fill in the user name and password. The username and password of the OpenMLDB service can be used to access the OpenMLDB service according to the MySQL operation method.

Compatibility Principle of MySQL Protocol

The protocols of MySQL (including subsequent versions like MariaDB) are publicly available. On the server side, OpenM(ysq)LDB fully implements and is compatible with the MySQL protocol. While at the backend, it manages connections to the distributed OpenMLDB cluster through the OpenMLDB SDK, enabling compatibility access with various MySQL clients.

Currently, OpenM(ysql)LDB maintains client interaction with OpenMLDB through long-lived connections. This ensures that each connection has a unique client object accessing the OpenMLDB cluster. All SQL queries from the same connection do not require additional initialization, and resources are automatically released after the connection is closed. The overhead of the service itself is almost negligible, and performance can be consistent with directly connecting to OpenMLDB.

For more usage documentation, please refer to the official documentation.

Summary

OpenM(ysql)LDB is a bold attempt within the OpenMLDB project. After a total of 39 versions released from 0.1.5 to 0.8.5, and continuous improvement in functionality and SQL syntax compatibility, it has finally achieved full compatibility with the MySQL protocol. It not only ensures basic SQL query functionality but also provides a lower-level storage implementation and AI capabilities that outperform MySQL. From now on, MySQL/MariaDB users can seamlessly switch their database storage engines. Developers using different programming languages can also directly utilize mature MySQL SDKs. The barrier to entry for using OpenMLDB services has been significantly lowered, providing a “shortcut” for all DBAs or data developers to transition to AI.

Please note that as of now, MySQL Workbench testing with OpenM(ysql)LDB is not yet supported. Relevant testing work is still ongoing, and interested developers can stay updated on the development progress of this project on GitHub.

For more information on OpenMLDB:

Official website: https://openmldb.ai/
GitHub: https://github.com/4paradigm/OpenMLDB
Documentation: https://openmldb.ai/docs/en/
Join us on Slack!

Integrating Apache Hive — Offline Data for OpenMLDB

Hana Wang — Wed, 13 Mar 2024 06:21:49 +0000

The Apache Hive™ is a distributed, fault-tolerant data warehouse system that enables analytics at a massive scale and facilitates reading, writing, and managing petabytes of data residing in distributed storage using SQL. OpenMLDB extends its capabilities by offering seamless import and export functionalities for Hive as a data warehousing solution. While Hive is primarily used as an offline data source, it can also function as a data source for online data ingestion during the initialization phase of online engines.

Note that currently, only reading and writing to non-ACID tables (EXTERNAL tables) in Hive is supported. ACID tables (Full ACID or insert-only tables, i.e., MANAGED tables) are not supported at the moment.

OpenMLDB Deployment

You can refer to the official documentation for deployment. An easier way is to deploy with an official docker image, as described in Quickstart.

In addition, you will also need Spark, please refer to OpenMLDB Spark Distribution.

Hive-OpenMLDB Integration

Installation

For users employing OpenMLDB Spark Distribution Version, specifically v0.6.7 and newer iterations, the essential Hive dependencies are already integrated.

However, if you are working with an alternative Spark distribution, you can follow these steps for installation.

Execute the following command in Spark to compile Hive dependencies


    ./build/mvn -Pyarn -Phive -Phive-thriftserver -DskipTests clean package

After successfully executed, the dependent package is located in the directory assembly/target/scala-xx/jars
Add all dependent packages to Spark’s class path.

Configuration

At present, OpenMLDB exclusively supports utilizing metastore services for establishing connections to Hive. You can adopt either of the two provided configuration methods to access the Hive data source. To set up a simple HIVE environment, configuring hive.metastore.uris will suffice. However, in the production environment when HIVE configurations are required, configurations through hive-site.xml is recommended.

Using the spark.conf Approach: You can set up spark.hadoop.hive.metastore.uris within the Spark configuration. This can be accomplished in two ways:

a. taskmanager.properties: Include spark.hadoop.hive.metastore.uris=thrift://... within the spark.default.conf configuration item, followed by restarting the taskmanager.

b. CLI: Integrate this configuration directive into ini conf and use --spark_conf when start CLI. Please refer to Client Spark Configuration.

hive-site.xml: You can configure hive.metastore.uris within the hive-site.xml file. Place this configuration file within the conf/ directory of the Spark home. If the HADOOP_CONF_DIR environment variable is already set, you can also position the configuration file there. For instance:

    <configuration>
      <property>
        <name>hive.metastore.uris</name>
         <!--Make sure that <value> points to the Hive Metastore URI in your cluster -->
         <value>thrift://localhost:9083</value>
         <description>URI for client to contact metastore server</description>
      </property>
    </configuration>

Apart from configuring the Hive connection, it is crucial to provide the necessary permissions to the initial users (both OS users and groups) of the TaskManager for Read/Write operations within Hive. Additionally, Read/Write/Execute permissions should be granted to the HDFS path associated with the Hive table.

Check

Verify whether the task is connected to the appropriate Hive cluster by examining the task log. Here’s how you can proceed:

INFO HiveConf: indicates the Hive configuration file that was utilized. If you require further information about the loading process, you can review the Spark logs.
When connecting to the Hive metastore, there should be a log entry similar to INFO metastore: Trying to connect to metastore with URI. A successful connection will be denoted by a log entry reading INFO metastore: Connected to metastore.

Usage

Table Creation with `LIKE`

You can use LIKE syntax to create tables, leveraging existing Hive tables, with identical schemas in OpenMLDB.

    CREATE TABLE db1.t1 LIKE HIVE 'hive://hive_db.t1';
    -- SUCCEED

Import Hive Data to OpenMLDB

Importing data from Hive sources is done through the API LOAD DATA INFILE. This operation employs a specialized URI format, hive://[db].table, to seamlessly import data from Hive.

    LOAD DATA INFILE 'hive://db1.t1' INTO TABLE t1 OPTIONS(deep_copy=false);

The data loading process also supports using SQL queries to filter specific data from Hive tables. The table name used should be the registered name without the hive:// prefix.

    LOAD DATA INFILE 'hive://db1.t1' INTO TABLE db1.t1 OPTIONS(deep_copy=true, sql='SELECT * FROM db1.t1 where key=\"foo\"')

Export OpenMLDB Data to Hive

Exporting data to Hive sources is done through the API SELECT INTO, which employs a distinct URI hive://[db].table, to seamlessly transfer data to Hive.

    SELECT col1, col2, col3 FROM t1 INTO OUTFILE 'hive://db1.t1';

Summary

This is a brief guide for integration of Hive offline data source with OpenMLDB to best facilitate your application needs. For more details, you can check the official documentation on Hive integration.

OpenMLDB community has recently released FeatInsight, a sophisticated feature store service, leveraging OpenMLDB for efficient feature computation, management, and orchestration. The service is available for trial at http://152.136.144.33/. Contact us for a user ID and password to gain access!

For more information on OpenMLDB:

Official website: https://openmldb.ai/
GitHub: https://github.com/4paradigm/OpenMLDB
Documentation: https://openmldb.ai/docs/en/
Join us on Slack!

Mastering Distributed Database Development in 10 Minutes with OpenMLDB Developer Docker Image

Hana Wang — Wed, 13 Mar 2024 05:21:07 +0000

OpenMLDB is an open-source, distributed in-memory database system designed for time-series data. It focuses on high performance, reliability, and scalability, making it suitable for handling massive time-series data and real-time computation of online features. In the wave of big data and machine learning, OpenMLDB has emerged as a promising player in the open-source database field, thanks to its powerful data processing capabilities and efficient support for machine learning.

The core storage and SQL engine of OpenMLDB consist of over 360,000 lines of C++ code and a massive amount of C header files. To further reduce the project compilation threshold and enhance developers’ efficiency, we have introduced a newly designed OpenMLDB Docker image. This allows developers to quickly compile the database source code from scratch on any operating system platform, including Linux, MacOS, Windows, etc. With just ten minutes, developers can join as contributors to the development of distributed databases.

Usage

The mirror is currently hosted on the Alibaba Cloud Mirror Repository. The process for using the mirror is as follows:

Start container: Use Docker commands to start the container. This will initiate an environment containing the OpenMLDB source code and all dependencies.

docker run -it registry.cn-beijing.aliyuncs.com/openmldb/openmldb-build bash

Compile OpenMLDB: Inside the container, you can directly navigate to the OpenMLDB source code directory and execute the compilation script.

cd OpenMLDB
make

Install OpenMLDB, default installation path is ${PROJECT_ROOT}/openmldb

make install

Deployment and Testing: After the compilation is complete, you can proceed with deployment and testing accordingly. All necessary tools and dependencies are already prepared and ready to use.

Concurrent Compilation Time

OpenMLDB disables concurrent compilation by default. However, if the resources on the compilation machine are sufficient, you can enable concurrent compilation using the compilation parameter NPROC. Here we list the time required for concurrent compilation.

1. 4-core Compilation

    make NPROC=4

2. 8-core Compilation

    make NPROC=8

3. 16-core Compilation

    make NPROC=16

Highlights

Quick Start: Eliminates complex setup steps, allowing developers to quickly enter development mode on different operating system platforms.
Unified Environment: Whether for individual development or team collaboration, the Docker image ensures that each member develops in a consistent environment, effectively avoiding the “it works on my machine” problem.
Easy Sharing: The image can be easily shared with other team members or distributed in the community, accelerating the adoption and application of OpenMLDB.
Complete OpenMLDB Environment: The image comes pre-installed with the complete source code of OpenMLDB, enabling developers to easily explore and modify the OpenMLDB source code and contribute to the OpenMLDB community.
Offline Compilation and Deployment Capabilities: By pre-downloading the third-party libraries required by OpenMLDB, the image can compile and deploy OpenMLDB in a completely offline environment. This greatly improves work efficiency in network-restricted environments, enhancing the flexibility and feasibility of development.
Compilation Efficiency: Since all dependencies are already built into the image, this avoids lengthy dependency download and installation processes, making the compilation process much faster.

This custom Docker image tailored for offline building of OpenMLDB not only simplifies the onboarding process for developers but also provides robust support for project compilation, deployment, and testing. We anticipate that this tool will help more developers and enterprises leverage OpenMLDB more efficiently, enabling them to control the compilation and development capabilities of OpenMLDB at the source code level. Moreover, with the enhanced development and application capabilities, we look forward to seeing OpenMLDB further develop and apply in industry ecosystems such as financial risk control, recommendation systems, and quantitative trading.

For more information on OpenMLDB:

Official website: https://openmldb.ai/
GitHub: https://github.com/4paradigm/OpenMLDB
Documentation: https://openmldb.ai/docs/en/
Join us on Slack!

OpenMLDB v0.8.5 Release: Enhanced Authentication Feature, Comprehensive Security Upgrade

Hana Wang — Thu, 29 Feb 2024 07:37:16 +0000

Release Date

28 February 2024

Release Notes

https://github.com/4paradigm/OpenMLDB/releases/tag/v0.8.5

OpenMLDB released a new version v0.8.5, including SQL syntax extensions, Iceberg data lake support, TTL type extensions, and improved user authentication functionality. The most noteworthy feature is the integration of the Iceberg engine and the enhancement of user authentication.

Iceberg is an open-source table format data lake management tool, focusing on providing a highly reliable, scalable data lake management solution. Its core features include atomic write mechanisms, multi-version data management, metadata management, etc., aiming to provide comprehensive data lake management functionality for enterprises. OpenMLDB integrates Iceberg into its platform, allowing users to directly read and write Iceberg data lakes while using OpenMLDB’s features. This results in higher data reliability and consistency, more flexible data operations and management, and more efficient data query performance, providing enterprises with a comprehensive and reliable data lake management solution.

OpenMLDB has also introduced user authentication functionality, allowing users to more flexibly manage and control database access permissions through SQL statements such as CREATE / ALTER / DROP USER. This feature not only ensures the security of the data but also enhances the convenience and flexibility of database management. Users can autonomously manage the creation, modification, and deletion of user accounts according to their actual needs, better meeting the enterprise's requirements for data access permission management, and improving the overall system's security and manageability.

For detailed release notes, please refer to: https://github.com/4paradigm/OpenMLDB/releases/tag/v0.8.5

Highlights

Added integration with Apache Iceberg offline storage engine, supporting data import and feature data export functionalities, further strengthening the integration of OpenMLDB with the ecosystem.
Added standard SQL syntax UNION ALL, expanding WINDOW UNION and LAST JOIN to achieve multi-table join.
Support for SELECT INTO OUTFILE to configure OpenMLDB online tables, achieving synchronization between online and offline storage.
In offline mode, LAST JOIN and WINDOW operations support not specifying ORDER BY parameters, making usage more flexible.
Added user management functionality, enabling user addition, modification, and deletion through standard SQL statements CREATE / ALTER / DROP USER.
Support for configuring Spark task parameters through SDK, providing more flexible offline task resource configuration.
INSERT statement supports configuring server-side memory limits, providing more user-friendly error messages for insertion failures.
LEFT JOIN statement deployment supports automatic index creation, eliminating the need for manual index creation and data re-importation.
File storage engine supports TTL types absandlat / absorlat, aligning with in-memory storage engine functionality.

For more information on OpenMLDB:

Website: https://openmldb.ai/
GitHub: https://github.com/4paradigm/OpenMLDB
Documentation: https://openmldb.ai/docs/en/
Join us on Slack !

FeatInsight: Leveraging OpenMLDB for Highly Efficient Feature Management and Orchestration

Hana Wang — Wed, 28 Feb 2024 09:53:04 +0000

The OpenMLDB community has recently released a new open-source feature platform product — FeatInsight (https://github.com/4paradigm/FeatInsight).

FeatInsight is a sophisticated feature store service, leveraging OpenMLDB for efficient feature computation, management, and orchestration.

FeatInsight provides a user-friendly user interface, allowing users to perform the entire process of feature engineering for machine learning, including data import, viewing and update, feature generation, store, and online deployment. For offline scenarios, users can choose features for training sample generation for ML training; for online scenarios, users can deploy online feature services for real-time feature computations.

Key Features

The main objective of FeatInsight is to address common challenges in machine learning development, including facilitating easy and quick feature extraction, transformation, combination, and selection, managing feature lineage, enabling feature reuse and sharing, version control for feature services, and ensuring consistency and reliability of feature data used in both training and inference processes. Application scenarios include the following:

Online Feature Service Deployment: Provides high-performance feature storage and online feature computation functions for localized deployment.
MLOps Platform: Establishes MLOps workflow with OpenMLDB online-offline consistent computations.
FeatureStore Platform: Provides comprehensive feature extraction, deletion, online deployment, and lineage management functionality to achieve low-cost local FeatureStore services.
Open-Source Feature Solution Reuse: Supports solution reuse locally for feature reuse and sharing.
Business Component for Machine Learning: Provides a one-stop feature engineering solution for machine learning models in recommendation systems, natural language processing, finance, healthcare, and other areas of machine learning implementation.

For more content, please refer to FeatInsight Documentation.

QuickStart

We will use a simple example to show how to use FeatInsight to perform feature engineering. The usage process includes the following four steps: data import, feature creation, offline scenarios, and online scenarios.

1. Data Import

Firstly, create database test_db and data table test_table. You can use SQL to create.

CREATE DATABASE test_db;
CREATE TABLE test_db.test_table (id STRING, trx_time DATE);

Or you can use the UI and create it under “Data Import”.

For easier testing, we prepare a CSV file and save it to /tmp/test_table.csv. Note that, this path is a local path for the machine that runs the OpenMLDB TaskManager, usually also the machine for FeatInsight. You will need access to the machine for the edition.

id,trx_time
user1,2024-01-01
user2,2024-01-02
user3,2024-01-03
user4,2024-01-04
user5,2024-01-05
user6,2024-01-06
user7,2024-01-07

For online scenarios, you can use the command LOAD DATA or INSERT. Here we use "Import from CSV".

The imported data can be previewed.

For offline scenarios, you can also use LOAD_DATA or "Import from CSV".

Wait for about half a minute for the task to finish. You can also check the status and log.

2. Feature Creation

After data imports, we can create features. Here we use SQL to create two basic features.

SELECT id, dayofweek(trx_time) as trx_day FROM test_table

In “Features”, the button beside “All Features” is to create new features. Fill in the form accordingly.

After successful creation, you can check the features. Click on the name to go into details. You can check the basic information, as well as preview feature values.

3. Offline Samples Export

In “Offline Scenario”, you can choose to export offline samples. You can choose the features to export and specify the export path. There are “More Options” for you to specify the file format and other advanced parameters.

Wait for about half a minute and you can check the status at “Offline Samples”.

You can check the content of the exported samples. To verify online-offline consistency provided by FeatInsight, you can record the result and compare it with online feature computation results.

4. Online Feature Service

In “Feature Services”, the button beside “All Feature Services” is to create a new feature service. You can choose the features to deploy, and fill in the service name and version accordingly.

After successful creation, you can check service details, including the feature list, dependent tables, and lineage.

Lastly, on the “Request Feature Service” page, we can key in test data to perform online feature calculation, and compare it with offline computation results.

Summary

This example demonstrates the complete process of using FeatInsight. By writing simple SQL statements, users can define features for both online and offline scenarios. By selecting different features or combining feature sets, users can quickly reuse and deploy feature services. Lastly, the consistency of feature computation can be validated by comparing offline and online calculation results.

If you want to have a further understanding of how to use FeatInsight and its application scenarios, please refer to Application Scenarios.

Appendix: Advanced Functions

In addition to the basic functionalities of feature engineering, FeatInsight also provides advanced functionalities to facilitate feature development for users:

SQL Playground: Offers debugging and execution capabilities for OpenMLDB SQL statements, allowing users to execute arbitrary SQL operations and debug SQL statements for feature extraction.
Computed Features: Enables the direct storage of feature values obtained through external batch computation or stream processing into OpenMLDB online tables. Users can then access and manipulate feature data in online tables.

FeatInsight Github: https://github.com/4paradigm/FeatInsight
FeatInsight documentation: https://openmldb.ai/docs/en/main/app_ecosystem/feat_insight/index.html

For more information on OpenMLDB:

Website: https://openmldb.ai/
GitHub: https://github.com/4paradigm/OpenMLDB
Documentation: https://openmldb.ai/docs/en/
Join us on Slack !

OpenMLDB Selected as the Sole Feature Store Vendor from China in the 2023 Gartner Report

Hana Wang — Mon, 26 Feb 2024 08:05:23 +0000

In the report “The Logical Feature Store: Data Management for Machine Learning” published by the International Authoritative Consulting and Research firm, Gartner, OpenMLDB is honored to be selected as the sole representative feature store vendor from China.

The report thoroughly analyzes the three major challenges faced by current machine learning applications in the practical implementation process: Low End-to-end Efficiency, Lack of Reusability, and Inconsistency between Training and Production Environments. This explains the urgent necessity of a feature store. Considering the challenges posed by the high complexity and resource allocation involved in developing feature stores, Gartner firmly believes that, compared to in-house development, seeking external procurement, especially purchasing from MLOps vendors with built-in feature stores, is a more cost-effective choice. In this regard, OpenMLDB has successfully been included in Gartner’s recommended list of vendors for its outstanding performance, becoming the only Chinese ML vendor with a built-in feature store. This report provides valuable professional guidance for enterprises eager to expand the scale of their AI implementation in business.

OpenMLDB: Providing a Consistent Production-level Feature Store Online and Offline, Achieving a 500% Efficiency Improvement per Unit Cost

Gartner emphasizes the challenges of machine learning in practical applications in its report. Typically, machine learning teams in enterprises find themselves investing significant time in addressing data issues, leaving little room for focusing on actual model development and optimization. During this process, there is a notable prevalence of inconsistent feature definitions and frequent repetitive rework. Similar observations are revealed in OpenMLDB’s research, “In the Realm of Artificial Intelligence Engineering Practices, Enterprises Often Allocate a Staggering 95% of their Overall Time and Effort to Tasks such as Data Processing and Feature Validation”.

In the traditional approach without OpenMLDB, the deployment of real-time feature computations typically involves the following three steps: (1) Data scientists develop feature scripts offline using SparkSQL or Python; (2) As the developed offline scripts cannot meet the requirements of the production environment, the engineering team needs to reoptimize them based on a different tool stack; (3) Finally, there is a need for consistency validation of the offline feature scripts developed by data scientists and the online services developed by the engineering team. The entire process involves two groups of developers and two sets of tool stacks, resulting in significant deployment costs.

OpenMLDB aims for a seamless transition from development to deployment, allowing feature scripts developed by data scientists to be directly deployed in the production environment. The platform is equipped with both offline and online processing engines, with the online engine being deeply optimized to meet both production-level online requirements and ensure consistency between online and offline through an automatic consistency execution plan generator. Utilizing OpenMLDB, the implementation of machine learning applications in the feature phase involves only two steps: (1) Data scientists develop offline feature scripts using SQL, and (2) deploying the feature script to the online engine with a single deployment command. This approach ensures consistency between online and offline while successfully achieving millisecond-level low latency, high concurrency, and high availability for online services.

Therefore, the greatest value of OpenMLDB lies in significantly reducing the engineering deployment costs of artificial intelligence. In a larger business scenario, OpenMLDB can achieve a remarkable reduction from the traditional approach, where 6 person/ month was required, to just 1 person/ month. This results in a 500% efficiency improvement per unit cost by eliminating the need for the engineering team to develop online services and conduct online-offline consistency checks.

OpenMLDB X Akulaku: By Scenario-driven Approach, Windowed Feature Computation for One Billion Orders Achieves 4 Milliseconds Latency Performance, Saving over 4 Million Resources

OpenMLDB is committed to addressing the data governance challenges in the implementation of AI engineering and has already been deployed in over a hundred enterprise-level AI scenarios. Among them, Akulaku, as a leading internet finance service provider in Southeast Asia, covers the entire e-commerce chain, with applications spanning financial risk control, intelligent customer service, and e-commerce recommendations. In numerous scenarios, Akulaku requires the implementation of corresponding AI applications. In the field of e-commerce finance, there is often a high demand for the feature store, requiring online deployment with low latency and high efficiency. It needs to reflect real-time feature computation for new data as much as possible, while offline demand analysis requires high throughput. At the same time, consistency between online and offline must be ensured. Meeting all three requirements simultaneously is a challenging task.

To address this challenge, OpenMLDB has assisted Akulaku in building an intelligent computing architecture. This involves embedding OpenMLDB’s online engine into the model computation layer and embedding the offline engine into the feature computation layer. Through a scenario-driven approach, real-time computation results are invoked in the business calling process. This approach has successfully performed windowed feature computation for one billion orders, achieving a 4-millisecond latency performance and conservatively estimated resource savings of over 4 million.

In addition, OpenMLDB has assisted numerous enterprises in optimizing their database architecture, facilitating more effective implementation of AI scenarios. For example, it helped Vipshop reduce the feature development and iteration time for personalized product recommendations from 5 person/ day to 2 person/ day, resulting in a 60% improvement in feature development iteration speed. A leading bank’s anti-fraud system utilized OpenMLDB for feature computation and management in offline development, online inference, and self-learning stages. This resolved long-standing issues of data traversal and inconsistent results, eliminating the need for expensive consistency verification costs. Huawei, after implementing OpenMLDB for real-time personalized product recommendations, achieved minute-level data updates and hour-level feature deployment. Looking ahead, OpenMLDB aims to assist more enterprises in addressing real-world challenges in data and feature processing for successful business implementation.

As the sole representative of a database feature store from China selected in the Gartner report “The Logical Feature Store: Data Management for Machine Learning,” OpenMLDB will continue refining its product, optimizing performance, and leveraging its strengths in the field of database feature platforms. The aim is to liberate AI practitioners from tedious and inefficient data processing, assisting enterprises in achieving simpler and more efficient implementations of machine learning applications.

For more information on OpenMLDB:

Website: https://openmldb.ai/
GitHub: https://github.com/4paradigm/OpenMLDB
Documentation: https://openmldb.ai/docs/en/
Join us on Slack !

Kubernetes Deployment Guide for OpenMLDB

Hana Wang — Tue, 16 Jan 2024 05:16:15 +0000

Kubernetes is a widely adopted cloud-native container orchestration and management tool in the industry that has been extensively used in project implementations. Currently, both the offline and online engines of OpenMLDB have complete support for deployment based on Kubernetes, enabling more convenient management functionalities. This article will respectively introduce the deployment strategies of the offline and online engines based on Kubernetes.

It’s important to note that the deployment of the offline engine and the online engine based on Kubernetes are entirely decoupled. Users have the flexibility to deploy either the offline or online engine based on their specific requirements.

Besides Kubernetes-based deployment, the offline engine also supports deployment in local mode and yarn mode. Similarly, the online engine supports a native deployment method that doesn't rely on containers. These deployment strategies can be flexibly mixed and matched in practical scenarios to meet the demands of production environments.

Offline Engine with Kubernetes Backend

Deployment of Kubernetes Operator for Apache Spark

Please refer to spark-on-k8s-operator official documentation. The following is the command to deploy to the default namespace using Helm. Modify the namespace and permission as required.

helm install my-release spark-operator/spark-operator --namespace default --create-namespace --set webhook.enable=true
kubectl create serviceaccount spark --namespace default
kubectl create clusterrolebinding binding --clusterrole=edit --serviceaccount=default:spark

After successful deployment, you can use the code examples provided by spark-operator to test whether Spark tasks can be submitted normally.

HDFS Support

If you need to configure Kubernetes tasks to read and write HDFS data, you need to prepare a Hadoop configuration file in advance and create a ConfigMap. You can modify the ConfigMap name and file path as needed. The creation command example is as follows:

kubectl create configmap hadoop-config --from-file=/tmp/hadoop/etc/

Offline Engine Configurations for Kubernetes Support

The configuration file for TaskManager in the offline engine can be configured for Kubernetes support, respective settings are:

If Kubernetes is used to run the offline engine, the user’s computation tasks will run on the cluster. Therefore, it’s recommended to configure the offline storage path as an HDFS path; otherwise, it might lead to data read/write failures in tasks. Example configuration for the item is as follows:

offline.data.prefix=hdfs:///foo/bar/

💡For a complete configuration file example for TaskManager in OpenMLDB offline engine, visit: https://openmldb.ai/docs/en/main/deploy/conf.html#he-configuration-file-for-taskmanager-conf-taskmanager-properties

Task Submission and Management

After configuring TaskManager and Kubernetes, you can submit offline tasks via the command line. The usage is similar to that of the local or YARN mode, allowing not only usage within the SQL command-line client but also via SDKs in various programming languages.

For instance, to submit a data import task:

LOAD DATA INFILE 'hdfs:///hosts' INTO TABLE db1.t1 OPTIONS(delimiter = ',', mode='overwrite');

Check Hadoop ConfigMap:

kubectl get configmap hdfs-config -o yaml

Check Spark job and Pod log:

kubectl get SparkApplicationkubectl get pods

Online Engine Deployment with Kubernetes

Github

The deployment of online engine based on Kubernetes is supported as a separate tool for OpenMLDB. Its source code repository is located at: https://github.com/4paradigm/openmldb-k8s

Requirement

This deployment tool offers a Kubernetes-based deployment solution for the OpenMLDB online engine, implemented using Helm Charts. The tool has been tested and verified with the following versions:

Kubernetes 1.19+
Helm 3.2.0+

Additionally, for users who utilize pre-compiled OpenMLDB images from Docker Hub, only OpenMLDB versions >= 0.8.2 are supported. Users also have the option to create other versions of OpenMLDB images using the tool described in the last section of this article.

Preparation: Deploy ZooKeeper

If there is an available ZooKeeper instance, you can skip this step. Otherwise, proceed with the installation process:

helm install zookeeper oci://registry-1.docker.io/bitnamicharts/zookeeper --set persistence.enabled=false

You can specify a previously created storage class for persistent storage:

helm install zookeeper oci://registry-1.docker.io/bitnamicharts/zookeeper --set persistence.storageClass=local-storage

For more parameter settings, refer to here.

OpenMLDB Deployment

Download Source Code
Download the source code and set the working directory to the root directory of the repository.

git clone https://github.com/4paradigm/openmldb-k8s.git
cd openmldb-k8s

Configure ZooKeeper Address

Modify the zk_cluster in the charts/openmldb/conf/tablet.flags and charts/openmldb/conf/nameserver.flags files to the actual ZooKeeper address, with the default zk_root_path set to /openmldb.

Deploy OpenMLDB

You can achieve one-click deployment using Helm with the following commands:

helm install openmldb ./charts/openmldb

Users have the flexibility to configure additional deployment options using the --set command. Detailed information about supported options can be found in the OpenMLDB Chart Configuration.

Important configuration considerations include:

By default, temporary files are used for data storage, which means that data may be lost if the pod restarts. It is recommended to associate a Persistent Volume Claim (PVC) with a specific storage class using the following method:

helm install openmldb ./charts/openmldb --set persistence.dataDir.enabled=true --set  persistence.dataDir.storageClass=local-storage

By default, the 4pdosc/openmldb-online image from Docker Hub is utilized (supporting OpenMLDB >= 0.8.2). If you prefer to use a custom image, you can specify the image name during installation with --set image.openmldbImage. For information on creating custom images, refer to the last section of this article.

helm install openmldb ./charts/openmldb --set image.openmldbImage=openmldb-online:0.8.4

Note

Deployed OpenMLDB services can only be accessed within the same namespace within Kubernetes.
The OpenMLDB cluster deployed using this method does not include a TaskManager module. Consequently, statements such as LOAD DATA and SELECT INTO, and offline-related functions are not supported. If you need to import data into OpenMLDB, you can use OpenMLDB’s Online Import Tool, OpenMLDB Connector, or SDK. For exporting table data, the Online Data Export Tool can be utilized.
For production, it’s necessary to disable Transparent Huge Pages (THP) on the physical node where Kubernetes deploys the tablet. Failure to do so may result in issues where deleted tables cannot be fully released. For instructions on disabling THP, please refer to this link.

Create Docker Image

The default deployment uses the OpenMLDB docker image from Docker Hub. Users can also create their local docker image. The creation tool is located in the repository (https://github.com/4paradigm/openmldb-k8s) as docker/build.sh.

This script supports two parameters:

OpenMLDB version number.
Source of the OpenMLDB package. By default, it pulls the package from a mirror in mainland China. If you want to pull it from GitHub, you can set the second parameter to github.

cd docker
sh build.sh 0.8.4

For more information on OpenMLDB:

Website: https://openmldb.ai/
GitHub: https://github.com/4paradigm/OpenMLDB
Documentation: https://openmldb.ai/docs/en/
Join us on Slack !

OpenMLDB SQL Emulator — a Development and Debugging Tool for OpenMLDB SQL

Hana Wang — Tue, 26 Dec 2023 09:39:21 +0000

In this blog, we would like to introduce an excellent standalone tool from the OpenMLDB community — OpenMLDB SQL Emulator (https://github.com/vagetablechicken/OpenMLDBSQLEmulator). This tool allows users to develop and debug OpenMLDB SQL more efficiently and conveniently.

To efficiently implement time-series feature calculations, OpenMLDB SQL has improved and extended standard SQL. In practical use, beginners often encounter problems such as unfamiliar syntax and confusing execution modes when using OpenMLDB SQL. Suppose one directly develops and debugs on OpenMLDB itself, due to issues including deployment, index building, handling large data volume, and so on. In that case, he/she will end up wasting a lot of time on irrelevant tasks and might not be able to pinpoint the root cause easily.

The OpenMLDB SQL Emulator is a lightweight simulation development and debugging tool for OpenMLDB SQL. It allows for SQL validation and debugging operations without the need for deploying an OpenMLDB cluster. We strongly recommend this tool to our application developers. They can use this tool to quickly verify the correctness and deployability of SQL, before switching to the actual OpenMLDB environment for deployment.

Installation

Download the runtime package emulator-1.0.jar from the project page at https://github.com/vagetablechicken/OpenMLDBSQLEmulator/releases. Use the following method to start (Note that the current release version 1.0 corresponds to SQL syntax for OpenMLDB 0.8.3):

java -jar emulator-1.0.jar

Note: If you want to execute SQL using the run command to validate results, you will also need to download toydb_run_engine from the same page and store it in the system's /tmp directory.

Creation of Virtual Databases and Tables

Once started, it will directly enter the default database emudb, and no additional database creation is required.

Databases don’t need to be explicitly created. Just use use or specify the database name when creating tables, and the database will be created automatically.
Use the addtable command or t to create a virtual table. Repeatedly creating a table with the same name is considered an update operation and will use the latest table schema. We use a simplified SQL-like syntax to manage tables. For example, the following example creates a table with two columns:

addtable t1 a int, b int64

Use the showtables command or st to view all current databases and tables.

Validation of OpenMLDB SQL

To verify whether OpenMLDB SQL can be deployed in the cluster setting, you can use DEPLOY. Note that DEPLOYMENT and index need to be managed. For instance, if a certain DEPLOYMENT is not needed, it needs to be manually deleted. Similarly, if unnecessary indexes are created, they must be cleaned up.

Hence, it is suggested to test and verify in the Emulator instead. You can use val and valreq to perform online batch mode and online request mode (i.e., service deployment) OpenMLDB SQL validation. For example, testing whether a SQL can be DEPLOYed online using the valreq command:

addtable t1 a int, b int64
valreq select count(*) over w1 from t1 window w1 as (partition by a order by b rows between unbounded preceding and current row);

If the test fails, it will print SQL compilation errors. If successful, it will print validate * success. The entire process happens in a virtual environment, without any concerns about resource utilization after table creation, and without any side effects. Any SQL that passes the valreq validation will definitely be able to be deployed in a real cluster.

Testing of OpenMLDB SQL

OpenMLDB Emulator can also return computation results to verify if the implemented SQL gives correct calculation results. You can continuously perform computations and online validations until the implemented SQL meets the expectation. This functionality can be achieved through the run command in the Emulator. Note that using the run command requires additional support from toydb_run_engine. You can use the built-in toydb from the emulator package, or download it from this page (https://github.com/vagetablechicken/OpenMLDBSQLEmulator/releases) and place it directly in /tmp.

Assuming Emulator already has toydb installed, here are the steps for testing SQL:

# step 1, generate a yaml template
gencase
# step 2, modify the yaml file to add table and data
# ...
# step 3, load yaml to get table catalog, 
# then using val/valreq sql to validate the sql in emulator
loadcase
valreq <sql>
# step 4, dump the sql, this will rewrite the yaml file
dumpcase <sql>

# step 5, run sql using toydb
run

The command gencase generates a YAML template file, defaulting to the directory /tmp/emu-case.yaml. You'll need to edit this YAML file as shown below. When editing, consider the following:

Modify table names, table schema, and their data, which can not be changed in the Emulator.
Modify the run mode to accept batch or request mode.
You may leave the SQL section blank. SQL can be written to the file in the Emulator using dumpcase . A common usage is to validate SQL first, then dump it to the case, and finally, use the run command to confirm if the SQL calculation meets expectations.
The table’s indexes don’t need manual filling. When using dumpcase, indexes can be automatically generated based on the table schema (indexes are not specific to SQL and are only needed to create at least one index when creating the table). If you are not using dumpcase, please manually specify at least one index.

# call toydb_run_engine to run this yaml file
# you can generate yaml cases for reproduction by emulator dump or by yourself

# you can set the global default db
db: emudb
cases:
  - id: 0
    desc: describe this case
    # you can set batch mode
    mode: request
    db: emudb # you can set default db for case, if not set, use the global default db
    inputs:
      - name: t1
        db: emudb # you can set db for each table, if not set, use the default db(table db > case db > global db)
        # must set table schema, emulator can't do this
        columns: ["id int", "pk1 string","col1 int32", "std_ts timestamp"]
        # gen by emulator, just to init table, not the deployment index
        indexs: []
        # must set the data, emulator can't do this
        data: |
          1, A, 1, 1590115420000
          2, B, 1, 1590115420000
    # query: only support single query, to check the result by `expect`
    sql: |

    # optional, you can just check the output, or add your expect
    # expect:
    #   schema: id:int, pk1:string, col1:int, std_ts:timestamp, w1_col1_sum:int, w2_col1_sum:int, w3_col1_sum:int
    #   order: id
    #   data: |
    #     1, A, 1, 1590115420000, 1, 1, 1
    #     2, B, 1, 1590115420000, 1, 1, 1

For simplicity, let’s not make any modifications and directly use this template to demonstrate how to modify the running case. In the Emulator, executing loadcase will load the table information from this case into the Emulator. You can confirm the successful loading of the case's tables by using st/showtables.

emudb> st
emudb={t1=id:int32,pk1:string,col1:int32,std_ts:timestamp}

You can see that the table information has been successfully loaded. Now, we can use valreq to confirm if the SQL we've written is syntactically correct and deployable. Subsequently, you can perform a computation test on this SQL using the dumpcase and run commands. For example:

valreq select count(*) over w1 from t1 window w1 as (partition by id order by std_ts rows between unbounded preceding and current row);
dumpcase select count(*) over w1 from t1 window w1 as (partition by id order by std_ts rows between unbounded preceding and current row);
run

dumpcase command actually writes the SQL and default indexes into the case file, and the run command executes this case file. Therefore, if you are skilled enough, you can directly modify this case file and then run it in the Emulator using run, or alternatively, use toydb_run_engine --yaml_path=... to run it. After execution, you will obtain the computed results for debugging and inspection purposes.

The OpenMLDB SQL Emulator also features a genddl function that helps users generate optimal index creation statements directly from SQL. This feature aids in avoiding redundant indexes and currently supports only single database. In future, there will be improvements in index handling, providing simpler and more convenient operations to guide users in index management.

Additionally, for usage of the Emulator, it’s recommended to utilize the ?help and ?list-all prompts. Commands are in lowercase, but SQL parameter inputs are case-insensitive and do not require additional double quotes, aligning with CLI conventions. Functionalities such as command history and exporting the current environment will be added to facilitate user operations and integration with real OpenMLDB clusters in future updates.

For more information on OpenMLDB:

Website: https://openmldb.ai/
GitHub: https://github.com/4paradigm/OpenMLDB
Documentation: https://openmldb.ai/docs/en/
Join us on Slack !

OpenMLDB Integration in Real-Time Decision-Making Systems

Hana Wang — Mon, 11 Dec 2023 07:43:52 +0000

OpenMLDB provides a real-time feature computing platform that ensures consistency between online and offline environments. It also offers flexible support for integrating into actual business systems and building complete machine-learning platforms. This article focuses on the common architectures used to integrate OpenMLDB into enterprise-level business systems, with a particular emphasis on storage and application computation architecture:

Offline and online data storage architecture: How to store offline and online data in a reasonable manner while maintaining consistency between the two.
Real-time decision-making application architecture: How to build online applications based on OpenMLDB’s real-time request model, including architectures for real-time computation and real-time query applications.

Storage Architecture for Offline and Online Data

Due to different performance and data volume requirements, in general, OpenMLDB’s offline and online data are stored separately:

Online Data Storage: OpenMLDB provides an efficient real-time database (based on memory or hard disk) primarily for storing online data used for real-time feature computations, rather than full data. The main features are:
- Millisecond-level access for time series data (based on memory, by default).
- The ability to automatically expire data (TTL). TTL can be set according to the granularity of the table, which is used to store only the necessary data within a certain time window.
- The memory-based storage engine has high performance but may consume a large amount of external memory. A disk-based storage engine can be used to complement if performance requirements can be met.
Offline Data Warehouse: OpenMLDB does not provide a standalone offline storage engine, but can flexibly support different offline data warehouses and architecture forms.

The following section discusses the common storage architectures for offline and online data.

Full Data Storage in a Real-Time Database (not recommended)

Users can choose to store full data in OpenMLDB’s real-time database. The advantage of this method is its simplicity and having only one copy of the data in physical storage, which saves management and maintenance costs. However, this usage method is rarely used in practice due to the following potential problems:

Full data is generally large, and to ensure online performance, OpenMLDB uses an external memory-based storage engine by default. Storing full data in external memory will result in a large hardware cost.
Although OpenMLDB provides disk-based storage engines, disk storage can result in a 3–7x performance decrease, which may not meet the requirements of some online business scenarios.
Storing offline and online data on the same physical medium may significantly impact the performance and stability of real-time computations.

Therefore, in practice, to fully leverage OpenMLDB’s real-time computation capabilities, we do not recommend storing full data in OpenMLDB, but rather using it in conjunction with an offline data warehouse.

Separate Storage for Offline Data Warehouse and Online Real-Time Database

Currently, in practice, most users adopt a separate storage architecture for offline and online data. Based on this architecture, data is simultaneously written to both the offline data warehouse and the real-time database. The real-time database of OpenMLDB sets table-level data expiration (TTL), which corresponds to the size of the time window required in the feature script. In this way, the real-time database only stores necessary data for real-time feature computation, rather than the entire dataset. Some important points to note are:

In actual enterprise architecture, data sources are generally based on subscription mechanisms of message queues such as Kafka. Different applications will consume data separately. Under this architecture, the channels for writing to the real-time database of OpenMLDB and storing it in the offline data warehouse can be considered as two independent consumers.
If it is not based on a subscription mechanism of message queues, it can also be considered that there are one or more data-receiving programs upstream of OpenMLDB, which are used to implement and manage the online and offline storage of OpenMLDB.
The expiration time of the real-time database of OpenMLDB needs to be correctly set so that the data stored in the real-time database can be used for correct real-time feature computation.
The main disadvantage of this architecture is that it is more complicated to manage, as users need to manage offline and online data separately.

Unified Storage for Offline Data Warehouse and Online Real-Time Database (Support from v0.8.0)

With this architecture of unified management for online and offline data, users’ perspectives on data synchronization and management have been simplified. While we still have two storage engines: a real-time database, and an offline data warehouse physically, from users’ perspective, a single writing channel is presented. Users only need to write new data into OpenMLDB’s real-time database and set up the synchronization mechanism, and OpenMLDB will automatically synchronize data in real time or periodically to one or more offline data warehouses. OpenMLDB’s real-time database still relies on a data expiration mechanism to only save data necessary for online feature computation, while the offline data warehouse retains all full data. This feature is available after version 0.8.0.

Real-time Decision-making Application Architecture

OpenMLDB Execution Mode: Real-Time Request Mode

Let us first understand the real-time request mode offered by OpenMLDB online real-time computing engine. It mainly includes three steps:

The client sends a computation request through REST APIs or OpenMLDB SDKs, which may optionally include state information of the current event, such as the swiped amount and shop ID of the current swipe event.
The OpenMLDB real-time engine accepts the request and performs on-demand real-time feature computation based on the deployed feature computation logic.
OpenMLDB returns the real-time computation results to the client who initiated the request, completing the real-time computation request.

With the execution mode in mind, we will start with practical application scenarios and explain the application architecture for real-time decision-making applications: computation and query.

Application Architecture for Real-Time Computation — for Real-Time Decision-Making

The real-time request mode (default) of OpenMLDB supports in-event (real-time) decision-making applications, meaning that decision-making behavior takes place during the event occurrence. Therefore, its main characteristic is that behavioral data generated by the current event is also taken into consideration for decision-making.

The most typical example is credit card anti-fraud. When a credit card transaction occurs, the anti-fraud system makes a decision before the transaction is completed, taking into account the current transaction behavioral data (such as the amount, time, and location of the current transaction), along with data from a recent time window. This architecture is widely used in fields such as anti-fraud and risk control.

Let’s take a concrete example. The following figure shows the functional logic of the entire system when a credit card transaction occurs. As shown in the figure, the system maintains a history of transaction records, and when a new transaction behavior occurs, the current behavioral data is virtually inserted into the table along with recent transaction records for feature computation, then given to the model for inference, and finally evaluated to determine whether it is a fraudulent transaction.

Note that in the above figure, the new swipe record data is shown as being virtually inserted into the historical transaction table. This is because, in OpenMLDB’s default execution mode, the system virtually inserts the in-event data carried by the request into the table, participating in the overall feature computation (if in special circumstances where the request row is not needed for decision-making, see section below “Application Architecture for Real-time Query”). At the same time, the current request row is also useful for subsequent decision-making in general, so after completing the current feature computation, it will be physically inserted into the database. To build an in-event decision-making system like the above business process, a typical architecture flowchart is listed below.

The architecture is based on the OpenMLDB SDK and achieves strict in-event decision-making, which consists of two stages:

Steps 1, 2, and 3 in the above diagram constitute a real-time feature computation with OpenMLDB. The request includes the necessary data (card number, transaction amount, timestamp) at the time of the event.
After the real-time computation request is completed, the client initiates an additional data insertion request through the OpenMLDB SDK to insert the current transaction data into OpenMLDB for subsequent real-time request computations.

The above strict in-event decision-making architecture based on the OpenMLDB SDK is the default and recommended architecture. In actual enterprise application architectures, there may be some variations due to the complexity of peripheral coupling or permission issues. For example, the data writing path can be completely separated, with Kafka or other methods for data writing. However, if such architecture is not verified and guaranteed, there may be problems with the order of read and write operations, resulting in duplicate or missing data computation. In general, we still recommend using the strict in-event decision-making architecture shown in the above diagram.

Application Architecture for Real-time Query

In some scenarios, the application may need to perform a real-time query only, which does not carry any meaningful data. For example, when a user browses products, it is helpful to query the most popular products on the platform that match the user’s interests in the last ten minutes to recommend relevant products. In such scenarios, the user’s request does not carry any meaningful data and thus can be completely decoupled with the data writing path. Such query requests only trigger a calculation (read-only, no data writes), which can be achieved using EXCLUDE CURRENT_ROW.

In the above architecture, the real-time query request (read-only) and the data writing path are decoupled.

For the data writing path, users can continuously write data to the OpenMLDB database through streaming (such as Kafka connector) or OpenMLDB SDK.
For real-time query requests, there are two main features:
- After the query request is completed, there is no additional step to write into real-time data (data write is completed by the data writing path).
- Since data carried by query request is not meaningful, the extended SQL keyword EXCLUDE CURRENT_ROW is required to avoid virtual insertion.

Other Architectures

In addition to the two architectures mentioned above, OpenMLDB can also be extended to support architectures for online queries of offline features and architectures for streaming features. We will gradually introduce other enterprise-level architectures applied in practical scenarios in subsequent articles.

Appendix: EXCLUDE CURRENT_ROW Semantics

The real-time request mode of OpenMLDB by default will virtually insert the current data row into the table and include it in the window computation. If the current row’s data is not required for the computation, EXCLUDE CURRENT_ROW can be used. This syntax excludes the data of the current request row from the window computation, but the PARTITION BY KEY and ORDER BY KEY provided by the request row still need to be used to locate the specific data and time window of the request.

The following example illustrates its semantics, assuming the schema of the data table txn used to store transaction records is as follows.

The following SQL with EXCLUDE CURRENT_ROW is used:

SELECT card_id, sum(amount) OVER (w1) AS w1_amount_sum FROM txn 
    WINDOW w1 AS (PARTITION BY card_id ORDER BY txn_time 
    ROWS BETWEEN 2 PRECEDING AND CURRENT ROW EXCLUDE CURRENT_ROW);

The statement defines a window based on card_id as the key, sorted by txn_time, and includes two rows before the current request row. At the same time, because EXCLUDE CURRENT_ROW is specified, the current request row is excluded from the window for computation.

For simplicity, let’s assume that the table only has the following two rows:

--------- ----------- --------------- 
  card_id   amount      txn_time       
 --------- ----------- --------------- 
  aaa       22.000000   1636097890000  
  aaa       20.000000   1636097290000  
 --------- ----------- ---------------

We send a real-time computing request, which includes the following request data:

If EXCLUDE CURRENT_ROW is not used, both the current request row and the two rows that are already in the database will be included in the window for computation, resulting in a return value of "aaa, 65.0". However, since EXCLUDE CURRENT_ROW is used, the current row will not be included in the window computation, so the return value is actually "aaa, 42.0".

Note that although the value of the amount in the current row does not participate in the window computation, its card_id (key) and txn_time ( timestamp) still need to be correctly set to correctly locate the time window.

For more information on OpenMLDB:

Website: https://openmldb.ai/
GitHub: https://github.com/4paradigm/OpenMLDB
Documentation: https://openmldb.ai/docs/en/
Join us on Slack !

DEV Community: Hana Wang

Introducing OpenMLDB’s New Feature: Feature Signatures — Enabling Complete Feature Engineering with SQL

Background

Feature Signatures and Label Signatures

Usage Example

Summary

OpenMLDB v0.9.0 Release: Major Upgrade in SQL Capabilities Covering the Entire Feature Servicing Process

Release Date

Release Note

Highlighted Features

Comparative Analysis of Memory Consumption: OpenMLDB vs Redis Test Report

Background

Test Environment

Test Methods

Method One: Random Data Generation

Example

Configurable Parameters

Operations Steps (Reproducible Steps)

a. Deploy OpenMLDB and Redis

b. Pull the testing code

c. Modify configuration

d. Rut the tests

e. Check the output results

Method Two: Using the Open Source Dataset TalkingData

Example

Configurable Parameters

Operation Steps (Reproducible Steps)

a. Deploy OpenMLDB and Redis

b. Pull the testing code

c. Modify configuration

d. Obtain the test data file

e. Run the tests

f. Check the output results

Results

Random Data Test

TalkingData Test

Summary

Ultra High-Performance Database OpenM(ysq)LDB: Seamless Compatibility with MySQL Protocol and Multi-Language MySQL Client

What’s OpenM(ysq)LDB?

Usage

Use a Compatible MySQL Command Line

Use a Compatible JDBC Driver

Use a Compatible SQLAlchemy Driver

Use a Compatible Go MySQL Driver

Use a Compatible Sequel Ace Client

Use a Compatible Navicat Client

Compatibility Principle of MySQL Protocol

Summary

Integrating Apache Hive — Offline Data for OpenMLDB

OpenMLDB Deployment

Hive-OpenMLDB Integration

Installation

Configuration

Check

Usage

Table Creation with LIKE

Import Hive Data to OpenMLDB

Export OpenMLDB Data to Hive

Summary

Mastering Distributed Database Development in 10 Minutes with OpenMLDB Developer Docker Image

Usage

Concurrent Compilation Time

1. 4-core Compilation

2. 8-core Compilation

3. 16-core Compilation

Highlights

OpenMLDB v0.8.5 Release: Enhanced Authentication Feature, Comprehensive Security Upgrade

Release Date

Release Notes

Highlights

FeatInsight: Leveraging OpenMLDB for Highly Efficient Feature Management and Orchestration

Key Features

QuickStart

1. Data Import

2. Feature Creation

3. Offline Samples Export

4. Online Feature Service

Summary

Appendix: Advanced Functions

Read More:

Table Creation with `LIKE`