Apache Doris

Posted on Jan 23, 2023

Query TBs of Data in Seconds, Apache Doris Accelerates Data Warehouse of 360 DigiTech

#database #security #java #opensource

Author: Middleware Team of 360 DigiTech

About Apache Doris

Apache Doris is a real-time analytical database based on MPP architecture, known for its high performance and ease of use. It supports both high-concurrency point queries and high-throughput complex analysis. (https://github.com/apache/doris)
SelectDB is a cloud-native real-time data warehouse developed based on the Apache Doris open source project by the same key developers. The SelectDB company focuses on the enterprise-grade cloud-native distribution for Apache Doris.

About 360 DigiTech

360 DigiTech (NASDAQ: QFIN) is an AI-driven FinTech platform. It works in collaboration with financial institutions to provide tailored online consumer financial products, including 360 IOU APP, 360 Small Business Loans, and 360 Installments, for a broad customer base as a way to achieve inclusive finance. Till now, it has assisted 141 financial institutions in providing credit services for 43 million users, borrowing services for 26.3 million users, and facilitated quarterly transactions worth around US$ 16.5 billion. As a leading credit technology service brand in China, 360 DigiTech had accumulated over 200 million registered users by the end of Q3, 2022.

Requirements of 360 DigiTech

The ever-evolving FinTech business requires higher levels of data security, data accuracy, and real-time data updates. In business scenarios like analysis and tagging, early Clickhouse clusters reveal disadvantages including low stability, complex operation and maintenance, and slow Join queries. In the case of 360 DigiTech, our report data is distributed across various databases, adding to the difficulty of data maintenance and management. We are in urgent need of a powerful and handy tool.

Ideal Data Warehouse for 360 DigiTech

We were looking for a real-time MPP data warehouse that could meet the following requirements:

Quick data writing
Quick query response time of only seconds
Be compatible with standard SQL protocol
High Join query performance
A rich variety of data models
Low maintenance complexity
An active user community
Be business-friendly, with no potential legal risks We found Apache Doris an ideal candidate and started a 2-month survey on it in March, 2022. Here are the results:

We decided to go for Apache Doris due to its almost perfect fulfillment of our requirements and a few other considerations:

ClickHouse is more of a log storage and historical data processing tool due to its low maintenability and limitations on Join queries, functions, and data models. We need a tool that can better serve our needs.
Apache Doris comes with an active user community where technical communications are frequent. Plus, SelectDB, the company that commercializes Apache Doris, has established a full-time technical support team for the Doris community that can offer timely response to our problems.
Apache Doris, together with the Apache open source license, are business-friendly and impose no legal risks. Also, the Apache projects have constituted de facto standards in the big data industry and have been embraced by 360 DigiTech internally.

Platform Architecture of 360 DigiTech

The Yushu big data platform of 360 DigiTech provides one-stop big data management, development, and analysis services across the whole data lifecycle. Currently, Apache Doris is mainly used for offline data warehouse analysis acceleration and self-service BI reporting in 360 DigiTech.

After introducing Apache Doris into our toolkit, we ingested some of our high-priority data into Doris clusters. As is shown in the above figure, we placed Doris on top of Apache Hive to give full play to Doris' strong query performance so it could accelerate queries in certain scenarios. This was an easy move since we only needed to make simple adaptations to smoothen the data ingestion from Hive to Doris. The Doris clusters could directly reuse the existing data lineage in Hive.

Data Ingestion Solution

We compared two different data ingestion methods in Doris: Stream Load and Broker Load. We found that Broker Load slightly outperformed Stream Load in ingestion speed while there was no noticeable difference in development costs between the two. Also, Broker Load was able to ingest one large table as one transaction; however, with the Stream Load method, any table of over 10G data would need to be split before it could be ingested. Thus, we opted for Broker Load.

Ad-Hoc Query Solution

Our self-developed query engines support a dynamic switching mechanism. That means the system can identify the meta information of the query data and then conduct automatic routing switch and failover for the query engine (Doris/Presto/Spark/Hive) of the current query.
Since Apache Doris supports the native MySQL protocol and standard SQL, it can be seamlessly integrated with popular BI tools such as Tableau. Hence, we built a Doris report analysis cluster separately as the data source for BI tools.

Application of Doris

How Doris Accelerates Hive Queries

In ad-hoc queries, the performance of traditional query engines (Hive/Spark/Presto) is lagging behind the expectations of data developers and data analysts. Query response time of dozens of seconds or even minutes has become a big hindrance for developers.
To solve this, we have built a data warehouse acceleration layer, where Doris plays a role. As a result, Doris has reduced the average ad-hoc query response time to less than 5 seconds. This was achieved without enabling optimization strategies such as cache and materialized view.
As a follow-up effort, we will analyze the features of relevant queries and adopt proper optimization methods to further improve query performance.

Doris accelerates queries by the query engine dynamic switching mechanism, which works as follows:

As is shown above, the query engine collects the meta information of data in Hive and Doris in a timely manner, including information about databases, tables, fields, and rows. When users submit an ad-hoc query request, the system will parse the target table, and make the following judgements one by one:

Is the target table existing in Doris?
Does the Doris table have the same scheme as the Hive table?
Does the Doris table have the same number of rows as the Hive table? If all answers are "Yes", the current query will be routed to Doris; otherwise the query will be conducted in Presto, then Spark, and then Hive. If problems occur during the query, a failover will be performed in that same order. Analysis of Slow Queries and Slow Ingestion Doris has a well-developed Profile mechanism for slow queries and slow data ingestion. After learning about this mechanism, we started to regularly collect and store profile information of slow queries and slow data ingestion through scheduling tasks on our online clusters.

The profile information contains lots of details. For example, OLAP_SCAN_NODE is the number of the overall scanned rows and the filtered rows of each index; the EXCHANGE_NODE of each instance is the number of received rows and the total received data volume. Based on these details, we conducted tailored optimization and removed performance bottlenecks effectively.

How We Create Tables

In our application scenarios, we use the following tables:

PDA Table: fully updated day to day, with daily partitions storing full snapshot data
PDI Table: incrementally updated day to day, with daily partitions storing incremental data
A Table: full table with no partitions
S Table: static data with no day-to-day updates Since all tables in Doris clusters are ingested from and synchronized with those in Hive, we have only adopted the Duplicate and Unique data models so far. When creating PDA, PDI, and A Tables, in order to reduce the number of partitions and management complexity of FE metadata, we used a dynamic data partitioning method: Old data is partitioned and archived by year and month, while recent data is partitioned by day and hour. We also plan to automize the merging of historical data partitions.

For PDA Tables, we used the Duplicate model instead of the Unique model. The reason is twofold. Firstly, the Doris data ingestion model itself provides data consistency guarantee based on the task label. One ingestion task of one partition of a PDA table in one scheduling cycle generates only one unique and immutable label. Thus, no matter how many execution errors occur, data in that partition will not be duplicated. Secondly, unlike the Unique model, the Duplicate model does not perform pre-aggregate de-duplication, making data ingestion and queries faster.

For PDI Tables, we used the Unique model. The usage of PDI tables involves partial update of historical data (mostly data update and basically no data deletion). Thus, considering the partition availability, we chose the Unique model so that we don't need to rebuild the partitions when updating historical data.

For A Tables, we used dynamic partitioning since temporary data unavailability was acceptable. Before each ingestion, we deleted the historical partitions and loaded all data into the current day's partition. This was easy to implement and could avoid recreation of tables.

Older versions of Apache Doris did not support Hive metadata sychronization and management. We had to select and configure the clusters, Hive table names, data model, and bucket numbers before we could relate to the corresponding Hive table. Then the system would parse table fields, and generate CREATE TABLE statements. An exciting news is that Apache Doris 1.2.0 supports Multi Catalog, which means it can perform auto-synchronization of Hive metadata and schema. This relieves so much of our workload!

Monitoring System

The current Doris cluster monitoring system consists of three parts: host metrics monitoring alerts, log alerts, and cluster metrics monitoring alerts. The overall system is shown as follows:

The host metrics monitoring platform is developed based on Open-Falcon. It collects CPU, IO, memory, and disk metrics of Doris cluster nodes, and creates alerts after detecting abnormal situations.

The cluster metrics monitoring platform is built based on the monitoring solutions of Prometheus and Grafana, which are mentioned in the Doris documentation.

Log alert is a complement to the above two. In some cases, it is a less expensive way to monitor the Doris service logs. By this method, we can accurately identify the root reason for data ingestion failures and receive prompt notifications.

Troubleshooting and Audit Log

In addition to the monitoring system, we need extra measures to address some extreme cluster issues. In order to quickly locate the stacks in the case of unusual downtime of cluster BE, we need to enable Core Dump for all BE nodes. Besides, audit logs also play a significant role in everyday operation and maintenance.

Audit logs of Doris clusters can be collected in two ways:

Collect components from the logs and fe.audit.log from each FE node
Use the Auditloader plugin provided by Doris (which locates in doris/fe_plugins/auditloader in the Doris source code). See detailed usage in Audit Log Plugin. We followed the second method since it is simpler. In actual practice, we found several problems regarding the use of the Auditloader plugin and committed PRs to the Doris project. Meanwhile, we customized our Doris console, on which we could check the data distribution and synchronization details of the clusters, and conduct audit log retrieval.

If cluster BE collapses, audit logs can provide information about SQL location, client access, response time of query SQL, and features of access SQL. For example, our data developers once reported a Doris SQL failure with the log showing that we had hit the connection limit. By looking into the audit log, we realized that it was because we had created too many database connections due to a bug in an upstream workflow import task. Also, for several BE downtime that occurred when we used the older versions of Apache Doris, we located the query_id of the relevant SQL using the GDB debugger. Based on that, we quickly found the culprit SQL in the audit log.

Summary and Future Plans

Summary

In July, 2022, we applied Apache Doris in our production environment. Now we have 2 clusters containing hundreds of tables and dozens of TBs of data. Every day, we run hundreds of workflows simultaneously and handle billions of data updates. Apache Doris manages to support such big business with excellent performance and stability all the way.

Why Doris Stands Out:

Doris clusters are of neat architecture with no dependencies on other components. With user-friendly data models and various data ingestion methods that bring low adaptation costs, Doris has made the trial use and application of it an easy journey for us.
As Doris clusters have become a major data source of our BI tools, Doris has greatly accelerated our report analysis workloads and largely reduced the response time of ad-hoc queries.
Doris has a well-developed monitoring and auditing mechanism, which makes maintenance no longer a nightmare.
Doris comes with an active user community and a professional technical support team that can respond to and address our problems quickly.

Future Plans

We are planning to apply Apache Doris in more business scenarios, such as real-time data warehouse creation, user behavior profiling, and federated queries to data lakes such as Hive, Iceberg, and Hudi. We also plan to further improve query performance by utilizing the query caching feature and materialized views of Doris based on user query SQL features. We will build a cluster probing tool that monitors data distribution (checking for overly large Tablets or unevenly distributed data in Tablets) and running status of clusters and provides optimization suggestions automatically.
Special thanks to the SelectDB technical support team for their kind help.

DEV Community