Apache Doris

Posted on Apr 8, 2023

Beginner's Guide to Data Analytics: Diving into Our Data Management Platform

#database #datascience #data #analytics

In this post, I'm going to walk you through the parts and components of our own Data Management Platform (DMP), and how we improve analytic efficiency by architectural optimization.

Let's start from the basics.

As the raw material of our DMP, the data sources include:

Business logs from all sales ends；
Sales data from third-party platforms;
Basic data from within the company.

These constitute our data assets, from which we derive a number of tags to describe the customers' age, address, preferenced products, the device they used, etc. Using these tags as filters, we pick out a group of customers that match certain characteristics (the "Grouping" process). Then we view the behavior pattern of the target group.

What is the DMP used for?

From the data user's perspective, they mainly use the DMP for two purposes: tag query and grouping.

Tag Query: Sometimes they find a certain customer (or a group of customers) and check what tags they are attached to.
Grouping: After grouping, they might want to check if a certain customer is among a specified group, in order to find grounding for their marketing decisions. They might also pull the grouping result set from the DMP to their own business system for further development. Most of the time, they will start analyzing the shopping mode of the target group directly.

How does the DMP work?

Firstly, we, as data platform engineers, define the tags and the rules for grouping.

Next, we define the domain-specific language (DSL) that describes what we want to do, so we can submit computation tasks to Apache Spark.

Then, computing results will be stored in Apache Hive and Apache Doris.

Lastly, data users can perform whatever queries they need in Hive and Doris.

To provide an abstraction of the DMP, it can be seen as a four-layered architecture:

Metadata Management: All meta information about the tags are stored in the source data tables;
Computation & Storage: This layer is supported by Spark, Hive, Doris, and Redis;
Scheduling: This is like a command center of the DMP. It arranges the tasks throughout the whole platform, such as aggregating data into basic tags, converting data of basic tags into SQL semantics for queries based on the DSL rules, and passing computing results from Spark to Hive and Doris;
Service: This is where data users conduct grouping, profile analysis, and check the tags.

Where Do We Store Our Data?

We made a list of what would look like an ideal data warehouse for us:

Support high-performance queries so as to handle massive consumer end traffic;
Support SQL for data analysis;
Support data update;
Be able to store huge amounts of data;
Support extension functions to deal with user-defined data structures;
Be closely integrated into the big data ecosystem.

It was hard to find a single tool that met all these needs, so we tried a mixture of multiple tools.

We stored part of our offline and real-time data in Hbase for basic tag queries, most of the offline data in Hive, and double wrote the rest of our real-time data into Kudu and ES for real-time grouping and data query. The grouping results would be produced by Impala and then cached in Redis.

As you can imagine, complicated storage could lead to tricky maintenance. Also, doublewriting added to the risk of data inconsistency since one of them might fail.

So, in Storage Architecture 2.0, we introduced Apache Doris and Apache Spark. The whole data pipeline was a Y-shaped diagram.

We stored offline data in Hive, and the basic tags and real-time data Doris. Then, based on Spark, we conducted federated queries across Hive and Spark. The query results would be stored in Redis.

With this new architecture, we compromised a little bit of performance for much lower maintenance costs.

P.S.

For your reference, here is a summary of the applicable scenarios for the various engines that we've used or investigated.

What Are Our High-Performance Queries?

Some of our data queries demand high performance, such as grouping checks and customer group analysis.

Grouping Check

Grouping check means to see if certain users are categorized into one or several given groups. It is a two-step check:

Check in static group packets: perform pre-computations and store the results in Redis; use Lua script for bulk check and thus increase performance.
Check in real-time behavior groups: extract data from contexts, API, and Apache Doris for rule-based judgement. Meanwhile, we have tried to increase performance by asynchronous checks, quick short-circuit calculation, query statement optimization, and join table quantity control.

Customer Group Analysis

Customer group analysis is to figure out the behavioral path of consumers. It entails join queries across the group packets and multiple tables. Apache Doris does not support path analysis functions so far, but its computing model is friendly to user-defined function development, so we have built a UDF for this purpose and it works well.

Our Gains After Architectural Simplication

The newly introduced data warehouse, Apache Doris, can be applied in multiple scenarios, including point queries, batch queries, behavioral path analysis, and grouping. In point queries and small-scale join queries. It delivers a high performance of over 10,000 QPS and an RT99 of less than 50ms. Apart from its strong scalability and easy maintenance, we also benefit from the much simpler tag models brought by the integration of real-time and offline data.

Conclusion

In this post, I zoomed in on various parts of our DMP, explained how the data tags, storage and query worked. We believe that a good tag system and fast query speeds are the recipe for efficient data analytics. So our follow-up efforts will be put into these aspects.

I'm writing this piece to share our practice in the data engineer community and hopefully, collect some valuable suggestions, so if you've got any ideas, meet me in the comment section.

DEV Community

Beginner's Guide to Data Analytics: Diving into Our Data Management Platform

Tags

How Tags are Produced?

Types of Tags