Apache Doris

Posted on Oct 28

How to build real-time user-facing analytics with Kafka + Flink + Doris

#bigdata #olap #kafka #doris

In the data-driven era, when people hear the term data analysis, their first thought is often that it is an skill for corporate executives, managers, or professional data analysts. However, with the widespread adoption of the internet and the full digitalization of consumer behavior, data analysis has long transcended professional circles. It has quietly permeated every aspect of our daily lives, becoming a practical tool that even ordinary people can leverage. Examples include:

For e-commerce operators: By analyzing real-time product sales data and advertising performance, they can accurately adjust promotion strategies for key events like Black Friday. This makes every marketing investment more efficient and effectively boosts return on marketing (ROM).
For restaurant managers: Using order volume data from food delivery platforms, they can scientifically plan ingredient procurement and stock levels. This not only prevents order fulfillment issues due to insufficient stock but also reduces waste from excess ingredients, balancing costs and supply.
Even for ordinary stock investors: Analyzing the revenue data and quarterly profit-and-loss statements of their holdings helps them gain a clearer understanding of investment dynamics, providing references for future decisions.

Today, every online interaction—from online shopping and food delivery to ride-hailing and apartment hunting—generates massive amounts of data. User-Facing Analytics transforms these fragmented data points into intuitive, easy-to-understand insights. This enables small business owners, individual operators, and even ordinary consumers to easily interpret the information behind the data and truly benefit from it.

Core Challenges of User-Facing Analytics

Unlike traditional enterprise-internal Business Intelligence (BI), User-Facing Analytics may serve millions or even billions of users. These users have scattered, diverse needs and higher requirements for real-time performance and usability, leading to three core challenges:

Data Freshness

Traditional BI typically relies on T+1 (previous day) data. For example, a company manager reviewing last month’s sales report would not be significantly affected by a 1-day delay. However, in User-Facing Analytics, minimizing the time from data generation to user visibility is critical—especially in scenarios requiring real-time decisions (e.g., algorithmic trading), where real-time market data directly impacts decision-making responsiveness. The challenges here include:

High-throughput data inflow: A top live-streaming e-commerce platform can generate tens of thousands of log entries per second (from user clicks, cart additions, and purchases) during a single live broadcast, with daily data volumes reaching dozens of terabytes. Traditional data processing systems struggle to handle this load.
High-frequency data updates: In addition to user behavior data, information such as product inventory, prices, and coupons may update multiple times per second (e.g., temporary discount adjustments during live streams). Systems must simultaneously handle read (users viewing data) and write (data updates) requests, which easily leads to delays.

High Concurrency & Low Latency

Traditional BI users are mostly internal employees (tens to thousands of people), so systems only need to support low-concurrency requests. In contrast, User-Facing Analytics serves a massive number of end-users. If system response latency exceeds 1 second, users may refresh the page or abandon viewing, harming the experience. Key challenges include:

High-concurrency requests: Systems must handle a large number of user requests simultaneously, significantly increasing load.
Low-latency requirements: Users expect data response times in the millisecond range; any delay may impact experience and decision efficiency.

Complex Queries

Traditional BI primarily offers predefined reports (e.g., the finance department reviewing monthly revenue reports with fixed dimensions like time, region, and product). User-Facing Analytics, however, requires support for custom queries due to diverse user needs:

A small business owner may want to check the sales share of a product among users aged 18-25 in the past 3 days.
An ordinary consumer may want to view the trend of spending on a product category in the past month.

The challenges here are:

Computational resource consumption: Complex queries require real-time integration of multiple data sources and multi-dimensional calculations (e.g., SUM, COUNT, GROUP BY), which consume significant computational resources. If multiple users initiate complex queries simultaneously, system performance degrades sharply.
Query flexibility: Users may adjust query dimensions at any time (e.g., switching from daily analysis to hourly analysis, or from regional analysis to user age analysis). Systems must support Ad-Hoc Queries instead of relying on precomputed results.

Design User-Facing Analytics Solution Using Kafka + Doris

A typical real-time data-based User-Facing Analytics architecture consists of a three-tier real-time data warehouse, with Kafka as the unified data ingestion bus, Flink as the real-time computing engine, and Doris as the core data service layer. Through in-depth collaboration between components, this architecture addresses high-throughput ingestion of multi-source data, enables low-latency stream processing, and provides flexible data services—meeting enterprises’ diverse needs for real-time analysis, business queries, and metric statistics.

Data Ingestion Layer

The core goal of this layer is to realtime and stably aggregate all data sources. Kafka is the preferred component here due to its high throughput and reliability, with the following advantages:

High throughput & low latency: Based on an architecture of partition parallelism + sequential disk I/O, a single Kafka cluster can easily handle millions of messages per second (both writes and reads) with millisecond-level latency. For example, during an e-commerce peak promotion, Kafka processes 500,000 user order records per second, preventing data backlogs.
High data reliability: Default 3-replica mechanism ensures no data loss even if a server fails. For instance, user behavior logs from a live-streaming platform are stored via Kafka’s multi-replica feature, ensuring every click or comment is fully preserved.
Rich ecosystem: Via Kafka Connect, it can connect to various data sources (structured data like MySQL/PostgreSQL, semi-structured data like JSON/CSV, and unstructured data like log files/images) without custom development, reducing data ingestion costs.

Stream Processing Layer

The core goal of this layer is to transform raw data into usable analytical data. As a unified batch-stream computing engine, Flink efficiently processes real-time data streams to perform cleaning, transformation, and aggregation:

Real-Time ETL

Raw data often suffers from inconsistent formats, invalid values, and sensitive information. Flink handles this in real time:

Format standardization: Convert JSON-format APP logs into structured data (e.g., splitting the user_behavior field into user_id, action_type, timestamp).
Data cleaning: Filter invalid data (e.g., negative order amounts, empty user IDs) and fill missing fields (e.g., using default values for unprovided user gender).
Sensitive information desensitization: Encrypt data like phone numbers (e.g., 138*5678) and ID numbers (e.g., 110101*****1234) to ensure data security.

Dimension Table Join

This solves the integration of stream data and static data. In data analysis, stream data (e.g., order data) often needs to be joined with static dimension data (e.g., user information, product categories) to generate complete insights. Flink achieves low-latency joins by collaborating with Doris row-store dimension tables:

Stream data: Real-time order data in Kafka (including user_id, product_id, order_amount).
Dimension data: User information tables (user_id, user_age, user_city) and product category tables (product_id, product_category) stored in Doris.
Join result: A wide order table including user age, city, and product category—supporting subsequent queries like sales analysis by city or consumption preference analysis by user age.

Real-Time Metric Calculation

Flink supports multiple window calculation methods (tumbling windows, sliding windows, session windows) to aggregate key metrics in real time, meeting User-Facing Analytics’ need for real-time insights:

Tumbling window: Aggregate at fixed time intervals (e.g., calculating total order amount in the last 1 minute every minute).
Sliding window: Slide at fixed steps (e.g., calculating active user count in the last 5 minutes every 1 minute).
Session window: Aggregate based on user inactivity intervals (e.g., ending a session if a user is inactive for 30 consecutive minutes, then calculate number of products viewed in a single session).

Online Data Serving Layer

The Online Data Serving Layer is the final mile of the real-time data processing pipeline and the key to converting data from raw resources to business value. Whether e-commerce merchants check real-time sales reports, food delivery riders access order heatmaps, or ordinary users query consumption bills—all rely on this layer to obtain insights. Doris, with its in-depth optimizations for high-throughput ingestion, high-concurrency queries, and flexible updates, serves as the core of the Online Data Serving Layer for User-Facing Analytics. Its advantages are detailed below:

Ultra-High Throughput Ingestion

In User-Facing Analytics, data ingestion faces challenges of massive volume and high frequency. Doris, via its HTTP-based StreamLoad API, builds an efficient batch ingestion mechanism with two core advantages:

High performance per thread: Optimized for batch compressed transmission + asynchronous writing, the StreamLoad API achieves over 50MB/s data ingestion per thread and supports concurrent ingestion. For example, when an upstream Flink cluster starts 10 parallel write tasks, the total ingestion throughput easily exceeds 500MB/s—covering real-time data write needs of medium-to-large enterprises.
Validation in ultra-large-scale scenarios: In core data storage scenarios for the telecommunications industry, Doris demonstrates strong ultra-large-scale data storage and high-throughput write capabilities. It supports stable storage of 500 trillion records and 13PB of data in a single large table. Additionally, it handles 145TB of daily incremental user behavior data and business logs while maintaining stability and timeliness—addressing pain points of traditional storage solutions (e.g., difficult storage, slow writes, poor scalability) in ultra-large-scale data scenarios.

High Concurrency & Low Latency Queries

User-Facing Analytics is characterized by large user scale—tens of thousands of merchants and millions of ordinary users may initiate queries simultaneously. For example, during an e-commerce peak promotion, over 100,000 merchants frequently refresh real-time transaction dashboards, and nearly 1 million users query my order delivery status. Doris balances high concurrency and low latency via in-depth query engine optimizations:

Distributed query scheduling: Adopting an MPP (Massively Parallel Processing) architecture, queries are automatically split into sub-tasks executed in parallel across multiple Backend (BE) nodes. For example, a query like order volume by city nationwide in the last hour is split into 30 parallel sub-tasks (one per city partition), with results aggregated after node-level computation—greatly reducing query time.
Inverted indexes & multi-level caching: Inverted indexes quickly filter invalid data (e.g., a query for orders of a product in May 2024 skips data from other months, improving efficiency by 5-10x). Built-in multi-level caching (memory cache, disk cache) allows popular queries (e.g., merchants checking today’s sales) to return results directly from memory, compressing latency to milliseconds.
Performance validation: In standard stress tests, a Doris cluster (10 BE nodes) supports 100,000 concurrent queries per second, with 99% of responses completed within 500ms. Even in extreme scenarios (e.g., 200,000 queries per second during e-commerce peaks), the system remains stable without timeouts or crashes—fully meeting User-Facing Analytics’ user experience requirements.

Flexible Data Update Mechanism

In real business, data is not write-once and immutable: Food delivery order status changes from pending acceptance to delivered, e-commerce product inventory decreases in real time with sales, and user membership levels may rise due to qualified consumption. Slow or complex data updates lead to stale data (e.g., users seeing in-stock products but receiving out-of-stock messages after ordering), eroding business trust. Doris addresses traditional data warehouse pain points (e.g., difficult updates, high costs) via native CRUD support, primary key models, and partial-column updates:

Primary key models ensure uniqueness: Supports primary key tables with business keys (e.g., order_id, user_id) as unique identifiers—preventing duplicate data writes. When upstream data is updated, Upsert operations (update existing data or insert new data) are performed based on primary keys, eliminating manual duplicate handling and simplifying business logic.
Partial-column updates reduce costs: Traditional data warehouses rewrite entire rows even for single-field updates (e.g., changing order status from pending payment to paid), consuming significant storage and computing resources. Doris supports partial-column updates, writing only changed fields—improving update efficiency by 3-5x and reducing storage usage.

Example: An e-commerce platform builds a product 360° table (over 2,000 columns, including basic product info, inventory, price, sales, and user rating). Multiple Flink tasks update different columns by primary key:

Flink Task 1: Syncs real-time basic product info (e.g., name, specifications) to update basic info columns (50 columns total).
Flink Task 2: Syncs real-time inventory data (e.g., current stock, pre-order stock) to update inventory columns (10 columns total).
Flink Task 3: Calculates hourly sales (24-hour sales, 7-day sales) to update sales columns (8 columns total).
Flink Task 4: Updates daily user ratings (overall score, positive rate) to update rating columns (5 columns total).

Conclusion

In the future, as digitalization deepens, User-Facing Analytics demands will become more diverse—evolving from real-time to instant and expanding from single-dimensional analysis to multi-scenario linked insights. Technical architectures represented by Kafka+Flink+Doris will continue to be core enablers due to their scalability, flexibility, and scenario adaptability. Ultimately, the ultimate goal of User-Facing Analytics is not technology stacking, but to make data a truly inclusive tool—empowering every user and every business link to achieve full-scale data-driven decision-making.

DEV Community