DEV Community: Darren XU

Best Practices for Syncing Hive Data to Apache Doris — From Scenario Matching to Performance Tuning

Darren XU — Mon, 19 May 2025 08:31:41 +0000

Hive to Apache Doris Data Synchronization: A Comprehensive Guide

In the realm of big data, Hive has long been a cornerstone for massive data warehousing and offline processing, while Apache Doris shines in real-time analytics and ad-hoc query scenarios with its robust OLAP capabilities. When enterprises aim to combine Hive's storage prowess with Doris's analytical agility, the challenge lies in efficiently and reliably syncing data between these two systems. This article provides a comprehensive guide to Hive-to-Doris data synchronization, covering use cases, technical solutions, model design, and performance optimization.

I. Core Use Cases and Scope

When target data resides in a Hive data warehouse and requires accelerated analysis via Doris's OLAP capabilities, key scenarios include:

Reporting & Ad-Hoc Queries: Enable fast analytics through synchronization or federated queries.
Unified Data Warehouse Construction: Build layered data models in Doris to enhance query efficiency.
Federated Query Acceleration: Directly access Hive tables from Doris to avoid frequent data ingestion.

II. Technical Pathways and Synchronization Modes

(1) Synchronization Mode

Full/Incremental Sync: Suitable for low-update-frequency scenarios (e.g., log data, dimension tables) where a complete data model is needed in Doris.
Federated Query Mode: Ideal for high-frequency, small-data-volume scenarios (e.g., real-time pricing data) to reduce storage costs and ingestion latency by querying Hive directly from Doris.

(2) Technical Solutions Overview

Four mainstream approaches exist, chosen based on data volume, update frequency, and ETL complexity:

III. In-Depth Analysis of Four Synchronization Solutions

(1) Broker Load: Asynchronous Sync for Large Dataset

Core Principle: Leverage Doris's built-in Broker service to asynchronously load data from HDFS (where Hive data resides) into Doris, supporting full and incremental modes.
Use Case:
- Suitable for datasets ranging from tens to hundreds of GB, stored in HDFS accessible by Doris.
- Performance: Syncing a 5.8GB SSB dataset (60M rows) takes 140–164 seconds, achieving 370k–420k rows/sec (cluster-dependent).
Key Operations:
- Table Optimization: Temporarily set replication_num=1 during ingestion for speed, then adjust to 3 replicas for durability.
- Partition Conversion: Convert Hive partition fields (e.g., yyyymm) to Doris-compatible date types using str_to_date.
- HA Configuration: Include namenode addresses in WITH BROKER for HDFS high-availability setups.

(2) Doris On Hive: Low-Latency Federated Querie

Core Principle: Use a Catalog to access Hive metadata, enabling direct queries or INSERT INTO SELECT syncs.
Use Case:
- Small datasets (e.g., pricing tables) with frequent updates (minute-level), no pre-aggregation needed in Doris.
- Supports Text, Parquet, ORC formats (Hive ≥2.3.7).
Advantages:
- No data landing in Doris; direct join queries between Hive and Doris tables with sub-0.2-second latency.

(3) Spark Load: Performance Acceleration for Complex ETL

Core Principle: Offload data preprocessing to an external Spark cluster, reducing Doris's computational pressure.
Use Case:
- Complex data cleaning (e.g., multi-table JOINs, field transformations) with Spark accessing HDFS.
- Performance: 5.8GB synced in 137 seconds (440k rows/sec), outperforming Broker Load.
Configuration:
- Spark Settings: Update Doris FE config (fe.conf) with spark_home and spark_resource_path:

enable_spark_load = true 

spark_home_default_dir = /opt/cloudera/parcels/CDH/lib/spark 

spark_resource_path = /opt/cloudera/parcels/CDH/lib/spark/spark-2x.zip

External Resource Creation:

CREATE EXTERNAL RESOURCE "spark0"
PROPERTIES
(
"type" = "spark",
"spark.master" = "yarn",
"spark.submit.deployMode" = "cluster",
"spark.executor.memory" = "1g",
"spark.yarn.queue" = "queue0",
"spark.hadoop.yarn.resourcemanager.address" = "hdfs://nodename:8032",
"spark.hadoop.fs.defaultFS" = "hdfs://nodename:8020",
"working_dir" = "hdfs://nodename:8020/tmp/doris",
"broker" = "broker_name_1"
);

(4) DataX: Heterogeneous Data Source Compatibility

Core Principle: Use Alibaba's open-source DataX tool with custom hdfsreader and doriswriter plugins.
Use Case:
- Non-standard file formats (e.g., CSV) or non-HA HDFS environments.
Drawback: Lower performance (5.8GB in 1,421 seconds, 40k rows/sec) – use as a fallback.
Configuration Example:

{ 

 "job": { 

   "content": [ 

     { 

       "reader": { 

         "name": "hdfsreader", 

         "parameter": { 

           "path": "/data/ssb/*", 

           "defaultFS": "hdfs://xxxx:9000", 

           "fileType": "text" 
         } 

       }, 

       "writer": { 

         "name": "doriswriter", 

         "parameter": { 

           "feLoadUrl": ["xxxx:18040"], 

           "database": "test", 

           "table": "lineorder3" 

         } 

       } 

     } 

   ] 

 } 

}

IV. Decision Tree for Solution Selection

Priority: Broker Load – Large datasets (≥10GB), minimal ETL, high throughput needs.
Second Choice: Doris On Hive – Small datasets (<1GB), frequent updates, federated query requirements.
Complex ETL: Spark Load – Data preprocessing needed; leverage Spark cluster resources.
Fallback: DataX – Special formats or network constraints; prioritize compatibility over performance.

V. Data Modeling and Storage Optimization

(1) Data Model Selection

Aggregate Model: Ideal for log statistics; stores aggregated metrics by key to reduce data volume.
Unique Model: Ensures key uniqueness for slowly changing dimensions (equivalent to Replace in Aggregate).
Duplicate Model: Stores raw data for multi-dimensional analysis without aggregation.

(2) Data Type Mapping

String to Varchar: Use Varchar for Doris key columns (avoid String); reserve 3x Hive field length for Chinese characters.
Type Consistency: Convert Hive dates to Doris Date/DateTime and numeric types to Decimal/Float to avoid query-time conversions.

(3) Partitioning & Bucketing Strategie

Partition Keys: Reuse Hive partition fields (e.g., year-month) converted via str_to_date for pruning.
Bucket Keys: Choose high-cardinality fields (e.g., order ID); keep single bucket size under 10GB to avoid skew and segment limits (default ≤200).

VI. Performance Comparison and Best Practices

Solution	5.8GB Sync Time	Throughput	Query Latency	Use Case
Broker Load	140–164	370k–420k rows/	0.2–0.5	Large-scale full sync
Spark Load	137	440k rows/	0.3	ETL-intensive sync
Doris On Hive	Immediate	–	0.2–0.4	High-frequency federated querie
DataX	1,421	40k rows/	1–3	Special format compatibility

Optimization Tips:

Small File Merging: Use HDFS commands to merge small files and reduce Broker Load scanning.
Model Tuning: Use Duplicate model for fast ingestion, then create materialized views for query speed.
Monitoring: Track load status with SHOW LOAD in Doris.

VII. Conclusion

Combining Hive and Doris unlocks synergies between offline storage and real-time analytics. By choosing the right sync strategy (prioritizing Broker/Spark Load), optimizing data models (Aggregate for storage, bucketing for skew), and leveraging federated queries (Doris On Hive), enterprises can build efficient data architectures. Test with small datasets (e.g., SSB) before scaling to production, and stay updated with Doris community improvements (e.g., predicate pushdown) for ongoing performance gains.

if want to get more information and help of doris, you can join us

https://join.slack.com/t/apachedoriscommunity/shared_invite/zt-31dcopb90-zqBVqBrOIYhmy4U29fv9yQ

Doris: Breaking Down the Barriers of SQL Dialects and Building a Unified Data Query Ecosystem

Darren XU — Tue, 15 Apr 2025 08:51:52 +0000

In the realm of big data, different database systems frequently employ distinct SQL dialects. This is analogous to people from various regions speaking different languages, posing substantial challenges to data analysts and developers. When enterprises need to integrate multiple data sources for analysis, they often have to invest a great deal of time and effort in switching between different SQL syntaxes. However, Apache Doris, with its robust SQL dialect compatibility capabilities, has shattered this barrier and constructed a unified data query ecosystem for users.

SQL Dialect Compatibility: The "Universal Language" in a Complex Data Environment

In today's enterprise data architectures, it is a common occurrence for data to be dispersed across multiple database systems. These database systems each have their own characteristics. For instance, MySQL is commonly utilized for online transaction processing (OLTP), excelling in high - concurrency writing and transaction handling. Hive, on the other hand, is a leading force in big data offline analysis, capable of processing vast amounts of data. The use of different SQL dialects by these diverse database systems makes it extremely difficult for data analysts and developers to query and integrate data across systems.

The SQL dialect compatibility feature of Apache Doris is like a master interpreter, enabling users to communicate freely between different database systems. Doris not only supports standard SQL syntax but also is compatible with the SQL dialects of multiple mainstream databases, significantly reducing the learning and usage costs. Users no longer need to worry about the syntax differences among different database systems and can effortlessly query and analyze data from multiple data sources through Doris.

How Doris Achieves SQL Dialect Compatibility

1. The "Intelligent Collaboration" of the Parser and Optimizer

Doris realizes support for multiple SQL dialects through its unique parser and optimizer design. When a user submits an SQL query, the parser first conducts lexical and syntactic analysis on the query statement, transforming it into an abstract syntax tree (AST). During this process, the parser can identify the syntax structures of different dialects and handle them appropriately.

Subsequently, the optimizer optimizes the abstract syntax tree. It generates an efficient execution plan based on the query semantics and data distribution. In this process, the optimizer fully takes into account the characteristics of different data sources and selects the optimal query strategy, ensuring that the query can be executed efficiently on different data sources.

2. The "Seamless Connection" of Metadata Management

To achieve unified querying of different data sources, Doris has established a comprehensive metadata management mechanism. It can automatically discover and synchronize the metadata information of multiple data sources, including table structures, field types, indexes, etc. In this way, when users query data in Doris, it is as convenient as querying local tables, and they do not need to be concerned about the actual storage location of the data.

Moreover, Doris's metadata management mechanism also supports real - time updates, ensuring that users can always obtain the latest data source information. This provides great convenience for users, enabling them to respond promptly to business changes.

Analysis of Practical Application Scenarios

1. Replacing the Original OLAP System with Doris

For example, if the original systems are Trino and ClickHouse, and the switch is made to Doris. There are a large number of existing SQL business logics in the upstream business. If the business side is required to change the SQL dialect, the cost will be extremely high. The business hopes to be able to use the original SQL dialect to query in Doris.

2. Unified SQL Entrance

Doris serves as a unified entrance for OLAP. Users may query Hive tables through Doris and hope to use the SQL dialects of Hive or Spark.

3. Query Degradation

Users use Doris as a high - speed query engine. However, if some queries are not supported or fail (such as due to insufficient memory), the SQL needs to be downgraded and routed to, for example, a Spark cluster for execution. In such cases, users hope to uniformly use the Spark dialect, first send it to Doris, and if it fails, directly send it to Spark.

Advantages of Achieving SQL Dialect Compatibility with Doris

1. Reducing the Technical Threshold

For data analysts and developers, the SQL dialect compatibility feature of Doris reduces the learning and usage costs. They do not need to spend a significant amount of time learning the SQL syntaxes of different database systems and can easily query and analyze data from multiple data sources through Doris. This allows them to focus more on business analysis and improve work efficiency.

2. Improving Data Integration Efficiency

Doris breaks down the barriers between different database systems, enabling rapid data integration and analysis. Enterprises can establish a unified data query platform through Doris, allowing personnel from different departments to easily obtain the required data, promoting data sharing and utilization, and providing strong support for enterprise decision - making.

3. Ensuring Business Continuity

In the process of continuous evolution of enterprise data architectures, the SQL dialect compatibility feature of Doris provides assurance for business continuity. Even if enterprises replace or add new data sources, Doris can still seamlessly connect, ensuring that data querying and analysis are not affected.

Conclusion

The SQL dialect compatibility feature of Apache Doris offers an efficient and convenient data query solution for enterprises in a complex data environment. It breaks down the barriers of SQL dialects, allowing data to flow freely and injecting powerful impetus into the digital transformation of enterprises. It is believed that in the future, with the continuous development and improvement of Doris, it will play an even more important role in more fields and help enterprises maximize the value of data.

If you are interested in the [SQL dialect](https://doris.apache.org/zh - CN/docs/lakehouse/sql - dialect) compatibility feature of Doris, you might as well give it a try and experience the convenience and efficiency it brings!

Is Storage-Computing Separation Really Necessary? From the Architectural Debate to the Practical Analysis of Doris

Darren XU — Mon, 31 Mar 2025 10:00:39 +0000

Introduction: A Decade-Long Debate on “Storage and Computing”

In the field of databases and big data, the architectural debate between “storage-computing integration” and “storage-computing separation” has never ceased. Some people question, “Is storage-computing separation really necessary? Isn’t the performance of local disks sufficient?” The answer is not black and white — the key to technology selection lies in the precise matching of business scenarios and resource requirements. This article takes Apache Doris as an example to analyze the essential differences, advantages and disadvantages, and implementation scenarios of the two architectures.

I. Storage-Computing Integration vs. Storage-Computing Separation: Core Concepts and Evolution Logic

1.Storage-Computing Integration: The Tightly-Coupled “All-Rounder”

Definition: Data storage and computing resources are bound to the same node (such as a local disk + server), and local reading and writing are used to reduce network overhead. Typical examples include the early architecture of Hadoop and traditional OLTP databases.

Historical Origin: In the early days of IT systems, the data volume was small (such as IBM mainframes in the 1960s), and a single machine could meet the storage and computing requirements, naturally forming a storage-computing integration architecture.

2.Storage-Computing Separation: The Decoupled “Perfect Partners”

Definition: The storage layer (such as object storage, HDFS) and the computing layer (such as cloud servers, container clusters) are independently scalable and connected through a high-speed network to achieve data sharing. Typical representatives include the cloud-native database Snowflake and the storage-computing separation mode of Doris.

Driving Forces: Exponential growth of data volume, elastic requirements of cloud computing, and fine-grained cost control.

II. Architectural Duel: The Ultimate Game of Performance, Cost, and Elasticity

1.Advantages and Shortcomings of Storage-Computing Integration

(1)Advantages

Minimal Deployment: It does not need to rely on external storage systems and can run on a single machine, which is suitable for quick trials or small to medium — scale scenarios (for example, the storage-computing integration mode of Doris only requires the deployment of FE/BE processes).

Ultimate Performance: Local reading and writing reduce network latency, making it suitable for high — concurrency and low — latency scenarios. (For example, in the YCSB scenario, the storage-computing integration of Doris can reach 30,000 QPS, and the 99th percentile latency is as low as 0.6ms)

(2)Shortcomings

Inflexible Expansion: Storage and computing need to be scaled simultaneously, which is likely to cause resource waste (for example, the CPU is idle while the disk is full).

High Cost: The price of local SSD disks is high, and redundant backups increase hardware investment (for example, the storage-computing integration version of Doris requires three copies to ensure high data reliability).

2.Breakthroughs and Challenges of Storage-Computing Separation

(1)Advantages

Elastic Scalability: Computing resources can be scaled on demand, and storage can be independently expanded (for example, the computing group of Doris can dynamically add or remove nodes).

Cost Optimization: Shared storage (such as object storage) costs as low as 1/3 of that of local disks and supports hierarchical management of hot and cold data.

High Availability: The storage layer has independent disaster recovery, and there is no risk of data loss in case of computing node failures.

(2)Challenges

Network Bottleneck: Remote reading and writing may introduce latency (relying on intelligent caching optimization).

Operation and Maintenance Complexity: It is necessary to manage shared storage (such as HDFS, S3) and network stability.

III. Scenarios Matter: How to Choose the Most Suitable Architecture?

1.The “Main Battlefield” of Storage-Computing Integration

Small to Medium-Scale Real-Time Analysis: The data volume is within the TB level, and low latency is pursued (such as the high-concurrency query scenario of Doris).

Independent Business Lines: There is no dedicated DBA team, and simple operation and maintenance are required (such as start-ups trying out data analysis).

No Dependence on Cloud Environment: Localized deployment and no reliable shared storage resources.

2.The “Killer Scenarios” of Storage-Computing Separation

Cloud Native and Elastic Requirements: In public cloud / hybrid cloud environments, pay — as — you — go is required (for example, the cloud-native version of Doris supports K8s containerization).

Massive Data Lake Warehouses: PB — level data storage, and multiple computing clusters share the same data source (such as financial risk control, e-commerce user portraits).

Cost-Sensitive Businesses: Archiving historical data, low-cost storage of cold data (such as the hot and cold layering technology of Doris).

IV. Practical Insights from Doris: Can You Have Your Cake and Eat It Too?

As a new-generation real-time analysis database, Apache Doris supports both storage-computing integration and storage-computing separation modes, becoming a benchmark for architectural flexibility:

1.Storage-Computing Integration Mode

Applicable Scenarios: Development and testing, small to medium-scale real-time analysis.

2.Storage-Computing Separation Mode

Technical Highlights

Shared Storage: Supports HDFS/S3, decoupling the main data storage from computing nodes.

Local Cache: BE nodes cache hot data to offset network latency.

Doris case: Slash your cost by 90% with Apache Doris Compute-Storage Decoupled Mode

V. Conclusion: There Is No Absolutely Optimal, Only the Most Suitable Match

Storage-computing separation is not a “panacea”, and storage-computing integration is not an “outdated product”. Technical decisions should return to the essence of the business:

Choose Storage-Computing Integration: When performance is sensitive, the data scale is controllable, and operation and maintenance resources are limited.

Embrace Storage-Computing Separation: When cost and elasticity are the core requirements and a cloud-native technology stack is available.

In the future, with the breakthroughs in storage networks (such as RDMA) and intelligent caching technologies, the “performance ceiling” of storage-computing separation will be further broken. The continuous evolution of open-source technologies such as Doris is providing more possibilities for this architectural debate.

A Deep Dive into Apache Doris Indexes

Darren XU — Mon, 31 Mar 2025 09:48:05 +0000

A Deep Dive into Apache Doris Indexes

Developers in the big data field know that quickly retrieving data from a vast amount of information is like searching for a specific star in the constellations—extremely challenging. But don't worry! Database indexes are our “positioning magic tools,” capable of significantly boosting query efficiency.

Take Apache Doris, a popular analytical database, for example. It supports several types of indexes, each with its own unique features, enabling it to excel in various query scenarios. Today, let's explore Apache Doris indexes in detail and uncover the secrets behind their remarkable performance.

I. Classification and Principles of Indexes

(A) Point Query Indexes: Precise Targeting of Data Points

Prefix Indexes: Shortcuts on Sorted Keys

Apache Doris stores data in an ordered structure similar to SSTable, sorted by specified columns. For the three data models—Aggregate, Unique, and Duplicate—when creating a table, they are sorted based on the Aggregate Key, Unique Key, and Duplicate Key specified in the table creation statements.

These sorted keys are like the category labels on a well - organized bookshelf. Prefix indexes, on the other hand, are sparse indexes built on these sorted keys.

Imagine that every 1,024 rows of data form a logical data block, similar to a partition on a bookshelf. Each partition has an index entry in the prefix index table. This index entry serves as a “mini - directory” for the partition, with its content being the prefix formed by the sorted columns of the first row in the partition.

When queries involve these sorted columns, the system can quickly locate the relevant data block through this compact “directory,” much like finding the required book category through the partition directory on a bookshelf. This significantly reduces the search range and accelerates queries.

Note ⚠️ The length of Doris prefix indexes does not exceed 36 bytes.

For example, if the sorted columns of a table are user_id (8 Bytes), age (4 Bytes), message (VARCHAR (100)), the prefix index might consist of user_id + age + the first 20 bytes of message (if the total length does not exceed 36 bytes).

When the query condition is SELECT * FROM table WHERE user_id = 1829239 and age = 20;, the prefix index can quickly locate the logical data block containing the matching data. The query efficiency is much higher than SELECT * FROM table WHERE age = 20; because the latter cannot effectively utilize the prefix index.

Inverted Indexes: Keyword Locators for Information Retrieval

Since version 2.0.0, Doris has introduced inverted indexes, a powerful tool that plays a crucial role in the field of information retrieval. In the world of Doris, a row in a table is like a document, and a column is a field within the document.

Inverted indexes are like highly efficient “keyword locators,” breaking down text into individual words and constructing an index from words to document numbers (i.e., rows in the table).

For example, for a table containing user comments, after creating an inverted index on the comment column, when we want to query comments containing a specific keyword (such as “OLAP”), the inverted index can quickly locate the rows containing the keyword.

It not only accelerates full - text retrieval for string types, supporting various keyword matching methods like matching multiple keywords simultaneously (MATCH_ALL), matching any one of the keywords (MATCH_ANY), and phrase queries (MATCH_PHRASE), but also accelerates ordinary equality and range queries, replacing the previous BITMAP index function.

In terms of storage, inverted indexes use independent files, physically separated from data files. This allows indexes to be created and deleted without rewriting data files, greatly reducing processing overhead.

Note ⚠️ Floating - point types with precision issues (FLOAT and DOUBLE) and some complex data types (such as MAP, STRUCT, etc.) do not currently support inverted indexes.

(B) Skip - Indexes: Smartly Skipping “Irrelevant Data Blocks”

ZoneMap Indexes: Statistical Detectives for Data Blocks

ZoneMap indexes are like silent “statistical detectives,” automatically maintaining statistical information for each column. For each data file (Segment) and data block (Page), they record the maximum value, minimum value, and whether there are NULL values.

When performing equality queries, range queries, or IS NULL queries, they can quickly determine whether the data file or data block is likely to contain data that meets the conditions based on this statistical information. If it is determined not to contain such data, just like a detective eliminating an irrelevant clue, the file or data block is skipped without being read, reducing I/O operations and accelerating queries.

For example, in a table containing user ages, when querying data within a certain age range, the ZoneMap index can quickly exclude data blocks that clearly do not meet the conditions based on the maximum and minimum ages of the data blocks, improving query efficiency.

BloomFilter Indexes: Probabilistic Fast Sieves

BloomFilter indexes are skip - indexes based on the BloomFilter algorithm, acting like highly efficient “fast sieves.” BloomFilter is a space - efficient probabilistic data structure consisting of an extremely long binary array and a series of hash functions.

In Doris, BloomFilter indexes are constructed on a per - data - block (page) basis. When writing data, each value in the data block is hashed and stored in the corresponding BloomFilter. During queries, based on the value of the equality condition, it is determined whether the BloomFilter contains the value. If not, the corresponding data block is skipped.

For example, in a table containing a large number of user IDs, after creating a BloomFilter index on the user ID column, when querying for a specific user ID, if the BloomFilter determines that the user ID is not in the BloomFilter corresponding to a certain data block, the data block can be skipped without being read, greatly reducing I/O.

Note ⚠️ It is only effective for IN and = equality queries. It does not support Tinyint, Float, or Double type columns, and has limited acceleration effects for low - cardinality fields.

NGram BloomFilter Indexes: Boosters for Text LIKE Queries

NGram BloomFilter indexes are specifically designed for text LIKE queries, serving as “boosters” for text queries. They are similar to BloomFilter indexes, but instead of storing the original text values in the BloomFilter, each word obtained by NGram tokenization of the text is stored.

For LIKE queries, the LIKE pattern is also tokenized using NGram, and it is determined whether each word is in the BloomFilter. If a word is not present, the corresponding data block does not meet the LIKE condition, and the data block can be skipped.

For example, in a table storing product descriptions, after creating an NGram BloomFilter index on the description column, when querying for product descriptions containing a specific phrase (such as “super awesome”), it can quickly filter out data blocks that may contain the phrase, accelerating the query.

However, it only supports string columns and requires the number of consecutive characters in the LIKE pattern to be greater than or equal to N in the NGram defined in the index.

II. Detailed Comparison of Index Characteristics

Different types of indexes have their own advantages and limitations. Let's compare them intuitively through the following table:

Type	Index	Advantages	Limitations
Point Query Indexes	Prefix Indexes	Built - in index, best performance	Only one set of prefix indexes per table
Point Query Indexes	Inverted Indexes	Supports tokenization and keyword matching, indexes can be created on any column, supports multi - condition combinations, and continuously adds function acceleration	Large index storage space, approximately equivalent to the original data
Skip - Indexes	ZoneMap Indexes	Built - in index, small index storage space	Limited supported query types, only supports equality and range queries
Skip - Indexes	BloomFilter Indexes	More refined than ZoneMap, moderate index space	Limited supported query types, only supports equality queries
Skip - Indexes	NGram BloomFilter Indexes	Supports LIKE acceleration, moderate index space	Limited supported query types, only supports LIKE acceleration

III. List of Operators and Functions Accelerated by Indexes

Understanding the support of indexes for different operators and functions helps us better utilize indexes to accelerate queries:

Operators / Functions	Prefix Indexes	Inverted Indexes	ZoneMap Indexes	BloomFilter Indexes	NGram BloomFilter Indexes
=	YES	YES	YES	YES	NO
!=	YES	YES	NO	NO	NO
IN	YES	YES	YES	YES	NO
NOT IN	YES	YES	NO	NO	NO
>, >=, <, <=, BETWEEN	YES	YES	YES	NO	NO
IS NULL	YES	YES	YES	NO	NO
IS NOT NULL	YES	YES	NO	NO	NO
LIKE	NO	NO	NO	NO	YES
MATCH, MATCH_*	NO	YES	NO	NO	NO
array_contains	NO	YES	NO	NO	NO
array_overlaps	NO	YES	NO	NO	NO
is_ip_address_in_range	NO	YES	NO	NO	NO

IV. Guide to Index Usage

(A) Suggestions for Selecting Prefix Indexes

Select the Most Frequently Filtered Fields

Since there is only one set of prefix indexes per table, it is advisable to use the fields most frequently used in WHERE filtering conditions as the Key.

For example, in a user behavior analysis table, if queries are often made based on user IDs, it is a wise choice to use user ID as the Key column of the prefix index.

The Order of Fields Matters

The more frequently used fields should be placed at the front. Prefix indexes are only effective when the fields in the WHERE condition are in the prefix of the Key.

For instance, if the query condition is often WHERE user_id = 123 AND age = 25, it is better to place user_id before age as the sorted columns when creating the table to make better use of the prefix index for query acceleration.

(B) Suggestions for Selecting Other Indexes

Filtering of Non - Key Fields

For non - Key fields that require filtering acceleration, it is advisable to create inverted indexes first because of their wide applicability and support for multi - condition combinations.

For example, in a table containing user comments and ratings, if queries need to be filtered based on both comment content and rating range, an inverted index can meet the requirements effectively.

String LIKE Matching

If there is a need for string LIKE matching, an NGram BloomFilter index can be added. For example, in a product description search scenario, using an NGram BloomFilter index can effectively accelerate LIKE queries.

Sensitivity to Index Storage Space

When sensitivity to index storage space is high, inverted indexes can be replaced with BloomFilter indexes.

For example, in a table storing a massive amount of low - cardinality user attribute data, BloomFilter indexes can reduce storage space while meeting the acceleration requirements for equality queries.

(C) Performance Optimization and Analysis

If the performance does not meet expectations, analyze the amount of data filtered by indexes and the time consumed through QueryProfile. Refer to the detailed documentation of each index for specific analysis.

For example, evaluate the filtering effect of indexes by checking indicators such as RowsKeyRangeFiltered (the number of rows filtered by prefix indexes) and RowsInvertedIndexFiltered (the number of rows filtered by inverted indexes), and then optimize the index design.

V. Management and Usage of Indexes

(A) Prefix Indexes

Management

Prefix indexes do not require a specific syntax for definition. When creating a table, the first 36 bytes of the table's Key are automatically taken as the prefix index.

Usage

They are used to accelerate equality and range queries in WHERE conditions. They take effect automatically when applicable, with no special syntax required.

For example, in a query like SELECT * FROM table WHERE user_id = 123 AND age > 20;, if user_id and age are sorted columns, the prefix index will automatically play its role.

(B) Inverted Indexes

Management

Definition at Table Creation: In the table creation statement, define the index after the COLUMN definition. For example,

CREATE TABLE table_name (

   column_name1 TYPE1,

   column_name2 TYPE2,

   INDEX idx_name1(column_name1) USING INVERTED [PROPERTIES(...)] [COMMENT 'your comment']

);

Specify the index column name, index type (USING INVERTED), and additional attributes such as tokenizers.

Adding to Existing Tables: Both CREATE INDEX and ALTER TABLE ADD INDEX syntaxes are supported. After adding a new index definition, new data written will generate inverted indexes. For existing data, use BUILD INDEX to trigger index construction. For example,

CREATE INDEX idx_name ON table_name(column_name) USING INVERTED;

BUILD INDEX index_name ON table_name;

Deleting from Existing Tables: Use DROP INDEX idx_name ON table_name; or ALTER TABLE table_name DROP INDEX idx_name; to delete inverted indexes.

Usage

Full - Text Retrieval Keyword Matching: Achieved through MATCH_ANY, MATCH_ALL, etc. For example, SELECT * FROM table_name WHERE column_name MATCH_ANY 'keyword1 ...';

Full - Text Retrieval Phrase Matching: Achieved through MATCH_PHRASE. For example, SELECT * FROM table_name WHERE content MATCH_PHRASE 'keyword1 keyword2'; Note that the support_phrase attribute needs to be set.

Ordinary Equality, Range, IN, NOT IN Queries: Use normal SQL statements. For example, SELECT * FROM table_name WHERE id = 123; Analyze the acceleration effect of inverted indexes through Query Profile indicators such as RowsInvertedIndexFiltered and InvertedIndexFilterTime.

(C) BloomFilter Indexes

Management

Creation at Table Creation: Specify which fields to create BloomFilter indexes on through the table's PROPERTIES "bloom_filter_columns", for example, PROPERTIES ("bloom_filter_columns" = "column_name1,column_name2");

Adding and Deleting from Existing Tables: Modify the bloom_filter_columns property of the table through ALTER TABLE. For example,

ALTER TABLE table_name SET ("bloom_filter_columns" = "column_name1,column_name2,column_name3");

ALTER TABLE table_name SET ("bloom_filter_columns" = "column_name2,column_name3");

The former is to add indexes, and the latter is to delete indexes.

Usage

They are used to accelerate equality queries in WHERE conditions, taking effect automatically with no special syntax required. Analyze the acceleration effect through Query Profile indicators such as RowsBloomFilterFiltered and BlockConditionsFilteredBloomFilterTime.

(D) NGram BloomFilter Indexes

Management

Creation: Define the index after the COLUMN definition in the table creation statement. For example,

INDEX `idx_column_name` (`column_name`) USING NGRAM_BF

PROPERTIES("gram_size"="3", "bf_size"="1024")

COMMENT 'username ngram_bf index'

Specify the index column name, index type (USING NGRAM_BF), and tokenization - related attributes.

Viewing: Use SHOW CREATE TABLE table_name; or SHOW INDEX FROM idx_name; to view indexes.

Deleting: Use ALTER TABLE table_ngrambf DROP INDEX idx_ngrambf; to delete indexes.

Modifying: Use CREATE INDEX or ALTER TABLE ADD INDEX syntax to modify index definitions.

Usage

They are used to accelerate LIKE queries, for example, SELECT count() FROM table1 WHERE message LIKE '%error%'; Analyze the acceleration effect through Query Profile indicators such as RowsBloomFilterFiltered and BlockConditionsFilteredBloomFilterTime.

VI. Conclusion

In conclusion, the Apache Doris index system is rich and powerful, with various indexes having their own strengths. Prefix indexes locate data based on the sorted structure, inverted indexes facilitate full - text retrieval, ZoneMap indexes skip irrelevant data blocks using statistical information, and BloomFilter indexes and NGram BloomFilter indexes accelerate equality and text LIKE queries respectively.

By thoroughly understanding their principles, application scenarios, and usage methods, users can make accurate selections according to their needs, maximizing the performance of Doris in data queries. Whether it's point queries on massive data or complex text retrievals, Doris can handle them with ease.

If you're really stuck, check the QueryProfile to see if the indexes are taking effect. There's nothing worse than implementing indexes that don't work!