GBASE Database

Posted on Mar 12

GBase 8a Optimization Techniques for Precise Queries

#gbase #database #数据库

“In database systems, the fastest query is not just about hardware power, but about how intelligently the data is located and returned.”

This article summarizes several known optimization techniques for precise query scenarios in the GBase 8a MPP Cluster database.
The focus is on the simplest form of exact-match queries, where the query condition is already known.

The techniques discussed include:

Hash distribution
Global hash index
Row-storage columns
Parallel materialization
Pre-sorting

More optimization methods may be added in the future as additional practices are verified.

References

Performance improvement test of Global Hash Index in GBase 8a
Parallel materialization parameter gbase_parallel_threshold in GBase 8a
Performance improvement test of row-storage grouped columns in GBase 8a
Improving simple queries with LIMIT using the first_rows hint in GBase 8a

What Is a Precise Query?

A precise query refers to an equality-based query where the exact value is specified.

For example, retrieving the call records of a specific phone number on a specific date:

SELECT * 
FROM MyTable_20210808 
WHERE calling_number = 13912345678;

Optimization Analysis

From a database execution perspective, this type of query can be divided into two stages:

Location stage
Identify which rows match the query condition.
Result retrieval stage
Read the full row data and return the result set.

“Efficient queries are essentially a two-step dance: find the rows quickly, then assemble the results efficiently.”

Location Stage

This stage determines which rows satisfy the given condition.

Global Hash Index

Indexes are the most common method for improving query performance.

In the GBase 8a database, this stage is optimized using a hash index.

This index works best for low-duplication columns, such as:

Phone numbers
ID numbers

For columns with high duplication, such as:

Ethnicity
Region

performance may actually decrease.

A practical rule of thumb:

If the duplication rate is below 0.1% (1‰), the column can generally be considered suitable for a hash index.

Actual performance should always be validated through testing.

Pre-Sorting

Sorting data based on query columns can increase the hit rate of the intelligent indexing mechanism used in the GBase database engine.

Result Retrieval Stage

In a traditional row-based database, once a row is located, retrieving the entire row is straightforward.

However, in a columnar database, the system must reconstruct a row by combining data from multiple columns.

The more columns involved, the greater the cost of this reconstruction.

In GBase 8a, this process is called Materialization.

Row-Storage Columns

GBase allows certain columns to be stored using row-based storage.

This approach trades storage space for faster query performance, improving efficiency during result reconstruction.

Parallel Materialization

In scenarios with:

small query volumes
low concurrency
limited disk resource contention

parallel materialization can be considered.

When the result set is small and system resources are not heavily contested, adjusting the parameter:

gbase_parallel_threshold

can enable parallel result materialization and improve query performance.

Pre-Sorting (Again)

If data is stored sorted by the query column, the matching rows are typically physically adjacent.

This dramatically reduces disk I/O and can significantly improve query performance.

However, sorting has trade-offs:

Sorting large datasets takes time
The larger the sorting range, the longer the sorting operation
A table can have only one primary sorting column

In many cases, this approach works well for historical data, where data is written once but queried many times.

Choosing a Hash Distribution Column

Whether a query column should be used for hash distribution depends largely on query concurrency.

Hash distribution ensures that rows with the same key value are stored on the same node.

Low-Concurrency Scenarios

If only a small number of queries are executed simultaneously:

Hash distribution is not recommended.

Allowing more nodes to participate in the query computation can actually improve performance.

High-Concurrency Scenarios

A classic example comes from the telecommunications industry, where customers frequently query call detail records (CDR).

Although each user may only have tens or hundreds of records per month, there may be:

hundreds
or even thousands

of users querying simultaneously.

At peak times, especially at the beginning of the month, the concurrency can increase even further.

In such cases, hash distribution on the query column is recommended.

With hash distribution:

queries for the same condition can be completed on a single node
different queries are likely distributed across different nodes

This reduces resource contention across the cluster and increases the overall number of parallel queries the database can support.

It also makes capacity planning during cluster expansion much easier.

Precise Queries with LIMIT

For exact-match queries that include a LIMIT clause, query performance can be further optimized using the first_rows hint.

This technique allows the GBase database optimizer to prioritize returning the first rows of the result set more quickly.

“Database optimization is rarely about a single trick. It is usually a combination of distribution strategy, indexing design, and execution tuning working together.”

DEV Community