GBASE Database

Posted on Sep 26, 2024

GBase 8a MPP Cluster Performance Optimization: Storage Optimization

#database

I. Storage Optimization

GBase 8a MPP Cluster is a columnar relational database, distinct from traditional row-based databases. Currently, GBase 8a MPP Cluster supports two storage modes: columnar and row-column mixed storage. This article focuses on storage features that provide optimization value.

1. Columnar Storage

In columnar storage mode, DML (Data Manipulation Language) operations on a column scan only the database page chain (column) corresponding to that column, avoiding full table scans. This effectively reduces the I/O overhead during DML operations.

1.1. Optimization Features:

Smart Indexing: Columnar databases often use sparse indexing, known as smart indexing. The key idea is to create indexes (or statistical information) for data blocks (pages), enabling coarse-grained filtering. These indexes occupy minimal storage and are maintained automatically without manual intervention.
Compression: Columnar databases typically achieve higher compression ratios than row-based databases due to the similarity of data in the same column and specialized compression algorithms. Row-based databases typically achieve a 2-3x compression ratio, whereas columnar databases can reach 5-10x (and in some cases, even 20-40x).
Delayed Materialization: In a columnar database execution plan, data is not decompressed until the final step, whether filtering, projection, joining, or aggregation. This reduces CPU, memory, and network transmission overhead, as well as storage requirements.

1.2. Optimization Limitations:

Columnar storage offers a performance advantage primarily when scanning single columns. When selecting multiple columns or using SELECT *, I/O performance can degrade significantly, negatively impacting execution performance.
Coarse-grained indexes are effective primarily for queries involving statistics on data chunks (DC).

2. Row-Column Hybrid Storage (Row Storage in Columns)

As GBase 8a MPP Cluster is columnar in architecture, when dealing with a large number of columns and sparse data access, significant random I/O can occur, leading to reduced performance.

The row-column hybrid storage feature helps mitigate this by storing certain columns redundantly as rows in a single column (of VARCHAR type). This column contains concatenated row data, allowing selective page loading without reading the entire DC. This feature improves I/O performance, particularly for queries like SELECT * FROM, by an order of magnitude.

2.1. Optimization Features:

Compressed Storage: Row-column data supports compression, reducing redundancy.
Smaller Data Pages: Data is read in smaller Data Pages, minimizing unnecessary I/O and improving query performance.
Configurable Hybrid Storage: Controlled by the gbase_hybrid_store parameter, which can be set as:
- 0: Disabled.
- 1: Automatically determined by the server.
- 2: Force enabled.

Automatic usage occurs under the following conditions:

The field is defined as a row-column.
The number of rows per DC is less than or equal to _gbase_hybrid_store_limit.
- Flexible Redundancy: Users can customize which fields are stored as rows, primarily selecting those frequently used in query projections.

2.2. Optimization Limitations:

Currently, row-column data cannot be used for SCAN, JOIN, or GROUP operations. It is only applicable for single-table queries with ORDER BY, such as:

  SELECT * FROM t [WHERE …] ORDER BY …;

2.3. Optimization Parameters:

Hybrid Store Usage: Controlled by the parameter gbase_hybrid_store (0 = disabled, 1 = auto, 2 = forced), default is 1.
Page Size for Row Storage: Set by gbase_hybrid_store_page_size=<1k –1G>, default is 32K.
Limit on Rows Per DC: Set by _gbase_hybrid_store_limit=<1-65536>, default is 100.

2.4. Additional Features and Limitations:

Supports specifying column storage at the time of table creation.
Supports adding column storage to existing tables.
Supports deleting column storage.
Supports creating multiple column stores on a single table, but column stores cannot have duplicate names.
Data updates for column storage are automatically maintained by the system and are transparent to the user. When performing INSERT, fast UPDATE, DELETE, or LOAD operations, the system will automatically update the redundant data.
Does not support modifying the definition of column storage. To change it, you must first manually delete and then recreate the column storage.
The original column definition of the column store cannot be deleted or have its data attributes modified (although column names and column order can be changed).
Bulk UPDATE operations (entire column replacement mode) are not allowed on the original column's data.
The total length of all columns in the same group cannot exceed 32KB.
Column storage supports partitioned tables and allows indexes on the original columns.

3. Principles for Selecting Hash Distribution Columns

When working with large tables and equality queries, consider using hash distribution for better performance. Guidelines for selecting hash distribution columns:

Even Data Distribution: Select columns with a high COUNT(DISTINCT) to ensure even data distribution.
Distributed Operations: Prioritize using hash distribution for columns involved in JOIN conditions in large tables to enable distributed execution across nodes.
GROUP BY Columns: Prefer hash-distributed columns in GROUP BY clauses for optimized group aggregation.

II. Hash Index Utilization

Hash indexes are particularly useful for exact match queries, such as those commonly found in telecommunications applications involving concurrent record searches.

GBase 8a MPP Cluster first filters data using smart indexes, then uses the hash index if applicable, or defaults to a full DC scan. This can be verified in the Trace Log.

Key Notes:

Indexes introduce maintenance costs and can affect data load and DML performance. Use them judiciously based on your application needs.
Hash indexes should be created on columns with low duplicate values to avoid hash collisions.
Binary columns are not suitable for hash indexes.

III. Ordered Data Insertion

1. Benefits of Ordered Data Insertion:

Improves smart index hit rate.
Significantly boosts query performance.

2. Local Range Sorting:

When inserting incremental data, perform local range sorting before loading the data to maintain order within the database and improve read speeds.

For example:

Sort daily data before inserting it into a monthly table.
Analyze SQL queries to identify key filtering columns and sort data accordingly.

3. Sorting Methods:

External Sorting: Use external tools (like psort) to sort data files before loading them into the table.
In-Database Sorting: Use a temporary table to store unsorted data, then sort it with an INSERT INTO … SELECT * … ORDER BY … statement before inserting it into the main table.

Recommended Usage:

External sorting is better for non-real-time data loading, while in-database sorting is preferred for real-time applications.

This concludes today's discussion on storage optimization in GBase 8a MPP Cluster. Thank you for reading!

DEV Community