In a distributed analytical gbase database, many performance issues are baked in at the table design stage. Data types, partitioning, distribution keys, and replicated table strategies largely determine query cost down the line. This guide walks through these four core design decisions with practical, implementable advice.
1. Modeling Matters More Than Post‑Hoc Tuning
The GBase 8a community consensus on query optimisation is clear: prioritise business SQL and table structure first, then tune database parameters, and only then add hardware. The way data is organised sets the upper bound for query performance.
| Design Area | Common Shortcut | Later Pain | Better Approach |
|---|---|---|---|
| Data types | Store everything as strings | Heavy scans, poor compression, constant casting | Choose types by actual semantics |
| Partitioning | Skip it initially, add later | Hard to manage, clean, and query large tables | Partition time‑based large tables early |
| Distribution key | Pick any familiar column | Node skew, slow GROUP/JOIN | Prefer high‑cardinality columns used in frequent JOINs/GROUPs |
| Replicated tables | Build everything as a distribution table | Extra redistribution on small‑table JOINs | Consider replication for small, frequently‑joined dimension tables |
2. Data Types: They Dictate Compression, Scanning, and Computation
The clearer the business semantics, the less you should compromise on types.
-
Status and type codes: Use
TINYINT/SMALLINT/INT, notVARCHARfor enumerated values. -
Monetary amounts: Use
DECIMAL; avoidFLOAT/DOUBLEprecision issues. -
Time‑based filter columns: Use
DATE/DATETIME/TIMESTAMP; never store dates asVARCHAR. -
Distributed sequence numbers: Use
BIGINT;INTrisks overflow on large tables.
Anti‑pattern vs. correct approach:
-- Anti‑pattern: string‑everything
CREATE TABLE ods_order_raw (
order_id VARCHAR(64),
user_id VARCHAR(64),
order_status VARCHAR(20),
pay_amt DOUBLE,
create_time VARCHAR(19)
);
-- Correct: semantic types
CREATE TABLE ods_order_raw (
order_id BIGINT,
user_id BIGINT,
order_status TINYINT,
pay_amt DECIMAL(18,2),
create_time DATETIME
);
3. Partitioning: Plan for Large Tables from the Start
GBase 8a supports RANGE, LIST, HASH, and KEY partitioning. Total partitions cannot exceed 8,192; production best practice is to keep per‑table partitions under 50. The partition key column cannot be updated.
Tables that benefit from partitioning: daily/monthly fact tables, historical log tables — data with natural time boundaries that need periodic cleanup and range queries. Skip partitioning for small dimension tables and high‑update small tables.
CREATE TABLE dwd_trade_detail (
trade_id BIGINT,
user_id BIGINT,
shop_id BIGINT,
pay_amt DECIMAL(18,2),
trade_date DATE
)
PARTITION BY RANGE(trade_date) (
PARTITION p202601 VALUES LESS THAN ('2026-02-01'),
PARTITION p202602 VALUES LESS THAN ('2026-03-01'),
PARTITION p202603 VALUES LESS THAN ('2026-04-01')
);
Partition pruning is the real payoff: partitioning helps only when queries land on a subset of partitions. Avoid wrapping the partition key in functions (DATE_FORMAT); use direct range filters to let partition pruning work.
4. Hash Distribution Key: The Foundation of Horizontal Data Placement
The distribution key determines how evenly data is spread across nodes and directly impacts whether GROUP BY and JOIN can execute locally. Evaluate in this order: data uniformity → frequent JOIN column → frequent GROUP BY column → still uniform after filtering.
Common mistake: using low‑cardinality columns like province_code as the distribution key, causing severe node skew and forcing extra redistribution during aggregation and JOINs.
5. Replicated Tables: Best for Small Dimension Tables
A replicated table stores a full copy on every gnode, enabling fully local JOINs with fact tables — zero network transfer. Ideal for small, frequently‑read dimension and dictionary tables. Avoid for large fact tables and high‑churn large tables.
CREATE TABLE dim_region (
region_id INT,
region_name VARCHAR(64)
) REPLICATED;
6. Recommended Modeling Sequence
- Define types by business semantics — lock down the real meaning of status codes, amounts, times, and primary keys first.
- Decide on partitioning — time‑accumulating large tables and log tables are the prime candidates.
- Choose the distribution strategy — for distribution tables, prioritise uniformity, then JOIN/GROUP needs; evaluate replication for small dimension tables.
- Review expected query patterns — verify that future queries will filter by the partition key and frequently JOIN/GROUP by the chosen distribution key.
In a gbase database, slow queries are often not "discovered" — they are "built in" at the design stage. Getting data types, partitioning, distribution keys, and replication right from the start dramatically reduces the tuning burden later.
Top comments (0)