GBase 8a Table Design and Modeling: Choosing Data Types, Partitions, Distribution Keys, and Replicated Tables

#gbase #database #数据库 #performance

In a distributed analytical gbase database, many performance issues are baked in at the table design stage. Data types, partitioning, distribution keys, and replicated table strategies largely determine query cost down the line. This guide walks through these four core design decisions with practical, implementable advice.

1. Modeling Matters More Than Post‑Hoc Tuning

The GBase 8a community consensus on query optimisation is clear: prioritise business SQL and table structure first, then tune database parameters, and only then add hardware. The way data is organised sets the upper bound for query performance.

Design Area	Common Shortcut	Later Pain	Better Approach
Data types	Store everything as strings	Heavy scans, poor compression, constant casting	Choose types by actual semantics
Partitioning	Skip it initially, add later	Hard to manage, clean, and query large tables	Partition time‑based large tables early
Distribution key	Pick any familiar column	Node skew, slow GROUP/JOIN	Prefer high‑cardinality columns used in frequent JOINs/GROUPs
Replicated tables	Build everything as a distribution table	Extra redistribution on small‑table JOINs	Consider replication for small, frequently‑joined dimension tables

2. Data Types: They Dictate Compression, Scanning, and Computation

The clearer the business semantics, the less you should compromise on types.

Status and type codes: Use TINYINT/SMALLINT/INT, not VARCHAR for enumerated values.
Monetary amounts: Use DECIMAL; avoid FLOAT/DOUBLE precision issues.
Time‑based filter columns: Use DATE/DATETIME/TIMESTAMP; never store dates as VARCHAR.
Distributed sequence numbers: Use BIGINT; INT risks overflow on large tables.

Anti‑pattern vs. correct approach:

-- Anti‑pattern: string‑everything
CREATE TABLE ods_order_raw (
    order_id     VARCHAR(64),
    user_id      VARCHAR(64),
    order_status VARCHAR(20),
    pay_amt      DOUBLE,
    create_time  VARCHAR(19)
);

-- Correct: semantic types
CREATE TABLE ods_order_raw (
    order_id     BIGINT,
    user_id      BIGINT,
    order_status TINYINT,
    pay_amt      DECIMAL(18,2),
    create_time  DATETIME
);

3. Partitioning: Plan for Large Tables from the Start

GBase 8a supports RANGE, LIST, HASH, and KEY partitioning. Total partitions cannot exceed 8,192; production best practice is to keep per‑table partitions under 50. The partition key column cannot be updated.

Tables that benefit from partitioning: daily/monthly fact tables, historical log tables — data with natural time boundaries that need periodic cleanup and range queries. Skip partitioning for small dimension tables and high‑update small tables.

CREATE TABLE dwd_trade_detail (
    trade_id   BIGINT,
    user_id    BIGINT,
    shop_id    BIGINT,
    pay_amt    DECIMAL(18,2),
    trade_date DATE
)
PARTITION BY RANGE(trade_date) (
    PARTITION p202601 VALUES LESS THAN ('2026-02-01'),
    PARTITION p202602 VALUES LESS THAN ('2026-03-01'),
    PARTITION p202603 VALUES LESS THAN ('2026-04-01')
);

Partition pruning is the real payoff: partitioning helps only when queries land on a subset of partitions. Avoid wrapping the partition key in functions (DATE_FORMAT); use direct range filters to let partition pruning work.

4. Hash Distribution Key: The Foundation of Horizontal Data Placement

The distribution key determines how evenly data is spread across nodes and directly impacts whether GROUP BY and JOIN can execute locally. Evaluate in this order: data uniformity → frequent JOIN column → frequent GROUP BY column → still uniform after filtering.

Common mistake: using low‑cardinality columns like province_code as the distribution key, causing severe node skew and forcing extra redistribution during aggregation and JOINs.

5. Replicated Tables: Best for Small Dimension Tables

A replicated table stores a full copy on every gnode, enabling fully local JOINs with fact tables — zero network transfer. Ideal for small, frequently‑read dimension and dictionary tables. Avoid for large fact tables and high‑churn large tables.

CREATE TABLE dim_region (
    region_id   INT,
    region_name VARCHAR(64)
) REPLICATED;

6. Recommended Modeling Sequence

Define types by business semantics — lock down the real meaning of status codes, amounts, times, and primary keys first.
Decide on partitioning — time‑accumulating large tables and log tables are the prime candidates.
Choose the distribution strategy — for distribution tables, prioritise uniformity, then JOIN/GROUP needs; evaluate replication for small dimension tables.
Review expected query patterns — verify that future queries will filter by the partition key and frequently JOIN/GROUP by the chosen distribution key.

In a gbase database, slow queries are often not "discovered" — they are "built in" at the design stage. Getting data types, partitioning, distribution keys, and replication right from the start dramatically reduces the tuning burden later.