GBase 8a MPP Cluster offers three primary data distribution strategies: Hash distribution, random distribution, and replicated tables. Each determines how data is physically placed across nodes, directly impacting join and aggregation performance in a gbase database.
Strategy Overview
| Strategy | Syntax | Distribution Method | Key Feature |
|---|---|---|---|
| Hash | DISTRIBUTED BY (column) |
Rows with the same column value are mapped to the same node via hash. | Data locality — related rows are physically together, ideal for joins and grouping. |
| Random | Default (no keyword) or DISTRIBUTED RANDOMLY
|
Data is scattered evenly across all data nodes. | Perfect load balancing — equal data volume and compute pressure on every node. |
| Replicated | REPLICATED |
A full copy of the table exists on every data node. | Global redundancy — no network transfer needed for joins. |
When to Use Each Strategy
1. Hash Distribution: Large Fact Tables, Frequent Joins
-
Best for: Huge fact tables like orders or transaction logs that are frequently joined on a specific column (e.g.,
user_id) or grouped. - Why: Placing matching rows on the same node eliminates cross‑node data shuffling, delivering high‑performance local computation.
- Watch out for: Data skew. Picking a low‑cardinality column (e.g., gender) can overload a few nodes. Choose a column with many unique values and even distribution.
2. Random Distribution: Standalone Large Tables or Staging Tables
- Best for: Access‑pattern logs, monitoring data, or intermediate ETL tables that are mostly full‑scanned.
- Why: Guarantees perfect balance across nodes, maximizing parallel I/O and CPU utilization.
- Trade‑off: Joins and groupings will trigger heavy network redistribution, increasing query latency.
3. Replicated Tables: Small Dimension or Lookup Tables
- Best for: Small tables (usually under a few GB) like country codes, product categories, or compact user dimensions.
- Why: Every node has a full copy, so joins with any distributed table happen entirely locally — the fastest possible join.
- Cost: Storage multiplies by the node count, and every insert/update/delete must be applied to all nodes. Suited for read‑mostly or read‑only tables.
Best Practice
Follow the rule: “Large fact tables → Hash, small dimension tables → Replicated, everything else → Random.” Choosing the right strategy and distribution key based on actual query patterns is essential for keeping a gbase database running at peak performance.
Top comments (0)