DEV Community

Michael
Michael

Posted on • Originally published at gbase.cn

GBase 8a Data Distribution Strategies: Hash, Random, and Replicated Tables

GBase 8a MPP Cluster offers three primary data distribution strategies: Hash distribution, random distribution, and replicated tables. Each determines how data is physically placed across nodes, directly impacting join and aggregation performance in a gbase database.

Strategy Overview

Strategy Syntax Distribution Method Key Feature
Hash DISTRIBUTED BY (column) Rows with the same column value are mapped to the same node via hash. Data locality — related rows are physically together, ideal for joins and grouping.
Random Default (no keyword) or DISTRIBUTED RANDOMLY Data is scattered evenly across all data nodes. Perfect load balancing — equal data volume and compute pressure on every node.
Replicated REPLICATED A full copy of the table exists on every data node. Global redundancy — no network transfer needed for joins.

When to Use Each Strategy

1. Hash Distribution: Large Fact Tables, Frequent Joins

  • Best for: Huge fact tables like orders or transaction logs that are frequently joined on a specific column (e.g., user_id) or grouped.
  • Why: Placing matching rows on the same node eliminates cross‑node data shuffling, delivering high‑performance local computation.
  • Watch out for: Data skew. Picking a low‑cardinality column (e.g., gender) can overload a few nodes. Choose a column with many unique values and even distribution.

2. Random Distribution: Standalone Large Tables or Staging Tables

  • Best for: Access‑pattern logs, monitoring data, or intermediate ETL tables that are mostly full‑scanned.
  • Why: Guarantees perfect balance across nodes, maximizing parallel I/O and CPU utilization.
  • Trade‑off: Joins and groupings will trigger heavy network redistribution, increasing query latency.

3. Replicated Tables: Small Dimension or Lookup Tables

  • Best for: Small tables (usually under a few GB) like country codes, product categories, or compact user dimensions.
  • Why: Every node has a full copy, so joins with any distributed table happen entirely locally — the fastest possible join.
  • Cost: Storage multiplies by the node count, and every insert/update/delete must be applied to all nodes. Suited for read‑mostly or read‑only tables.

Best Practice

Follow the rule: “Large fact tables → Hash, small dimension tables → Replicated, everything else → Random.” Choosing the right strategy and distribution key based on actual query patterns is essential for keeping a gbase database running at peak performance.

Top comments (0)