The query optimizer in GBase 8a, the China‑domestically developed MPP database from GBASE, relies heavily on statistics to choose efficient execution plans. Those statistics live in two system tables: gclusterdb.gc_stats_table for table‑level row counts, and gclusterdb.gc_stats_column for column‑level distributions. Let's break down what they store, how the numbers are calculated, and which parameters matter most.
1. Table‑Level Stats: gc_stats_table
This table holds the estimated row count and the last collection timestamp for each table.
SELECT * FROM gc_stats_table WHERE db = 'tpch1t';
Sample output:
| db | table_name | tuples | timestamp |
|---|---|---|---|
| tpch1t | lineitem | 5999989709 | 2023-12-29 11:37:05 |
| tpch1t | orders | 1500000000 | 2023-12-29 11:47:35 |
| tpch1t | part | 200000000 | 2023-12-29 11:49:58 |
| ... | ... | ... | ... |
The tuples column guides cost estimation and join ordering across the gbase database.
2. Column‑Level Stats: gc_stats_column
Column statistics are stored in gclusterdb.gc_stats_column with the following schema:
DESC gclusterdb.gc_stats_column;
| Column | Type | Description |
|---|---|---|
| db | varchar(64) | Database name |
| table_name | varchar(64) | Table name |
| column_name | varchar(64) | Column name |
| null_frac | float | Fraction of null values |
| avg_width | int(11) | Average width (variable‑length types only) |
| n_distinct | float | Distinct count (positive) or ratio (negative) |
| mcv_vals | varchar(10922) | Most common values list |
| mcv_freqs | varchar(10922) | Frequencies of those values |
| histogram | varchar(10922) | Histogram boundaries (numeric columns) |
How the Key Columns Are Computed
-
null_frac:
null rows / sampled rows. -
avg_width: Applies only to variable‑length columns (
VARCHAR, etc.) — computed asSELECT AVG(LENGTH(c1)) FROM table. -
n_distinct: If positive, it represents
COUNT(DISTINCT c1). If negative, it's the ratioCOUNT(DISTINCT c1) / COUNT(*), used when the ratio exceeds 10%. -
mcv_vals & mcv_freqs: The most frequent values and their frequencies, limited to the number specified by
gcluster_statistics_target(default 25). Queries:SELECT v, COUNT(col)/COUNT(*) FROM table GROUP BY v ORDER BY COUNT(*) DESC LIMIT 25. -
histogram: Equi‑height histogram boundaries, only for numeric types. The number of bins equals
gcluster_statistics_target. Algorithm:-
bucket_size = COUNT(*) / gcluster_statistics_target; - After sorting values, record the minimum of every
bucket_size‑th row to define left‑closed, right‑open intervals.
-
3. Important Parameters
- gcluster_statistics_sampling_threshold: Tables with fewer rows are scanned entirely; larger tables are sampled randomly.
- gcluster_statistics_target: Controls the number of MCV entries and histogram bins; the default is usually fine.
- gcluster_analyze_relative_error: Desired relative error for statistical approximations.
By understanding these two statistics tables and their parameters, you can better evaluate query plans and keep your gbase database performing well — a must‑have skill for any DBA working with GBASE's analytical engine.
Top comments (0)