GBase 8a Statistics Tables: Understanding gc_stats_table and gc_stats_column

#gbase #database #数据库

The query optimizer in GBase 8a, the China‑domestically developed MPP database from GBASE, relies heavily on statistics to choose efficient execution plans. Those statistics live in two system tables: gclusterdb.gc_stats_table for table‑level row counts, and gclusterdb.gc_stats_column for column‑level distributions. Let's break down what they store, how the numbers are calculated, and which parameters matter most.

1. Table‑Level Stats: gc_stats_table

This table holds the estimated row count and the last collection timestamp for each table.

SELECT * FROM gc_stats_table WHERE db = 'tpch1t';

Sample output:

db	table_name	tuples	timestamp
tpch1t	lineitem	5999989709	2023-12-29 11:37:05
tpch1t	orders	1500000000	2023-12-29 11:47:35
tpch1t	part	200000000	2023-12-29 11:49:58
...	...	...	...

The tuples column guides cost estimation and join ordering across the gbase database.

2. Column‑Level Stats: gc_stats_column

Column statistics are stored in gclusterdb.gc_stats_column with the following schema:

DESC gclusterdb.gc_stats_column;

Column	Type	Description
db	varchar(64)	Database name
table_name	varchar(64)	Table name
column_name	varchar(64)	Column name
null_frac	float	Fraction of null values
avg_width	int(11)	Average width (variable‑length types only)
n_distinct	float	Distinct count (positive) or ratio (negative)
mcv_vals	varchar(10922)	Most common values list
mcv_freqs	varchar(10922)	Frequencies of those values
histogram	varchar(10922)	Histogram boundaries (numeric columns)

How the Key Columns Are Computed

null_frac: null rows / sampled rows.
avg_width: Applies only to variable‑length columns (VARCHAR, etc.) — computed as SELECT AVG(LENGTH(c1)) FROM table.
n_distinct: If positive, it represents COUNT(DISTINCT c1). If negative, it's the ratio COUNT(DISTINCT c1) / COUNT(*), used when the ratio exceeds 10%.
mcv_vals & mcv_freqs: The most frequent values and their frequencies, limited to the number specified by gcluster_statistics_target (default 25). Queries: SELECT v, COUNT(col)/COUNT(*) FROM table GROUP BY v ORDER BY COUNT(*) DESC LIMIT 25.
histogram: Equi‑height histogram boundaries, only for numeric types. The number of bins equals gcluster_statistics_target. Algorithm:
1. bucket_size = COUNT(*) / gcluster_statistics_target;
2. After sorting values, record the minimum of every bucket_size‑th row to define left‑closed, right‑open intervals.

3. Important Parameters

gcluster_statistics_sampling_threshold: Tables with fewer rows are scanned entirely; larger tables are sampled randomly.
gcluster_statistics_target: Controls the number of MCV entries and histogram bins; the default is usually fine.
gcluster_analyze_relative_error: Desired relative error for statistical approximations.

By understanding these two statistics tables and their parameters, you can better evaluate query plans and keep your gbase database performing well — a must‑have skill for any DBA working with GBASE's analytical engine.