DEV Community

Michael
Michael

Posted on • Originally published at gbase.cn

GBase 8a Statistics Tables: Understanding gc_stats_table and gc_stats_column

The query optimizer in GBase 8a, the China‑domestically developed MPP database from GBASE, relies heavily on statistics to choose efficient execution plans. Those statistics live in two system tables: gclusterdb.gc_stats_table for table‑level row counts, and gclusterdb.gc_stats_column for column‑level distributions. Let's break down what they store, how the numbers are calculated, and which parameters matter most.

1. Table‑Level Stats: gc_stats_table

This table holds the estimated row count and the last collection timestamp for each table.

SELECT * FROM gc_stats_table WHERE db = 'tpch1t';
Enter fullscreen mode Exit fullscreen mode

Sample output:

db table_name tuples timestamp
tpch1t lineitem 5999989709 2023-12-29 11:37:05
tpch1t orders 1500000000 2023-12-29 11:47:35
tpch1t part 200000000 2023-12-29 11:49:58
... ... ... ...

The tuples column guides cost estimation and join ordering across the gbase database.

2. Column‑Level Stats: gc_stats_column

Column statistics are stored in gclusterdb.gc_stats_column with the following schema:

DESC gclusterdb.gc_stats_column;
Enter fullscreen mode Exit fullscreen mode
Column Type Description
db varchar(64) Database name
table_name varchar(64) Table name
column_name varchar(64) Column name
null_frac float Fraction of null values
avg_width int(11) Average width (variable‑length types only)
n_distinct float Distinct count (positive) or ratio (negative)
mcv_vals varchar(10922) Most common values list
mcv_freqs varchar(10922) Frequencies of those values
histogram varchar(10922) Histogram boundaries (numeric columns)

How the Key Columns Are Computed

  • null_frac: null rows / sampled rows.
  • avg_width: Applies only to variable‑length columns (VARCHAR, etc.) — computed as SELECT AVG(LENGTH(c1)) FROM table.
  • n_distinct: If positive, it represents COUNT(DISTINCT c1). If negative, it's the ratio COUNT(DISTINCT c1) / COUNT(*), used when the ratio exceeds 10%.
  • mcv_vals & mcv_freqs: The most frequent values and their frequencies, limited to the number specified by gcluster_statistics_target (default 25). Queries: SELECT v, COUNT(col)/COUNT(*) FROM table GROUP BY v ORDER BY COUNT(*) DESC LIMIT 25.
  • histogram: Equi‑height histogram boundaries, only for numeric types. The number of bins equals gcluster_statistics_target. Algorithm:
    1. bucket_size = COUNT(*) / gcluster_statistics_target;
    2. After sorting values, record the minimum of every bucket_size‑th row to define left‑closed, right‑open intervals.

3. Important Parameters

  • gcluster_statistics_sampling_threshold: Tables with fewer rows are scanned entirely; larger tables are sampled randomly.
  • gcluster_statistics_target: Controls the number of MCV entries and histogram bins; the default is usually fine.
  • gcluster_analyze_relative_error: Desired relative error for statistical approximations.

By understanding these two statistics tables and their parameters, you can better evaluate query plans and keep your gbase database performing well — a must‑have skill for any DBA working with GBASE's analytical engine.

Top comments (0)