Michael

Posted on Jun 19 • Originally published at gbase.cn

GBase 8a Query Optimization in Practice: EXPLAIN, Materialized Views, CTE, and Common Tuning Techniques

#gbase #database #数据库

This article starts from real slow queries and explains how to read execution plans with EXPLAIN, use materialized views correctly, when to apply CTEs, and several high‑frequency query tuning tips in a gbase database.

1. Reading Execution Plans with EXPLAIN

Basic Usage

EXPLAIN
SELECT dept_id, SUM(amount)
FROM orders
WHERE order_date >= '2024-01-01'
GROUP BY dept_id;

The EXPLAIN output in GBase 8a is a tree structure. Each row represents an operator, and execution proceeds from bottom to top, inside to outside.

Key Operators

Operator	Meaning	Performance Concern
SeqScan	Sequential scan	Are row estimates accurate?
HashAgg	Hash aggregation	Memory sufficiency, spills
HashJoin	Hash join	Correct choice of driving table?
Redistribute	Data shuffle across nodes	Can it be avoided? High cost
Broadcast	Broadcast small table	Lower cost than Redistribute, but table must be small
Gather	Collect results from gnodes	Final collection point
Sort	Sort	Expensive on large datasets

Focus on Redistribute

Redistribute means cross‑node data transfer, the largest network overhead in MPP. The goal is to reduce its occurrence, ideally to zero.

A Real Case

Original slow query (~30 seconds):

SELECT o.dept_id, d.dept_name, SUM(o.amount) AS total
FROM orders o
JOIN dept d ON o.dept_id = d.dept_id
WHERE o.order_date >= '2024-01-01'
GROUP BY o.dept_id, d.dept_name;

EXPLAIN showed that orders required Redistribute by dept_id (orders is distributed by customer_id), and dept also required Redistribute — yet dept has only 100 rows. It should be a replicated table.

-- Rebuild dept as a replicated table
CREATE TABLE dept_rep (
    dept_id INT, dept_name VARCHAR(64)
) REPLICATED;
INSERT INTO dept_rep SELECT * FROM dept;

After this change, both Redistributes were eliminated and execution time dropped to 3 seconds.

2. Materialized Views: Pre‑computation for Analytical Queries

A materialized view persists query results, ideal for aggregated reports that are read frequently but whose underlying data changes rarely.

Creating a Materialized View

CREATE MATERIALIZED VIEW mv_sales_daily AS
SELECT dept_id, order_date,
       COUNT(*) AS order_cnt,
       SUM(amount) AS total_amount,
       AVG(amount) AS avg_amount
FROM orders
GROUP BY dept_id, order_date;

Permissions

Materialized views need to read metadata in gclusterdb. If you encounter a permission error, grant access:

GRANT SELECT ON gclusterdb.* TO 'your_user'@'%';

Refresh and Query Rewrite

Only full refresh is currently supported: REFRESH MATERIALIZED VIEW mv_sales_daily;. Run it during off‑peak hours. GBase 8a supports automatic query rewrite based on materialized views; use EXPLAIN to verify whether a view was hit.

3. CTE (WITH AS): Readability and Performance for Complex Queries

CTEs must be enabled in both gcluster and gnode config files: _t_gcluster_support_cte = 1.

CTE Example

WITH
  valid_orders AS (
      SELECT order_id, customer_id, dept_id, amount
      FROM orders WHERE order_date >= '2024-01-01' AND status = 1
  ),
  customer_summary AS (
      SELECT customer_id, SUM(amount) AS total, COUNT(*) AS cnt
      FROM valid_orders GROUP BY customer_id
  )
SELECT * FROM customer_summary WHERE total > 10000 ORDER BY total DESC LIMIT 100;

When a CTE is referenced multiple times, enable _t_gcluster_reuse_tmp_table_optimize = 1 to avoid redundant computation. If referenced only once, a CTE may add unnecessary materialization overhead compared to a regular subquery.

4. Common Slow‑Query Scenarios and Tuning

COUNT(DISTINCT) slow: Enable two‑phase distinct optimization: _t_gcluster_agg_distinct_redist_optimize = 1 and _gbase_optimizer_aggr_distinct = 1.
ORDER BY + LIMIT slow: Avoid sorting huge result sets without LIMIT; GBase 8a usually optimizes local Top‑N automatically.
GROUP BY with high cardinality causing memory overflow: Enable gcluster_delayed_group_by_optimize = 1.
Many small JOINs causing single‑node execution: Adjust the broadcast threshold gcluster_hash_redist_threshold_row = 1000000 and enable JOIN redistribution optimization.

5. Query Tuning Methodology

Run EXPLAIN first. Look for Redistribute and full table scans.
Check whether filters hit partition pruning.
Check whether the distribution keys of joined tables align.
Are small tables created as REPLICATED?
Handle high‑cardinality DISTINCT or large GROUP BY with specific parameters.
Check data skew via gclusterdb.dql_statistic by comparing per‑node execution times.
Use materialized views for pre‑computation when appropriate.

Characteristics of a good execution plan: at most one Redistribute (preferably zero), early data filtering, small tables joined via Broadcast, and roughly equal execution time across gnodes (no data skew).

Good query tuning in a gbase database starts with reading the execution plan, fixing distribution issues, and knowing when to pre‑compute. Apply these patterns and you'll see consistent performance improvements across your analytical workloads.

DEV Community