DEV Community

Michael
Michael

Posted on • Originally published at gbase.cn

GBase 8a Query Optimization in Practice: EXPLAIN, Materialized Views, CTE, and Common Tuning Techniques

This article starts from real slow queries and explains how to read execution plans with EXPLAIN, use materialized views correctly, when to apply CTEs, and several high‑frequency query tuning tips in a gbase database.

1. Reading Execution Plans with EXPLAIN

Basic Usage

EXPLAIN
SELECT dept_id, SUM(amount)
FROM orders
WHERE order_date >= '2024-01-01'
GROUP BY dept_id;
Enter fullscreen mode Exit fullscreen mode

The EXPLAIN output in GBase 8a is a tree structure. Each row represents an operator, and execution proceeds from bottom to top, inside to outside.

Key Operators

Operator Meaning Performance Concern
SeqScan Sequential scan Are row estimates accurate?
HashAgg Hash aggregation Memory sufficiency, spills
HashJoin Hash join Correct choice of driving table?
Redistribute Data shuffle across nodes Can it be avoided? High cost
Broadcast Broadcast small table Lower cost than Redistribute, but table must be small
Gather Collect results from gnodes Final collection point
Sort Sort Expensive on large datasets

Focus on Redistribute

Redistribute means cross‑node data transfer, the largest network overhead in MPP. The goal is to reduce its occurrence, ideally to zero.

A Real Case

Original slow query (~30 seconds):

SELECT o.dept_id, d.dept_name, SUM(o.amount) AS total
FROM orders o
JOIN dept d ON o.dept_id = d.dept_id
WHERE o.order_date >= '2024-01-01'
GROUP BY o.dept_id, d.dept_name;
Enter fullscreen mode Exit fullscreen mode

EXPLAIN showed that orders required Redistribute by dept_id (orders is distributed by customer_id), and dept also required Redistribute — yet dept has only 100 rows. It should be a replicated table.

-- Rebuild dept as a replicated table
CREATE TABLE dept_rep (
    dept_id INT, dept_name VARCHAR(64)
) REPLICATED;
INSERT INTO dept_rep SELECT * FROM dept;
Enter fullscreen mode Exit fullscreen mode

After this change, both Redistributes were eliminated and execution time dropped to 3 seconds.

2. Materialized Views: Pre‑computation for Analytical Queries

A materialized view persists query results, ideal for aggregated reports that are read frequently but whose underlying data changes rarely.

Creating a Materialized View

CREATE MATERIALIZED VIEW mv_sales_daily AS
SELECT dept_id, order_date,
       COUNT(*) AS order_cnt,
       SUM(amount) AS total_amount,
       AVG(amount) AS avg_amount
FROM orders
GROUP BY dept_id, order_date;
Enter fullscreen mode Exit fullscreen mode

Permissions

Materialized views need to read metadata in gclusterdb. If you encounter a permission error, grant access:

GRANT SELECT ON gclusterdb.* TO 'your_user'@'%';
Enter fullscreen mode Exit fullscreen mode

Refresh and Query Rewrite

Only full refresh is currently supported: REFRESH MATERIALIZED VIEW mv_sales_daily;. Run it during off‑peak hours. GBase 8a supports automatic query rewrite based on materialized views; use EXPLAIN to verify whether a view was hit.

3. CTE (WITH AS): Readability and Performance for Complex Queries

CTEs must be enabled in both gcluster and gnode config files: _t_gcluster_support_cte = 1.

CTE Example

WITH
  valid_orders AS (
      SELECT order_id, customer_id, dept_id, amount
      FROM orders WHERE order_date >= '2024-01-01' AND status = 1
  ),
  customer_summary AS (
      SELECT customer_id, SUM(amount) AS total, COUNT(*) AS cnt
      FROM valid_orders GROUP BY customer_id
  )
SELECT * FROM customer_summary WHERE total > 10000 ORDER BY total DESC LIMIT 100;
Enter fullscreen mode Exit fullscreen mode

When a CTE is referenced multiple times, enable _t_gcluster_reuse_tmp_table_optimize = 1 to avoid redundant computation. If referenced only once, a CTE may add unnecessary materialization overhead compared to a regular subquery.

4. Common Slow‑Query Scenarios and Tuning

  • COUNT(DISTINCT) slow: Enable two‑phase distinct optimization: _t_gcluster_agg_distinct_redist_optimize = 1 and _gbase_optimizer_aggr_distinct = 1.
  • ORDER BY + LIMIT slow: Avoid sorting huge result sets without LIMIT; GBase 8a usually optimizes local Top‑N automatically.
  • GROUP BY with high cardinality causing memory overflow: Enable gcluster_delayed_group_by_optimize = 1.
  • Many small JOINs causing single‑node execution: Adjust the broadcast threshold gcluster_hash_redist_threshold_row = 1000000 and enable JOIN redistribution optimization.

5. Query Tuning Methodology

  1. Run EXPLAIN first. Look for Redistribute and full table scans.
  2. Check whether filters hit partition pruning.
  3. Check whether the distribution keys of joined tables align.
  4. Are small tables created as REPLICATED?
  5. Handle high‑cardinality DISTINCT or large GROUP BY with specific parameters.
  6. Check data skew via gclusterdb.dql_statistic by comparing per‑node execution times.
  7. Use materialized views for pre‑computation when appropriate.

Characteristics of a good execution plan: at most one Redistribute (preferably zero), early data filtering, small tables joined via Broadcast, and roughly equal execution time across gnodes (no data skew).

Good query tuning in a gbase database starts with reading the execution plan, fixing distribution issues, and knowing when to pre‑compute. Apply these patterns and you'll see consistent performance improvements across your analytical workloads.

Top comments (0)