This article starts from real slow queries and explains how to read execution plans with EXPLAIN, use materialized views correctly, when to apply CTEs, and several high‑frequency query tuning tips in a gbase database.
1. Reading Execution Plans with EXPLAIN
Basic Usage
EXPLAIN
SELECT dept_id, SUM(amount)
FROM orders
WHERE order_date >= '2024-01-01'
GROUP BY dept_id;
The EXPLAIN output in GBase 8a is a tree structure. Each row represents an operator, and execution proceeds from bottom to top, inside to outside.
Key Operators
| Operator | Meaning | Performance Concern |
|---|---|---|
| SeqScan | Sequential scan | Are row estimates accurate? |
| HashAgg | Hash aggregation | Memory sufficiency, spills |
| HashJoin | Hash join | Correct choice of driving table? |
| Redistribute | Data shuffle across nodes | Can it be avoided? High cost |
| Broadcast | Broadcast small table | Lower cost than Redistribute, but table must be small |
| Gather | Collect results from gnodes | Final collection point |
| Sort | Sort | Expensive on large datasets |
Focus on Redistribute
Redistribute means cross‑node data transfer, the largest network overhead in MPP. The goal is to reduce its occurrence, ideally to zero.
A Real Case
Original slow query (~30 seconds):
SELECT o.dept_id, d.dept_name, SUM(o.amount) AS total
FROM orders o
JOIN dept d ON o.dept_id = d.dept_id
WHERE o.order_date >= '2024-01-01'
GROUP BY o.dept_id, d.dept_name;
EXPLAIN showed that orders required Redistribute by dept_id (orders is distributed by customer_id), and dept also required Redistribute — yet dept has only 100 rows. It should be a replicated table.
-- Rebuild dept as a replicated table
CREATE TABLE dept_rep (
dept_id INT, dept_name VARCHAR(64)
) REPLICATED;
INSERT INTO dept_rep SELECT * FROM dept;
After this change, both Redistributes were eliminated and execution time dropped to 3 seconds.
2. Materialized Views: Pre‑computation for Analytical Queries
A materialized view persists query results, ideal for aggregated reports that are read frequently but whose underlying data changes rarely.
Creating a Materialized View
CREATE MATERIALIZED VIEW mv_sales_daily AS
SELECT dept_id, order_date,
COUNT(*) AS order_cnt,
SUM(amount) AS total_amount,
AVG(amount) AS avg_amount
FROM orders
GROUP BY dept_id, order_date;
Permissions
Materialized views need to read metadata in gclusterdb. If you encounter a permission error, grant access:
GRANT SELECT ON gclusterdb.* TO 'your_user'@'%';
Refresh and Query Rewrite
Only full refresh is currently supported: REFRESH MATERIALIZED VIEW mv_sales_daily;. Run it during off‑peak hours. GBase 8a supports automatic query rewrite based on materialized views; use EXPLAIN to verify whether a view was hit.
3. CTE (WITH AS): Readability and Performance for Complex Queries
CTEs must be enabled in both gcluster and gnode config files: _t_gcluster_support_cte = 1.
CTE Example
WITH
valid_orders AS (
SELECT order_id, customer_id, dept_id, amount
FROM orders WHERE order_date >= '2024-01-01' AND status = 1
),
customer_summary AS (
SELECT customer_id, SUM(amount) AS total, COUNT(*) AS cnt
FROM valid_orders GROUP BY customer_id
)
SELECT * FROM customer_summary WHERE total > 10000 ORDER BY total DESC LIMIT 100;
When a CTE is referenced multiple times, enable _t_gcluster_reuse_tmp_table_optimize = 1 to avoid redundant computation. If referenced only once, a CTE may add unnecessary materialization overhead compared to a regular subquery.
4. Common Slow‑Query Scenarios and Tuning
-
COUNT(DISTINCT) slow: Enable two‑phase distinct optimization:
_t_gcluster_agg_distinct_redist_optimize = 1and_gbase_optimizer_aggr_distinct = 1. - ORDER BY + LIMIT slow: Avoid sorting huge result sets without LIMIT; GBase 8a usually optimizes local Top‑N automatically.
-
GROUP BY with high cardinality causing memory overflow: Enable
gcluster_delayed_group_by_optimize = 1. -
Many small JOINs causing single‑node execution: Adjust the broadcast threshold
gcluster_hash_redist_threshold_row = 1000000and enable JOIN redistribution optimization.
5. Query Tuning Methodology
- Run EXPLAIN first. Look for Redistribute and full table scans.
- Check whether filters hit partition pruning.
- Check whether the distribution keys of joined tables align.
- Are small tables created as REPLICATED?
- Handle high‑cardinality DISTINCT or large GROUP BY with specific parameters.
- Check data skew via
gclusterdb.dql_statisticby comparing per‑node execution times. - Use materialized views for pre‑computation when appropriate.
Characteristics of a good execution plan: at most one Redistribute (preferably zero), early data filtering, small tables joined via Broadcast, and roughly equal execution time across gnodes (no data skew).
Good query tuning in a gbase database starts with reading the execution plan, fixing distribution issues, and knowing when to pre‑compute. Apply these patterns and you'll see consistent performance improvements across your analytical workloads.
Top comments (0)