GBase 8a Slow SQL Diagnosis: 6 Real‑World Cases from Symptom to Fix

#database #数据库 #gbase

Six real slow‑query scenarios with complete trace: data skew, wrong execution plans, oversized intermediate results, and functions that disable partition pruning.

Case 1: Partition Pruning Disabled, 300M Row Full Scan

Symptom: A date‑filtered query that should take 5 seconds ran for 3 minutes.

SELECT dept_id, SUM(amount)
FROM orders
WHERE DATE_FORMAT(order_date, '%Y%m') = '202406'
GROUP BY dept_id;

Diagnosis: EXPLAIN showed SeqScan orders，rows=300000000 — no partition filtering.

Root Cause: The DATE_FORMAT function on the partition key prevented the optimizer from recognising partition boundaries.

Fix:

SELECT dept_id, SUM(amount)
FROM orders
WHERE order_date >= '2024-06-01' AND order_date < '2024-07-01'
GROUP BY dept_id;

Result: Rows dropped from 300M to 25M; time fell from 3 minutes to 8 seconds. Golden rule: never wrap the partition key in a function.

Case 2: Data Skew — One Node Did 95% of the Work

Symptom: A GROUP BY query on province looked fine in EXPLAIN but took 20 minutes.

Diagnosis: gclusterdb.dql_statistic revealed node1 consumed 1198 seconds while other nodes took 12–15 seconds. Node1 processed 280M rows; the others under 500K each.

Root Cause: The distribution key was province, which has only 34 distinct values. High‑volume provinces overloaded a single node.

Fix:

Permanent: Rebuild the table with a high‑cardinality key like user_id.
Temporary: Enable multi‑column hash redistribution for skewed GROUP BYs.

Result: After rebuilding, the query dropped from 20 minutes to 45 seconds.

Case 3: Cartesian Product Filled /tmp

Symptom: A multi‑table JOIN ran for 2 hours and then crashed when /tmp was full.

Diagnosis: EXPLAIN showed CrossJoin.

Root Cause: Three tables were listed in the FROM clause, but the WHERE clause contained only a date filter — no JOIN condition at all.

Fix: Add proper JOIN conditions and set a row cap on intermediate results.

SELECT a.order_id, b.product_name, c.customer_name
FROM orders a
JOIN products b ON a.product_id = b.product_id
JOIN customers c ON a.customer_id = c.customer_id
WHERE a.order_date = '2024-06-01';

-- Safety cap: error when intermediate rows exceed 1 billion
SET _gbase_result_threshold = 1000000000;

Case 4: Small Table Exceeded Broadcast Threshold

Symptom: A JOIN with an 800K‑row dimension table took 40 seconds.

Diagnosis: 800K rows exceeded the broadcast threshold (default 100K), so the optimizer chose Hash Redistribute, but misaligned distribution keys caused multiple shuffles.

Fix: Convert the dimension table to a replicated table (800K rows ≈ 100MB, easily affordable).

CREATE TABLE dim_channel_rep (
    channel_id INT, channel_name VARCHAR(64)
) REPLICATED;

Result: The JOIN ran entirely locally; 40 seconds became 3 seconds.

Case 5: COUNT(DISTINCT) on High‑Cardinality Column

Symptom: A daily active user count on 500M rows took 25 minutes with frequent "heap exceeds limit" warnings.

Diagnosis: user_id cardinality exceeded 100M; the HashAgg memory spilled to disk.

Fix:

Enable the two‑phase distinct optimization parameters.
Long‑term: Pre‑aggregate with a daily ETL materialised view.

Result: Parameters brought it to 6 minutes; pre‑aggregation reduced report queries to milliseconds.

Case 6: OR Condition Caused Full Table Scan

Symptom: A query with an OR filter scanned the entire table, even though each branch ran fast individually.

Root Cause: The columnar engine cannot effectively push down OR conditions for partition pruning and column projection.

Fix: Rewrite as UNION ALL so each branch uses its optimal path.

SELECT * FROM orders WHERE customer_id = 10001
UNION ALL
SELECT * FROM orders WHERE order_date = '2024-06-01' AND customer_id != 10001;

Result: 45 seconds down to 4 seconds.

Slow SQL Troubleshooting Flow

EXPLAIN — look for CrossJoin, missing partition pruning, multiple Redistribute steps.
Check gclusterdb.dql_statistic — a >10× gap between node execution times indicates data skew.
Inspect specific operators — COUNT DISTINCT, Sort, large table broadcast.
For fixed reports, prefer pre‑aggregation or materialised views.

Quick Reference

Symptom	Likely Cause	First Check
Query 10×+ slower than expected	Partition pruning disabled	EXPLAIN rows
One node at 100% CPU, others idle	Data skew	dql_statistic node times
/tmp ballooned	Cartesian product or huge ORDER BY	EXPLAIN for CrossJoin
Memory warning then query error	COUNT DISTINCT high cardinality	gnode heap logs
OR queries extremely slow	OR can't push down	Rewrite as UNION ALL
JOIN runs on a single node	Small table exceeds broadcast threshold	Replicate table or raise threshold

These six patterns cover the most frequent slow‑query culprits in a gbase database. Working through them systematically gets you to the root cause faster and keeps your GBASE analytical workloads running at full speed.