Michael

Posted on Jun 14 • Originally published at gbase.cn

Troubleshooting Slow Queries Caused by Data Redistribution in GBase 8a

#gbase #database #数据库

In production GBase 8a clusters, many queries that look simple and deal with reasonable data volumes can become unexpectedly slow. The real bottleneck is often not table scanning, but cross‑node data redistribution during execution. This article shows how to identify, analyze, and fix performance problems caused by unnecessary or excessive redistribution in a gbase database.

1. What Redistribution Is and Why It Hurts Performance

GBase 8a stores data across multiple nodes. Ideally each node processes its local data and only a small aggregated result is collected. But when the data needed for a JOIN, GROUP BY, or other operation is not co‑located on the same node, the system must redistribute it — moving rows between nodes according to a new key. Common triggers include:

JOIN columns that are not the table's distribution key
GROUP BY on columns that don't match the distribution scheme
Filtering after large intermediate results have already been built
Applying functions or implicit type conversions on JOIN columns, breaking local matching
Long join chains that scatter data early

Redistribution itself is not evil, but when it moves billions of rows, the network transfer and temporary disk usage dominate the query cost. Typical symptoms: the query runs fast at first then suddenly slows down, inter‑node network traffic spikes, temporary space usage grows, and some nodes become overloaded while others wait.

2. SQL Patterns That Often Cause Expensive Redistribution

Direct JOINs between two large fact tables on non‑distribution keys.
JOIN first, aggregate later — building a huge intermediate result before reducing it.
Functions on dimension columns in JOIN conditions, e.g., CAST(a.user_id AS VARCHAR) = b.user_code, which prevents the optimizer from recognizing co‑location.
Distribution keys chosen only for balance, ignoring core query paths — no data skew, but every query triggers a full reshuffle.

3. Diagnostic Workflow: Look Before You Leap

Identify where the slowness comes from: scanning? joining? a single step? consistent or sporadic?
Inspect the execution plan for redistribution steps: look for data motion before JOINs or aggregations, check the volume involved, and whether multiple redistribution steps chain together.
Compare the distribution keys with the query predicates: if the fact table is distributed by order_id but the most frequent JOIN is on user_id, local JOIN is impossible by design.
Check for implicit conversions or expression‑based JOINs: functions like TRIM(), CAST(), SUBSTR() hide the real key and force redistribution. Standardise columns beforehand.

4. Four Effective Optimization Strategies

1. Make high‑frequency JOINs local

Align distribution keys with the most critical JOIN columns, even if it means slightly less uniform data distribution. Protect the dominant access path first.

2. Filter and aggregate before JOINing

Reduce the intermediate result size before the join. For example, pre‑aggregate the fact table on the join key:

SELECT u.region_id,
       SUM(t.user_amt) AS total_amt
FROM (
    SELECT user_id, SUM(pay_amount) AS user_amt
    FROM fact_order
    WHERE order_date BETWEEN '2026-03-01' AND '2026-03-31'
    GROUP BY user_id
) t
JOIN dim_user u ON t.user_id = u.user_id
GROUP BY u.region_id;

3. Move expressions to the data preparation layer

Perform cleansing and standardisation (e.g., trimming, casting) during ETL so that JOINs use consistent, index‑friendly columns.

4. Split ultra‑long query chains into stages

For complex reports, materialize intermediate results into staging tables step by step, making each step controllable and easier to debug.

5. Real‑World Case: From 30 Seconds to Under 3 Seconds

A report aggregating paid amounts by region for the last 7 days:

SELECT u.region_name, SUM(o.pay_amount) AS total_amt
FROM fact_order o
JOIN dim_user u ON o.user_id = u.user_id
WHERE o.order_status = 'PAID'
  AND o.order_date >= '2026-03-20'
GROUP BY u.region_name;

Fact table: 1.8 billion rows. Dimension: 30 million. Original runtime: ~31 seconds. The execution plan showed a large redistribution before the JOIN. Root cause: fact_order was distributed by order_id, dim_user by user_id — the join key user_id was not the distribution key of the fact table.

Optimization: Pre‑aggregated the fact table to one row per user, then joined:

SELECT u.region_name, SUM(t.user_amt) AS total_amt
FROM (
    SELECT user_id, SUM(pay_amount) AS user_amt
    FROM fact_order
    WHERE order_status = 'PAID'
      AND order_date >= '2026-03-20'
    GROUP BY user_id
) t
JOIN dim_user u ON t.user_id = u.user_id
GROUP BY u.region_name;

Runtime dropped to ~8 seconds. Later, a dedicated analytical summary table distributed by user_id was created, bringing the final runtime to a stable 2.7–3.2 seconds, with significantly lower network traffic and resource fluctuations.

6. Modeling Principles to Prevent Redistribution Issues

Distribution key should serve the dominant access path: balance JOIN columns, filter predicates, and aggregation dimensions — not just evenness.
Minimise direct JOINs between two huge tables: consider pre‑built wide tables or summary tables to shift work to off‑peak time.
Unify join key definitions across tables: same business key should have identical data type, length, and format everywhere.
Don't mistake "works in test" for "healthy at scale": redistribution costs grow non‑linearly with data volume.

7. Summary

Many slow queries in a gbase database are slow not because the computation is heavy, but because too much data is moved between nodes. Diagnose by checking the execution plan for redistribution steps, verifying whether distribution keys match query paths, and then applying techniques like pre‑aggregation, early filtering, and, where necessary, adjusting the table design for long‑term gains. In GBase 8a, a query is often slow not because it computes too much, but because it shuffles too much.

DEV Community