Michael

Posted on Mar 26

GBase 8a Performance Troubleshooting: From Slow Query Detection to Primary‑Replica Consistency

#gbase #database #数据库

In day‑to‑day operations with GBase 8a, the real headache often isn’t creating tables or databases — it’s the “production issues” that appear after the system goes live. A SQL query that used to run smoothly suddenly becomes slow. The cluster looks healthy, but business starts seeing result fluctuations, uneven node execution times, or local replica inconsistencies. These problems are rarely isolated. Drawing from the GBase community’s collective experience, this article lays out a structured troubleshooting path — from slow query detection all the way to primary‑replica consistency handling.

Don’t Rewrite SQL First — Figure Out Where It’s Slow The most common reflex when a query slows down is to immediately rewrite JOINs, break down subqueries, or add filters. That approach isn’t wrong, but it should come after you know whether the entire SQL is slow or just a few nodes are dragging it down.

GBase 8a provides the parameter gcluster_dql_statistic_threshold. When enabled, it records SQL execution information for queries exceeding the threshold into the system tables gclusterdb.sys_sqls and gclusterdb.sys_sql_elapsepernode. Meanwhile, the in‑memory table information_schema.SELECT_STATISTIC lets you observe currently running queries per node. The default value is 0 (disabled); when set to a positive value, only queries running longer than that threshold are recorded.

Practical approach: Set a reasonable threshold, then examine node‑level differences.
set global gcluster_dql_statistic_threshold = 3000;
Check recently recorded slow SQLs:

select *
from gclusterdb.sys_sqls
order by create_time desc
limit 20;

Then look at the execution time distribution for a specific SQL across nodes:

select *
from gclusterdb.sys_sql_elapsepernode
where sql_id = 'replace_with_actual_sql_id';

If most nodes are fast but a few are significantly slower, the problem is not “bad SQL” — it’s likely node resource imbalance, data skew, execution blocking, or even a node‑level issue. Conversely, if all nodes are slow, then the bottleneck is global: the SQL logic itself, large intermediate results, or configuration.

When Node Times Are Uneven, Suspect Data Skew First In a distributed analytical database like GBase 8a, many “slow queries” aren’t due to global performance deficiency — they’re caused by a few nodes doing most of the work.

Scenarios like distinct, group by, dynamic hash redistribution, or joining multiple small tables can easily trigger data skew. Relevant optimization parameters include gcluster_hash_redistribute_join_optimize and gcluster_hash_redistribute_groupby_optimize.

A classic example:

select region_code, count(distinct user_id)
from fact_order
group by region_code;

This looks simple, but if region_code is highly skewed — say, a few values dominate — a handful of nodes will stay busy while others finish quickly. Even if total cluster CPU isn’t maxed out, query time gets dragged down by the slowest nodes.

When node‑level time differences are large, a solid sequence is: confirm whether it’s skew; then determine if the skew comes from the original distribution key or from dynamic redistribution during execution; finally decide whether to adjust the data model, rewrite the SQL, or tune parameters.

Slow Queries Aren’t Always About Scanning — Intermediate Results Can Be Heavy In many analytical environments, the real bottleneck isn’t “reading data” — it’s oversized intermediate results, too many temporary objects, or high data exchange costs after redistribution.

For example, in a wide‑table scenario, this kind of query easily blows up:

select *
from fact_a a
join fact_b b
  on a.user_id = b.user_id
where a.stat_date >= '2026-01-01';

The problem isn’t necessarily the JOIN itself — it’s that select * makes the intermediate result set explode, and everything downstream (sorting, aggregation, data exchange) becomes heavier. For an analytical system like GBase 8a, a better practice is to select only necessary columns, push filters down as early as possible, and reduce the data volume before joining large tables.

When you suspect the bottleneck is not at the SQL text level but within the execution chain, logs become critical. express.log records important events inside the express engine, and gcluster_log_level controls log verbosity.

Look at Local Logs Alongside SQL Statistics System tables alone aren’t always enough. Some problems don’t show up in SQL statistics but appear first in service, recovery, loading, or daemon logs.

express.log: important events and exceptions from the execution engine.
system.log: stack traces when a service crashes.
gc_recover: event recovery processes.
gcware.log: node status, resource conditions, and multi‑replica operations.
Also loader_logs, loader_result, gcinstall.log, replace.log, etc.

These logs are invaluable during troubleshooting. For instance, if node‑level statistics show one node is clearly abnormal, check its logs for crashes, recovery, load conflicts, disk issues, or resource exhaustion.

Audit Information Isn’t Just for Compliance — It Helps Troubleshooting Too Many environments only look at audit tables when compliance demands it, but they’re also useful for root‑cause analysis. Combining “who did what when” with “which node was slow and how slow” often helps connect the dots faster.

If a statistical SQL suddenly slows down, node‑level stats tell you that it’s slow, but they don’t tell you what happened in the preceding hours. Pairing that with recent bulk loads, deletes, table changes, or parameter modifications makes the investigation much more efficient.

When Problems Start Looking Like Replica Inconsistency, Stop Purely Performance‑Focused Diagnosis Some issues that seem like query anomalies are actually primary‑replica consistency problems. Common causes include inconsistent local parameters, power outages, server crashes, RAID/driver issues, virtual machine failures, or manual misoperations.

GBase 8a uses direct I/O for writes; the database only considers a write successful after confirmation, but if the underlying environment fails, there’s still a risk of logical success without physical durability.

The parameter gcluster_suffix_consistency_resolve defaults to 0 (do not resolve shard consistency). When set to 1, it attempts resolution. This feature requires at least three host nodes; a two‑node cluster may not support all capabilities.

If you encounter any of these symptoms, stop treating it as a normal slow‑query problem:

The same table returns unstable results at different times.
After a node recovers, local queries start erroring.
DML operations appear successful but subsequent reads behave abnormally.
Recent power outages, node failures, manual event deletions, or local parameter changes.

At this point, first verify whether a consistency issue exists — otherwise, you’ll waste time on SQL tuning that can’t solve the real problem.

A Cleaner Troubleshooting Path Putting the above together gives a repeatable sequence:

Step 1: Determine whether the problem is “global slowdown” or “a few nodes are significantly slower.” Use gcluster_dql_statistic_threshold together with sys_sqls and sys_sql_elapsepernode.
Step 2: If it’s node‑specific, check for data skew, dynamic redistribution, or hot spots triggered by distinct/group by/JOINs. Look at execution logs to see if any stage is unusually long.
Step 3: If it’s not purely a performance issue but involves unstable results, post‑recovery anomalies, or inconsistent reads/writes, shift focus to primary‑replica consistency and underlying environment events. Don’t keep tweaking SQL.

This sequence prevents the two extremes: rewriting SQL without evidence, or rebuilding everything at the first sign of trouble. Narrow down the scope first, then decide whether to modify SQL, adjust parameters, examine logs, or handle consistency issues.

Things Often Overlooked in Production
Don’t rely on average time: In a distributed environment, averages hide node‑level drags. The slowest node often determines overall query time.
Parameters aren’t magic bullets: They’re optimization levers for specific scenarios, not universal fixes.
Don’t ignore the underlying environment: Power outages, crashes, drivers, and virtual machine issues are repeatedly mentioned in consistency‑related content — database behavior often reflects underlying conditions.
Combine logs with statistics: System tables tell you which SQL is slow and where; logs tell you why it’s slow and whether anomalies occurred alongside it.
Conclusion
The hardest part of GBase 8a performance and stability isn’t finding a “magic parameter” — it’s breaking the problem into the right layers. A slow query may not be poorly written SQL; it could be node skew. A healthy cluster status doesn’t guarantee a healthy execution path. And once a problem reaches the primary‑replica consistency layer, ordinary SQL tuning will only lead you in circles.

The GBase community’s resources on node‑level statistics, execution logs, parameter tuning, and consistency handling together offer a practical methodology: first observe node differences, then analyze execution phases, then decide whether it’s a distribution/intermediate‑result issue, and finally determine if you need to move into consistency troubleshooting.

DEV Community

GBase 8a Performance Troubleshooting: From Slow Query Detection to Primary‑Replica Consistency

Top comments (0)